17

I have a table with two columns:

+---------+--------+
| keyword | color  |
+---------+--------+
| foo     | red    |
| bar     | yellow |
| fobar   | red    |
| baz     | blue   |
| bazbaz  | green  |
+---------+--------+

I need to do some kind of one-hot encoding and transform table in PostgreSQL to:

+---------+-----+--------+-------+------+
| keyword | red | yellow | green | blue |
+---------+-----+--------+-------+------+
| foo     |   1 |      0 |     0 |    0 |
| bar     |   0 |      1 |     0 |    0 |
| fobar   |   1 |      0 |     0 |    0 |
| baz     |   0 |      0 |     0 |    1 |
| bazbaz  |   0 |      0 |     1 |    0 |
+---------+-----+--------+-------+------+

Is it possible to do with SQL only? Any tips on how to get started?

Ernest
  • 8,701
  • 5
  • 40
  • 51

3 Answers3

29

If I correctly understand, you need conditional aggregation:

select keyword,
count(case when color = 'red' then 1 end) as red,
count(case when color = 'yellow' then 1 end) as yellow
-- another colors here
from t
group by keyword
Oto Shavadze
  • 40,603
  • 55
  • 152
  • 236
  • 2
    Not sure why it was downvoted, it does exactly the job. Seems so obvious now, thank you. – Ernest Aug 10 '17 at 18:54
  • 5
    This does meet the criteria of the original question, but I'm looking for a way to dynamically one hot encode 100+ columns for a data scientist. It's basically needing to pivot each column's potential values and do it dynamically. Hoping someone would have already written out the dynamic SQL! – njkroes Sep 06 '18 at 17:37
  • 1
    @njkroes, did you ever find an answer to this? I'm running into a similar situation where the list of columns I wish to encode may change – njfrazie Feb 22 '19 at 16:59
  • @njfrazie I ended up just using a bunch of IIF statements to do this as outlined here. In a situation where you are looking a large set of data, you could take all represented answer and dynamically generate something, but there's really no way to do something for one record on the fly unless you already knew what to expect. In that case, you would just write out all expected outcomes like this answer suggests. – njkroes Mar 04 '19 at 17:11
  • This does not work if you have duplicates. You would end up with numbers greater than 1. First you need to get distinct elements of that class, apply the query above and then join with the original. – Cocomico Feb 16 '23 at 07:50
1

To use this code on a table with a high number of columns, use Python to generate your queries:

1) Create a list with unique variables that you want to have as your column names and import this to Python, say as: list.

for item in list:
 print('count(case when item=' +str(item)+ 'then 1 end) as is_'+str(item)+',')

2) Copy the output (minus the last comma on the last row)

3) Then:

select keyword,

OUTPUT FROM PYTHON

from t
group by keyword
Emil
  • 1,531
  • 3
  • 22
  • 47
1

Another way to achieve the goal in your test case using tablefunc extension and COALESCE() to fill all NULL fields:

postgres=# create table t(keyword varchar,color varchar);
CREATE TABLE
postgres=# insert into t values ('foo','red'),('bar','yellow'),('fobar','red'),('baz','blue'),('bazbaz','green');
INSERT 0 5
postgres=# SELECT keyword, COALESCE(red,0) red, 
 COALESCE(blue,0) blue, COALESCE(green,0) green, 
 COALESCE(yellow,0) yellow 
 FROM crosstab(                         
  $$select keyword, color, COALESCE('1',0) as onehot from test01
    group by 1, 2 order by 1, 2$$,
  $$select distinct color from test01 order by 1$$)
 AS result(keyword varchar, blue int, green int, red int, yellow int);
 keyword | red | blue | green | yellow 
---------+-----+------+-------+--------
 bar     |   0 |    0 |     0 |      1
 baz     |   0 |    1 |     0 |      0
 bazbaz  |   0 |    0 |     1 |      0
 fobar   |   1 |    0 |     0 |      0
 foo     |   1 |    0 |     0 |      0
(5 rows)

postgres=# 

And if you just to obtain the result under psql:

postgres=# select keyword, color, COALESCE('1',0) as onehot from t
  --group by 1, 2 order by 1, 2
  \crosstabview keyword color
 keyword | red | yellow | blue | green 
---------+-----+--------+------+-------
 foo     |   1 |        |      |      
 bar     |     |      1 |      |      
 fobar   |   1 |        |      |      
 baz     |     |        |    1 |      
 bazbaz  |     |        |      |     1
(5 rows)

postgres=# 
C.C. Hsu
  • 169
  • 2