Categorical variables into multiple columns

Question

I have dataframe with categorical variable Segment

ID  Segment Var
1   AAA     1
2   BBB     0
3   BBB     1
4   AAA     1
5   CCC     1
6   AAA     0 
7   AAA     1
8   AAA     0
9   BBB     0
10  CCC     0

And I would like to transform column Segment into 3 category like this:

ID  SegmentAAA  SegmentBBB  SegmentCCC
1   1           null        null
2   null        0           null
3   null        1           null
4   1           null        null
5   null        null        1
6   0           null        null
7   1           null        null
8   0           null        null
9   null        0           null
10  null        null        0

Could you please help me with that. Thank you very much.

score 5 · Answer 1 · answered Jun 14 '18 at 16:47

5

Is that pivot ?

df.pivot(*df.columns)
Out[70]: 
Segment  AAA  BBB  CCC
ID                    
1        1.0  NaN  NaN
2        NaN  0.0  NaN
3        NaN  1.0  NaN
4        1.0  NaN  NaN
5        NaN  NaN  1.0
6        0.0  NaN  NaN
7        1.0  NaN  NaN
8        0.0  NaN  NaN
9        NaN  0.0  NaN
10       NaN  NaN  0.0

answered Jun 14 '18 at 16:47

BENY

317,841
20
164
234

1

Ooo... that is a new trick for me. Unpack columns to pivot on. Brilliant short hand code. I like it. – Scott Boston Jun 14 '18 at 16:48
1

@ScottBoston THANK YOU MAN – BENY Jun 14 '18 at 17:10

Scott Boston · Accepted Answer · 2018-06-14T16:47:39.833

3

Use:

df.set_index(['ID','Segment'])['Var']\
  .unstack()\
  .add_prefix('Segment')\
  .rename_axis([None], axis=1)\
  .reset_index()

Output:

   ID  SegmentAAA  SegmentBBB  SegmentCCC
0   1         1.0         NaN         NaN
1   2         NaN         0.0         NaN
2   3         NaN         1.0         NaN
3   4         1.0         NaN         NaN
4   5         NaN         NaN         1.0
5   6         0.0         NaN         NaN
6   7         1.0         NaN         NaN
7   8         0.0         NaN         NaN
8   9         NaN         0.0         NaN
9  10         NaN         NaN         0.0

Option 2:

pd.crosstab(df.ID,df.Segment,df.Var,aggfunc='first')

edited Jun 14 '18 at 16:47

answered Jun 14 '18 at 16:38

Scott Boston

147,308
15
139
187

1

Thank you very much for your quick help. I used second option. May I ask you how i could do it also for more Segment columns? (E.g. Segment1, Segment2, Segment3 etc.) I tried this but it does not work: pd.crosstab(df.ID,df.Segment,[col for col in df.columns if 'Segm' in col],aggfunc='first') – hta Jun 14 '18 at 17:47
You could reshape the dataframe and stack all those Segment columns into one column and do the same. If would create a new question with the appropriate data, we'll see what is most efficient. – Scott Boston Jun 14 '18 at 17:50

score 2 · Answer 3 · answered Jun 14 '18 at 16:38

2

pd.get_dummies(df).drop('Var', axis=1)

answered Jun 14 '18 at 16:38

harvpan

8,571
2
18
36

Does this method confuse what is zero and what is null? Where are the nulls? – Scott Boston Jun 14 '18 at 16:42

score 0 · Answer 4 · answered Jun 14 '18 at 17:00

Pivot would be the best option. If you don't want ID to be the index and to get the column names as indicated, this is a little more specific than Wen's answer

df.pivot(index='ID',columns='Segment',values='Var').add_prefix('Segment').reset_index()
Segment  ID  SegmentAAA  SegmentBBB  SegmentCCC
0         1         1.0         NaN         NaN
1         2         NaN         0.0         NaN
2         3         NaN         1.0         NaN
3         4         1.0         NaN         NaN
4         5         NaN         NaN         1.0
5         6         0.0         NaN         NaN
6         7         1.0         NaN         NaN
7         8         0.0         NaN         NaN
8         9         NaN         0.0         NaN
9        10         NaN         NaN         0.0

Categorical variables into multiple columns

4 Answers4

Linked