Attributes/information contained in DataFrame column names

Question

I have some data imported from a csv, to create something similar I used this:

data = pd.DataFrame([[1,0,2,3,4,5],[0,1,2,3,4,5],[1,1,2,3,4,5],[0,0,2,3,4,5]], columns=['split','sex', 'group0Low', 'group0High', 'group1Low', 'group1High'])
means = data.groupby(['split','sex']).mean()

so the dataframe looks something like this:

           group0Low  group0High  group1Low  group1High
split sex                                              
0     0            2           3          4           5
      1            2           3          4           5
1     0            2           3          4           5
      1            2           3          4           5

You'll notice that each column actually contains 2 variables (group# and height). (It was set up this way for running repeated measures anova in SPSS.)

I want to split the columns up, so I can also groupby "group", like this (I actually screwed up the order of the numbers, but hopefully the idea is clear):

            low high
split   sex group       
    0   0   95  265
0   0   1   123 54
    1   0   120 220
    1   1   98  111
1   0   0   150 190
    0   1   211 300
    1   0   139 86
    1   1   132 250

How do I achieve this?

FYI, many of those of us who follow the `pandas` tag skip over questions where the poster has inserted an image of their example (which can't be copy-pasted into a console) instead of text (which is easy to copy and work with.) — DSM, Aug 04 '15 at 05:34
Show us what you have so far (code) and where you get stuck / what fails. — hitzg, Aug 04 '15 at 07:19
Not simply related to this question, I really wish posters would include code that allowed easy replication of the datasets they include. Often that's most of the work of providing a verified answer. — holdenweb, Aug 04 '15 at 07:33
fair enough. I have added a way to reproduce a similar dataframe and have also removed the plotting part of the question. As of right now, I'm not even sure if its possible to split a column up like this, so I havent been able to try anything. If you can point me in some direction then I can at least research possibilities myself — Simon, Aug 04 '15 at 08:44

score 1 · Answer 1 · answered Aug 04 '15 at 11:00

This can be done by first construct multi-level index on column names and then reshape the dataframe by stack.

import pandas as pd
import numpy as np

# some artificial data
# ==================================
multi_index = pd.MultiIndex.from_arrays([[0,0,1,1], [0,1,0,1]], names=['split', 'sex'])
np.random.seed(0)
df = pd.DataFrame(np.random.randint(50,300, (4,4)), columns='group0Low group0High group1Low group1High'.split(), index=multi_index)
df

           group0Low  group0High  group1Low  group1High
split sex                                              
0     0          222          97        167         242
      1          117         245        153          59
1     0          261          71        292          86
      1          137         120        266         138

# processing
# ==============================

level_group = np.where(df.columns.str.contains('0'), 0, 1)
# output: array([0, 0, 1, 1])
level_low_high = np.where(df.columns.str.contains('Low'), 'low', 'high')
# output: array(['low', 'high', 'low', 'high'], dtype='<U4')

multi_level_columns = pd.MultiIndex.from_arrays([level_group, level_low_high], names=['group', 'val'])
df.columns = multi_level_columns
df.stack(level='group')

val              high  low
split sex group           
0     0   0        97  222
          1       242  167
      1   0       245  117
          1        59  153
1     0   0        71  261
          1        86  292
      1   0       120  137
          1       138  266

I can see from your output that this is exactly what I want. However, I cant recreate it because our initial dataframes are created in different ways. I'm using a groupby to set the initial split and sex index. Then, if I use your code: level_group = np.where(df.columns.str.contains('0'), 0, 1) on my groupby (replacing the df), I get an error: Cannot access attribute 'columns' of 'DataFrameGroupBy' objects, try using the 'apply' method — Simon, Aug 04 '15 at 18:49
@Nem Could you please upload your raw data file and the full code script? (for example, via dropbox sharelink or google driver) I need to reproduce the error you saw in order to fix it. — Jianxun Li, Aug 04 '15 at 19:32
It seems the problem is related to whether these operations are being conducted on mean data or raw data. The code can be found here: https://www.dropbox.com/s/dw534rzmqf98ca5/pandas%20test%20original.html?dl=0. The data file here: https://www.dropbox.com/s/een5pkhjitzk81o/data.csv?dl=0 — Simon, Aug 04 '15 at 21:47
Im not sure if its best to do this column splitting on the filtered, condensed, or finalData dataframes (i.e. before or after groupby operations). My guess is doing it as early on in the process as possible would be better because then I can just create averages later on whatever I need — Simon, Aug 04 '15 at 21:50

score 1 · Accepted Answer · edited May 23 '17 at 11:51

The first trick is to gather the columns into a single column using stack:

In [6]: means
Out[6]: 
           group0Low  group0High  group1Low  group1High
split sex                                              
0     0            2           3          4           5
      1            2           3          4           5
1     0            2           3          4           5
      1            2           3          4           5

In [13]: stacked = means.stack().reset_index(level=2)
In [14]: stacked.columns = ['group_level', 'mean']
In [15]: stacked.head(2)
Out[15]: 
          group_level  mean
split sex                  
0     0     group0Low     2
      0    group0High     3

Now we can do whatever string operations we want on group_level using pd.Series.str as follows:

In [18]: stacked['group'] = stacked.group_level.str[:6]
In [21]: stacked['level'] = stacked.group_level.str[6:]
In [22]: stacked.head(2)
Out[22]: 
          group_level  mean   group level
split sex                                
0     0     group0Low     2  group0   Low
      0    group0High     3  group0  High

Now you're in business and you can do whatever you want. For example, sum each group/level:

In [31]: stacked.groupby(['group', 'level']).sum()
Out[31]: 
              mean
group  level      
group0 High     12
       Low       8
group1 High     20
       Low      16

How do I group by everything?

If you want to group by split, sex, group and level you can do:

In [113]: stacked.reset_index().groupby(['split', 'sex', 'group', 'level']).sum().head(4)
Out[113]: 
                        mean
split sex group  level      
0     0   group0 High      3
                 Low       2
          group1 0High     5
                 0Low      4

What if the split is not always at location 6?

This SO answer will show you how to do the splitting more intelligently.

I got it working exactly as you described. But 2 questions: 1) When I get to the final stage I want to be able to groupby group, level, sex, and split (like in my original example). But when I add sex or split to groupby I get a key error.... 2) You're splitting the factor using stacked.group_level.str[:6], but this is dependent on the first factor being 6 chars long. What if the column names contained group25/group1 for example? In that case you cant just splice using a 6th index — Simon, Aug 06 '15 at 23:45
and finally what if I wanted to completely unstack everything so that split and sex become columns as well? — Simon, Aug 07 '15 at 20:06
nevermind, just tried stacked.reset_index() and it worked as I wanted. thanks! — Simon, Aug 07 '15 at 21:01

Attributes/information contained in DataFrame column names

2 Answers2

Linked