3

I have a test dataframe that looks something like this:

data = pd.DataFrame([[0,0,0,3,6,5,6,1],[1,1,1,3,4,5,2,0],[2,1,0,3,6,5,6,1],[3,0,0,2,9,4,2,1]], columns=["id", "sex", "split", "group0Low", "group0High", "group1Low", "group1High", "trim"])

grouped = data.groupby(['sex','split']).mean()

stacked = grouped.stack().reset_index(level=2)
stacked.columns = ['group_level', 'mean']

Next, I want to separate out group_level and stack those 2 new factors:

stacked['group'] = stacked.group_level.str[:6]
stacked['level'] = stacked.group_level.str[6:]

This all works fine. My question is this:

This works if my column names ("group0Low", "group0High", "group1Low", "group1High") have something in common with each other.

What if instead my column names were more like "routeLow", "routeHigh", "landmarkLow", "landmarkHigh"? How would I use str to split group_level in this case?

This question is similar to this one posted here: Slice/split string Series at various positions

The difference is all of my column subnames are different and have no commonality (whereas in the other post everything had group or class in the name). Is there a regex string, or some other method, I can use to do this stacking?

Community
  • 1
  • 1
Simon
  • 9,762
  • 15
  • 62
  • 119

2 Answers2

3

Here is another way. It assumes that low/high group ends with the words Low and High respectively, so that we can use .str.endswith() to identify which rows are Low/High.

Here is the sample data

df = pd.DataFrame('group0Low group0High group1Low group1High routeLow routeHigh landmarkLow landmarkHigh'.split(), columns=['group_level'])
df

    group_level
0     group0Low
1    group0High
2     group1Low
3    group1High
4      routeLow
5     routeHigh
6   landmarkLow
7  landmarkHigh

Use np.where, we can do the following

df['level'] = np.where(df['group_level'].str.endswith('Low'), 'Low', 'High')
df['group'] = np.where(df['group_level'].str.endswith('Low'), df['group_level'].str[:-3], df['group_level'].str[:-4])

df

    group_level level     group
0     group0Low   Low    group0
1    group0High  High    group0
2     group1Low   Low    group1
3    group1High  High    group1
4      routeLow   Low     route
5     routeHigh  High     route
6   landmarkLow   Low  landmark
7  landmarkHigh  High  landmark
Jianxun Li
  • 24,004
  • 10
  • 58
  • 76
  • thanks for the idea. I adapted it by using str.contains() instead and everything works very well now! – Simon Aug 08 '15 at 00:40
2

I suppose it depends how general the strings you're working are. Assuming the only levels are always delimited by a capital letter you can do

In [30]:    
s = pd.Series(['routeHigh', 'routeLow', 'landmarkHigh', 
               'landmarkLow', 'routeMid', 'group0Level'])
s.str.extract('([\d\w]*)([A-Z][\w\d]*)')

Out[30]:
    0       1
0   route   High
1   route   Low
2   landmark    High
3   landmark    Low
4   route   Mid
5   group0  Level

You can even name the columns of the result in the same line by doing

s.str.extract('(?P<group>[\d\w]*)(?P<Level>[A-Z][\w\d]*)')

So in your use case you can do

group_level_df = stacked.group_level.extract('(?P<group>[\d\w]*)(?P<Level>[A-Z][\w\d]*)')
stacked = pd.concat([stacked, group_level_df])

Here's another approach which assumes only knowledge of the level names in advance. Suppose you have three levels:

lower = stacked.group_level.str.lower()
for level in ['low', 'mid', 'high']:

    rows_in = lower.str.contains(level)
    stacked.loc[rows_in, 'level'] = level.capitalize()  
    stacked.loc[rows_in, 'group'] = stacked.group_level[rows_in].str.replace(level, '')

Which should work as long as the level doesn't appear in the group name as well, e.g. 'highballHigh'. In cases where group_level didn't contain any of these levels you would end up with null values in the corresponding rows

JoeCondron
  • 8,546
  • 3
  • 27
  • 28
  • I saw your pre-edit answer, which actually worked perfectly but I do see the limitations with using str.contains(). I'm interested in your str.extract approach where you're also naming the columns at the same time. However, it appears to only work on series and not dataframes (error: 'DataFrame' object has no attribute 'str'). Is there an alternative? – Simon Aug 07 '15 at 22:25
  • 1
    No you just need to do `stacked.group_level.str.extract`, that is, call it on the relevant column. It's a series method as you say. I used my own series to show some more examples which is why i didn't put in `stacked.group_level`. I removed the pre-edit answer as the strip method wasn't good at all but then I thought of using replace which is in the second approach above. If you do go that way, don't use strip, use replace. – JoeCondron Aug 07 '15 at 22:28
  • If I do something like this: stacked['condition'] = stacked.condition.str.extract('(?P[\d\w]*)(?P[A-Z][\w\d]*)') it will only store route/landmark in the column, it ignores the low/high factor. It seems its trying to split 2 things in the column name and store it into a single column called condition, when I really need to create 2 new columns, one for route/landmark and one for low/high – Simon Aug 08 '15 at 00:16
  • So really it seems I need one statement for stacked['group'] = ... and one for stacked['level'] = ... unless theres a way to create both new columns from a single extract/regex statement? – Simon Aug 08 '15 at 00:20
  • The extract method will create a dataframe with as many columns as groups specified in the pattern you pass, in this case two. Groups are delimited by brackets in the pattern. I've edited the question to show hoe to apply it in your case. – JoeCondron Aug 08 '15 at 08:04