1

I'm a newbie in Python, and currently trying to repeat examples from the books and courses. And in all cases I'm quite a lot struggling with DataFrame structure, it seems like it has been hugely changed from 2.7 to 3.0

Basically, in the current example, I want to add a total column (total for each year). so I've done the following

import pandas as pd
import seaborn
flights = seaborn.load_dataset('flights')
flights_indexed = flights.set_index(['year','month'])
flights_unstacked = flights_indexed.unstack();

enter image description here

from the example, the following line should work, but it doesn't in python3

flights_unstacked['passengers','total'] = flights_unstacked.sum(axis=1)

I found a few links that show how to add the column (link1, link2), but none of this work for me

flights_unstacked["passengers"].insert(loc=0,column="total", value=flights_unstacked.sum(axis=1).values)

In both cases, the error is the same cannot insert an item into a CategoricalIndex that is not already an existing category

I have a feeling that it must be something more tricky as my DataFrame no more completely flat, it's currently grouped and I want to add the total values precisely on the "month" level.

I would be super happy even if someone let me know how to google it!

Vladimir Semashkin
  • 1,270
  • 1
  • 10
  • 21

1 Answers1

1

It's because the column 'month' in the flight data is of type category. So when it's unstacked it, it creates a pd.CategoricalIndex and 'total' is not one of the valid categories.

Solution 1

The quickest and easiest fix would be to cast that column as type object:

import pandas as pd
import seaborn
flights = seaborn.load_dataset('flights')

# Casting here
flights['month'] =  flights.month.astype('O')

# Should work as intended now
flights_indexed = flights.set_index(['year','month'])
flights_unstacked = flights_indexed.unstack()
flights_unstacked['passengers','total'] = flights_unstacked.sum(axis=1)

Here is some more information on categorical data.


Solution 2

How you could handle this whilst maintaining categorical datatype.

import pandas as pd
import seaborn
flights = seaborn.load_dataset('flights')

flights.month.dtype

This shows the categories of this field as...

CategoricalDtype(categories=['January', 'February', 'March', 'April', 'May', 'June',
                  'July', 'August', 'September', 'October', 'November',
                  'December'],
                 ordered=False)

So you can see in this case 12 categories, the months 'January' .. 'December'.

You can add additional categories using:

flights.month.cat.add_categories('total', inplace=True)

And checking the categories again...

flights.month.dtype

CategoricalDtype(categories=['January', 'February', 'March', 'April', 'May', 'June',
                  'July', 'August', 'September', 'October', 'November',
                  'December', 'total'],
                 ordered=False)

'total' was added as a valid category.

The following should now work:

flights_indexed = flights.set_index(['year','month'])
flights_unstacked = flights_indexed.unstack()
flights_unstacked['passengers','total'] = flights_unstacked.sum(axis=1)
Chris Adams
  • 18,389
  • 4
  • 22
  • 39
  • thanks for the quick answer! Could you please explain a bit what's the trick with conversion to an object? – Vladimir Semashkin Mar 09 '19 at 16:00
  • @VladimirSemashkin Not a trick really, just removing all of the behaviour and nuances of `categorical` type by converting it to a string. I've added an alternative solution that maintains the categorical type to show how it works – Chris Adams Mar 09 '19 at 16:34