1

How can I add a new column to a dataframe with the sum of values from a nested dataframe, without losing any of the other columns and nested data using pandas?

Specifically, I want to create a new column total_cost with the sum of all nested dataframes for a row.

I managed to create the following dataframe using a series of groupby and apply:

  user_id   description                                       unit_summary
0     111  xxx  [{'total_period_cost': 100, 'unit_id': 'xxx', ...
1     222  xxx  [{'total_period_cost': 100, 'unit_id': 'yyy', ...

I'm trying to add the column total_cost which is the sum of total_period_cost of each nested dataframe (grouped by user_id). How can I achieve the following dataframe?

  user_id   description   total_cost                          unit_summary
0     111  xxx  300  [{'total_period_cost': 100, 'unit_id': 'xxx', ...
1     222  xxx  100  [{'total_period_cost': 100, 'unit_id': 'yyy', ...

My code:

import pandas as pd

series = [{
    "user_id":"111", 
    "description": "xxx",
    "unit_summary":[
        {
        "total_period_cost":100,
        "unit_id":"xxx",
        "cost_per_unit":50,
        "total_period_usage":2
        },
        {
        "total_period_cost":200,
        "unit_id":"yyy",
        "cost_per_unit":25,
        "total_period_usage": 8
        }
    ]
},
{
    "user_id":"222",
    "description": "xxx",
    "unit_summary":[
        {
            "total_period_cost":100,
            "unit_id":"yyy",
            "cost_per_unit":25,
            "total_period_usage": 4
        }
    ]
}]

df = pd.DataFrame(series)

print(df)
print(df.to_dict(orient='records'))

Here is an example of the groupby..apply code I used to achieve the series JSON object:

import pandas as pd

series = [
    {"user_id":"111", "unit_id":"xxx","cost_per_unit":50, "total_period_usage": 1},
    {"user_id":"111", "unit_id":"xxx","cost_per_unit":50, "total_period_usage": 1},
    {"user_id":"111", "unit_id":"yyy","cost_per_unit":25, "total_period_usage": 8},
    {"user_id":"222", "unit_id":"yyy","cost_per_unit":25, "total_period_usage": 3},
    {"user_id":"222", "unit_id":"yyy","cost_per_unit":25, "total_period_usage": 1}
]

df = pd.DataFrame(series)

sumc = (
    df.groupby(['user_id', 'unit_id', 'cost_per_unit'], as_index=False)
        .agg({'total_period_usage': 'sum'})
)

sumc['total_period_cost'] = sumc.total_period_usage * sumc.cost_per_unit

sumc = (
    sumc.groupby(['user_id'])
        .apply(lambda x: x[['total_period_cost', 'unit_id', 'cost_per_unit', 'total_period_usage']].to_dict('r'))
        .reset_index()
)

sumc = sumc.rename(columns={0:'unit_summary'})

sumc['description'] = 'xxx'

print(sumc)
print(sumc.to_dict(orient='records'))

Solved it by adding the following from the answer by anky_91:

def myf(x):
    return pd.DataFrame(x).loc[:,'total_period_cost'].sum()
# Sum all server sumbscriptions total_period_cost
sumc['total_period_cost'] = sumc['unit_summary'].apply(myf)
barracuda
  • 968
  • 3
  • 10
  • 26
  • 1
    You shouldn't write code like this, there is never a need to store the intermediate result of the `groupby..apply` as a nested dataframe, it just makes your life harder. [**When you want to add a new (summary) column but also keep the existing rows and columns, use `.transform()`**](https://stackoverflow.com/questions/40957932/transform-vs-aggregate-in-pandas). If you post your original `groupby..apply` code we can implement things correctly. – smci Dec 31 '19 at 09:58
  • I posted my original groupby..apply, for now I got the expected result using the .apply() with a function on the existing group by. I read the transform but I don't know how to use it yet. – barracuda Dec 31 '19 at 10:38
  • 1
    @barracuda do you want the unit summary for any other purpose? if not I have added another method at the end of my answer – anky Dec 31 '19 at 11:27
  • 1
    @anky_91 Method1 worked perfectly to create the final dataset I was looking for. When converting the dataframe to a JSON object I am using the unit_summary to create the nested array in the final JSON object. – barracuda Dec 31 '19 at 12:07

1 Answers1

2

You can read each row in the unit_summary column as a dataframe and sum the desired column:

Method1: apply

def myf(x):
    return pd.DataFrame(x).loc[:,'total_period_cost'].sum()
df['total_cost'] = df['unit_summary'].apply(myf)

print(df)

Method2: similarly via list comprehension:

df['total_cost'] = [pd.DataFrame(i)['total_period_cost'].sum() 
                            for i in df['unit_summary'].tolist()]

Method3: using explode:

m = df['unit_summary'].explode()
df['total_cost'] = pd.DataFrame(m.tolist(),index=m.index)['total_period_cost'].sum(level=0)

  user_id description                                       unit_summary  \
0     111         xxx  [{'total_period_cost': 100, 'unit_id': 'xxx', ...   
1     222         xxx  [{'total_period_cost': 100, 'unit_id': 'yyy', ...   

   total_cost  
0         300  
1         100 

In addition to the above, starting from your original dataframe, we can also do something like below to achieve the desired output, however this wouldnot give you the series with the dicts ('unit_summary`):

(df.assign(total_cost=df['cost_per_unit']*df['total_period_usage'])
  .groupby(['user_id'],as_index=False)['total_cost'].sum().assign(description='xxxx'))

  user_id  total_cost description
0     111         300        xxxx
1     222         100        xxxx
anky
  • 74,114
  • 11
  • 41
  • 70