How can I add a new column to a dataframe with the sum of values from a nested dataframe, without losing any of the other columns and nested data using pandas?
Specifically, I want to create a new column total_cost
with the sum of all nested dataframes for a row.
I managed to create the following dataframe using a series of groupby
and apply
:
user_id description unit_summary
0 111 xxx [{'total_period_cost': 100, 'unit_id': 'xxx', ...
1 222 xxx [{'total_period_cost': 100, 'unit_id': 'yyy', ...
I'm trying to add the column total_cost
which is the sum of total_period_cost
of each nested dataframe (grouped by user_id
). How can I achieve the following dataframe?
user_id description total_cost unit_summary
0 111 xxx 300 [{'total_period_cost': 100, 'unit_id': 'xxx', ...
1 222 xxx 100 [{'total_period_cost': 100, 'unit_id': 'yyy', ...
My code:
import pandas as pd
series = [{
"user_id":"111",
"description": "xxx",
"unit_summary":[
{
"total_period_cost":100,
"unit_id":"xxx",
"cost_per_unit":50,
"total_period_usage":2
},
{
"total_period_cost":200,
"unit_id":"yyy",
"cost_per_unit":25,
"total_period_usage": 8
}
]
},
{
"user_id":"222",
"description": "xxx",
"unit_summary":[
{
"total_period_cost":100,
"unit_id":"yyy",
"cost_per_unit":25,
"total_period_usage": 4
}
]
}]
df = pd.DataFrame(series)
print(df)
print(df.to_dict(orient='records'))
Here is an example of the groupby..apply code I used to achieve the series
JSON object:
import pandas as pd
series = [
{"user_id":"111", "unit_id":"xxx","cost_per_unit":50, "total_period_usage": 1},
{"user_id":"111", "unit_id":"xxx","cost_per_unit":50, "total_period_usage": 1},
{"user_id":"111", "unit_id":"yyy","cost_per_unit":25, "total_period_usage": 8},
{"user_id":"222", "unit_id":"yyy","cost_per_unit":25, "total_period_usage": 3},
{"user_id":"222", "unit_id":"yyy","cost_per_unit":25, "total_period_usage": 1}
]
df = pd.DataFrame(series)
sumc = (
df.groupby(['user_id', 'unit_id', 'cost_per_unit'], as_index=False)
.agg({'total_period_usage': 'sum'})
)
sumc['total_period_cost'] = sumc.total_period_usage * sumc.cost_per_unit
sumc = (
sumc.groupby(['user_id'])
.apply(lambda x: x[['total_period_cost', 'unit_id', 'cost_per_unit', 'total_period_usage']].to_dict('r'))
.reset_index()
)
sumc = sumc.rename(columns={0:'unit_summary'})
sumc['description'] = 'xxx'
print(sumc)
print(sumc.to_dict(orient='records'))
Solved it by adding the following from the answer by anky_91:
def myf(x):
return pd.DataFrame(x).loc[:,'total_period_cost'].sum()
# Sum all server sumbscriptions total_period_cost
sumc['total_period_cost'] = sumc['unit_summary'].apply(myf)