1

Edit: clearified the question

I want to aggregate a pd.DataFrame called df by "Identifier" and sum the "Costs" columns. For the category columns I want to apply an aggregation function that can be spoken out loud like "aggregate and take the most frequent value (the mode) of the column, but if the mode is a blank than take the second most frequent column". In other words, I want to have the mode of the category (after aggregation) but the mode must not be a blank.

Result should be the pd.DataFrame new_df.

df
  Identifier  Cost  Cost2 Category1 Category2 Category3
0          A    10     10       one                 aaa
1          A    20     10                blue       aaa
2          B    10     20       two                 bbb
3          B    10     30               green       bbb
4          B    30     40                           bbb
5          C    20     50     three       red       ccc

--- aggregation process --->

new_df
  Identifier  Cost  Cost2 Category1 Category2 Category3
0          A    30     20       one      blue       aaa
1          B    50     90       two     green       bbb
2          C    20     50     three       red       ccc

Code to reproduce the example:

import pandas as pd

data_df = {       
           'Identifier': ['A', 'A', 'B', 'B', 'B', 'C'],
           'Cost': [10, 20, 10, 10, 30, 20],
           'Cost2':[10,10,20,30,40,50],
           'Category1' : ['one', '', 'two', '', '', 'three'],
           'Category2' : ['', 'blue', '', 'green', '', 'red'],
           'Category3' : ['aaa', 'aaa', 'bbb', 'bbb', 'bbb', 'ccc']
          }

df = pd.DataFrame(data_df)


data_new_df = {       
           'Identifier': ['A', 'B', 'C'],
           'Cost': [30, 50, 20],
           'Cost2' : [20,90,50],
           'Category1' : ['one', 'two', 'three'],
           'Category2' : ['blue', 'green', 'red'],
           'Category3' : ['aaa', 'bbb', 'ccc']
          }

new_df = pd.DataFrame(data_new_df)
constiii
  • 638
  • 3
  • 19
  • Possible duplicate of [GroupBy pandas DataFrame and select most common value](https://stackoverflow.com/questions/15222754/groupby-pandas-dataframe-and-select-most-common-value) – G. Anderson Jul 11 '19 at 21:21
  • filter out the blank values and follow the answer given in the linked question above? – Quang Hoang Jul 11 '19 at 21:24
  • Is cost column of B going to 50 instead? – niraj Jul 11 '19 at 21:25
  • As far as I understand, it is no duplicate. I want to take the most frequent value of the column, but in case the most frequent value of the column is a blank, I want to have the second most frequent (non-blank) value. – constiii Jul 11 '19 at 21:52

2 Answers2

1

May be you can try groupby with sum following:

new_df = df.groupby('Identifier').apply(sum).drop('Identifier', axis=1).reset_index()

Result:

  Identifier  Cost Category1 Category2
0          A    30       one      blue
1          B    50       two     green
2          C    20     three       red
niraj
  • 17,498
  • 4
  • 33
  • 48
  • How would this answer work with multiple Cost columns? – constiii Jul 11 '19 at 21:59
  • Did you try the same code? I am little confused with your sum results, you have different values on result and scripts for values. – niraj Jul 11 '19 at 22:02
  • Sorry my fault, there was two little mistakes in the code. How can I use your code with two sum operations (on "Cost" and "Cost2")? Thanks in advance! – constiii Jul 12 '19 at 12:53
  • np, you can try same code, it should sum all the columns values. – niraj Jul 12 '19 at 14:31
  • works for the code I provides as an example, so answer accepted :) However if you dont mind, I got an error with my actual dataset: "ValueError: No axis named 1 for object type " Do you have an idea? – constiii Jul 12 '19 at 14:47
  • 1
    I identified the problem: For some category columns I dont have blanks and in this case I want to have the most frequent value (mode) - again the aggregation function a la "aggregate and take the most frequent value other than blank" should be applied. For this purpose your one-liner does unfortunately not work - I'll update the question. – constiii Jul 12 '19 at 14:54
0

You can try:

new_df = df.groupby('Identifier').sum().reset_index()
new_df['Category1'] = df.loc[df.Category1 != '', 'Category1'].reset_index(drop=True)
new_df['Category2'] = df.loc[df.Category2 != '', 'Category2'].reset_index(drop=True)
new_df

Result:

  Identifier  Cost Category1 Category2
0          A    30       one      blue
1          B    50       two     green
2          C    20     three       red
René
  • 4,594
  • 5
  • 23
  • 52