2

I am trying to group my dataframe by values in one of the columns, 'category'. Although, one of the other columns 'prob' contains a list of tuples on each row. When I try to group-by 'category', the 'prob' column disappears.

My current df:

 category          other:          prob:
   one              val         [(hi, hello), (jimbob, joe)]
   one              val2        [(this, not), (is, work), (now, any)]
   two              val2        [(bob, jones), (work, here)]
   three            val3        [(milk, coffee), (tea, bread)]
   two              val3        [(money, here), (job, money)]

Expected output:

 category:           other:         prob:
   one             val, val2     [(hi, hello), (jimbob, joe), (this, not), (is, work), (now, any)]
   two             val2, val3    [(bob, jones), (work, here), (money, here), (job, money)]
   three           val3          [(money, here), (job, money)]

What is the best way to do this? Apologies if I have mis-phrased this question please let me know if you have any questions. Thank you!

jackiegirl89
  • 87
  • 1
  • 9

3 Answers3

4

You can aggregate data by GroupBy.agg with join for string column and flatten data for tuples - added 3 solutions, sum use only if small data and performance is not important:

import functools
import operator

from  itertools import chain

f = lambda x: [z for y in x for z in y]
#faster alternative
#f = lambda x: list(chain.from_iterable(x))
#faster alternative2
#f = lambda x: functools.reduce(operator.iadd, x, [])
#slow alternative
#f = lambda x: x.sum()
df = df.groupby('category', as_index=False).agg({'other':', '.join, 'prob':f})

print (df)
  category       other                                               prob
0      one   val, val2  [(hi, hello), (jimbob, joe), (this, not), (is,...
1    three        val3                     [(milk, coffee), (tea, bread)]
2      two  val2, val3  [(bob, jones), (work, here), (money, here), (j...

Performance:

pic

Code for testing:

np.random.seed(2019)

import perfplot
import functools
import operator

from  itertools import chain


default_value = 10

def iadd(df1):
    f = lambda x: functools.reduce(operator.iadd, x, [])
    d = {'other':', '.join, 'prob':f}
    return df1.groupby('category', as_index=False).agg(d)

def listcomp(df1):
    f = lambda x: [z for y in x for z in y]
    d = {'other':', '.join, 'prob':f}
    return df1.groupby('category', as_index=False).agg(d)

def from_iterable(df1):
    f = lambda x: list(chain.from_iterable(x))
    d = {'other':', '.join, 'prob':f}
    return df1.groupby('category', as_index=False).agg(d)

def sum_series(df1):
    f = lambda x: x.sum()
    d = {'other':', '.join, 'prob':f}
    return df1.groupby('category', as_index=False).agg(d)

def sum_groupby_cat(df1):
    d = {'other':lambda x: x.str.cat(sep=', '), 'prob':'sum'}
    return df1.groupby('category', as_index=False).agg(d)

def sum_groupby_join(df1):
    d = {'other': ', '.join, 'prob': 'sum'}
    return df1.groupby('category', as_index=False).agg(d)


def make_df(n):
    a = np.random.randint(0, n / 10, n)
    b = np.random.choice(list('abcdef'), len(a))
    c = [tuple(np.random.choice(list(string.ascii_letters), 2)) for _ in a]
    df = pd.DataFrame({"category":a, "other":b, "prob":c})
    df1 = df.groupby(['category','other'])['prob'].apply(list).reset_index()
    return df1

perfplot.show(
    setup=make_df,
    kernels=[iadd, listcomp, from_iterable, sum_series,sum_groupby_cat,sum_groupby_join],
    n_range=[10**k for k in range(1, 8)],
    logx=True,
    logy=True,
    equality_check=False,
    xlabel='len(df)')
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
2

You could GroupBy the category column and aggregate with the following functions:

df.groupby('category', as_index=False).agg({'other':lambda x: x.str.cat(sep=', '),
                                            'prob':'sum'})

Which for the first rows gives:

   category   other                             prob
0      one  val, val2  [(hi, hello), (jimbob, joe), (this, not), (is,...
1      two      val2                       [(bob, jones), (work, here)]
yatu
  • 86,083
  • 12
  • 84
  • 139
  • try `operator.idd` under `reduce()` think will be faster. :) – anky Mar 26 '19 at 13:04
  • 1
    Yeah you mean for the strings? yes another option I guess. Would be interesting to see the timingss @anky_91 – yatu Mar 26 '19 at 13:05
  • yes, for flattening lists which has the tuples, i have tested, after itertools, `operator.iadd()` is almost as fast. check [this](https://stackoverflow.com/questions/54103720/appending-lists-of-words-or-characters-from-all-rows-in-a-dataframe/54103814#54103814) – anky Mar 26 '19 at 13:06
  • Also added performance graph to my answer, `sum` is slowiest. – jezrael Mar 27 '19 at 10:21
  • 1
    Nice to see the timings @jezrael. Yes indeed it looks like `sum` is slower than using a list comp. It seems as they all scale similarly though. thx for sharing! – yatu Mar 27 '19 at 10:30
0

Try using groupby()+agg():

df.groupby('category').agg({'other': ', '.join, 'prob': 'sum'})
Loochie
  • 2,414
  • 13
  • 20
  • 3
    You need to provide a description of how your proposed solution meets the needs of the original poster. This helps both the original poster and anyone else who finds this question. – Bob Dalgleish Mar 26 '19 at 21:02