Appending lists of words or characters from all rows in a dataframe

Question

Is there a way to append lists present in different rows in a dataframe without using a 'for' loop ?

I am able to achieve that by using 'for' loop, but I want to achieve this in a much more efficient way, probably without using 'for' loop

d = {'col1': [1,2,3,4,5], 'col2': [['a'],['a','b','c'],['d'],['e'],['a','e','d']]}
df = pd.DataFrame(data=d)
word_list = []
for i in df['col2']:
  word_list = word_list + i

I want to get an output list like this: ['a', 'a', 'b', 'c', 'd', 'e', 'a', 'e', 'd']

For all possible answers have a look on this post https://stackoverflow.com/a/54089037/10734525 by jezrael. The notion of that question is little different but the approach is what you can learn. — meW, Jan 09 '19 at 06:20

iz_ · Accepted Answer · 2019-01-09T07:09:55.357

One way to do it is with panda's sum function:

In [1]: import pandas as pd
   ...: d = {'col1': [1,2,3,4,5], 'col2': [['a'],['a','b','c'],['d'],['e'],['a','e','d']]}
   ...: df = pd.DataFrame(data=d)

In [2]: df['col2'].sum()
Out[2]: ['a', 'a', 'b', 'c', 'd', 'e', 'a', 'e', 'd']

However, itertools.chain.from_iterable is much faster:

In [3]: import itertools
   ...: list(itertools.chain.from_iterable(df['col2']))
Out[3]: ['a', 'a', 'b', 'c', 'd', 'e', 'a', 'e', 'd']

In [4]: %timeit df['col2'].sum()
92.7 µs ± 1.03 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [5]: %timeit list(itertools.chain.from_iterable(df['col2']))
20.4 µs ± 2.62 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In my testing, itertools.chain.from_iterable can be up to 30x faster for larger dataframes (~1000 rows). Another option is

import functools
import operator

functools.reduce(operator.iadd, df['col2'], [])

which is pretty much equally as fast as itertools.chain.from_iterable. I made a graph for all of the answers that were posted:

(The x-axis is the length of the dataframe)

As you can see, everything using sum or functools.reduce with operators.add is unusable, with np.concat being slightly better. However, the three winners by far are itertools.chain, itertool.chain.from_iterable, and functools.reduce with operators.iadd. They take almost no time. Here is the code used to produce the plot:

import functools
import itertools
import operator
import random
import string

import numpy as np
import pandas as pd
import perfplot # see https://github.com/nschloe/perfplot for this awesome library


def gen_data(n):
    return pd.DataFrame(data={0: [
        [random.choice(string.ascii_lowercase) for _ in range(random.randint(10, 20))]
        for _ in range(n)
    ]})

def pd_sum(df):
    return df[0].sum()

def np_sum(df):
    return np.sum(df[0].values)

def np_concat(df):
    return np.concatenate(df[0]).tolist()

def functools_reduce_add(df):
    return functools.reduce(operator.add, df[0].values)

def functools_reduce_iadd(df):
    return functools.reduce(operator.iadd, df[0], [])

def itertools_chain(df):
    return list(itertools.chain(*(df[0])))

def itertools_chain_from_iterable(df):
    return list(itertools.chain.from_iterable(df[0]))

perfplot.show(
    setup=gen_data,
    kernels=[
        pd_sum,
        np_sum,
        np_concat,
        functools_reduce_add,
        functools_reduce_iadd,
        itertools_chain,
        itertools_chain_from_iterable
    ],
    n_range=[10, 50, 100, 500, 1000, 1500, 2000, 2500, 3000, 4000, 5000],
    equality_check=None
)

`sum` has O(n^2) complexity so that 30x will become much higher for larger dataframes. It is practically not usable for anything that has 100k+ rows. — ayhan, Jan 09 '19 at 06:07

Vaishali · Answer 2 · 2019-01-09T05:42:51.857

4

Can't find a dupe, sum of lists will return a combined list

df.col2.sum()

['a', 'a', 'b', 'c', 'd', 'e', 'a', 'e', 'd']

Or use Numpy

np.sum(df.col2.values)

edited Jan 09 '19 at 05:42

answered Jan 09 '19 at 05:34

Vaishali

37,545
5
58
86

U13-Forward · Answer 3 · 2019-01-09T06:05:27.903

4

Or use numpy.concatenate:

print(np.concatenate(df['col2']).tolist())

Output:

['a', 'a', 'b', 'c', 'd', 'e', 'a', 'e', 'd']

edited Jan 09 '19 at 06:05

answered Jan 09 '19 at 05:50

U13-Forward

69,221
14
89
114

1

This isn't working for me. I get `[['a'], ['a', 'b', 'c'], ['d'], ['e'], ['a', 'e', 'd']]` – iz_ Jan 09 '19 at 06:02

score 2 · Answer 4 · answered Jan 09 '19 at 06:22

2

Okay, another way(Just FYI):

from functools import reduce
reduce(lambda x,y: x+y,df.col2.values)

or:

from functools import reduce
import operator
reduce(operator.add,df.col2.values)

#['a', 'a', 'b', 'c', 'd', 'e', 'a', 'e', 'd']

answered Jan 09 '19 at 06:22

anky

74,114
11
41
70

2

great going genius :) – meW Jan 09 '19 at 06:25
Thank you @meW. :) glad you liked it. – anky Jan 09 '19 at 06:26
Sorry to one-up you again, but this is 30-40x slower than my `functools.reduce` solution above for a dataframe with 1000 rows. – iz_ Jan 09 '19 at 06:28
1

@anky_91 I tested both the first and second, I was talking about the second. – iz_ Jan 09 '19 at 06:30
@Tomothy32 we both seem to be thinking on the same lines, again , didnt realise when you added the `functools.reduce` part since I was in typing mode then. :) – anky Jan 09 '19 at 07:08
1

Our answers were somewhat similar, if you are interested I added a full performance comparison for all of the answers. – iz_ Jan 09 '19 at 07:10
@Tomothy32 : Nice one. :) Cant upvote again, already did so. :D – anky Jan 09 '19 at 07:11

Appending lists of words or characters from all rows in a dataframe

4 Answers4

Linked