Why the groupby.diff is so slower?

Question

I want to compute diff of a series per each group, something as following example:

In [24]: rnd_ser = pd.Series(np.random.randn(5000))
    ...: com_ser = pd.concat([rnd_ser] * 500, keys=np.arange(500), names=['Date', 'ID'])

In [25]: d1 = com_ser.groupby("Date").diff()

In [26]: d2 = com_ser - com_ser.groupby("Date").shift()

In [27]: np.allclose(d1.fillna(0), d2.fillna(0))
Out[27]: True

There are two ways to solve this problem, however, the first one has badly performance:

In [30]: %timeit d1 = com_ser.groupby("Date").diff()
616 ms ± 5.62 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [31]: %timeit d2 = com_ser - com_ser.groupby("Date").shift()
95 ms ± 326 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Is this expected or a bug?

The details of my env:

In [23]: pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.7.1.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 158 Stepping 10, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.23.4
pytest: 3.9.3
pip: 18.1
setuptools: 40.5.0
Cython: 0.29
numpy: 1.15.3
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 7.1.1
sphinx: 1.8.1
patsy: 0.5.1
dateutil: 2.7.5
pytz: 2018.7
blosc: None
bottleneck: 1.2.1
tables: 3.4.4
numexpr: 2.6.8
feather: None
matplotlib: 3.0.1
openpyxl: 2.5.9
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.1.2
lxml: 4.2.5
bs4: 4.6.3
html5lib: 1.0.1
sqlalchemy: 1.2.12
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

score 3 · Answer 1 · answered Nov 05 '18 at 17:17

FWIW, I am seeing similar numbers on my machine

%timeit d1 = com_ser.groupby("Date").diff()
523 ms ± 32.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit d2 = com_ser - com_ser.groupby("Date").shift()
80.8 ms ± 2.31 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Pandas implementation of diff() seems to be slow with the groupby()

For example if I make one big series

big_ser = pd.Series(np.random.randn(int(1e7)))

then compare a shift and subtract versus the Series.diff()

%timeit big_ser - big_ser.shift()
46.3 ms ± 789 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit big_ser.diff()
41.6 ms ± 488 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Then the times are identical between the implementations. This follows, as you look the internal source code for Series.diff it says explicitly in the comments

def diff(arr, n, axis=0):
    """
    difference of n between self,
    analogous to s-s.shift(n)

So I think its gotta be some overhead in the groupby specific to diff()

I have seen similar things like this " For some reason .groupby with .diff uses heaps of memory and it's all rather inefficient " https://stackoverflow.com/questions/20670726/computing-diffs-in-pandas-after-using-groupby-leads-to-unexpected-result — T. Scharf, Nov 05 '18 at 17:21

Why the groupby.diff is so slower?

1 Answers1