Fill dates on dataframe within groups with same ending

Question

This is what I have:

df = pd.DataFrame({'item': [1,1,2,2,1,1],
                   'shop': ['A','A','A','A','B','B'],
                   'date': pd.to_datetime(['2018.01.'+ str(x) for x in [2,3,1,4,4,5]]),
                   'qty': [5,6,7,8,9,10]})
print(df)

   item shop       date  qty
0     1    A 2018-01-02    5
1     1    A 2018-01-03    6
2     2    A 2018-01-01    7
3     2    A 2018-01-04    8
4     1    B 2018-01-04    9
5     1    B 2018-01-05   10

This is what I want:

out = pd.DataFrame({'item': [1,1,1,1,2,2,2,2,2,1,1],
                   'shop': ['A','A','A','A','A','A','A','A','A','B','B'],
                   'date': pd.to_datetime(['2018.01.'+ str(x) for x in [2,3,4,5,1,2,3,4,5,4,5]]),
                   'qty': [5,6,0,0,7,0,0,8,0,9,10]})
print(out)

    item shop       date  qty
0      1    A 2018-01-02    5
1      1    A 2018-01-03    6
2      1    A 2018-01-04    0
3      1    A 2018-01-05    0
4      2    A 2018-01-01    7
5      2    A 2018-01-02    0
6      2    A 2018-01-03    0
7      2    A 2018-01-04    8
8      2    A 2018-01-05    0
9      1    B 2018-01-04    9
10     1    B 2018-01-05   10

This is what I achieved so far:

df.set_index('date').groupby(['item', 'shop']).resample("D")['qty'].sum().reset_index(name='qty')

   item shop       date  qty
0     1    A 2018-01-02    5
1     1    A 2018-01-03    6
2     1    B 2018-01-04    9
3     1    B 2018-01-05   10
4     2    A 2018-01-01    7
5     2    A 2018-01-02    0
6     2    A 2018-01-03    0
7     2    A 2018-01-04    8

I want to complete the missing dates (by day!) so that each group [item-shop] will end with the same date.

Ideas?

Ciao Edo, is it not clear to me which kind of solution are you looking for. Do you mind to Edit your question with the expected output? — rpanai, Apr 28 '21 at 16:29
The expected output is the `out` dataframe. I also wrote my expectations (last line) compared to the result I achieved so far (the third dataframe). What kind of information do you think is still missing? I think I’m missing the point of your question — Edo, Apr 28 '21 at 16:58

score 5 · Accepted Answer · edited Apr 30 '21 at 08:22

5

The key here is create the min and max within different group , then we create the range and explode merge back

# find the min date for each shop under each item
s = df.groupby(['item','shop'])[['date']].min()
# find the global max
s['datemax'] = df['date'].max()
# combine two results 
s['date'] = [pd.date_range(x,y) for x , y in zip(s['date'],s['datemax'])]
out = s.explode('date').reset_index().merge(df,how='left').fillna(0)
out

    item shop       date    datemax   qty
0      1    A 2018-01-02 2018-01-05   5.0
1      1    A 2018-01-03 2018-01-05   6.0
2      1    A 2018-01-04 2018-01-05   0.0
3      1    A 2018-01-05 2018-01-05   0.0
4      1    B 2018-01-04 2018-01-05   9.0
5      1    B 2018-01-05 2018-01-05  10.0
6      2    A 2018-01-01 2018-01-05   7.0
7      2    A 2018-01-02 2018-01-05   0.0
8      2    A 2018-01-03 2018-01-05   0.0
9      2    A 2018-01-04 2018-01-05   8.0
10     2    A 2018-01-05 2018-01-05   0.0

edited Apr 30 '21 at 08:22

Edo

7,567
2
9
19

answered Apr 28 '21 at 15:38

BENY

317,841
20
164
234

"the max here we need the upper universal max for each item". You need a universal max, but **not for each item**. My expected result is the second dataframe, not the third one. The third one it's just an attempt of mine. – Edo Apr 29 '21 at 18:11
@Edo did you check my output ? try match with your 2nd df firstly – BENY Apr 29 '21 at 20:08
My second df has 11 rows. Your output has 10. Am I missing something? – Edo Apr 29 '21 at 21:40

score 3 · Answer 2 · answered Apr 28 '21 at 16:15

3

I think this gives you what you want (columns are ordered differently)

max_date = df.date.max()

def reindex_to_max_date(df):
    return df.set_index('date').reindex(pd.date_range(df.date.min(), max_date, name='date'), fill_value=0)

res = df.groupby(['shop', 'item']).apply(reindex_to_max_date)
res = res.qty.reset_index()

I grouped by shop, item to give the same sort order as you have in out but these can be swapped.

answered Apr 28 '21 at 16:15

JoeCondron

8,546
3
27
28

I liked your answer. Upvoted. The only problem is that with `apply` it makes a lot of copies and it exhausts the RAM pretty fast for big dataframes. Anyhow, thanks a lot! – Edo Apr 30 '21 at 08:24

score 1 · Answer 3 · answered Apr 28 '21 at 15:53

Not sure if this is the most efficient way but one idea is to create a dataframe with all the dates and do a left join at shop-item level as followinf

Initial data

import pandas as pd


df = pd.DataFrame({'item': [1,1,2,2,1,1],
                   'shop': ['A','A','A','A','B','B'],
                   'date': pd.to_datetime(['2018.01.'+ str(x) 
                                           for x in [2,3,1,4,4,5]]),
                   'qty': [5,6,7,8,9,10]})

df = df.set_index('date')\
       .groupby(['item', 'shop'])\
       .resample("D")['qty']\
       .sum()\
       .reset_index(name='qty')

Dataframe with all dates

We first get the max and min date

rg = df.agg({"date":{"min", "max"}})

and then we create a df with all possible dates

df_dates = pd.DataFrame(
    {"date": pd.date_range(
        start=rg["date"]["min"],
        end=rg["date"]["max"])
    })

Complete dates

Now for every shop item we do a left join with all possible dates

def complete_dates(x, df_dates):
    item = x["item"].iloc[0]
    shop = x["shop"].iloc[0]
    x = pd.merge(df_dates, x,
                 on=["date"],
                 how="left")
    x["item"] = item
    x["shop"] = shop
    return x

And we finally apply this function to the original df.

df.groupby(["item", "shop"])\
  .apply(lambda x: 
         complete_dates(x, df_dates)
        )\
  .reset_index(drop=True)

         date  item shop   qty
0  2018-01-01     1    A   NaN
1  2018-01-02     1    A   5.0
2  2018-01-03     1    A   6.0
3  2018-01-04     1    A   NaN
4  2018-01-05     1    A   NaN
5  2018-01-01     1    B   NaN
6  2018-01-02     1    B   NaN
7  2018-01-03     1    B   NaN
8  2018-01-04     1    B   9.0
9  2018-01-05     1    B  10.0
10 2018-01-01     2    A   7.0
11 2018-01-02     2    A   0.0
12 2018-01-03     2    A   0.0
13 2018-01-04     2    A   8.0
14 2018-01-05     2    A   NaN

score 1 · Answer 4 · answered Apr 28 '21 at 22:43

You could use the complete function from pyjanitor to expose the missing values; the end date is the max of date, the starting date varies per group of item and shop.

Create a dictionary that pairs the target column date to a new date range:

new_date = {"date" : lambda date: pd.date_range(date.min(), df['date'].max())}

Pass the new_date variable to complete :

# pip install https://github.com/pyjanitor-devs/pyjanitor.git
import janitor
import pandas as pd

df.complete([new_date], by = ['item', 'shop']).fillna(0)

    item shop       date   qty
0      1    A 2018-01-02   5.0
1      1    A 2018-01-03   6.0
2      1    A 2018-01-04   0.0
3      1    A 2018-01-05   0.0
4      1    B 2018-01-04   9.0
5      1    B 2018-01-05  10.0
6      2    A 2018-01-01   7.0
7      2    A 2018-01-02   0.0
8      2    A 2018-01-03   0.0
9      2    A 2018-01-04   8.0
10     2    A 2018-01-05   0.0

complete is just an abstraction of pandas functions that makes it easier to explicitly expose missing values in a Pandas dataframe.

Fill dates on dataframe within groups with same ending

4 Answers4

Initial data

Dataframe with all dates

Complete dates