Count days by ID - Pandas

Question

By having the following table, how can I count the days by ID? without use of for or any loop because it's large size data.

ID  Date 
a   01/01/2020
a   05/01/2020
a   08/01/2020
a   10/01/2020
b   05/05/2020
b   08/05/2020
b   12/05/2020
c   08/08/2020
c   22/08/2020

to have this result

ID  Date        Days Evolved Since Inicial date
a   01/01/2020  1
a   05/01/2020  4
a   08/01/2020  7
a   10/01/2020  9
b   05/05/2020  1
b   08/05/2020  3
b   12/05/2020  7
c   08/08/2020  1
c   22/08/2020  14

Read this: https://stackoverflow.com/questions/35534152/how-to-calculate-days-difference-between-successive-pandas-dataframe-rows-with — Gedas Miksenas, Nov 09 '21 at 12:29
Where is this data coming from? Your best bet is to have whatever database you're pulling from do the heavy lifting for you in this case. — Jamie_D, Nov 09 '21 at 12:29
Why is "Days evolved since inicial date" == 1 and not 0 for the first date? — Thomas Weller, Nov 09 '21 at 12:30
I doubt this can be done without a loop. Either you implement the loop or someone else already did it in a method. — Thomas Weller, Nov 09 '21 at 12:31

jezrael · Accepted Answer · 2021-11-09T14:14:26.123

2

Use GroupBy.transform with GroupBy.first for first values to new column, so possible subtract. Then if not duplicated datetimes is possible replace 0:

df['new']=df['Date'].sub(df.groupby("ID")['Date'].transform('first')).dt.days.replace(0, 1)

print (df)
  ID       Date  new
0  a 2020-01-01    1
1  a 2020-01-05    4
2  a 2020-01-08    7
3  a 2020-01-10    9
4  b 2020-05-05    1
5  b 2020-05-08    3
6  b 2020-05-12    7
7  c 2020-08-08    1
8  c 2020-08-22   14

Or set 1 for first value of group by Series.where and Series.duplicated:

df['new'] = (df['Date'].sub(df.groupby("ID")['Date'].transform('first'))
                       .dt.days.where(df['ID'].duplicated(), 1))

print (df)
  ID       Date  new
0  a 2020-01-01    1
1  a 2020-01-05    4
2  a 2020-01-08    7
3  a 2020-01-10    9
4  b 2020-05-05    1
5  b 2020-05-08    3
6  b 2020-05-12    7
7  c 2020-08-08    1
8  c 2020-08-22   14

edited Nov 09 '21 at 14:14

answered Nov 09 '21 at 13:33

jezrael

822,522
95
1,334
1,252

Interesting! Do you do it this way to avoid the `.apply` performance bottleneck? – Timus Nov 09 '21 at 17:41
1

@Timus exactly. – jezrael Nov 09 '21 at 17:45
1

Thanks, good to know! – Timus Nov 09 '21 at 17:50

Timus · Answer 2 · 2021-11-09T13:15:41.263

You could do something like (df your dataframe):

def days_evolved(sdf):
    sdf["Days_evolved"] = sdf.Date - sdf.Date.iat[0]
    sdf["Days_evolved"].iat[0] = pd.Timedelta(days=1)
    return sdf

df = df.groupby("ID", as_index=False, sort=False).apply(days_evolved)

Result for the sample:

  ID       Date Days_evolved
0  a 2020-01-01       1 days
1  a 2020-01-05       4 days
2  a 2020-01-08       7 days
3  a 2020-01-10       9 days
4  b 2020-05-05       1 days
5  b 2020-05-08       3 days
6  b 2020-05-12       7 days
7  c 2020-08-08       1 days
8  c 2020-08-22      14 days

If you want int instead of pd.Timedelta then do

df["Days_evolved"] = df["Days_evolved"].dt.days

at the end.

Count days by ID - Pandas

2 Answers2