0

By having the following table, how can I count the days by ID? without use of for or any loop because it's large size data.

ID  Date 
a   01/01/2020
a   05/01/2020
a   08/01/2020
a   10/01/2020
b   05/05/2020
b   08/05/2020
b   12/05/2020
c   08/08/2020
c   22/08/2020

to have this result

ID  Date        Days Evolved Since Inicial date
a   01/01/2020  1
a   05/01/2020  4
a   08/01/2020  7
a   10/01/2020  9
b   05/05/2020  1
b   08/05/2020  3
b   12/05/2020  7
c   08/08/2020  1
c   22/08/2020  14
rickhtv
  • 31
  • 2
  • Read this: https://stackoverflow.com/questions/35534152/how-to-calculate-days-difference-between-successive-pandas-dataframe-rows-with – Gedas Miksenas Nov 09 '21 at 12:29
  • Where is this data coming from? Your best bet is to have whatever database you're pulling from do the heavy lifting for you in this case. – Jamie_D Nov 09 '21 at 12:29
  • Why is "Days evolved since inicial date" == 1 and not 0 for the first date? – Thomas Weller Nov 09 '21 at 12:30
  • I doubt this can be done without a loop. Either you implement the loop or someone else already did it in a method. – Thomas Weller Nov 09 '21 at 12:31

2 Answers2

2

Use GroupBy.transform with GroupBy.first for first values to new column, so possible subtract. Then if not duplicated datetimes is possible replace 0:

df['new']=df['Date'].sub(df.groupby("ID")['Date'].transform('first')).dt.days.replace(0, 1)

print (df)
  ID       Date  new
0  a 2020-01-01    1
1  a 2020-01-05    4
2  a 2020-01-08    7
3  a 2020-01-10    9
4  b 2020-05-05    1
5  b 2020-05-08    3
6  b 2020-05-12    7
7  c 2020-08-08    1
8  c 2020-08-22   14

Or set 1 for first value of group by Series.where and Series.duplicated:

df['new'] = (df['Date'].sub(df.groupby("ID")['Date'].transform('first'))
                       .dt.days.where(df['ID'].duplicated(), 1))

print (df)
  ID       Date  new
0  a 2020-01-01    1
1  a 2020-01-05    4
2  a 2020-01-08    7
3  a 2020-01-10    9
4  b 2020-05-05    1
5  b 2020-05-08    3
6  b 2020-05-12    7
7  c 2020-08-08    1
8  c 2020-08-22   14
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
1

You could do something like (df your dataframe):

def days_evolved(sdf):
    sdf["Days_evolved"] = sdf.Date - sdf.Date.iat[0]
    sdf["Days_evolved"].iat[0] = pd.Timedelta(days=1)
    return sdf

df = df.groupby("ID", as_index=False, sort=False).apply(days_evolved)

Result for the sample:

  ID       Date Days_evolved
0  a 2020-01-01       1 days
1  a 2020-01-05       4 days
2  a 2020-01-08       7 days
3  a 2020-01-10       9 days
4  b 2020-05-05       1 days
5  b 2020-05-08       3 days
6  b 2020-05-12       7 days
7  c 2020-08-08       1 days
8  c 2020-08-22      14 days

If you want int instead of pd.Timedelta then do

df["Days_evolved"] = df["Days_evolved"].dt.days

at the end.

Timus
  • 10,974
  • 5
  • 14
  • 28