0

I'm working with time series data, which is packaged in a time long dataframe, something like this:

ACCOUNT | VAR1 | VAR2 | DAY

I'm interested in creating a new column DAY_ORD, which would give the ordinal rank of the DAY variable to each row within a group over unique (ACCOUNT, VAR1, VAR2) triplets.

Here is a small example of what I want to achieve:

ACCOUNT VAR1 VAR2 DAY DAY-ORD
A X True 2022-02-03 0
A X True 2022-02-04 1
B X True 2021-05-18 0
A X True 2022-02-05 2
B X True 2022-05-20 1
A Y True 2022-02-05 0
A X True 2022-03-12 3

Here is my current implementation:

#initialize an empty 'DAY_ORD' column
df['DAY_ORD'] = [None for i in range(len(df))]

#iterate over all triplets that appear in the data
for (_, row) in fb_data[['ACCOUNT', 'VAR1', 'VAR2']].copy().drop_duplicates().iterrows()):
    acc, v1, v2 = row[0], row[1], row[2]

    #find the df slice that adheres to the considered triplet
    fdf = df.loc[(df.ACCOUNT== acc) & (fb_data.VAR1 == v1) & (fb_data.VAR2 == v2)].sort_values('DAY')

    #assign them an ordinal rank
    fdf['DAY_ORD'] = [i for i in range(len(fdf))]

    #set the DAY_ORD values in the original dataframe
    for i in fdf.index:
        df.loc[i, 'DAY_ORD'] = fdf['DAY_ORD'][i]
df['DAY_ORD']

It seems like it will do the job, but it runs very slowly, at around 8 it/s. What is a clean way to make this faster?

Vid Stropnik
  • 182
  • 10

1 Answers1

1

Use GroupBy.rank with convert values to datetimes, subtract 1 and convert to integers:

df['DAY'] = pd.to_datetime(df['DAY'])

df['DAY-ORD'] = (df.groupby(['ACCOUNT', 'VAR1', 'VAR2'])['DAY']
                   .rank('dense').sub(1).astype(int))

print (df)
  ACCOUNT VAR1  VAR2        DAY  DAY-ORD
0       A    X  True 2022-02-03        0
1       A    X  True 2022-02-04        1
2       B    X  True 2021-05-18        0
3       A    X  True 2022-02-05        2
4       B    X  True 2022-05-20        1
5       A    Y  True 2022-02-05        0
6       A    X  True 2022-03-12        3
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • I'd like to follow this up by only selecting the subset of rows, the groups of which have the maximal day-ord >= 14. How would you go about doing this? – Vid Stropnik Mar 08 '23 at 13:08
  • 1
    @VidStropnik - Do you need `df.loc[df['DAY-ORD'].ge(14), ['ACCOUNT', 'VAR1', 'VAR2']]` ? – jezrael Mar 09 '23 at 06:02