Efficient way for ordering values within group in Pandas

Question

I'm working with time series data, which is packaged in a time long dataframe, something like this:

ACCOUNT | VAR1 | VAR2 | DAY

I'm interested in creating a new column DAY_ORD, which would give the ordinal rank of the DAY variable to each row within a group over unique (ACCOUNT, VAR1, VAR2) triplets.

Here is a small example of what I want to achieve:

ACCOUNT	VAR1	VAR2	DAY	DAY-ORD
A	X	True	2022-02-03	0
A	X	True	2022-02-04	1
B	X	True	2021-05-18	0
A	X	True	2022-02-05	2
B	X	True	2022-05-20	1
A	Y	True	2022-02-05	0
A	X	True	2022-03-12	3

Here is my current implementation:

#initialize an empty 'DAY_ORD' column
df['DAY_ORD'] = [None for i in range(len(df))]

#iterate over all triplets that appear in the data
for (_, row) in fb_data[['ACCOUNT', 'VAR1', 'VAR2']].copy().drop_duplicates().iterrows()):
    acc, v1, v2 = row[0], row[1], row[2]

    #find the df slice that adheres to the considered triplet
    fdf = df.loc[(df.ACCOUNT== acc) & (fb_data.VAR1 == v1) & (fb_data.VAR2 == v2)].sort_values('DAY')

    #assign them an ordinal rank
    fdf['DAY_ORD'] = [i for i in range(len(fdf))]

    #set the DAY_ORD values in the original dataframe
    for i in fdf.index:
        df.loc[i, 'DAY_ORD'] = fdf['DAY_ORD'][i]
df['DAY_ORD']

It seems like it will do the job, but it runs very slowly, at around 8 it/s. What is a clean way to make this faster?

score 1 · Accepted Answer · answered Mar 08 '23 at 11:38

1

Use GroupBy.rank with convert values to datetimes, subtract 1 and convert to integers:

df['DAY'] = pd.to_datetime(df['DAY'])

df['DAY-ORD'] = (df.groupby(['ACCOUNT', 'VAR1', 'VAR2'])['DAY']
                   .rank('dense').sub(1).astype(int))

print (df)
  ACCOUNT VAR1  VAR2        DAY  DAY-ORD
0       A    X  True 2022-02-03        0
1       A    X  True 2022-02-04        1
2       B    X  True 2021-05-18        0
3       A    X  True 2022-02-05        2
4       B    X  True 2022-05-20        1
5       A    Y  True 2022-02-05        0
6       A    X  True 2022-03-12        3

answered Mar 08 '23 at 11:38

jezrael

822,522
95
1,334
1,252

I'd like to follow this up by only selecting the subset of rows, the groups of which have the maximal day-ord >= 14. How would you go about doing this? – Vid Stropnik Mar 08 '23 at 13:08
1

@VidStropnik - Do you need `df.loc[df['DAY-ORD'].ge(14), ['ACCOUNT', 'VAR1', 'VAR2']]` ? – jezrael Mar 09 '23 at 06:02

Efficient way for ordering values within group in Pandas

1 Answers1