0

Background

I have two dataframes. The first one, df_player_snapshot_master, has snapshot data for individual poker players at different times in a tournament and looks like this:

    snapshotDateTime            playerID    deactivationRecord
0   2021-05-06 09:28:24.987995  1679613     False
1   2021-05-06 09:28:24.987995  2660567     False
2   2021-05-06 09:28:24.987995  2668394     False
3   2021-05-06 09:28:24.987995  2280604     False
4   2021-05-06 09:28:24.987995  2018271     False

The second dataframe, df_tournament_summary_master, summarizes this data at a tournament level by snapshot and looks like this:

    intervalStartDateTime       intervalEndDateTime         playersDeactivatedThisInterval
0   2021-05-06 09:28:24.987995  2021-05-06 09:28:38.605930  NaN
1   2021-05-06 09:28:38.605930  2021-05-06 09:28:47.860595  NaN
2   2021-05-06 09:28:47.860595  2021-05-06 09:28:57.187734  NaN
3   2021-05-06 09:28:57.187734  2021-05-06 09:29:07.187734  NaN

My goal is to populate df_tournament_summary_master['playersDeactivatedThisInterval'] by counting the number of records in df_player_snapshot_master where df_player_snapshot_master['snapshotDateTime'] equals df_tournament_summary_master['intervalStartDateTime'] and df_player_snapshot_master['deactivationRecord'] is True. So essentially a COUNTIFS exercise between two dataframes with multiple conditions.

What I've Tried

I translated the solution in this post to my situation which resulted in this code:

tournament_info = df_tournament_summary_master[['intervalStartDateTime']].values

deacs = []

for iSDT in tournament_info:
    deacs.append(len(df_player_snapshot_master[(df_player_snapshot_master['snapshotDateTime']==iSDT) &
                                  (df_player_snapshot_master['deactivationRecord']==True)])
                          )

df_tournament_summary_master['playersDeactivatedThisInterval'] = deacs

but it keeps throwing an error:

ValueError: ('Lengths must match to compare', (332,), (1,))

The post I modeled my code after does mention the need to convert to datetimes and that seems like it might be the issue, but when I check my types for both dataframes/series they all seem to be datetimes already:

In:
display(tournament_info)

Out:
array([['2021-05-06T09:28:24.987995000'],
       ['2021-05-06T09:28:38.605930000'],
       ['2021-05-06T09:28:47.860595000'],
       ['2021-05-06T09:28:57.187734000']], dtype='datetime64[ns]')

In:
df_player_snapshot_master.dtypes

Out:
snapshotDateTime           datetime64[ns]
playerID                           object
deactivationRecord                   bool
dtype: object

Of course, I've looked at other posts that deal with filtering dataframes by comparing dates and they all seem to use different solutions than the for loop I'd like to use, so I'm not sure how to re-purpose and adapt their advice.

Questions

  1. What is causing the ValueError I'm getting?
  2. How do I populate df_tournament_summary_master['playersDeactivatedThisInterval'] with the conditional count using a date comparison condition?
Fist Pump Cat
  • 55
  • 1
  • 4

1 Answers1

0

You can do this by counting the sum of deactivationRecord after a groupby on snapshotDateTime, then merge the dataframes. The results of your sample data aren't that exciting, but it should work:

time_groups = df_player_snapshot_master.groupby('snapshotDateTime')['deactivationRecord'].sum()
df = pd.merge(df_tournament_summary_master, time_groups, how='left', left_on=['intervalStartDateTime'], right_on=['snapshotDateTime'])
df = df.drop(['playersDeactivatedThisInterval'], axis = 1).rename(columns={"deactivationRecord": "playersDeactivatedThisInterval"})

Result:

intervalStartDateTime intervalEndDateTime playersDeactivatedThisInterval
0 2021-05-06 09:28:24.987995 2021-05-06 09:28:38.605930 0
1 2021-05-06 09:28:38.605930 2021-05-06 09:28:47.860595 nan
2 2021-05-06 09:28:47.860595 2021-05-06 09:28:57.187734 nan
3 2021-05-06 09:28:57.187734 2021-05-06 09:29:07.187734 nan
RJ Adriaansen
  • 9,131
  • 2
  • 12
  • 26
  • Thanks for your answer. This does solve the problem. I have other similar applications though where I'm conducting a conditional count but am unsure how this generalizes to one condition, three conditions, or when the conditions have non-Boolean data types. That's why the `for` loop format in the link I referenced is so attractive. Does this generalize easily to other similar scenarios? – Fist Pump Cat May 13 '21 at 21:22