I have a dataframe that looks like:
|-----------------------------------------------------|
| | category | pct_formation |
|-----------------------------------------------------|
|ts_timestamp | | |
|-----------------------------------------------------|
|2018-10-22 10:13:44.043 | in_petr | 37.07 |
|2018-10-22 10:17:09.527 | in_petr | 36.97 |
|2018-10-22 10:17:43.977 | in_dsh | 36.95 |
|2018-10-22 10:17:43.963 | in_dsh | 36.96 |
|2018-10-22 10:17:09.527 | in_petr | 32.96 |
|2018-10-22 10:19:44.040 | out_petr | 36.89 |
|2018-10-23 10:19:44.043 | out_petr | 36.90 |
|2018-10-23 10:19:37.267 | sync | 33.91 |
|2018-10-23 10:19:44.057 | sync | 36.96 |
|2018-10-23 10:19:16.750 | out_petr | 36.88 |
|2018-10-23 10:20:03.160 | sync | 36.98 |
|2018-10-23 10:20:32.350 | sync | 37.00 |
|2018-10-23 10:23:03.150 | sync | 34.58 |
|2018-10-23 10:22:18.633 | in_dsh | 36.98 |
|2018-10-23 10:25:39.557 | in_dsh | 36.97 |
|-----------------------------------------------------|
The data contains pct_formation
values for various categories collected at different times every day (irregular frequency, unevenly spaced).
I want to compare the average pct_formation of each category for a 10-minute rolling window between 9am and 11am, on each day or average over a week.
The problem is that the data for each category does not always start coming in at 9am. For some, it starts at 9.10am, for some at 9.15am, for some at 10am and so on. Also, the data does not come at regular intervals. How can I get the 10-minute rolling average for each day and each category between 9am and 11am?
Initially, I converted ts_timestamp
column to an index:
df = df.set_index('ts_timestamp')
Then, I can groupby
and use rolling()
as such:
df.groupby('category').rolling('10T').agg({'pct_formation': 'mean'})
However, this does not show me regular 10 minute intervals, but shows the timestamps from the dataframe.
I realize that I would need to create a data range like so to be used as index:
pd.date_range(start=df.index.min().replace(hour=9, minute=0, second=0, microsecond=0),
end=df.index.max().replace(hour=11, minute=0, second=0, microsecond=0),
freq='10T')
#
# or should I use freq='1T' so that rolling() can do 10 minute intervals?
But, how can I align my data frame with this range? How can I average multiple values that occur between the range?
I am new to working with time series data, and would appreciate any help. Please feel free to ask if anything is not clear.