Pandas timeseries: groupby and rolling average of irregularly spaced data over regular 10-minute windows

Question

I have a dataframe that looks like:

|-----------------------------------------------------|
|                        | category   | pct_formation |
|-----------------------------------------------------|
|ts_timestamp            |            |               |
|-----------------------------------------------------|
|2018-10-22 10:13:44.043 | in_petr    | 37.07         |
|2018-10-22 10:17:09.527 | in_petr    | 36.97         |
|2018-10-22 10:17:43.977 | in_dsh     | 36.95         |
|2018-10-22 10:17:43.963 | in_dsh     | 36.96         |
|2018-10-22 10:17:09.527 | in_petr    | 32.96         |
|2018-10-22 10:19:44.040 | out_petr   | 36.89         |
|2018-10-23 10:19:44.043 | out_petr   | 36.90         |
|2018-10-23 10:19:37.267 | sync       | 33.91         |
|2018-10-23 10:19:44.057 | sync       | 36.96         |
|2018-10-23 10:19:16.750 | out_petr   | 36.88         |
|2018-10-23 10:20:03.160 | sync       | 36.98         |
|2018-10-23 10:20:32.350 | sync       | 37.00         |
|2018-10-23 10:23:03.150 | sync       | 34.58         |
|2018-10-23 10:22:18.633 | in_dsh     | 36.98         |
|2018-10-23 10:25:39.557 | in_dsh     | 36.97         |
|-----------------------------------------------------|

The data contains pct_formation values for various categories collected at different times every day (irregular frequency, unevenly spaced).

I want to compare the average pct_formation of each category for a 10-minute rolling window between 9am and 11am, on each day or average over a week.

The problem is that the data for each category does not always start coming in at 9am. For some, it starts at 9.10am, for some at 9.15am, for some at 10am and so on. Also, the data does not come at regular intervals. How can I get the 10-minute rolling average for each day and each category between 9am and 11am?

Initially, I converted ts_timestamp column to an index:

df = df.set_index('ts_timestamp')

Then, I can groupby and use rolling() as such:

df.groupby('category').rolling('10T').agg({'pct_formation': 'mean'})

However, this does not show me regular 10 minute intervals, but shows the timestamps from the dataframe.

I realize that I would need to create a data range like so to be used as index:

pd.date_range(start=df.index.min().replace(hour=9, minute=0, second=0, microsecond=0),
              end=df.index.max().replace(hour=11, minute=0, second=0, microsecond=0),
              freq='10T')
#
# or should I use freq='1T' so that rolling() can do 10 minute intervals?

But, how can I align my data frame with this range? How can I average multiple values that occur between the range?

I am new to working with time series data, and would appreciate any help. Please feel free to ask if anything is not clear.

Time-aware rolling is what you are looking for: https://stackoverflow.com/a/41176540 — toliveira, Oct 07 '20 at 22:23

Vishnu Kunchur · Answer 1 · 2018-10-26T02:01:30.840

Using pd.Grouper:

df.groupby(['category', pd.Grouper(key = 'ts_timestamp', freq = '10Min')]).\ agg({'pct_formation': 'mean'})

Output:

                                    pct
cat      ts                            
in_dsh   2018-10-22 10:10:00  36.955000
in_petr  2018-10-22 10:10:00  35.666667
out_petr 2018-10-22 10:10:00  36.890000
         2018-10-23 10:10:00  36.900000
sync     2018-10-23 10:10:00  35.435000

Pandas timeseries: groupby and rolling average of irregularly spaced data over regular 10-minute windows

1 Answers1