2

I am dealing with a dataset where observations occur between opening and closing hours -- but the service closes on the day after it opens. For example, opening occurs at 7am and closing at 1am, the following day.

This feels like a very common problem -- I've searched around for it and am open to the fact I might just not know the correct terms to search for.

For most of my uses it's enough to do something like:

   open_close = pd.DatetimeIndex(start='2012-01-01 05:00:00', periods = 15, offset='D')

Then I can just do fun little groupbys on the df:
df.groupby(open_close.asof).agg(func).

But I've run into an instance where I need to grab multiple of these open-close periods. What I really want to be able to do is just have an DatetimeIndex where I get to pick when an day starts. So I could just redefine 'day' to be from 5AM to 5AM. The nice thing about this is I can then use things like df[df.index.dayofweek == 6] and get back everything from 5AM on Sunday to 5AM on Monda.

It feels like Periods...or something inside of pandas anticipated this request. Would love help figuring it out.

EDIT:

I've also figured this out via creating another column with the right day
df['shift_day'] = df['datetime'].apply(magicFunctionToFigureOutOpenClose)
-- so this isn't blocking my progress. Just feels like something that could be nicely integrated into the package (or datetime...or somewhere...)

Garrett
  • 47,045
  • 6
  • 61
  • 50
badgley
  • 1,657
  • 1
  • 13
  • 19

1 Answers1

1

Perhaps the base parameter of df.resample() would help:

base : int, default 0
    For frequencies that evenly subdivide 1 day, the "origin" of the
    aggregated intervals. For example, for '5min' frequency, base could
    range from 0 through 4. Defaults to 0

Here's an example:

In [44]: df = pd.DataFrame(np.random.rand(28),
....:           index=pd.DatetimeIndex(start='2012/9/1', periods=28, freq='H'))

In [45]: df
Out[45]: 
                            0
2012-09-01 00:00:00  0.970273
2012-09-01 01:00:00  0.730171
2012-09-01 02:00:00  0.508588
2012-09-01 03:00:00  0.535351
2012-09-01 04:00:00  0.940255
2012-09-01 05:00:00  0.143483
2012-09-01 06:00:00  0.792659
2012-09-01 07:00:00  0.231413
2012-09-01 08:00:00  0.071676
2012-09-01 09:00:00  0.995202
2012-09-01 10:00:00  0.236551
2012-09-01 11:00:00  0.904853
2012-09-01 12:00:00  0.652873
2012-09-01 13:00:00  0.488400
2012-09-01 14:00:00  0.396647
2012-09-01 15:00:00  0.967261
2012-09-01 16:00:00  0.554188
2012-09-01 17:00:00  0.884086
2012-09-01 18:00:00  0.418577
2012-09-01 19:00:00  0.189584
2012-09-01 20:00:00  0.577041
2012-09-01 21:00:00  0.100332
2012-09-01 22:00:00  0.294672
2012-09-01 23:00:00  0.925425
2012-09-02 00:00:00  0.630807
2012-09-02 01:00:00  0.400261
2012-09-02 02:00:00  0.156469
2012-09-02 03:00:00  0.658608

 

In [46]: df.resample("24H", how=sum, label='left', closed='left', base=5)
Out[46]: 
                             0
2012-08-31 05:00:00   3.684638
2012-09-01 05:00:00  11.671068

In [47]: df.ix[:5].sum()
Out[47]: 0    3.684638

In [48]: df.ix[5:].sum()
Out[48]: 0    11.671068
Garrett
  • 47,045
  • 6
  • 61
  • 50
  • hey that's a pretty cool way to groupby! But I can already group fine -- my problem comes when I need to get multiple periods. Say, I want the last three 'Friday periods' that span from [5AM Fri, 5AM Sat). Starting to think I didn't describe my problem well enough -- will edit. – badgley Nov 13 '12 at 21:45
  • No problem, sorry for missing the target. The hard part seemed to be aligning the data. Since that's solved, is your next question: how to grab the last X instances of day Y? Would ``df[df.index.dayofweek == Y].ix[-X:]`` work? – Garrett Nov 15 '12 at 03:31
  • That doesn't work for me -- the problem here addresses my core question. When you filter by dayofweek you make it so you can't reach 'across' days as defined by [12am DayofWeek1, 12am DayofWeek2). – badgley Nov 17 '12 at 01:37
  • Ok, how about shifting the timestamps first with something like ``df.shift(freq='-5H')``? – Garrett Nov 17 '12 at 19:12
  • @crewburn -- base DOES do what i need. I ran across this --http://stackoverflow.com/questions/12579150/resample-hourly-timeseries-with-certain-starting-hour – badgley Sep 09 '13 at 00:59