Pandas groupby training/validation split

Question

I have a daily temperature dataset and I am trying to build a model that operates on one week of data at a time. I've imported it into pandas DataFrame and grouped it by week (using the resample method). So far so good.

Please note, I do not want to aggregate the weekly data, I just want to group my "flat" dataset into weekly "chunks" that I can feed into the model one at a time.

I was able to accomplish it with the below code, but my question is:

How can I split this grouped DataFrame into training/validation sets?

Here is what I've tried so far (and mostly failed):

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

daily = pd.DataFrame(
    data=np.random.rand(365) * 120, columns=["temp"],
    index=pd.date_range(start="2019-01-01", end="2019-12-31", freq="d")
)
print("days:", len(daily))

weekly = daily.resample("W")
print("weeks:", len(weekly))

mask = np.random.rand(len(weekly)) < .8
# Both of these give KeyError: 'Columns not found: False, True'
train = weekly[mask]
valid = weekly[~mask]

# This also fails with KeyError: 'Columns not found: 12'
train, valid = train_test_split(weekly, train_size=.8)

UPDATE:

In the meantime, I came up with a pair of generators I can use for training/validation:

def gen_train(df, mask):
    for index, (_, data) in enumerate(df):
        if mask[index]: yield data

def gen_valid(df, mask):
    for index, (_, data) in enumerate(df):
        if not mask[index]: yield data

mask = np.random.rand(len(weekly)) < .8

model.fit(x=gen_train(weekly, mask), validation_data=get_valid(weekly, mask),
    ...
)

Unfortunately, this doesn't shuffle the data.

Can anyone come up with a better solution?

Does this answer your question? [How to convert DatetimeIndexResampler to DataFrame?](https://stackoverflow.com/questions/39492004/how-to-convert-datetimeindexresampler-to-dataframe) — Dave, May 14 '20 at 18:39
@Dave how does that question apply here?... If you think it does, please post an answer with example. — Super-intelligent Shade, May 14 '20 at 18:49
Just use [TimeSeriesSplit](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html). — David, May 14 '20 at 19:08

Dave · Answer 1 · 2020-05-14T21:54:02.013

1

Your issue is that you're not completing the resample method. Choose a method to resample and your code works:

...
weekly = daily.resample("W").mean() # <- Note the call to complete the resample with weekly mean
train, valid = train_test_split(weekly, train_size=.8)

train.shape
# (42, 1)

valid.shape
# (11, 1)

42 / (42 + 11)
# 0.7924528301886793

EDIT: If you don't want to resample, just loop through weeks with a groupby:

...
for date, week in daily.groupby(pd.Grouper(freq='W')):
    train, valid = train_test_split(week, train_size=.8)
    print(date)
    print(train.shape)
    print(valid.shape)

2019-01-06 00:00:00
(4, 1)
(2, 1)
2019-01-13 00:00:00
(5, 1)
(2, 1)
2019-01-20 00:00:00
(5, 1)
(2, 1)
2019-01-27 00:00:00
(5, 1)
(2, 1)
2019-02-03 00:00:00
(5, 1)
(2, 1)
...

EDIT: If you want to sample weeks as the unit of observation, you'll want to make a new column for them:

daily['week'] = daily.index.year.astype(str) + '-' + daily.index.week.astype(str)

                  temp     week
2019-01-01   98.551345   2019-1
2019-01-02  103.880149   2019-1
2019-01-03   48.187819   2019-1
2019-01-04  116.942540   2019-1
2019-01-05   21.342152   2019-1
...                ...      ...

Then train/test split the weeks and select the rows:

train_weeks, test_weeks = train_test_split(daily.week.unique(), train_size=.8)
train = daily[daily.week.isin(train_weeks)]
test = daily[daily.week.isin(test_weeks)]

train.shape
#(288, 2)

test.shape
#(77, 2)

edited May 14 '20 at 21:54

answered May 14 '20 at 18:58

Dave

1,579
14
28

Dave, but I am not trying to get a mean (or any other agg method). I just want to split my dataset into weekly "chunks", so I can feed them into my model one week at a time – Super-intelligent Shade May 14 '20 at 19:05
Dave, I've added more data to my question to hopefully make it more clear. Your answer doesn't apply to my question, and while I won't downvote it, others might. If I may suggest, please delete it before anyone does. :) – Super-intelligent Shade May 14 '20 at 19:36
I updated my answer to just chunk weeks through test-train. – Dave May 14 '20 at 21:04
Almost, but no cigar. You are splitting each week into 5 and 2 days. I don't want that. I want to randomly split the original dataset into full weeks (7 days) and some of those full weeks to be used for training and some for validation. – Super-intelligent Shade May 14 '20 at 21:14
Now _that_ is clear explanation of what you want! Take a look at my edit. – Dave May 14 '20 at 21:56
Thanks @Dave. I plus-oned. I ended up converting my data into 2D array and it turned out to be surprisingly simple and I was able to use train_test_split. (I should prolly post my solution in case someone else runs into this in the future.) – Super-intelligent Shade May 22 '20 at 21:15

Ehsan · Answer 2 · 2020-05-15T14:48:17.113

1

Use itertools.compress

from itertools import compress

train = compress(weekly, mask)
valid = compress(weekly, ~mask)

edited May 15 '20 at 14:48

answered May 14 '20 at 19:21

Ehsan

711
2
7
21

Thank you @Ehsan. I haven't tried it, but it looks like a good option. – Super-intelligent Shade May 22 '20 at 21:10

Pandas groupby training/validation split

2 Answers2