I have a daily temperature dataset and I am trying to build a model that operates on one week of data at a time. I've imported it into pandas DataFrame and grouped it by week (using the resample method). So far so good.
Please note, I do not want to aggregate the weekly data, I just want to group my "flat" dataset into weekly "chunks" that I can feed into the model one at a time.
I was able to accomplish it with the below code, but my question is:
How can I split this grouped DataFrame into training/validation sets?
Here is what I've tried so far (and mostly failed):
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
daily = pd.DataFrame(
data=np.random.rand(365) * 120, columns=["temp"],
index=pd.date_range(start="2019-01-01", end="2019-12-31", freq="d")
)
print("days:", len(daily))
weekly = daily.resample("W")
print("weeks:", len(weekly))
mask = np.random.rand(len(weekly)) < .8
# Both of these give KeyError: 'Columns not found: False, True'
train = weekly[mask]
valid = weekly[~mask]
# This also fails with KeyError: 'Columns not found: 12'
train, valid = train_test_split(weekly, train_size=.8)
UPDATE:
In the meantime, I came up with a pair of generators I can use for training/validation:
def gen_train(df, mask):
for index, (_, data) in enumerate(df):
if mask[index]: yield data
def gen_valid(df, mask):
for index, (_, data) in enumerate(df):
if not mask[index]: yield data
mask = np.random.rand(len(weekly)) < .8
model.fit(x=gen_train(weekly, mask), validation_data=get_valid(weekly, mask),
...
)
Unfortunately, this doesn't shuffle the data.
Can anyone come up with a better solution?