Split pyspark dataset into two by date

Question

I have pyspark dataset, which I want to split into train and test datasets by datetime column, where train dataset should have datetimes less than median of the datetime column and test dataset should have the rest

I've tried to sort dataset by datetime column and select first half. But this only solves the train part problem, I dont know how to "substract" train dataset from initial dataset in PySpark

train = data.orderBy('datetime').limit(data.count() // 2)
# test = ?

It would be great if PySpark had some analogy of Pandas tail() function but it does not.

score 2 · Accepted Answer · answered Aug 21 '19 at 11:23

You can add a column that ranks the datetime and then partition the dataframe using the rank. The percent_rank function gives the percentile iirc.

from pyspark.sql import functions as F
from pyspark.window import Window

data_win = Window.partitionBy().orderBy('datetime')
dt_rank = data.withColumn('percent_rank', F.percent_rank().over(data_win))
train = dt_rank.filter(F.col('percent_rank') <= 0.5)
test = dt_rank.filter(F.col('percent_rank') > 0.5)

Split pyspark dataset into two by date

1 Answers1