I have pyspark dataset, which I want to split into train and test datasets by datetime column, where train dataset should have datetimes less than median of the datetime column and test dataset should have the rest
I've tried to sort dataset by datetime column and select first half. But this only solves the train part problem, I dont know how to "substract" train dataset from initial dataset in PySpark
train = data.orderBy('datetime').limit(data.count() // 2)
# test = ?
It would be great if PySpark had some analogy of Pandas tail() function but it does not.