4

I have a small dataset (140K) that I would like to split into validation set, validation set test set using the target variable and another field to straitified those splits.

Jay Gondin
  • 41
  • 1
  • 2

1 Answers1

-2

In Pyspark you can use randomSplit() function to divide the dataset into train and test dataset. It can take upto two argument that are weights and seed.We use Seed because we want same output.In weights you can specify the floating number.If it doesnt sums to 1 it will normalize the weights.It is used for specify what percentage of data will go in train,validation and test part.

Sample Code

data.randomSplit([0.8,0.1,0.1],785)
Sagar Dubey
  • 121
  • 4
  • thank you for the reply. It looks good, unfortunately `randomSplit` creates random train, validation and test parts. I wish I could have the sample splits stratified by a feature. so the split have the same percentage of each class. Something similar to [Stratified sampling with pyspark ](https://stackoverflow.com/questions/47637760/stratified-sampling-with-pyspark) as @pissall has mentined, but for split – Jay Gondin Oct 04 '19 at 22:49