How to create stratified split training, validation, and test set on pyspark?

Question

I have a small dataset (140K) that I would like to split into validation set, validation set test set using the target variable and another field to straitified those splits.

Possible duplicate of [Stratified sampling with pyspark](https://stackoverflow.com/questions/47637760/stratified-sampling-with-pyspark) — pissall, Sep 19 '19 at 15:49
@pissall stratified sample and stratified split is not quite the same. But, it is a good start. thx — Jay Gondin, Sep 19 '19 at 20:54
It's an example of how you can do proportionate allocation using `groupby` method. Pick up the logic and help your use case is all. — pissall, Sep 20 '19 at 04:04

score -2 · Answer 1 · answered Sep 20 '19 at 08:31

-2

In Pyspark you can use randomSplit() function to divide the dataset into train and test dataset. It can take upto two argument that are weights and seed.We use Seed because we want same output.In weights you can specify the floating number.If it doesnt sums to 1 it will normalize the weights.It is used for specify what percentage of data will go in train,validation and test part.

Sample Code

data.randomSplit([0.8,0.1,0.1],785)

answered Sep 20 '19 at 08:31

Sagar Dubey

121
4

thank you for the reply. It looks good, unfortunately `randomSplit` creates random train, validation and test parts. I wish I could have the sample splits stratified by a feature. so the split have the same percentage of each class. Something similar to [Stratified sampling with pyspark ](https://stackoverflow.com/questions/47637760/stratified-sampling-with-pyspark) as @pissall has mentined, but for split – Jay Gondin Oct 04 '19 at 22:49

How to create stratified split training, validation, and test set on pyspark?

1 Answers1