I have a small dataset (140K) that I would like to split into validation set, validation set test set using the target variable and another field to straitified those splits.
Asked
Active
Viewed 4,249 times
4
-
3Possible duplicate of [Stratified sampling with pyspark](https://stackoverflow.com/questions/47637760/stratified-sampling-with-pyspark) – pissall Sep 19 '19 at 15:49
-
@pissall stratified sample and stratified split is not quite the same. But, it is a good start. thx – Jay Gondin Sep 19 '19 at 20:54
-
It's an example of how you can do proportionate allocation using `groupby` method. Pick up the logic and help your use case is all. – pissall Sep 20 '19 at 04:04
-
Check this out https://stackoverflow.com/a/61016937/8836068 – brainoverflow98 May 27 '20 at 10:19
1 Answers
-2
In Pyspark you can use randomSplit() function to divide the dataset into train and test dataset. It can take upto two argument that are weights and seed.We use Seed because we want same output.In weights you can specify the floating number.If it doesnt sums to 1 it will normalize the weights.It is used for specify what percentage of data will go in train,validation and test part.
Sample Code
data.randomSplit([0.8,0.1,0.1],785)

Sagar Dubey
- 121
- 4
-
thank you for the reply. It looks good, unfortunately `randomSplit` creates random train, validation and test parts. I wish I could have the sample splits stratified by a feature. so the split have the same percentage of each class. Something similar to [Stratified sampling with pyspark ](https://stackoverflow.com/questions/47637760/stratified-sampling-with-pyspark) as @pissall has mentined, but for split – Jay Gondin Oct 04 '19 at 22:49