1

I know that there is a class_weight parameter in the version 0.17 of sklearn.ensemble.RandomForestClassifier.

I cannot install 0.17. How do I access this parameter in version 0.14?

Or, is there another way to deal with imbalanced labels (y values) in a RandomForestClassifier? I have a binary classifier with many more negatives than positives, which naturally skews the results, so I want to set the class weights to offset this.

makansij
  • 9,303
  • 37
  • 105
  • 183

1 Answers1

0

Looking at the source, it doesn't look like this is implemented in 0.14. Alternatively, you can down-sample the negative class to get an even balance:

import numpy as np

# Fake class labels skewed toward negative class:
real_p = 0.01 # true probability of class 1 (unknown in real cases)
Y = (np.random.rand(10000) < real_p).astype(np.int)

# Use average number of pos examples as an estimate of P(Y=1)
p = (Y==1).mean()

print "Label balance: %.3f pos / %.3f neg" % (p,1-p)

# Resample the training set:
inds = np.zeros(Y.shape[0], dtype=np.bool)
inds[np.where(Y==1)] = True
inds[np.where(Y==0)] = np.random.rand((Y==0).sum()) < p

resample_p = (Y[inds]==1).mean()

print "After resampling:"
print "Label balance: %.3f pos / %.3f neg" % (resample_p,1-resample_p)

Output:

Label balance: 0.013 pos / 0.987 neg
After resampling:
Label balance: 0.531 pos / 0.469 neg

Note that this is a very simplistic means of down-sampling the negative class. A better approach might be to integrate the down-sampling or weighting into the learning scheme - perhaps a boosting or cascade approach?

Matt Hancock
  • 3,870
  • 4
  • 30
  • 44
  • I was thinking maybe an easier way would be to use `sample_weight` parameter. By examining my training samples, I could just find out what class each example belongs to, and create a vector of dimension `n_samples x 1` to weight each sample appropriately. Would that work? – makansij Dec 21 '15 at 16:08
  • I propose this question here: http://stackoverflow.com/questions/34389624/what-does-sample-weight-do-to-the-way-a-decisiontreeclassifier-works-in-skle – makansij Dec 21 '15 at 16:09
  • Downsampling like I've shown above is about the most straight-forward approach you can take. I suggest giving it a shot, first. I don't believe scaling the feature vectors will work as you suggest. What you really need to do is augment the impurity function used by the algorithm for making node splits. If you really want to get down into weeds of this, you need a good understanding of CART and impurity measures, which you can find on Brieman's textbook on CART (or around the net). – Matt Hancock Dec 21 '15 at 16:45
  • By the way, [this is the paper cited in sklearn](http://gking.harvard.edu/files/0s.pdf) for which the class weighting scheme is based for the `DecisionTreeClassifier`. – Matt Hancock Dec 21 '15 at 16:48
  • 1
    @Hunle I'm sorry - I misread your comment as still referring to `class_weight`. You're correct about using `sample_weight`. – Matt Hancock Dec 21 '15 at 19:23