1

Variable names for my training and test data are X_train, X_test, Y_train, Y_test

I have ran a GridSearchCV instance from sklearn to do hyperparameter tuning for my random forest model.

   param_grid = {
        'n_estimators': [500],
        'max_features': ['sqrt', None],
        'max_depth': [ 6 ],
        'max_leaf_nodes': [8],

        'min_impurity_decrease':[0,0.02],
        'min_samples_split':[2]
    }

    grid_search= GridSearchCV(RandomForestClassifier(
                                       criterion='gini',
                                   min_weight_fraction_leaf=0.0, 
                                   bootstrap=True, 
                                       n_jobs=-1, 
                                       random_state=1, verbose=0, 
                                       warm_start=False, class_weight='balanced', 
                                       ccp_alpha=0.0, 
                                       max_samples=None),
                               param_grid=param_grid,verbose=50,cv=2,n_jobs=-1,scoring='balanced_accuracy')
    grid_search.fit(X_train, Y_train)

All the scores that I can see while the gridseach is training are in the range of 0.4-.6 Following is the output of the best score:

[CV 2/2; 1/4] END max_depth=6, max_features=sqrt, max_leaf_nodes=8, min_impurity_decrease=0, min_samples_split=2, n_estimators=500;, score=0.552 total time=  15.4s

My questions is when I am manually calculating balanced_accuracy using
from sklearn.metrics import balanced_accuracy_score, by running print('training accuracy', balanced_accuracy_score(grid_search.predict(X_train), Y_train,adjusted=False)), I am getting a value of about 0.96 which is very different from what the output of gridsearchcv is showing during the run. Why is this so? And what does the score in gridsearchcv mean then? Please note I have passed the parameter scoring = 'balanced_accuracy' in gridsearchcv to make sure they calculate the same thing.

1 Answers1

0

The score you get from gridsearchcv is the validation score (score measured on the part of X_train not used to train the model).

Your manually calculated score is the training score (you fit the model and evaluate the score on the same data: X_train).

The high difference is a sign of overfitting.

You can try to change param_grid :

 param_grid = {
    'n_estimators': [100, 200], # 500 seems high and might take too long for no reason
    'max_features': ['sqrt','log2', None],  # Less features can reduce overfitting
    'max_depth': [3, 4, 5, 6 ],  # Lower depth can reduce overfitting
    'max_leaf_nodes': [4, 6, 8],  # Lower max_leaf_nodes can reduce overfitting
    'min_impurity_decrease':[0,0.02],
    'min_samples_split':[2],
    'min_samples_leaf': [5, 10, 20]   # Higher values can reduce overfitting
}

Also using cv=3 or cv=5 in GridSearchCV could help.

See this post about solving Random Forest overfitting.

Mattravel
  • 1,358
  • 1
  • 15