Variable names for my training and test data are X_train, X_test, Y_train, Y_test
I have ran a GridSearchCV instance from sklearn to do hyperparameter tuning for my random forest model.
param_grid = {
'n_estimators': [500],
'max_features': ['sqrt', None],
'max_depth': [ 6 ],
'max_leaf_nodes': [8],
'min_impurity_decrease':[0,0.02],
'min_samples_split':[2]
}
grid_search= GridSearchCV(RandomForestClassifier(
criterion='gini',
min_weight_fraction_leaf=0.0,
bootstrap=True,
n_jobs=-1,
random_state=1, verbose=0,
warm_start=False, class_weight='balanced',
ccp_alpha=0.0,
max_samples=None),
param_grid=param_grid,verbose=50,cv=2,n_jobs=-1,scoring='balanced_accuracy')
grid_search.fit(X_train, Y_train)
All the scores that I can see while the gridseach is training are in the range of 0.4-.6 Following is the output of the best score:
[CV 2/2; 1/4] END max_depth=6, max_features=sqrt, max_leaf_nodes=8, min_impurity_decrease=0, min_samples_split=2, n_estimators=500;, score=0.552 total time= 15.4s
My questions is when I am manually calculating balanced_accuracy using
from sklearn.metrics import balanced_accuracy_score,
by running print('training accuracy', balanced_accuracy_score(grid_search.predict(X_train), Y_train,adjusted=False))
, I am getting a value of about 0.96 which is very different from what the output of gridsearchcv is showing during the run. Why is this so? And what does the score in gridsearchcv mean then? Please note I have passed the parameter scoring = 'balanced_accuracy'
in gridsearchcv to make sure they calculate the same thing.