Different score on cross_val_score() and accuracy_score() on sklearn

Question

I'm working with document classification.
I have about totally 14000 (document + category) data and I splitted them: 10000 to train data (x_train and y_train) and 4000 to test data (x_test and y_test).
And I used Doc2Vec() of gensim to vectorize the document: trained with x_train (not with x_test).
Here is my code applying Doc2Vec():

from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from sklearn import utils

total_data = [TaggedDocument(words=t, tags=[l]) for t, l in zip(prep_texts, labels)]
utils.shuffle(total_data)
train_data = total_data[:10000]
test_data = total_data[10000:]

d2v = Doc2Vec(dm=0, vector_size=100, window=5,
              alpha=0.025, min_alpha=0.001, min_count=5,
              sample=0, workers=8, hs=0, negative=5)

d2v.build_vocab([d for d in train_data])
d2v.train(train_data,
          total_examples=len(train_data),
          epochs=10)

So x_train and x_test is inferred vector from trained Doc2Vec().
Then I applied SVC of sklearn.svm to it like below.

from sklearn.svm import SVC
from sklearn.model_selection import KFold, cross_val_score
from sklearn.metrics import accuracy_score

clf = SVC()
k_fold = KFold(n_splits=10, shuffle=True, random_state=0)
scoring = 'accuracy'
score = cross_val_score(clf, x_train, y_train, cv=k_fold, n_jobs=8, scoring=scoring)
print(score)
print('Valid acc: {}'.format(round(np.mean(score)*100, 4)))
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)
print('Test acc: {}'.format(accuracy_score(y_test, y_pred)))

The result I got:

[0.916 0.912 0.908 0.923 0.901 0.922 0.921 0.908 0.923 0.924]
Valid acc: 91.58
Test acc: 0.6641691196146642

I am very confused that why I got very different score on cross_val_score() and accuracy_score().
I will write down my thinking below blockquotes:

When processing cross_val_score(), it will do cross-validation.
Then for each fold, (assume n_splits=10) 9/10 of train set will be used to train the classifier and left 1/10 of train set will be used to validate the classifier.
It means 1/10 of train set is always new for the model. So there is no difference between 1/10 of train set and test set in terms of newness for the model.

Is there any wrong thinking?
According to my current thinking, I cannot understand why I got very different score on cross_val_score() and accuracy_score().

Thanks in advance!!

EDIT: I realized that when I trained Doc2Vec() with not only x_train but also x_test, I could get better scores like below:

[0.905 0.886 0.883 0.91 0.888 0.903 0.904 0.897 0.906 0.905]
Valid acc: 89.87
Test acc: 0.8413165640888414

Yes, this is very natural to be better but I realized that the problem was not classification but vectorization.
But as you can see, there is still 5% difference between valid and test accuracy.
Now I'm still wondering why this difference occur and finding methods to improve the Doc2Vec().

Hi, this probably means your model is overfitting. It is unclear to me where x_test comes from. Could you add this? — amdex, Apr 29 '20 at 09:13
@amdex Sure! I plused the content you said and some other thing in detail. Can you check the edited version? Thx!! — sophia, Apr 29 '20 at 12:26
Hey, I saw your edits. I think you are right, and your doc2vec vectors just might not be representative enough to get good train/test transfer. Your better scores when you fit the doc2vec model confirms this. — amdex, Apr 29 '20 at 12:29
@amdex I agree. So I tried to pretrained my `Doc2Vec()` according to [this method](https://stackoverflow.com/a/39337595/10423945) but I got poorer score.. :( I think doc for pretraining was not fitted with my data. Do you have any other solution to improve this? And do you think I need more data to improve this? Thanks for your advices!! — sophia, Apr 29 '20 at 12:43

Different score on cross_val_score() and accuracy_score() on sklearn

0 Answers0