0

I'm working with document classification.
I have about totally 14000 (document + category) data and I splitted them: 10000 to train data (x_train and y_train) and 4000 to test data (x_test and y_test).
And I used Doc2Vec() of gensim to vectorize the document: trained with x_train (not with x_test).
Here is my code applying Doc2Vec():

from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from sklearn import utils

total_data = [TaggedDocument(words=t, tags=[l]) for t, l in zip(prep_texts, labels)]
utils.shuffle(total_data)
train_data = total_data[:10000]
test_data = total_data[10000:]

d2v = Doc2Vec(dm=0, vector_size=100, window=5,
              alpha=0.025, min_alpha=0.001, min_count=5,
              sample=0, workers=8, hs=0, negative=5)

d2v.build_vocab([d for d in train_data])
d2v.train(train_data,
          total_examples=len(train_data),
          epochs=10)

So x_train and x_test is inferred vector from trained Doc2Vec().
Then I applied SVC of sklearn.svm to it like below.

from sklearn.svm import SVC
from sklearn.model_selection import KFold, cross_val_score
from sklearn.metrics import accuracy_score

clf = SVC()
k_fold = KFold(n_splits=10, shuffle=True, random_state=0)
scoring = 'accuracy'
score = cross_val_score(clf, x_train, y_train, cv=k_fold, n_jobs=8, scoring=scoring)
print(score)
print('Valid acc: {}'.format(round(np.mean(score)*100, 4)))
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)
print('Test acc: {}'.format(accuracy_score(y_test, y_pred)))

The result I got:

[0.916 0.912 0.908 0.923 0.901 0.922 0.921 0.908 0.923 0.924]
Valid acc: 91.58
Test acc: 0.6641691196146642

I am very confused that why I got very different score on cross_val_score() and accuracy_score().
I will write down my thinking below blockquotes:

When processing cross_val_score(), it will do cross-validation.
Then for each fold, (assume n_splits=10) 9/10 of train set will be used to train the classifier and left 1/10 of train set will be used to validate the classifier.
It means 1/10 of train set is always new for the model. So there is no difference between 1/10 of train set and test set in terms of newness for the model.

Is there any wrong thinking?
According to my current thinking, I cannot understand why I got very different score on cross_val_score() and accuracy_score().

Thanks in advance!!

EDIT: I realized that when I trained Doc2Vec() with not only x_train but also x_test, I could get better scores like below:

[0.905 0.886 0.883 0.91 0.888 0.903 0.904 0.897 0.906 0.905]
Valid acc: 89.87
Test acc: 0.8413165640888414

Yes, this is very natural to be better but I realized that the problem was not classification but vectorization.
But as you can see, there is still 5% difference between valid and test accuracy.
Now I'm still wondering why this difference occur and finding methods to improve the Doc2Vec().

sophia
  • 11
  • 3
  • Hi, this probably means your model is overfitting. It is unclear to me where x_test comes from. Could you add this? – amdex Apr 29 '20 at 09:13
  • @amdex Sure! I plused the content you said and some other thing in detail. Can you check the edited version? Thx!! – sophia Apr 29 '20 at 12:26
  • Hey, I saw your edits. I think you are right, and your doc2vec vectors just might not be representative enough to get good train/test transfer. Your better scores when you fit the doc2vec model confirms this. – amdex Apr 29 '20 at 12:29
  • @amdex I agree. So I tried to pretrained my `Doc2Vec()` according to [this method](https://stackoverflow.com/a/39337595/10423945) but I got poorer score.. :( I think doc for pretraining was not fitted with my data. Do you have any other solution to improve this? And do you think I need more data to improve this? Thanks for your advices!! – sophia Apr 29 '20 at 12:43

0 Answers0