3

I have a dataset of 6000 observations; a sample of it is the following:

job_id      job_title                                           job_sector
30018141    Secondary Teaching Assistant                        Education
30006499    Legal Sales Assistant / Executive                   Sales
28661197    Private Client Practitioner                         Legal
28585608    Senior hydropower mechanical project manager        Engineering
28583146    Warehouse Stock Checker - Temp / Immediate Start    Transport & Logistics
28542478    Security Architect Contract                         IT & Telecoms

The goal is to predict the job sector of each row based on the job title.

Firstly, I apply some preprocessing on the job_title column:

def preprocess(document):
    lemmatizer = WordNetLemmatizer()
    stemmer_1 = PorterStemmer()
    stemmer_2 = LancasterStemmer()
    stemmer_3 = SnowballStemmer(language='english')

    # Remove all the special characters
    document = re.sub(r'\W', ' ', document)

    # remove all single characters
    document = re.sub(r'\b[a-zA-Z]\b', ' ', document)

    # Substituting multiple spaces with single space
    document = re.sub(r' +', ' ', document, flags=re.I)

    # Converting to lowercase
    document = document.lower()

    # Tokenisation
    document = document.split()

    # Stemming
    document = [stemmer_3.stem(word) for word in document]

    document = ' '.join(document)

    return document

df_first = pd.read_csv('../data.csv', keep_default_na=True)

for index, row in df_first.iterrows():

    df_first.loc[index, 'job_title'] = preprocess(row['job_title'])

Then I do the following with Gensim and Doc2Vec:

X = df_first.loc[:, 'job_title'].values
y = df_first.loc[:, 'job_sector'].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=0)

tagged_train = TaggedDocument(words=X_train.tolist(), tags=y_train.tolist())
tagged_train = list(tagged_train)

tagged_test = TaggedDocument(words=X_test.tolist(), tags=y_test.tolist())
tagged_test = list(tagged_test)

model = Doc2Vec(vector_size=5, min_count=2, epochs=30)

training_set = [TaggedDocument(sentence, tag) for sentence, tag in zip(X_train.tolist(), y_train.tolist())]

model.build_vocab(training_set)

model.train(training_set, total_examples=model.corpus_count, epochs=model.epochs)   

test_set = [TaggedDocument(sentence, tag) for sentence, tag in zip(X_test.tolist(), y_test.tolist())]

predictors_train = []
for sentence in X_train.tolist():

    sentence = sentence.split()
    predictor = model.infer_vector(doc_words=sentence, steps=20, alpha=0.01)

    predictors_train.append(predictor.tolist())

predictors_test = []
for sentence in X_test.tolist():

    sentence = sentence.split()
    predictor = model.infer_vector(doc_words=sentence, steps=20, alpha=0.025)

    predictors_test.append(predictor.tolist())

sv_classifier = SVC(kernel='linear', class_weight='balanced', decision_function_shape='ovr', random_state=0)
sv_classifier.fit(predictors_train, y_train)

score = sv_classifier.score(predictors_test, y_test)
print('accuracy: {}%'.format(round(score*100, 1)))

However, the result which I am getting is 22% accuracy.

This makes me a lot suspicious especially because by using the TfidfVectorizer instead of the Doc2Vec (both with the same classifier) then I am getting 88% accuracy (!).

Therefore, I guess that I must be doing something wrong in how I apply the Doc2Vec of Gensim.

What is it and how can I fix it?

Or it it simply that my dataset is relatively small while more advanced methods such as word embeddings etc require way more data?

Outcast
  • 4,967
  • 5
  • 44
  • 99

3 Answers3

5

You don't mention the size of your dataset - in rows, total words, unique words, or unique classes. Doc2Vec works best with lots of data. Most published work trains on tens-of-thousands to millions of documents, of dozens to thousands of words each. (Your data appears to only have 3-5 words per document.)

Also, published work tends to train on data where every document has a unique-ID. It can sometimes make sense to use known-labels as tags instead of, or in addition to, unique-IDs. But it isn't necessarily a better approach. By using known-labels as the only tags, you're effectively only training one doc-vector per label. (It's essentially similar to concatenating all rows with the same tag into one document.)

You're inexplicably using fewer steps in inference than epochs in training - when in fact these are analogous values. In recent versions of gensim, inference will by default use the same number of inference epochs as the model was configured to use for training. And, it's more common to use more epochs during inference than training. (Also, you're inexplicably using different starting alpha values for inference for both classifier-training and classifier-testing.)

But the main problem is likely your choice of tiny size=5 doc vectors. Instead of the TfidfVectorizer, which will summarize each row as a vector of width equal to the unique-word count – perhaps hundreds or thousands of dimensions – your Doc2Vec model summarizes each document as just 5 values. You've essentially lobotomized Doc2Vec. Usual values here are 100-1000 – though if the dataset is tiny smaller sizes may be required.

Finally, the lemmatization/stemming may not be strictly necessary and may even be destructive. Lots of Word2Vec/Doc2Vec work doesn't bother to lemmatize/stem - often because there's plentiful data, with many appearances of all word forms.

These steps are most likely to help with smaller data, by making sure rarer word forms are combined with related longer forms to still get value from words that would otherwise be too rare to be retained (or get useful vectors).

But I can see many ways they might hurt for your domain. Manager and Management won't have exactly the same implications in this context, but could both be stemmed to manag. Similar for Security and Securities both becoming secur, and other words. I'd only perform these steps if you can prove through evaluation that they're helping. (Are the words passed to the TfidfVectorizer being lemmatized/stemmed?)

gojomo
  • 52,260
  • 14
  • 86
  • 115
  • 1
    Thanks a lot for your answer here and apologies for my belated response. I had seen that you are one of the primary contributors in questions with the `gensim` tag in StackOverflow and I have been looking forward to your response. Your points are absolutely to the point and very insightful in general about how Doc2Vec should beapplied. I will take them into consideration. – Outcast Mar 29 '19 at 11:03
  • 1
    If we exclude the possibility that I made any trivial code mistake with the Gensim API (you saw my code and you did not mention anything like that) then I think that primary reason why my Doc2Vec model would not perform so well is that I have a rather limited dataset in number (6000 obsevations), as you have also said. Apparently, also the fact that that I messed up a bit the values of the hyperparameters played some role too but I do not think that only this would make the accuracy to jump from 22% to let's say 90%. – Outcast Mar 29 '19 at 11:05
  • I could easily imagine using a `size=5` alone driving accuracy arbitrarily low, compared to what could be achieved with a more appropriate dimensionality. The only way to know whether `Doc2Vec` is competitive, with all the issues I've mentioned addressed, would be to test it. And one of the few benefits of a small dataset is that fixes/parameter-adjustments can be checked quickly! So there's no need for guesswork about what might or might not create enough of an improvement. – gojomo Mar 29 '19 at 15:27
  • Hey, I am having this same issue. However, I noticed that lemma/stem helps a lot for my tfidf models but I noticed that on my doc2vec I get better results without them. Does this makes sense? In any case my tfidf model still performs way better than my doc2vec – Jorge A. Salazar Feb 24 '21 at 01:46
  • It's not surprising that stemming/lemmatization will sometimes help & sometimes hurt, depending on other algorithms used. In some cases they might make important patterns (many related words) more starkly legibles to some downstream classifiers; in others they might discard info (shades of related meaning) that other algorithms, with enough data, can also make use of. It's worth setting up a system where you can evaluate many options against each other. – gojomo Feb 24 '21 at 17:16
  • Also not suprised that there might be problems on which TFIDF outperforms Doc2Vec - I'd especially expect that if the data quantity is small, and/or few exact keywords serve as individual strong signals that a text fits a certain category. The exact presence of individual keywords is necessarily lost by the Doc2Vec compression-into-a-dense-vector, whereas a 'wide/sparse' TFIDF representation retains more info (at cost of a much larger representation). If such expansion is acceptable, you could also consider a mixed/concatenated representation, with N dims from Doc2Vec and V dims from TFIDF. – gojomo Feb 24 '21 at 17:20
0

usually to train doc2vec/word2vec requires lots of generalised data (word2vec trained on 3 milian Wikipedia articles), since it's performing poorly on doc2vec consider experimenting with pre trained doc2vec refer this

Or you can try using word2vec and averaging it for entire document since word2vec gives vector for each word.

Let me know how this helps ?

raghava
  • 86
  • 5
-1

The tools you are using are not suitable for classification. Id suggest you look into something like a char-rnn.

https://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html

This tutorial works on a similar problem, where it classifies names.

shiredude95
  • 560
  • 3
  • 7
  • 1
    Thank you for your answer. However, I am not sure about that since there are quite some posts on the Internet where people are using Doc2Vec for classification. – Outcast Mar 23 '19 at 00:16
  • Is there any reason you are using a support vector classifier over a simple logistic regression classifier. You can do multiclass prediction using logistic regression as well. https://fzr72725.github.io/2018/01/14/genism-guide.html. This article does something similar. But bear in mind, for doc2vec to work, similar to word2vec you need a decent sized corpus, or else the vectors simply wont be good enough. You can play around with different dimensional sizes and see if that helps. – shiredude95 Mar 23 '19 at 03:37
  • The reason is that usually support vector classifiers outperform logistic regression or at least they are on a par with them. Your last point about having a relatively small corpus is sth that I have in mind too and it may be the reason why the results are poor in comparison with the `tf-idf` & `SVC` which I used. However to be honest I did not expect them to be that poor; and this signals me a bit that I may have been simply doing something wrong in the way I am using `Gensim`. In this regard, if you have used `Gensim` and you can spot anything wrong at my code above then let me know please. – Outcast Mar 23 '19 at 13:20