2

As part of a research project, I want to select the best combination of preprocessing techniques and textual features that optimize the results of a text classification task. For this, I am using Python 3.6.

There are a number of methods to combine features and algorithms, but I want to take full advantage of sklearn's pipelines and test all the different (valid) possibilities using grid search for the ultimate feature combo.

My first step was to build a pipeline that looks like the following:

# Run a vectorizer with a predefined tweet tokenizer and a Naive Bayes

pipeline = Pipeline([
    ('vectorizer', CountVectorizer(tokenizer = tweet_tokenizer)),
    ('nb', MultinomialNB())
])

parameters = {
'vectorizer__preprocessor': (None, preprocessor)
}

gs =  GridSearchCV(pipeline, parameters, cv=5, n_jobs=-1, verbose=1)

In this simple example, the vectorizer tokenizes the data using tweet_tokenizer and then tests which option of preprocessing (None or a predefined function) results better.

This seems like a decent start, but I am now struggling to find a way to test all the different possibilities within the preprocessor function, defined below:

def preprocessor(tweet):
    # Data cleaning
    tweet = URL_remover(tweet) # Removing URLs
    tweet = mentions_remover(tweet) # Removing mentions
    tweet = email_remover(tweet) # Removing emails
    tweet = irrelev_chars_remover(tweet) # Removing invalid chars
    tweet = emojies_converter(tweet) # Translating emojies
    tweet = to_lowercase(tweet) # Converting words to lowercase
    # Others
    tweet = hashtag_decomposer(tweet) # Hashtag decomposition
    # Punctuation may only be removed after hashtag decomposition  
    # because it considers "#" as punctuation
    tweet = punct_remover(tweet) # Punctuation 
    return tweet

A "simple" solution to combine all the different processing techniques would be to create a different function for each possibility (e.g. funcA: proc1, funcB: proc1 + proc2, funcC: proc1 + proc3, etc.) and set the grid parameter as follows:

parameters = {
   'vectorizer__preprocessor': (None, funcA, funcB, funcC, ...)
}

Although this would most likely work, this isn't a viable or reasonable solution for this task, especially since there are 2^n_features different combinations and, consequently, functions.

The ultimate goal is to combine both preprocessing techniques and features in a pipeline in order to optimize the results of the classification using gridsearch:

pipeline = Pipeline([
    ('vectorizer', CountVectorizer(tokenizer = tweet_tokenizer)),
    ('feat_extractor' , feat_extractor)
    ('nb', MultinomialNB())
])

 parameters = {
   'vectorizer__preprocessor': (None, funcA, funcB, funcC, ...)
   'feat_extractor': (None, func_A, func_B, func_C, ...)
 }

Is there a simpler way to obtain this?

GRoutar
  • 1,311
  • 1
  • 15
  • 38
  • What is the `('feat_extractor' , feat_extractor)` after the `CounteVectorizer` supposed to do? The pipeline will pass the data through `CountVectorizer`, then the new data (count-matrix, not words) will be passed to the `feat_extractor`. Is this what you want? Or you want the `feat_extractor` to be included into `preprocesser` as you described above? – Vivek Kumar Dec 19 '18 at 06:55
  • @VivekKumar The feat_extractor should only work on the original raw text. I know it takes CountVectorizer's output but it was just a poor attempt at showing what I am trying to do. CountVectorizer only allows for a specific set of features (e.g. n-grams) and I want to perform more feature extraction (e.g. sentiment analysis) – GRoutar Dec 19 '18 at 12:08
  • 1
    If you want both of these things (`vectorizer` and `feat_extractor`) on same data, `FeatureUnion` can help. So now about `func_A`, `func_B` etc: Are they same for both `vectorizer__preprocessor` and `feat_extractor`? – Vivek Kumar Dec 19 '18 at 12:23
  • No, they should be different functions. For vectorizer__preprocessor, funcA would combine preprocessing 1 and 2 (e.g. lowercase + emojies converter), while feat_extractor's func_A would combine features 1 and 2 or any other possible combination (e.g. n-grams + sentiment analysis + tweet length). The preprocessing functions would only contain combinations of data cleaning features, while the feature extraction functions would only contain combinations of textual features. – GRoutar Dec 19 '18 at 12:32
  • For the sake of simplicity, I want to combine a set of functions (that either preprocess or extract features from the data). For each possible combination of functions I would have a different pipeline joining the features (functions) together and a classifier in the end to classify the performance of the combination and in the end get the combination that outputs the best result . Is this viable / understandable? – GRoutar Dec 19 '18 at 12:40
  • 1
    Yes. I will post an answer shortly describing roughly about it. – Vivek Kumar Dec 19 '18 at 12:59
  • I appreciate. I've read about the exhaustive feature selection, but it operates on each individual column, other than a feature as a whole. Combining 10k column features would take probable a week worth of computing time. – GRoutar Dec 19 '18 at 13:03
  • You say that `"The feat_extractor should only work on the original raw text."` How are you planning to do sentiment analysis on raw data? Would this require you to train on sentiments of the data (which are different than the actual classes)? – Vivek Kumar Dec 19 '18 at 14:23
  • I haven't actually thought about that yet. That is a possibility but I will most likely use something like TextBlob and compute the sentiment such as "tweet.sentiment.polarity" – GRoutar Dec 19 '18 at 14:37

1 Answers1

2

This solution is very rough based on your description and specific to the answer depending on the type of data used. Before making the pipeline, lets understand how the CountVectorizer works on the raw_documents that are passed in it. Essentially, this is the line that processes the string documents into tokens,

return lambda doc: self._word_ngrams(tokenize(preprocess(self.decode(doc))), stop_words)

which are then just counted and converted to count matrix.

So what happens here is:

  1. decode: Just decide how to read the data from file (if specified). Not of use to us, where we already have data into list.
  2. preprocess: It does the following if 'strip_accents' and 'lowercase' are True in CountVectorizer. Else nothing

    strip_accents(x.lower())
    

    Again, no use, because we are moving the lowercase functionality to our own preprocessor and dont need to strip accents because we already have data in list of strings.

  3. tokenize: Will remove all punctuations and retain only alphanumeric words of length 2 or more, and return a list of tokens for single document (element of list)

    lambda doc: token_pattern.findall(doc)
    

    This should be kept in mind. If you want to handle the punctuation and other symbols yourself (deciding on keeping some and removing others), then better also change the default token_pattern=’(?u)\b\w\w+\b’ of CountVectorizer.

    1. _word_ngrams: This method will first remove the stop words (supplied as parameter above) from the list of tokens from the previous step and then calculate the n_grams as defined by the ngram_range param in CountVectorizer. This should also be kept in mind, if you want to handle the "n_grams" your way.

Note: If the analyzer is set to 'char', then the tokenizer step will be not be performed and n_grams will be made from characters.

So now coming to our pipeline. This is the structure I am thinking can work here:

X --> combined_pipeline, Pipeline
            |
            |  Raw data is passed to Preprocessor
            |
            \/
         Preprocessor 
                 |
                 |  Cleaned data (still raw texts) is passed to FeatureUnion
                 |
                 \/
              FeatureUnion
                      |
                      |  Data is duplicated and passed to both parts
       _______________|__________________
      |                                  |
      |                                  |                         
      \/                                \/
   CountVectorizer                  FeatureExtractor
           |                                  |   
           |   Converts raw to                |   Extracts numerical features
           |   count-matrix                   |   from raw data
           \/________________________________\/
                             |
                             | FeatureUnion combines both the matrices
                             |
                             \/
                          Classifier

Now coming to code. This is what the pipeline looks like:

# Imports
from sklearn.svm import SVC
from sklearn.pipeline import FeatureUnion, Pipeline

# Pipeline
pipe = Pipeline([('preprocessor', CustomPreprocessor()), 
                 ('features', FeatureUnion([("vectorizer", CountVectorizer()),
                                            ("extractor", CustomFeatureExtractor())
                                            ]))
                 ('classifier', SVC())
                ])

Where CustomPreprocessor and CustomFeatureExtractor are defined as:

from sklearn.base import TransformerMixin, BaseEstimator

class CustomPreprocessor(BaseEstimator, TransformerMixin):
    def __init__(self, remove_urls=True, remove_mentions=True, 
                 remove_emails=True, remove_invalid_chars=True, 
                 convert_emojis=True, lowercase=True, 
                 decompose_hashtags=True, remove_punctuations=True):
        self.remove_urls=remove_urls
        self.remove_mentions=remove_mentions
        self.remove_emails=remove_emails
        self.remove_invalid_chars=remove_invalid_chars
        self.convert_emojis=convert_emojis
        self.lowercase=lowercase
        self.decompose_hashtags=decompose_hashtags
        self.remove_punctuations=remove_punctuations

    # You Need to have all the functions ready
    # This method works on single tweets
    def preprocessor(self, tweet):
        # Data cleaning
        if self.remove_urls:
            tweet = URL_remover(tweet) # Removing URLs

        if self.remove_mentions:
            tweet = mentions_remover(tweet) # Removing mentions

        if self.remove_emails:
            tweet = email_remover(tweet) # Removing emails

        if self.remove_invalid_chars:
            tweet = irrelev_chars_remover(tweet) # Removing invalid chars

        if self.convert_emojis:
            tweet = emojies_converter(tweet) # Translating emojies

        if self.lowercase:
            tweet = to_lowercase(tweet) # Converting words to lowercase

        if self.decompose_hashtags:
            # Others
            tweet = hashtag_decomposer(tweet) # Hashtag decomposition

        # Punctuation may only be removed after hashtag decomposition  
        # because it considers "#" as punctuation
        if self.remove_punctuations:
            tweet = punct_remover(tweet) # Punctuation 

        return tweet

    def fit(self, raw_docs, y=None):
        # Noop - We dont learn anything about the data
        return self

    def transform(self, raw_docs):
        return [self.preprocessor(tweet) for tweet in raw_docs]

from textblob import TextBlob
import numpy as np
# Same thing for feature extraction
class CustomFeatureExtractor(BaseEstimator, TransformerMixin):
    def __init__(self, sentiment_analysis=True, tweet_length=True):
        self.sentiment_analysis=sentiment_analysis
        self.tweet_length=tweet_length

    # This method works on single tweets
    def extractor(self, tweet):
        features = []

        if self.sentiment_analysis:
            blob = TextBlob(tweet)
            features.append(blob.sentiment.polarity)

        if self.tweet_length:
            features.append(len(tweet))

        # Do for other features you want.

        return np.array(features)

    def fit(self, raw_docs, y):
        # Noop - Again I am assuming that We dont learn anything about the data
        # Definitely not for tweet length, and also not for sentiment analysis
        # Or any other thing you might have here.
        return self

    def transform(self, raw_docs):
        # I am returning a numpy array so that the FeatureUnion can handle that correctly
        return np.vstack(tuple([self.extractor(tweet) for tweet in raw_docs]))

Finally, the parameter grid can be now done easily like:

param_grid = ['preprocessor__remove_urls':[True, False],
              'preprocessor__remove_mentions':[True, False],
              ...
              ...
              # No need to search for lowercase or preprocessor in CountVectorizer 
              'features__vectorizer__max_df':[0.1, 0.2, 0.3],
              ...
              ...
              'features__extractor__sentiment_analysis':[True, False],
              'features__extractor__tweet_length':[True, False],
              ...
              ...
              'classifier__C':[0.01, 0.1, 1.0]
            ]

The above code is to avoid "to create a different function for each possibility (e.g. funcA: proc1, funcB: proc1 + proc2, funcC: proc1 + proc3, etc.)". Just do True, False and GridSearchCV will handle that.

Update: If you dont want to have the CountVectorizer, then you can remove that from the pipeline and parameter grid and new pipeline will be:

pipe = Pipeline([('preprocessor', CustomPreprocessor()), 
                 ("extractor", CustomFeatureExtractor()),
                 ('classifier', SVC())
                ])

Then make sure to implement all the functionalities you want in CustomFeatureExtractor. If that becomes too complex, then you can always make simpler extractors and combine them together in the FeatureUnion in place of CountVectorizer

Vivek Kumar
  • 35,217
  • 8
  • 109
  • 132
  • I like the architecture provided, but I fail to see how I could test different sets of features. A pipeline for each different possible iteration? Or tune the parameters of the gridsearch? e.g. 'featureExtractor__ngram': (None, MyNgram) – GRoutar Dec 19 '18 at 15:40
  • Also, instead of using the prebuilt CountVectorizer, I thought about implementing it's functions myself for more flexibility. The only CountVectorizer functionalities I need are pretty much ngrams, tokenizer and lowercase, which I have already implemented myself anyway. Perhaps by doing so I am able to only use the FeatureExtractor in the pipeline? – GRoutar Dec 19 '18 at 15:43
  • Majestic answer, I like it alot. Would you recommend keeping CountVectorizer versus coding my own functions other than complexity issues? – GRoutar Dec 19 '18 at 15:54
  • 1
    @Khabz As I described above, the only thing that you are not doing and `CountVectorizer` is handling is the tokenization and ngram feature generation. You can try your own methods for that. – Vivek Kumar Dec 19 '18 at 15:59
  • @Khabz Also remember that when you do it your own way, you will need to do the learning part in `fit()` because you need to store found features, and then in `transform()` use those features to transform the new data. – Vivek Kumar Dec 19 '18 at 16:13
  • Are the fit+transform equivalent to extracting the features and one hot encoding (or whatever) into a matrix? – GRoutar Dec 19 '18 at 17:09
  • 1
    @Khabz `fit()` is only for learning about the data (like the words or ngrams you would like to make as features in text, or finding the max and min of data in numbers etc). You store these kind of values during `fit`. Now during `transform()` you use those learnt values to change the data meaningfully (like finding the same words or ngrams in the new data and making features from them). – Vivek Kumar Dec 20 '18 at 06:25
  • Could you please provide an example on how you would implement a ngram extractor? – GRoutar Dec 24 '18 at 17:43
  • @Khabz That depends on what you want. Do you want to extract all ngrams from the text of the requested size or only those which were seen during training? What will you do after that? – Vivek Kumar Dec 26 '18 at 09:41
  • My goal would be to collect only the training data ngrams and turn them into features (1- contains ngram, 0 - doesn't contain) – GRoutar Dec 26 '18 at 21:32
  • Well the CountVectorizer just does that. Check out the `binary` parameter on the documentation page](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html). You can consider my earlier setup containing the `CountVectorizer` for doing it. – Vivek Kumar Dec 27 '18 at 07:01