This solution is very rough based on your description and specific to the answer depending on the type of data used. Before making the pipeline, lets understand how the CountVectorizer
works on the raw_documents
that are passed in it. Essentially, this is the line that processes the string documents into tokens,
return lambda doc: self._word_ngrams(tokenize(preprocess(self.decode(doc))), stop_words)
which are then just counted and converted to count matrix.
So what happens here is:
decode
: Just decide how to read the data from file (if specified). Not of use to us, where we already have data into list.
preprocess
: It does the following if 'strip_accents'
and 'lowercase'
are True
in CountVectorizer
. Else nothing
strip_accents(x.lower())
Again, no use, because we are moving the lowercase functionality to our own preprocessor and dont need to strip accents because we already have data in list of strings.
tokenize
: Will remove all punctuations and retain only alphanumeric words of length 2 or more, and return a list of tokens for single document (element of list)
lambda doc: token_pattern.findall(doc)
This should be kept in mind. If you want to handle the punctuation and other symbols yourself (deciding on keeping some and removing others), then better also change the default token_pattern=’(?u)\b\w\w+\b’
of CountVectorizer
.
_word_ngrams
: This method will first remove the stop words (supplied as parameter above) from the list of tokens from the previous step and then calculate the n_grams as defined by the ngram_range
param in CountVectorizer
. This should also be kept in mind, if you want to handle the "n_grams"
your way.
Note: If the analyzer is set to 'char'
, then the tokenizer
step will be not be performed and n_grams will be made from characters.
So now coming to our pipeline. This is the structure I am thinking can work here:
X --> combined_pipeline, Pipeline
|
| Raw data is passed to Preprocessor
|
\/
Preprocessor
|
| Cleaned data (still raw texts) is passed to FeatureUnion
|
\/
FeatureUnion
|
| Data is duplicated and passed to both parts
_______________|__________________
| |
| |
\/ \/
CountVectorizer FeatureExtractor
| |
| Converts raw to | Extracts numerical features
| count-matrix | from raw data
\/________________________________\/
|
| FeatureUnion combines both the matrices
|
\/
Classifier
Now coming to code. This is what the pipeline looks like:
# Imports
from sklearn.svm import SVC
from sklearn.pipeline import FeatureUnion, Pipeline
# Pipeline
pipe = Pipeline([('preprocessor', CustomPreprocessor()),
('features', FeatureUnion([("vectorizer", CountVectorizer()),
("extractor", CustomFeatureExtractor())
]))
('classifier', SVC())
])
Where CustomPreprocessor
and CustomFeatureExtractor
are defined as:
from sklearn.base import TransformerMixin, BaseEstimator
class CustomPreprocessor(BaseEstimator, TransformerMixin):
def __init__(self, remove_urls=True, remove_mentions=True,
remove_emails=True, remove_invalid_chars=True,
convert_emojis=True, lowercase=True,
decompose_hashtags=True, remove_punctuations=True):
self.remove_urls=remove_urls
self.remove_mentions=remove_mentions
self.remove_emails=remove_emails
self.remove_invalid_chars=remove_invalid_chars
self.convert_emojis=convert_emojis
self.lowercase=lowercase
self.decompose_hashtags=decompose_hashtags
self.remove_punctuations=remove_punctuations
# You Need to have all the functions ready
# This method works on single tweets
def preprocessor(self, tweet):
# Data cleaning
if self.remove_urls:
tweet = URL_remover(tweet) # Removing URLs
if self.remove_mentions:
tweet = mentions_remover(tweet) # Removing mentions
if self.remove_emails:
tweet = email_remover(tweet) # Removing emails
if self.remove_invalid_chars:
tweet = irrelev_chars_remover(tweet) # Removing invalid chars
if self.convert_emojis:
tweet = emojies_converter(tweet) # Translating emojies
if self.lowercase:
tweet = to_lowercase(tweet) # Converting words to lowercase
if self.decompose_hashtags:
# Others
tweet = hashtag_decomposer(tweet) # Hashtag decomposition
# Punctuation may only be removed after hashtag decomposition
# because it considers "#" as punctuation
if self.remove_punctuations:
tweet = punct_remover(tweet) # Punctuation
return tweet
def fit(self, raw_docs, y=None):
# Noop - We dont learn anything about the data
return self
def transform(self, raw_docs):
return [self.preprocessor(tweet) for tweet in raw_docs]
from textblob import TextBlob
import numpy as np
# Same thing for feature extraction
class CustomFeatureExtractor(BaseEstimator, TransformerMixin):
def __init__(self, sentiment_analysis=True, tweet_length=True):
self.sentiment_analysis=sentiment_analysis
self.tweet_length=tweet_length
# This method works on single tweets
def extractor(self, tweet):
features = []
if self.sentiment_analysis:
blob = TextBlob(tweet)
features.append(blob.sentiment.polarity)
if self.tweet_length:
features.append(len(tweet))
# Do for other features you want.
return np.array(features)
def fit(self, raw_docs, y):
# Noop - Again I am assuming that We dont learn anything about the data
# Definitely not for tweet length, and also not for sentiment analysis
# Or any other thing you might have here.
return self
def transform(self, raw_docs):
# I am returning a numpy array so that the FeatureUnion can handle that correctly
return np.vstack(tuple([self.extractor(tweet) for tweet in raw_docs]))
Finally, the parameter grid can be now done easily like:
param_grid = ['preprocessor__remove_urls':[True, False],
'preprocessor__remove_mentions':[True, False],
...
...
# No need to search for lowercase or preprocessor in CountVectorizer
'features__vectorizer__max_df':[0.1, 0.2, 0.3],
...
...
'features__extractor__sentiment_analysis':[True, False],
'features__extractor__tweet_length':[True, False],
...
...
'classifier__C':[0.01, 0.1, 1.0]
]
The above code is to avoid "to create a different function for each possibility (e.g. funcA: proc1, funcB: proc1 + proc2, funcC: proc1 + proc3, etc.)
". Just do True, False and GridSearchCV will handle that.
Update:
If you dont want to have the CountVectorizer
, then you can remove that from the pipeline and parameter grid and new pipeline will be:
pipe = Pipeline([('preprocessor', CustomPreprocessor()),
("extractor", CustomFeatureExtractor()),
('classifier', SVC())
])
Then make sure to implement all the functionalities you want in CustomFeatureExtractor
. If that becomes too complex, then you can always make simpler extractors and combine them together in the FeatureUnion in place of CountVectorizer