I'm trying to build a data preprocessing pipeline with sklearn Pipeline and ColumnTransformer. The preprocessing steps consist of parallel imputting values and transforming (power transform, scaling or OHE) to specific columns. This preprocessing ColumnTransform works perfectly.
However, after doing some analysis on the preprocessed result, I decided to exclude some columns from the final result. My goal is to have one pipeline starting from the original dataframe that inputs and transforms values, excludes pre-selected columns, and triggers the model fitting all in one. So to be clear, I don't want to drop columns after the pipeline is fitted/transformed. I want instead that the process of dropping columns is part of the column transformation.
It's easy to remove the numerical columns from the model (by simply not adding them), but how can I exclude the columns created by OHE? I don't want to exclude all columns created by OHE, just some of them. For example, if categorical column "Example" becomes Example_1, Example_2, and Example_3, how can I exclude only Example_2?
Example code:
### Importing libraries
from sklearn.impute import SimpleImputer
SimpleImputer.get_feature_names_out = (lambda self, names = None: self.feature_names_in_) # SimpleImputer does not have get_feature_names_out, so we need to add it manually.
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
### Dummy dataframe
df_foo = pd.DataFrame({'Num_col' : [1,2,np.nan,4,5,6,7,np.nan,9],
'Example' : ['A','B','C','A','B','A','A','C','C'],
'another_col' : range(10,100,10)})
### Pipelines
SimpImpMean_MinMaxScaler = Pipeline([
('SimpleImputer', SimpleImputer(strategy="mean")),
('MinMaxScaler', MinMaxScaler()),
])
SimpImpConstNoAns_OHE = Pipeline([
('SimpleImputer', SimpleImputer(strategy="constant", fill_value='no_answer')),
('OHE', OneHotEncoder(sparse=False, drop='if_binary', categories='auto')),
])
### ColumnTransformer
preprocessor_transformer = ColumnTransformer([
('pipeline-1', SimpImpMean_MinMaxScaler, ['Num_col']),
('pipeline-2', SimpImpConstNoAns_OHE, ['Example'])
],
remainder='drop',
verbose_feature_names_out=False)
preprocessor_transformer
### Preprocessing dummy dataframe
df_foo = pd.DataFrame(preprocessor_transformer.fit_transform(df_foo),
columns=preprocessor_transformer.get_feature_names_out()
)
print(df_foo)
Finally, I've seen this solution out there (Adding Dropping Column instance into a Pipeline) but I didn't manage to make the custom columnDropperTransformer to work in my case. Adding the columnDropperTransformer to my pipeline returns an ValueError: A given column is not a column of the dataframe, refering to column "Example" not existing in the dataframe anymore.
class columnDropperTransformer():
def __init__(self,columns):
self.columns=columns
def transform(self,X,y=None):
return X.drop(self.columns,axis=1)
def fit(self, X, y=None):
return self
processor= make_pipeline(preprocessor_transformer,columnDropperTransformer([]))
processor.fit_transform(df_foo)
Any suggestions?