1

I have a lab working with preprocess data. And I try to use ColumnTransformer with pipeline syntax. I have some code below.

preprocess = ColumnTransformer(
                    [('imp_mean', SimpleImputer(strategy='mean'), numerics_cols),
                     ('imp_mode', SimpleImputer(strategy='most_frequent'), categorical_cols),
                     ('onehot', OneHotEncoder(handle_unknown='ignore'), categorical_cols),
                     #('stander', StandardScaler(), fewer_cols_train_X_df.columns)
                    ])

After I run this code and call the pipeline the result is.

       ['female', 1.0, 0.0, 0.0],
       ['male', 0.0, 1.0, 0.0],
       ['female', 1.0, 0.0, 0.0],
       ['male', 0.0, 1.0, 0.0],
       ['male', 0.0, 1.0, 0.0],
       ['male', 0.0, 1.0, 0.0],
       ['male', 0.0, 1.0, 0.0],
       ['female', 1.0, 0.0, 0.0],
       ['male', 0.0, 1.0, 0.0],
       ['male', 0.0, 1.0, 0.0],
       ['male', 0.0, 1.0, 0.0],
       ['male', 0.0, 1.0, 0.0],
       ['male', 0.0, 1.0, 0.0],
       ['female', 1.0, 0.0, 0.0],
       ['female', 1.0, 0.0, 0.0],
       ['male', 0.0, 1.0, 0.0],

You can see the categorical is in the result. I try to drop it, but it's still here. So I just want to remove categorical in this result to run StandardScaler. I don't understand why it doesn't work. Thank you for reading.

Huy Huy
  • 25
  • 5
  • Does this answer your question? [Apply multiple preprocessing steps to a column in sklearn pipeline](https://stackoverflow.com/questions/65554163/apply-multiple-preprocessing-steps-to-a-column-in-sklearn-pipeline) See also https://stackoverflow.com/q/67250392/10495893 – Ben Reiniger Dec 22 '21 at 19:32

1 Answers1

2

With ColumnTransformer you cannot perform sequential information on the different columns. This object will perform the first operation defined for a given column and then mark it as preprocessed.

Therefore in your example, categorical columns will only be imputed but will not be One-hot encoded.

To perform this operation (Imputing and One-hot Encoding on columns you should put these preprocessing on a Pipeline to perform them sequentially.

The example below is illustrating how to handle different processing for numerical and categorical features.

from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.pipeline import Pipeline
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler

X = pd.DataFrame({'gender' : ['male', 'male', 'female'],
                 'A' : [1, 10 , 20],
                 'B' : [1, 150 , 20]})

categorical_preprocessing = Pipeline(
[
    ('imp_mode', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore')),
])

numerical_preprocessing = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler()),
])

preprocessing = ColumnTransformer(
                    [
                        ('catecorical', categorical_preprocessing,
                         make_column_selector(dtype_include=object)),
                        ('numerical', numerical_preprocessing,
                         make_column_selector(dtype_include=np.number)),
                    ])

preprocessing.fit_transform(X)

Output:

array([[ 0.        ,  1.        , -1.20270298, -0.84570663],
       [ 0.        ,  1.        , -0.04295368,  1.40447708],
       [ 1.        ,  0.        ,  1.24565666, -0.55877045]])
Antoine Dubuis
  • 4,974
  • 1
  • 15
  • 29
  • Thank you! it working for me. Hmmm, How can I make ColumnTransformer before pipeline? is it possible? – Huy Huy Dec 22 '21 at 08:52
  • Well it is just the definition of the `ColumnTransformer` that is created before the pipeline. Then it is added as a step. – Antoine Dubuis Dec 22 '21 at 08:59