Apply multiple preprocessing steps to a column in sklearn pipeline

Question

I was trying sklearn pipeline for the first time and using Titanic dataset. I want to first impute missing value in Embarked and then do one hot encoding. While in Sex attribute, I just want to do one hot encoding. So, I have the below steps in which two steps are for Embarked. But it is not working as expected as the Embarked column remains in addition to its one hot encoding as shown in the output(column having 'S').

If I do imputation and one hot encoding for Embarked in single step, it is working as expected.

What is the reason behind this or I am doing something wrong? Also, I didn't find any information related to this.

categorical_cols_impute = ['Embarked']
categorical_impute = Pipeline([
    ("mode_impute", SimpleImputer(missing_values=np.nan, strategy='constant', fill_value='S')),
#     ("one_hot", OneHotEncoder(sparse=False))
])
categorical_cols = ['Embarked', 'Sex']
categorical_one_hot = Pipeline([
    ("one_hot", OneHotEncoder(sparse=False))
])
preprocesor = ColumnTransformer([
    ("cat_impute", categorical_impute, categorical_cols_impute),
    ("cat_one_hot", categorical_one_hot, categorical_cols)
], remainder="passthrough")
pipe = Pipeline([
    ("preprocessor", preprocesor),
#     ("model", RandomForestClassifier(random_state=0))
])

score 2 · Accepted Answer · answered Jan 04 '21 at 03:17

ColumnTransformer transformers are applied in parallel, not sequentially. So in your example, Embarked ends up in your transformed data twice: once from the first transformer, keeping its string type, and again from the second transformer, this time one-hot encoded (but not imputed first!(?)).

So just uncomment the second step in the embarked pipeline, and remove Embarked from categorical_cols.

See also Consistent ColumnTransformer for intersecting lists of columns (but I don't think it's quite a duplicate).

Thanks for including the link. It clears up further doubts. – ggaurav Jan 04 '21 at 04:46 — ggaurav, Jan 04 '21 at 04:46

Apply multiple preprocessing steps to a column in sklearn pipeline

1 Answers1

Linked

Related