2

I was trying sklearn pipeline for the first time and using Titanic dataset. I want to first impute missing value in Embarked and then do one hot encoding. While in Sex attribute, I just want to do one hot encoding. So, I have the below steps in which two steps are for Embarked. But it is not working as expected as the Embarked column remains in addition to its one hot encoding as shown in the output(column having 'S').

If I do imputation and one hot encoding for Embarked in single step, it is working as expected.

What is the reason behind this or I am doing something wrong? Also, I didn't find any information related to this.

categorical_cols_impute = ['Embarked']
categorical_impute = Pipeline([
    ("mode_impute", SimpleImputer(missing_values=np.nan, strategy='constant', fill_value='S')),
#     ("one_hot", OneHotEncoder(sparse=False))
])
categorical_cols = ['Embarked', 'Sex']
categorical_one_hot = Pipeline([
    ("one_hot", OneHotEncoder(sparse=False))
])
preprocesor = ColumnTransformer([
    ("cat_impute", categorical_impute, categorical_cols_impute),
    ("cat_one_hot", categorical_one_hot, categorical_cols)
], remainder="passthrough")
pipe = Pipeline([
    ("preprocessor", preprocesor),
#     ("model", RandomForestClassifier(random_state=0))
])

enter image description here

ggaurav
  • 1,764
  • 1
  • 10
  • 10

1 Answers1

2

ColumnTransformer transformers are applied in parallel, not sequentially. So in your example, Embarked ends up in your transformed data twice: once from the first transformer, keeping its string type, and again from the second transformer, this time one-hot encoded (but not imputed first!(?)).

So just uncomment the second step in the embarked pipeline, and remove Embarked from categorical_cols.

See also Consistent ColumnTransformer for intersecting lists of columns (but I don't think it's quite a duplicate).

Ben Reiniger
  • 10,517
  • 3
  • 16
  • 29