I'm going through the jupyter notebooks from the book Hand-On ML with scikit-learn. I'm trying to do the Titanic challenge but using the ColumnTransformer.
I'm trying to create the pre-processing pipeline, for numerical values the ColumnTransformer produces the right output. However, when working with the categorical values I'm getting a weird output.
Here's the code:
from sklearn.preprocessing import OneHotEncoder
cat_attr = ['Sex', 'Embarked', 'Pclass']
cat_pipeline = ColumnTransformer([
('cat_fill_missing', SimpleImputer(strategy='most_frequent'), cat_attr),
('cat_encoder', OneHotEncoder(sparse=False), cat_attr),
])
cat_pipeline.fit_transform(train_data)
This produces:
array([['male', 'S', 3, ..., 0.0, 0.0, 1.0],
['female', 'C', 1, ..., 1.0, 0.0, 0.0],
['female', 'S', 3, ..., 0.0, 0.0, 1.0],
...,
['female', 'S', 3, ..., 0.0, 0.0, 1.0],
['male', 'C', 1, ..., 1.0, 0.0, 0.0],
['male', 'Q', 3, ..., 0.0, 0.0, 1.0]], dtype=object)
However, if I run the Imputer and OneHotEncoder one by one:
imputer = SimpleImputer(strategy='most_frequent')
filled_df = imputer.fit_transform(train_data[cat_attr])
onehot = OneHotEncoder(sparse=False)
onehot.fit_transform(filled_df)
I get the correct encoding:
array([[0., 1., 0., ..., 0., 0., 1.],
[1., 0., 1., ..., 1., 0., 0.],
[1., 0., 0., ..., 0., 0., 1.],
...,
[1., 0., 0., ..., 0., 0., 1.],
[0., 1., 1., ..., 1., 0., 0.],
[0., 1., 0., ..., 0., 0., 1.]])
What's the reason behind this behaviour? I thought ColumnTransformer modified each column one by one.