Specifying columns in scikit-learn pipeline after ColumnTransformer

Question

I want to construct a scikit-learn pipeline in which some columns have values imputed, and then scaling is subsequently applied to some. If I put both operations in the same columntransformer this does not work as they proceed in parallel (and so missing values cause the scaler to fail). If I make two columntransformers and run them in series, however, I run into the issue that I cannot specify column names (as output of the first transformer is a np array). What is the correct way to go about this?

numeric_columns = list(X.select_dtypes('float64').columns)
cat_columns = list(X.select_dtypes('object').columns)+list(X.select_dtypes('int64').columns)

# Imputation
imp_mean = SimpleImputer(strategy='mean')
imp_freq = SimpleImputer(strategy='most_frequent')
imputer = ColumnTransformer(
    [('Imput_mean', imp_mean, numeric_columns),
     ('Imput_freq', imp_freq, cat_columns),
    ], remainder='passthrough'
)

# Scaling 
feature_transformer = ColumnTransformer(
    [('num',StandardScaler(),numeric_columns), 
    ], remainder='passthrough'
)

#Hyperparameters
parameters = {'model__n_components':[1,2,3,4,5]}

#Pipeline
pipeline = Pipeline([('imputer', imputer),
                     ('feature_transformer', feature_transformer),
                     ('model', PLSRegression())])

#Cross validation strategy
cv = KFold(n_splits=10, shuffle=True)

#Cross valdiate and evaluate
clf = GridSearchCV(pipeline, parameters, scoring="r2", cv=10)
cross_val_score(clf, X, y, cv=cv, scoring="r2"))

score 1 · Answer 1 · answered Jun 03 '22 at 11:40

You might nest a Pipeline which takes care of the preprocessing of the numerical columns (performed serially) within the ColumnTransformer instance.

Here's an example:

import numpy as np
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

X = pd.DataFrame({'city': ['London', 'London', '', 'Sallisaw'],
          'title': ['His Last Bow', 'How Watson Learned the Trick', 'A Moveable Feast', 'The Grapes of Wrath'],
          'expert_rating': [5, 3, np.nan, 5],
          'user_rating': [4, np.nan, 4, 3]})

numeric_columns = list(X.select_dtypes('float64').columns)
cat_columns = list(X.select_dtypes('object').columns) + list(X.select_dtypes('int64').columns)

imp_mean = SimpleImputer(strategy='mean')
imp_freq = SimpleImputer(missing_values='', strategy='most_frequent')

ct = ColumnTransformer([
        ('Imput_freq', imp_freq, cat_columns),
        ('pipe_num', Pipeline([('Imput_mean', imp_mean), ('num', StandardScaler())]), numeric_columns)
    ], remainder='passthrough'
)

pd.DataFrame(ct.fit_transform(X))

Here's a similar post: How to execute both parallel and serial transformations with sklearn pipeline?.

Specifying columns in scikit-learn pipeline after ColumnTransformer

1 Answers1