I want to construct a scikit-learn pipeline in which some columns have values imputed, and then scaling is subsequently applied to some. If I put both operations in the same columntransformer this does not work as they proceed in parallel (and so missing values cause the scaler to fail). If I make two columntransformers and run them in series, however, I run into the issue that I cannot specify column names (as output of the first transformer is a np array). What is the correct way to go about this?
numeric_columns = list(X.select_dtypes('float64').columns)
cat_columns = list(X.select_dtypes('object').columns)+list(X.select_dtypes('int64').columns)
# Imputation
imp_mean = SimpleImputer(strategy='mean')
imp_freq = SimpleImputer(strategy='most_frequent')
imputer = ColumnTransformer(
[('Imput_mean', imp_mean, numeric_columns),
('Imput_freq', imp_freq, cat_columns),
], remainder='passthrough'
)
# Scaling
feature_transformer = ColumnTransformer(
[('num',StandardScaler(),numeric_columns),
], remainder='passthrough'
)
#Hyperparameters
parameters = {'model__n_components':[1,2,3,4,5]}
#Pipeline
pipeline = Pipeline([('imputer', imputer),
('feature_transformer', feature_transformer),
('model', PLSRegression())])
#Cross validation strategy
cv = KFold(n_splits=10, shuffle=True)
#Cross valdiate and evaluate
clf = GridSearchCV(pipeline, parameters, scoring="r2", cv=10)
cross_val_score(clf, X, y, cv=cv, scoring="r2"))