I am trying to construct a scikit-learn pipeline in which some data needs to be imputed, this works fine when steps are done individually but there seems to be an issue with the imputation step when combined into a pipeline in that I receive the error ValueError: Input X contains NaN.
and the scores are all nan
. This is the pipeline:
# Make dummy data
X = pd.DataFrame(np.random.rand(20,20))
X.iloc[2,3] = np.nan
y=X.pop(0)
numeric_columns = np.arange(19)
# Imputation and transformation
imp = SimpleImputer(missing_values=np.nan, strategy='mean', fill_value=0)
feature_transformer = ColumnTransformer(
[('Imput_mean', imp, numeric_columns),
('num',StandardScaler(),numeric_columns),
], remainder='passthrough'
)
#Hyperparameters
parameters = {'model__n_components':[1,2,3,4,5]}
#Pipeline
pipeline = Pipeline([('feature_transformer', feature_transformer),
('model', PLSRegression())])
#Cross validation strategy
cv = KFold(n_splits=10, shuffle=True)
#Cross validate and evaluate
clf = GridSearchCV(pipeline, parameters, scoring="r2", cv=cv)
score = cross_val_score(clf,X, y, cv=cv, scoring="r2")
Any advice appreciated