1

I am trying to construct a scikit-learn pipeline in which some data needs to be imputed, this works fine when steps are done individually but there seems to be an issue with the imputation step when combined into a pipeline in that I receive the error ValueError: Input X contains NaN. and the scores are all nan. This is the pipeline:

# Make dummy data
X = pd.DataFrame(np.random.rand(20,20))
X.iloc[2,3] = np.nan
y=X.pop(0)
numeric_columns = np.arange(19)


# Imputation and transformation
imp = SimpleImputer(missing_values=np.nan, strategy='mean', fill_value=0)

feature_transformer = ColumnTransformer(
    [('Imput_mean', imp, numeric_columns),
    ('num',StandardScaler(),numeric_columns), 
    ], remainder='passthrough'
)

#Hyperparameters
parameters = {'model__n_components':[1,2,3,4,5]}

#Pipeline
pipeline = Pipeline([('feature_transformer', feature_transformer),
                     ('model', PLSRegression())])

#Cross validation strategy
cv = KFold(n_splits=10, shuffle=True)

#Cross validate and evaluate
clf = GridSearchCV(pipeline, parameters, scoring="r2", cv=cv)
score = cross_val_score(clf,X, y, cv=cv, scoring="r2")

Any advice appreciated

RobMcC
  • 392
  • 2
  • 7
  • 20
  • 2
    Imo the issue is in your use of `ColumnTransformer`; be aware that `ColumnTransformer` instances act in parallel on data (not sequentially). Therefore, your `StandardScaler` is trying to transform the original data (which does have nans) and not the imputed data. I would suggest to have a look at https://stackoverflow.com/questions/70527088/columntransformer-pipeline-with-ohe-is-the-ohe-encoded-field-retained-or-rem?noredirect=1&lq=1 and related posts for more details. – amiola Jun 02 '22 at 09:16

0 Answers0