sklearn pipelines: ColumnTransformer doesn't execute steps sequentially and pipeline doesn't keep feature names

Question

I have the following (hypothetical) dataset:

numerical_ok	numerical_missing	categorical
210	30	cat1
180	NaN	cat2
70	19	cat3

Where categorical is a string and numerical_ok and numerical_missing are both numerical columns but the last one has some missing points. I want to perform these three tasks:

OneHotEncode categorical column
Impute NAs in numerical_missing using SimpleInputer or another imputer presented in sklearn
Apply KBinsDiscretizer in numerical_missing

Of course this is quite easy to do if I use a mixed pandas/sklearn approach:

df["numerical_missing"] = SimpleImputer().fit_transform(df[["numerical_missing"]])

ColumnTransformer([
    ("encoder", OneHotEncoder(), ["categorical"]),
    ("discretizer", KBinsDiscretizer(), ["numerical_missing"])
], remainder="passthrough").fit_transform(df)

But for reasons of scalibility and consistency (I'll later fit a model to this data) I'd like to see how this could work using pipelines. I tried two ways to do that:

WAY 1: Using a single ColumnTransformer.

But since it seems to execute jointly all the steps, KBinsDiscretizer runs with still missing data:

ColumnTransformer([
    ("imputer", SimpleImputer(), ["numerical_missing"]),
    ("encoder", OneHotEncoder(), ["categorical"]),
    ("discretizer", KBinsDiscretizer(), ["numerical_missing"])
], remainder="passthrough").fit_transform(df)

Giving this error:

KBinsDiscretizer does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values

WAY 2: Composing two ColumnTransformers into a Pipeline

Now the product of the first Pipeline is a sparse array where I cannot access the original feature names.

Pipeline([
    ("transformer", (
        ColumnTransformer([
            ("imputer", SimpleImputer(), ["numerical_missing"]),
            ("encoder", OneHotEncoder(), ["categorical"])
        ], remainder="passthrough")
    )),
    ("discretizer", ColumnTransformer([("discretizer", KBinsDiscretizer(), ["numerical_missing"])]))
]).fit_transform(df)

This gives me:

ValueError: Specifying the columns using strings is only supported for pandas DataFrames```

How to proceed? Thanks.

This answer might help: [Consistent ColumnTransformer for intersecting lists of columns](https://stackoverflow.com/a/62234209/9987623) — AlexK, May 27 '22 at 04:35
The eventual pandas-out methods in sklearn would make your second approach work with very little change. Keep an eye on https://github.com/scikit-learn/scikit-learn/issues/23001 — Ben Reiniger, May 27 '22 at 14:27

Ben Reiniger · Accepted Answer · 2022-05-27T14:36:15.593

In addition to the question linked to in the comments, and its Linked questions (where I generally suggest defining one pipeline for each combination of transformers you want performed in sequence, and a single column transformer to run those in parallel), in this smallish example I'd like to also suggest an index-based solution to your second attempt.

The output of ColumnTransformer has columns in order of its transformers, and the remainder at the end. So in your case, the output will be the now-imputed numerical_missing, followed by some unknown number of one-hot encoded columns, followed by the remainder numerical_ok. Since you only want to bin the (imputed) numerical_missing, you can specify the discretizer column transformer as operating on the 0th column of its input:

Pipeline([
    ("transformer", (
        ColumnTransformer([
            ("imputer", SimpleImputer(), ["numerical_missing"]),
            ("encoder", OneHotEncoder(), ["categorical"])
        ], remainder="passthrough")
    )),
    ("discretizer", ColumnTransformer([("discretizer", KBinsDiscretizer(), [0])]))
]).fit_transform(df)

I tend to prefer using column names, so the single-column-transformer-with-separate-pipelines approach may still be preferable, but this isn't such a bad solution either.

OK, I suppose I might as well also include the approach I keep alluding to.

num_mis_pipe = Pipeline([
    ("imputer", SimpleImputer()),
    ("discretizer", KBinsDiscretizer()),
])
ColumnTransformer([
    ("imp_disc", num_mis_pipe, ["numerical_missing"]),
    ("encoder", OneHotEncoder(), ["categorical"]),
], remainder="passthrough").fit_transform(df)

This is a very interesting solution. Could I just check that I'm understanding this correctly, you cannot pass a column name to a pipeline step when using `Pipeline` but you can pass a column name if the `Pipeline` is a `ColumnTransformer` step? — jmich738, Jul 15 '22 at 03:49
@jmich738 if you pass a dataframe into a pipeline, the first step will see that frame, so column names will be available. But if the first step converts to an array (as most sklearn transformers will), then later steps in a pipeline will not have access to column names. — Ben Reiniger, Jul 15 '22 at 12:18

sklearn pipelines: ColumnTransformer doesn't execute steps sequentially and pipeline doesn't keep feature names

1 Answers1