0

I have the following (hypothetical) dataset:

numerical_ok numerical_missing categorical
210 30 cat1
180 NaN cat2
70 19 cat3

Where categorical is a string and numerical_ok and numerical_missing are both numerical columns but the last one has some missing points. I want to perform these three tasks:

  1. OneHotEncode categorical column
  2. Impute NAs in numerical_missing using SimpleInputer or another imputer presented in sklearn
  3. Apply KBinsDiscretizer in numerical_missing

Of course this is quite easy to do if I use a mixed pandas/sklearn approach:

df["numerical_missing"] = SimpleImputer().fit_transform(df[["numerical_missing"]])

ColumnTransformer([
    ("encoder", OneHotEncoder(), ["categorical"]),
    ("discretizer", KBinsDiscretizer(), ["numerical_missing"])
], remainder="passthrough").fit_transform(df)

But for reasons of scalibility and consistency (I'll later fit a model to this data) I'd like to see how this could work using pipelines. I tried two ways to do that:

WAY 1: Using a single ColumnTransformer.

But since it seems to execute jointly all the steps, KBinsDiscretizer runs with still missing data:

ColumnTransformer([
    ("imputer", SimpleImputer(), ["numerical_missing"]),
    ("encoder", OneHotEncoder(), ["categorical"]),
    ("discretizer", KBinsDiscretizer(), ["numerical_missing"])
], remainder="passthrough").fit_transform(df)

Giving this error:

KBinsDiscretizer does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values

WAY 2: Composing two ColumnTransformers into a Pipeline

Now the product of the first Pipeline is a sparse array where I cannot access the original feature names.

Pipeline([
    ("transformer", (
        ColumnTransformer([
            ("imputer", SimpleImputer(), ["numerical_missing"]),
            ("encoder", OneHotEncoder(), ["categorical"])
        ], remainder="passthrough")
    )),
    ("discretizer", ColumnTransformer([("discretizer", KBinsDiscretizer(), ["numerical_missing"])]))
]).fit_transform(df)

This gives me:

ValueError: Specifying the columns using strings is only supported for pandas DataFrames```

How to proceed? Thanks.
  • This answer might help: [Consistent ColumnTransformer for intersecting lists of columns](https://stackoverflow.com/a/62234209/9987623) – AlexK May 27 '22 at 04:35
  • The eventual pandas-out methods in sklearn would make your second approach work with very little change. Keep an eye on https://github.com/scikit-learn/scikit-learn/issues/23001 – Ben Reiniger May 27 '22 at 14:27

1 Answers1

2

In addition to the question linked to in the comments, and its Linked questions (where I generally suggest defining one pipeline for each combination of transformers you want performed in sequence, and a single column transformer to run those in parallel), in this smallish example I'd like to also suggest an index-based solution to your second attempt.

The output of ColumnTransformer has columns in order of its transformers, and the remainder at the end. So in your case, the output will be the now-imputed numerical_missing, followed by some unknown number of one-hot encoded columns, followed by the remainder numerical_ok. Since you only want to bin the (imputed) numerical_missing, you can specify the discretizer column transformer as operating on the 0th column of its input:

Pipeline([
    ("transformer", (
        ColumnTransformer([
            ("imputer", SimpleImputer(), ["numerical_missing"]),
            ("encoder", OneHotEncoder(), ["categorical"])
        ], remainder="passthrough")
    )),
    ("discretizer", ColumnTransformer([("discretizer", KBinsDiscretizer(), [0])]))
]).fit_transform(df)

I tend to prefer using column names, so the single-column-transformer-with-separate-pipelines approach may still be preferable, but this isn't such a bad solution either.

OK, I suppose I might as well also include the approach I keep alluding to.

num_mis_pipe = Pipeline([
    ("imputer", SimpleImputer()),
    ("discretizer", KBinsDiscretizer()),
])
ColumnTransformer([
    ("imp_disc", num_mis_pipe, ["numerical_missing"]),
    ("encoder", OneHotEncoder(), ["categorical"]),
], remainder="passthrough").fit_transform(df)
Ben Reiniger
  • 10,517
  • 3
  • 16
  • 29
  • This is a very interesting solution. Could I just check that I'm understanding this correctly, you cannot pass a column name to a pipeline step when using `Pipeline` but you can pass a column name if the `Pipeline` is a `ColumnTransformer` step? – jmich738 Jul 15 '22 at 03:49
  • @jmich738 if you pass a dataframe into a pipeline, the first step will see that frame, so column names will be available. But if the first step converts to an array (as most sklearn transformers will), then later steps in a pipeline will not have access to column names. – Ben Reiniger Jul 15 '22 at 12:18