I have the following (hypothetical) dataset:
numerical_ok | numerical_missing | categorical |
---|---|---|
210 | 30 | cat1 |
180 | NaN | cat2 |
70 | 19 | cat3 |
Where categorical is a string and numerical_ok and numerical_missing are both numerical columns but the last one has some missing points. I want to perform these three tasks:
- OneHotEncode categorical column
- Impute NAs in numerical_missing using SimpleInputer or another imputer presented in sklearn
- Apply KBinsDiscretizer in numerical_missing
Of course this is quite easy to do if I use a mixed pandas/sklearn approach:
df["numerical_missing"] = SimpleImputer().fit_transform(df[["numerical_missing"]])
ColumnTransformer([
("encoder", OneHotEncoder(), ["categorical"]),
("discretizer", KBinsDiscretizer(), ["numerical_missing"])
], remainder="passthrough").fit_transform(df)
But for reasons of scalibility and consistency (I'll later fit a model to this data) I'd like to see how this could work using pipelines. I tried two ways to do that:
WAY 1: Using a single ColumnTransformer.
But since it seems to execute jointly all the steps, KBinsDiscretizer runs with still missing data:
ColumnTransformer([
("imputer", SimpleImputer(), ["numerical_missing"]),
("encoder", OneHotEncoder(), ["categorical"]),
("discretizer", KBinsDiscretizer(), ["numerical_missing"])
], remainder="passthrough").fit_transform(df)
Giving this error:
KBinsDiscretizer does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values
WAY 2: Composing two ColumnTransformers into a Pipeline
Now the product of the first Pipeline is a sparse array where I cannot access the original feature names.
Pipeline([
("transformer", (
ColumnTransformer([
("imputer", SimpleImputer(), ["numerical_missing"]),
("encoder", OneHotEncoder(), ["categorical"])
], remainder="passthrough")
)),
("discretizer", ColumnTransformer([("discretizer", KBinsDiscretizer(), ["numerical_missing"])]))
]).fit_transform(df)
This gives me:
ValueError: Specifying the columns using strings is only supported for pandas DataFrames```
How to proceed? Thanks.