Extract feature names after Pipeline usage with ColumnTransformer (sklearn)

Question

I have the following toy code.

I use a pipeline to automatically normalize numerical variables and apply one-hot-encoding to the categorical ones.

I can get the coefficients of the linear regression model easily using pipe['logisticregression'].coef_ but how can I get all the feature names in the right order as this appearing in the coef matrix?

from sklearn.compose import ColumnTransformer
import numpy as np, pandas as pd
from sklearn.compose import make_column_transformer, make_column_selector
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split

# data from https://www.kaggle.com/datasets/uciml/adult-census-income
data = pd.read_csv("adult.csv")
data = data.iloc[0:3000,:]

target = "workclass"

y = data[target]
X = data.drop(columns=target)

numerical_columns_selector = make_column_selector(dtype_exclude=object)
categorical_columns_selector = make_column_selector(dtype_include=object)

numerical_columns = numerical_columns_selector(X)
categorical_columns = categorical_columns_selector(X)

ct = ColumnTransformer([  ('onehot', OneHotEncoder(handle_unknown='ignore'), categorical_columns) , 
                          ('std', StandardScaler(), numerical_columns)])
                        

model = LogisticRegression(max_iter=500)
pipe = make_pipeline(ct, model)

data_train, data_test, target_train, target_test = train_test_split(
    X, y, random_state=42)

pipe.fit(data_train, target_train)

pipe['logisticregression'].coef_.shape

The reason why feature names are returned in a different order wrt their order in X is due to the transformations applied by the `ColumnTransformer` object and is described in https://stackoverflow.com/questions/68874492/preserve-column-order-after-applying-sklearn-compose-columntransformer/70526434#70526434. Afaik you should go with a custom solution (eg manually specifying columns to pass to the `ColumnTransformer` object) to get a transformed matrix whose features are in the same order as they were at the beginning. — amiola, Sep 26 '22 at 15:54
Thanks for the comment. Then this means that since I first define the onehotencoder and then the scaler that the feature names will be in the order “transformed categorical” + numerical feature names. Am I missing something here ? — seralouk, Sep 26 '22 at 15:59
That's correct. You can see the feature names of the transformed matrix (and their order) by typing `pipe[:-1].get_feature_names_out()`. You'll also see the notation they're using to encode the transformed feature names. — amiola, Sep 26 '22 at 16:04

Extract feature names after Pipeline usage with ColumnTransformer (sklearn)

0 Answers0