Column transformers using NumPy indexing

Question

I am studying this snippet and I don't understand how to column addition was constructed.

def column_addition(X):
    return X[:, [0]] + X[:, [1]]

def addition_pipeline():
    return make_pipeline(
        SimpleImputer(strategy="median"),
        FunctionTransformer(column_addition))

preprocessing = ColumnTransformer(
    transformers=[("accompany", addition_pipeline, ["SibSp", "Parch"])], remainder='passthrough')

preprocess = preprocessing.fit_transform(df)

How is the df, and ["SibSp", "Parch"] running in background to create an addition in the code below

# How is the df and ["SibSp", "Parch"] implemented here?
# How can I replicate this as a non-function?

X[:, [0]] + X[:, [1]]

When I try to replace the X with the dataframe it throws an error.

score 0 · Answer 1 · answered Nov 23 '22 at 13:49

A couple of premises to understand how this example works:

the Pipeline is meant to apply transformations serially. Therefore, your pipeline will first impute some columns (we'll see later which ones; anticipation: they'll be ['sibsp', 'parch']) with the median value and then it will apply the column addition on these same columns
the ColumnTransformer is meant to apply transformations in parallel on the columns that you're passing it to. In your case you're applying a single transformation on columns ['sibsp', 'parch'] and you're leaving all other columns untouched (by specifying remainder='passthrough'). Moreover, the transformation you're performing is the one defined by the pipeline itself (imputation + column addition).

This said, the reason why the column_addition function is referencing columns by index (columns 0 and 1 will be respectively 'sibsp' and 'parch' because you're applying transformations to them only) is that - historically - calling .fit_transform() on a Pipeline or a ColumnTransformer instance (and thus applying multiple transformations at once on a DataFrame) made you lose the DataFrame structure and transform it into a Numpy array (whose columns can be only referenced positionally). I'm emphasizing historically because the upcoming sklearn version will bring a big news towards this (see the link attached at the end of the answer for further details; anticipation: there'll be the possibility of maintaining the DataFrame structure while applying serial or parallel transformations).

All in all, the way to reproduce the example might be the following:

Your version:

 import pandas as pd
 from sklearn.datasets import fetch_openml
 from sklearn.compose import ColumnTransformer
 from sklearn.impute import SimpleImputer
 from sklearn.pipeline import make_pipeline
 from sklearn.preprocessing import FunctionTransformer

 df = fetch_openml('titanic', version=1, as_frame=True)['data']

 def column_addition(X):
     return X[:, [0]] + X[:, [1]]

 addition_pipeline = make_pipeline(
     SimpleImputer(strategy="median"),
     FunctionTransformer(column_addition)
 )

 preprocessing = ColumnTransformer(
     transformers=[("accompany", addition_pipeline, ["sibsp", "parch"])], remainder='passthrough')

 df_updated = pd.DataFrame(preprocessing.fit_transform(df))

Replica:

 si = SimpleImputer(strategy="median")
 df_new = df.copy()
 df_new['sibsp_new'] = si.fit_transform(df[['sibsp']])
 df_new['parch_new'] = si.fit_transform(df[['parch']])
 df_new['addition_col'] = df_new['sibsp_new'] + df_new['parch_new']

Proof that all's working:

 (df_new['addition_col'] == df_updated.iloc[:, 0]).all()   # gives True

Also, observe that the Titanic dataset from openml does not require imputation on columns 'sibsp' and 'parch'; however, in principle this might not be the case for you depending on the dataset you're starting from.

df['sibsp'].isna().sum(), df['parch'].isna().sum()   # gives (0, 0)

I'd suggest how to use ColumnTransformer() to return a dataframe? for a couple of further details on the use of ColumnTransformer instances.

Column transformers using NumPy indexing

1 Answers1