1

I am studying this snippet and I don't understand how to column addition was constructed.

def column_addition(X):
    return X[:, [0]] + X[:, [1]]

def addition_pipeline():
    return make_pipeline(
        SimpleImputer(strategy="median"),
        FunctionTransformer(column_addition))

preprocessing = ColumnTransformer(
    transformers=[("accompany", addition_pipeline, ["SibSp", "Parch"])], remainder='passthrough')

preprocess = preprocessing.fit_transform(df)

How is the df, and ["SibSp", "Parch"] running in background to create an addition in the code below

# How is the df and ["SibSp", "Parch"] implemented here?
# How can I replicate this as a non-function?

X[:, [0]] + X[:, [1]]

When I try to replace the X with the dataframe it throws an error.

Armando Bridena
  • 237
  • 3
  • 10

1 Answers1

0

A couple of premises to understand how this example works:

  • the Pipeline is meant to apply transformations serially. Therefore, your pipeline will first impute some columns (we'll see later which ones; anticipation: they'll be ['sibsp', 'parch']) with the median value and then it will apply the column addition on these same columns
  • the ColumnTransformer is meant to apply transformations in parallel on the columns that you're passing it to. In your case you're applying a single transformation on columns ['sibsp', 'parch'] and you're leaving all other columns untouched (by specifying remainder='passthrough'). Moreover, the transformation you're performing is the one defined by the pipeline itself (imputation + column addition).

This said, the reason why the column_addition function is referencing columns by index (columns 0 and 1 will be respectively 'sibsp' and 'parch' because you're applying transformations to them only) is that - historically - calling .fit_transform() on a Pipeline or a ColumnTransformer instance (and thus applying multiple transformations at once on a DataFrame) made you lose the DataFrame structure and transform it into a Numpy array (whose columns can be only referenced positionally). I'm emphasizing historically because the upcoming sklearn version will bring a big news towards this (see the link attached at the end of the answer for further details; anticipation: there'll be the possibility of maintaining the DataFrame structure while applying serial or parallel transformations).

All in all, the way to reproduce the example might be the following:

  • Your version:

     import pandas as pd
     from sklearn.datasets import fetch_openml
     from sklearn.compose import ColumnTransformer
     from sklearn.impute import SimpleImputer
     from sklearn.pipeline import make_pipeline
     from sklearn.preprocessing import FunctionTransformer
    
     df = fetch_openml('titanic', version=1, as_frame=True)['data']
    
     def column_addition(X):
         return X[:, [0]] + X[:, [1]]
    
     addition_pipeline = make_pipeline(
         SimpleImputer(strategy="median"),
         FunctionTransformer(column_addition)
     )
    
     preprocessing = ColumnTransformer(
         transformers=[("accompany", addition_pipeline, ["sibsp", "parch"])], remainder='passthrough')
    
     df_updated = pd.DataFrame(preprocessing.fit_transform(df))
    
  • Replica:

     si = SimpleImputer(strategy="median")
     df_new = df.copy()
     df_new['sibsp_new'] = si.fit_transform(df[['sibsp']])
     df_new['parch_new'] = si.fit_transform(df[['parch']])
     df_new['addition_col'] = df_new['sibsp_new'] + df_new['parch_new']
    
  • Proof that all's working:

     (df_new['addition_col'] == df_updated.iloc[:, 0]).all()   # gives True
    

Also, observe that the Titanic dataset from openml does not require imputation on columns 'sibsp' and 'parch'; however, in principle this might not be the case for you depending on the dataset you're starting from.

df['sibsp'].isna().sum(), df['parch'].isna().sum()   # gives (0, 0)

I'd suggest how to use ColumnTransformer() to return a dataframe? for a couple of further details on the use of ColumnTransformer instances.

amiola
  • 2,593
  • 1
  • 11
  • 25