5

In general, we will df.drop('column_name', axis=1) to remove a column in a DataFrame. I want to add this transformer into a Pipeline

Example:

numerical_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='mean')),
                                     ('scaler', StandardScaler(with_mean=False))
                                     ])

How can I do it?

Vadim Kotov
  • 8,084
  • 8
  • 48
  • 62
Tan Phan
  • 337
  • 1
  • 4
  • 14

3 Answers3

16

You can write a custom Transformer like this :

class columnDropperTransformer():
    def __init__(self,columns):
        self.columns=columns

    def transform(self,X,y=None):
        return X.drop(self.columns,axis=1)

    def fit(self, X, y=None):
        return self 

And use it in a pipeline :

import pandas as pd

# sample dataframe
df = pd.DataFrame({
"col_1":["a","b","c","d"],
"col_2":["e","f","g","h"],
"col_3":[1,2,3,4],
"col_4":[5,6,7,8]
})

# your pipline
pipeline = Pipeline([
    ("columnDropper", columnDropperTransformer(['col_2','col_3']))
])

# apply the pipeline to dataframe
pipeline.fit_transform(df)

Output :

  col_1 col_4
0    a    5
1    b    6
2    c    7
3    d    8
Meysam Amini
  • 216
  • 2
  • 7
  • 1
    Your class should inherit from `BaseEstimator` and `TransformerMixin` in order to comply with the `sklearn` API. – Woodly0 Jun 07 '23 at 09:36
4

You can encapsulate your Pipeline into a ColumnTransformer which allows you to select the data that is processed through the pipeline as follows:

import pandas as pd

from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer

from sklearn.compose import make_column_selector, make_column_transformer

col_to_exclude = 'A'
df = pd.DataFrame({'A' : [ 0]*10, 'B' : [ 1]*10, 'C' : [ 2]*10})

numerical_transformer = make_pipeline
    SimpleImputer(strategy='mean'),
    StandardScaler(with_mean=False)
)


transform = ColumnTransformer(
    (numerical_transformer, make_column_selector(pattern=f'^(?!{col_to_exclude})'))
)

transform.fit_transform(df)

NOTE: I am using here a regex pattern to exclude the column A.

Antoine Dubuis
  • 4,974
  • 1
  • 15
  • 29
  • 1
    This works notably because the `ColumnTransformer` has the default parameter `remainder='drop'` – Woodly0 Jun 07 '23 at 14:36
0

The simplest way is to use the transformer special value of 'drop' in sklearn.compose.ColumnTransformer:

import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Specify columns to drop
columns_to_drop = ['feature1', 'feature3']

# Create a pipeline with ColumnTransformer to drop columns
preprocessor = ColumnTransformer(
    transformers=[
        ('column_dropper', 'drop', columns_to_drop),
    ]
)

pipeline = Pipeline(
    steps=[
        ('preprocessing', preprocessor),
    ]
)

# Transform the DataFrame using the pipeline
transformed_data = pipeline.fit_transform(df)
Mr. Duhart
  • 927
  • 12
  • 11