3

Good Day, I googled this without luck. It seems like it's possible, but I might be reading the API wrong. How can I have scikit-learn automatically drop the extra columns in my pandas dataframe, on my testing data, instead of explicitly having to drop those columns?


I am currently running Python 3.6 in my environment and v 0.24.2 of sklearn.

To show this with an example here's the code:

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MinMaxScaler

import pandas as pd

from random import randint
from random import choice
import random

random.seed(42)

df = pd.DataFrame({
    'cont_A': [randint(1,10) for _ in range(10)], 
    'cont_B': [randint(-20,20) for _ in range(10)],
    'cat_A': [choice('ABC') for _ in range(10)],
    'cat_B': [choice('XYZ') for _ in range(10)],
})

This will create a dataframe with two categorical columns and two continuous columns.

t = [
    ('cat', OneHotEncoder(), ['cat_A', 'cat_B']),
    ('nums', MinMaxScaler(), ['cont_A', 'cont_B'])
]

columnTransformer = ColumnTransformer(t, remainder='drop')
X_train = columnTransformer.fit_transform(df)
X_train

We can fit-transform our columnTransformer on our initial training data. Now let's say we generate our testing or input data before we want to run our model.

df_test = pd.DataFrame({
    'cont_A': [randint(2,9) for _ in range(3)], 
    'cont_B': [randint(-19,19) for _ in range(3)],
    'cat_A': [choice('ABC') for _ in range(3)],
    'cat_B': [choice('XYZ') for _ in range(3)],
    'extra_A': [randint(1,5) for _ in range(3)], 
    'extra_B': [randint(1,5) for _ in range(3)], 
    'extra_C': [randint(1,5) for _ in range(3)], 
})

This testing dataframe has 3 extra columns that are not of value to us. I want the columnTransformer to automatically drop them and process the remaining (if this is possible) without having to explicitly drop them.

If I run the columnTransformer on this data:

LST = columnTransformer.transform(df_test)

It will cause an error:

ValueError: X has 7 features, but ColumnTransformer is expecting 4 features as input.

However, if I explicitly drop those columns, it will run. I thought defining the remainder='drop' would have addressed this issue but it does not seem to help:

df_test_dropped = df_test.drop(['extra_A', 'extra_B', 'extra_C'], axis=1)
LST = columnTransformer.transform(df_test_dropped)

How can I (if it's even possible) have columnTransformer automatically drop non-relevant columns (instead of having to explicitly drop them)?

Reily Bourne
  • 5,117
  • 9
  • 30
  • 41
  • https://stackoverflow.com/questions/68402691/adding-dropping-column-instance-into-a-pipeline this can help – qaiser Sep 06 '22 at 11:27
  • Hey Thanks! That still requires explicitly identifying the columns to drop. I am asking if it is possible to drop the extras automatically. – Reily Bourne Sep 06 '22 at 11:33
  • 1
    @ReilyBourne check with sklearn version , as mention below by systemsigma_ your issue can be resolved – qaiser Sep 06 '22 at 12:25
  • Thanks. That require sklearn v 1.0+. I am running `python 3.6` in this environment so I cannot upgrade past `0.24.2`. – Reily Bourne Sep 06 '22 at 14:45
  • just curious, why dont you want to select the colums by `df_test[df.columns]`? – s510 Sep 06 '22 at 15:19

3 Answers3

1

remainder='drop' tells the transformer to drop columns in the training set that don't fit into any of the transformers. It doesn't say to ignore additional columns in a test set, and there is currently no way to accomplish that: all estimators expect to receive inputs in the same format at fitting and transform/prediction.

Ben Reiniger
  • 10,517
  • 3
  • 16
  • 29
0

Automatically dropping ColumnTransformer columns that are not in the input data-frame is possible, but you need to create a new transformer class.

Example application: I want to encode some categorical columns in the dataframe with a pipeline, and possibly drop some of those columns beforehand as part of a preprocessing grid search scheme. Since the ColumnTransformer is part of the pipeline, we can't easily change the list of columns to transform with a grid search scheme.

Example data:

ex = pd.DataFrame({'foo': ['A','B','A','A'],'bar':['x','z','y','x'],'foobar':[.1,.2,-.5,0]})
  foo bar  foobar
0   A   x     0.1
1   B   z     0.2
2   A   y    -0.5
3   A   x     0.0

We set up a regular ColumnTransformer to encode the categorical columns like this:

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

ct = ColumnTransformer(transformers=[('encode', OneHotEncoder(), ['foo', 'bar'])],
                       remainder='passthrough')
ct.fit_transform(ex)

which results in

array([[ 1. ,  0. ,  1. ,  0. ,  0. ,  0.1],
       [ 0. ,  1. ,  0. ,  0. ,  1. ,  0.2],
       [ 1. ,  0. ,  0. ,  1. ,  0. , -0.5],
       [ 1. ,  0. ,  1. ,  0. ,  0. ,  0. ]])

i.e. the first two columns have been encoded, and the 'foobar' column passes through the transformer unchanged.

As noted by the OP, the following:

ct.fit_transform(ex.drop(['foo'], axis=1))

results in a ValueError, as the column 'foo' is present in the ColumnTransformer specification, but not the data frame that is being transformed.

However, by comparing the list of columns in the transformer (which is a parameter) with the columns of X in the fit method, we can modify the transformer parameters.

class ColumnTransformerExtraCols(ColumnTransformer):
    def fit(self, X, y=None):
        new_transformers = []
        for tr in self.transformers:
            new_transformers.append((tr[0], tr[1], list(set(tr[2]).intersection(X.columns))))
        self.transformers = new_transformers
        return super().fit(X,y)
    def fit_transform(self, X, y=None):
        new_transformers = []
        for tr in self.transformers:
            new_transformers.append((tr[0], tr[1], list(set(tr[2]).intersection(X.columns))))
        self.transformers = new_transformers
        return super().fit_transform(X,y)

(Actually, we need to do the same in the fit_transform method of the new transformer class as well for some reason.)

Now,

ctex = ColumnTransformerExtraCols(transformers=[('encode', OneHotEncoder(), ['foo', 'bar'])],
                                remainder='passthrough')
ctex.fit_transform(ex.drop(['foo'], axis=1))

works as needed.

njp
  • 620
  • 1
  • 3
  • 16
-1

In this line

columnTransformer = ColumnTransformer(t, remainder='drop')

the kwarg remainder=drop is already dropping the columns not specified by the list of transformers t.

SystemSigma_
  • 1,059
  • 4
  • 18
  • 2
    which version of sklearn you are using , i am also getting same error as @reily boure has mention , mine sklearn version is '0.24.2' – qaiser Sep 06 '22 at 12:15
  • I am on ``scikit-learn == 1.1.2`` – SystemSigma_ Sep 06 '22 at 12:19
  • Thanks for this insight. I am running `python 3.6` in my environment so I am limited to sklearn `0.24.2`. I will clarify this in my question. This kwarg seems to be a sklearn 1.0+ option only. – Reily Bourne Sep 06 '22 at 14:47
  • This option is present for [``0.24.2``](https://scikit-learn.org/0.24/modules/generated/sklearn.compose.ColumnTransformer.html?highlight=columntransformer#sklearn.compose.ColumnTransformer) – SystemSigma_ Sep 06 '22 at 14:57
  • It's still the same issue. Same code, per example, generates same error. I created a new notebook, stripped it down to this example, and C&P the code from the notebook into this post. – Reily Bourne Sep 06 '22 at 15:08