7

I want to use sklearn.compose.ColumnTransformer consistently (not parallel, so, the second transformer should be executed only after the first) for intersecting lists of columns in this way:

log_transformer = p.FunctionTransformer(lambda x: np.log(x))
df = pd.DataFrame({'a': [1,2, np.NaN, 4], 'b': [1,np.NaN, 3, 4], 'c': [1 ,2, 3, 4]})
compose.ColumnTransformer(n_jobs=1,
                         transformers=[
                             ('num', impute.SimpleImputer() , ['a', 'b']),
                             ('log', log_transformer, ['b', 'c']),
                             ('scale', p.StandardScaler(), ['a', 'b', 'c'])
                         ]).fit_transform(df)

So, I want to use SimpleImputer for 'a', 'b', then log for 'b', 'c', and then StandardScaler for 'a', 'b', 'c'.

But:

  1. I get array of (4, 7) shape.
  2. I still get Nan in a and b columns.

So, how can I use ColumnTransformer for different columns in the manner of Pipeline?

UPD:

pipe_1 = pipeline.Pipeline(steps=[
    ('imp', impute.SimpleImputer(strategy='constant', fill_value=42)),
])

pipe_2 = pipeline.Pipeline(steps=[
    ('imp', impute.SimpleImputer(strategy='constant', fill_value=24)),
])

pipe_3 = pipeline.Pipeline(steps=[
    ('scl', p.StandardScaler()),
])

# in the real situation I don't know exactly what cols these arrays contain, so they are not static: 
cols_1 = ['a']
cols_2 = ['b']
cols_3 = ['a', 'b', 'c']

proc = compose.ColumnTransformer(remainder='passthrough', transformers=[
    ('1', pipe_1, cols_1),
    ('2', pipe_2, cols_2),
    ('3', pipe_3, cols_3),
])
proc.fit_transform(df).T

Output:

array([[ 1.        ,  2.        , 42.        ,  4.        ],
       [ 1.        , 24.        ,  3.        ,  4.        ],
       [-1.06904497, -0.26726124,         nan,  1.33630621],
       [-1.33630621,         nan,  0.26726124,  1.06904497],
       [-1.34164079, -0.4472136 ,  0.4472136 ,  1.34164079]])

I understood why I have cols duplicates, nans and not scaled values, but how can I fix this in the correct way when cols are not static?

UPD2:

A problem may arise when the columns change their order. So, I want to use FunctionTransformer for columns selection:

def select_col(X, cols=None):
    return X[cols]

ct1 = compose.make_column_transformer(
    (p.OneHotEncoder(), p.FunctionTransformer(select_col, kw_args=dict(cols=['a', 'b']))),
    remainder='passthrough'
)

ct1.fit(df)

But get this output:

ValueError: No valid specification of the columns. Only a scalar, list or slice of all integers or all strings, or boolean mask is allowed

How can I fix it?

konstantin_doncov
  • 2,725
  • 4
  • 40
  • 100
  • In the update, I don't understand what you mean when you say "I don't know exactly what cols these arrays contain, so they are not static" – Ben Reiniger Jun 07 '20 at 19:35
  • @BenReiniger this columns are created dynamically: e.g. I have skewness test, so col_1 array(for example) contains only skewed columns which should go to the log transformer. – konstantin_doncov Jun 07 '20 at 19:53
  • The list of columns on which to apply each transformer can be given in different ways; if your skewness test can be encapsulated in a function, that can be used (see the docs, _callable_ for `columns`). – Ben Reiniger Jun 07 '20 at 23:47
  • Re: update2: That's not what FunctionTransformer does. – Ben Reiniger Jun 10 '20 at 00:29

2 Answers2

5

The intended usage of ColumnTransformer is that the different transformers are applied in parallel, not sequentially. To accomplish your desired outcome, three approaches come to mind:

First approach:

pipe_a = Pipeline(steps=[('imp', SimpleImputer()),
                         ('scale', StandardScaler())])
pipe_b = Pipeline(steps=[('imp', SimpleImputer()),
                         ('log', log_transformer),
                         ('scale', StandardScaler())])
pipe_c = Pipeline(steps=[('log', log_transformer),
                         ('scale', StandardScaler())])
proc = ColumnTransformer(transformers=[
    ('a', pipe_a, ['a']),
    ('b', pipe_b, ['b']),
    ('c', pipe_c, ['c'])]
)

This second one actually won't work, because the ColumnTransformer will rearrange the columns and forget the names*, so that the later ones will fail or apply to the wrong columns. When sklearn finalizes how to pass along dataframes or feature names, this may be salvaged, or you may be able to tweak it for your specific usecase now. (* ColumnTransformer does already have a get_feature_names, but the actual data passed through the pipeline doesn't have that information.)

imp_tfm = ColumnTransformer(
    transformers=[('num', impute.SimpleImputer() , ['a', 'b'])],
    remainder='passthrough'
    )
log_tfm = ColumnTransformer(
    transformers=[('log', log_transformer, ['b', 'c'])],
    remainder='passthrough'
    )
scl_tfm = ColumnTransformer(
    transformers=[('scale', StandardScaler(), ['a', 'b', 'c'])
    )
proc = Pipeline(steps=[
    ('imp', imp_tfm),
    ('log', log_tfm),
    ('scale', scl_tfm)]
)

Third, there may be a way to use the Pipeline slicing feature to have one "master" pipeline that you cut down for each feature... this would work mostly like the first approach, might save some coding in the case of larger pipelines, but seems a little hacky. For example, here you can:

pipe_a = clone(pipe_b)[1:]
pipe_c = clone(pipe_b)
pipe_c.steps[1] = ('nolog', 'passthrough')

(Without cloning or otherwise deep-copying pipe_b, the last line would change both pipe_c and pipe_b. The slicing mechanism returns a copy, so pipe_a doesn't strictly need to be cloned, but I've left it in to feel safer. Unfortunately you can't provide a discontinuous slice, so pipe_c = pipe_b[0,2] doesn't work, but you can set the individual slices as I've done above to "passthrough" to disable them.)

Ben Reiniger
  • 10,517
  • 3
  • 16
  • 29
  • 2
    Thanks for your answer. I already thought about something like that. Unfortunately first method not very scalable and autonomous with respect to features - it will be very difficult to use all this manually when there are many features. Also, can you describe in more detail your last sentence "...there may be a way to use the Pipeline slicing feature to have one "master" pipeline that you cut down for each feature..."? – konstantin_doncov Jun 06 '20 at 22:25
  • Of course, in the first approach you don't need a separate pipe for every feature; just for each unique list of transformations you want to apply. (E.g., send feature `c` also to `pipe_b`.) – Ben Reiniger Jun 07 '20 at 01:29
  • Please, check my updated question, I tried to explain what I meant. – konstantin_doncov Jun 07 '20 at 13:54
  • 1
    I found out how we can do this, your second option is the starting point. [More info.](https://github.com/scikit-learn/scikit-learn/issues/17514) – konstantin_doncov Jun 08 '20 at 20:38
  • @BenReiniger Following your first approach, can you please let me know what should I do if I had to apply the pipe_a column a,b and if I had to apply the pipe_b to b,c (some the pipes have some columns in common). Thank you – tjt Aug 25 '22 at 00:50
  • 1
    @tjt The point of this "one pipeline per combination of preprocessing steps" approach is that there are no shared columns between pipes. The imputer works on columns `a` and `b`, but because `b` gets log-transformed while `a` doesn't, they get sent to different pipelines. – Ben Reiniger Aug 25 '22 at 12:30
  • @BenReiniger Thanks that makes sense. I was just asking, what should be the approach if I had usecase where I have different transformations and some of them have some shared columns – tjt Aug 26 '22 at 02:59
  • The second approach should now be workable, using the new pandas-out functionality in sklearn; you can specify the column names instead of their indices. – Ben Reiniger Feb 09 '23 at 19:56
1

We can use little columns_name_to_index hack to convert column names to index and then we can pass the dataframe to the pipeline like this:

def columns_name_to_index(arr_of_names, df):
    return [df.columns.get_loc(c) for c in arr_of_names if c in df]

cols_1 = ['a']
cols_2 = ['b']
cols_3 = ['a', 'b', 'c']

ct1 = compose.ColumnTransformer(remainder='passthrough', transformers=[
    (impute.SimpleImputer(strategy='constant', fill_value=42), columns_name_to_index(cols_1, df)),
    (impute.SimpleImputer(strategy='constant', fill_value=24), columns_name_to_index(cols_2, df)),
])

ct2 = compose.ColumnTransformer(remainder='passthrough', transformers=[
    (p.StandardScaler(), columns_name_to_index(cols_3, df)),
])

pipe = pipeline.Pipeline(steps=[
    ('ct1', ct1),
    ('ct2', ct2),
])

pipe.fit_transform(df).T
konstantin_doncov
  • 2,725
  • 4
  • 40
  • 100
  • 3
    I'm glad this works for your needs; I guess it's the "you may be able to tweak it for your specific usecase" part of my description of my second option. But just a word of caution, to reiterate: "`ColumnTransformer` will rearrange the columns..." so that if your `ct2` didn't operate on the entire frame, and `ct1` had its transformers in a different order (or included a transformer that added/dropped a column), this would fail because your `columns_name_to_index` refers to the index in the original `df`. – Ben Reiniger Jun 08 '20 at 21:00
  • @BenReiniger then I think e.g. `OneHotEncoder` can produce a problem. Can you suggest a fix? – konstantin_doncov Jun 08 '20 at 21:06
  • In all my projects, I only have a handful of different pipelines I would apply, so my first option is the most applicable. If that's not the case for your data, maybe if you provide a more representative example we can hack something together. – Ben Reiniger Jun 08 '20 at 21:25
  • @BenReiniger I tried to use `FunctionTransformer` but get an error, please check the **UPD2** in the question. – konstantin_doncov Jun 09 '20 at 15:22