How to be sure that sklearn piepline applies fit_transform method when using feature selection and ML model in piepline?

Question

Assume that I want to apply several feature selection methods using sklearn pipeline. An example is provided below:

from sklearn.datasets import load_breast_cancer
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.feature_selection import VarianceThreshold
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split


X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

fs_pipeline = Pipeline([('vt', VarianceThreshold(0.01)),
                        ('kbest', SelectKBest(chi2, k=5)),
                        ])

X_new = fs_pipeline.fit_transform(X_train, y_train)

I get the selected features using fit_transform method. If I use fit method on pipeline, I will get pipeline object.

Now, assume that I want to add a ML model to the pipeline like below:

from sklearn.datasets import load_breast_cancer
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.feature_selection import VarianceThreshold
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)


model = Pipeline([('vt', VarianceThreshold(0.01)),
                  ('kbest', SelectKBest(chi2, k=5)),
                  ('gbc', GradientBoostingClassifier(random_state=0))])


model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy_score(y_test, y_pred)

If I use fit_transform method in the above code (model.fit_transform(X_train, y_train)), I get the error:

AttributeError: 'GradientBoostingClassifier' object has no attribute 'transform'

So. I should use model.fit(X_train, y_train). But, how can I be sure that pipeline applied fit_transform method for feature selection steps?

`fit_transform` is inappropriate here, hence the expected error. It is not clear how (or why) you want to "be sure". The first step would be to read (and trust!) the relevant documentation; if not convinced, you should run an experiment, comparing the results of the pipeline with the ones from the individual steps applied "manually", but this is not exactly straightforward - you should opt for transformations and models that do not depend on the random seed, or control the seed *very* carefully in all steps in order to be sure that you compare apples to apples (and not to oranges). — desertnaut, Jul 25 '22 at 09:12

amiola · Accepted Answer · 2022-07-25T10:12:45.047

A pipeline is meant for sequential data transformation (for which it needs multiple calls to .fit_transform()). You can be sure that .fit_transform() is called on the intermediate steps (basically on all steps but the last one) of a pipeline as that's how it works by design.

Namely, when calling .fit() or .fit_transform() on a Pipeline instance, .fit_transform() is called sequentially on all intermediate transformers but the last one and the output of each call of the method is passed as parameter to the next call. On the very last step, either .fit() or .fit_transform() is called depending on the method called on the pipeline itself; indeed, in the last step an estimator is generally more commonly used rather than a transformer (as with the case of your GradientBoostingClassifier).

Whenever the last step is made of an estimator rather than a transformer, as in your case, you won't be able to call .fit_transform() on the pipeline instance as the pipeline itself exposes the same methods of the final estimator/transformer and in the considered case estimators do not expose neither .transform() nor .fit_transform().

Summing up,

case with an estimator in the last step (you can only call .fit() on the pipeline); model.fit(X_train, y_train) means the following:

 final_estimator.fit(transformer_n.fit_transform(transformer_n_minus_1.fit_transform(...transformer0.fit_transform(X_train, y_train))))

which in your case becomes

 gbc.fit(k_best.fit_transform(vt.fit_transform(X_train, y_train)))

case with a transformer in the last step (you can either call .fit() or .fit_transform() on the pipeline, but let's suppose you're calling .fit_transform()); model.fit_transform(X_train, y_train) means the following:
```
 final_estimator.fit_transform(transformer_n.fit_transform(transformer_n_minus_1.fit_transform(...transformer0.fit_transform(X_train, y_train))))
```

Eventually, here's the reference in the source code: https://github.com/scikit-learn/scikit-learn/blob/baf0ea25d6dd034403370fea552b21a6776bef18/sklearn/pipeline.py#L351

How to be sure that sklearn piepline applies fit_transform method when using feature selection and ML model in piepline?

1 Answers1

Linked