0

I am using a ColumnTransformer to create a pipeline of two transformers - one that converts time column to multiple features like day, month, week etc. This is followed by a OHE transformer to encode the categorical columns.

I am using the code below:

time_col = ['visitStartTime']


class TimeTransformer:
    def fit(self, X, y):
            return self
        
    def transform(self, X):
        for column in X.columns:
            X['time'] = pd.to_datetime(X[column], unit = 's', origin = 'unix')
            X['day_of_week'] = pd.to_datetime(X['time']).dt.strftime('%A')
            X['hour'] = pd.to_datetime(X['time']).dt.hour
            X['day'] = pd.to_datetime(X['time']).dt.day
            X['month'] = pd.to_datetime(X['time']).dt.month
            X['year'] = pd.to_datetime(X['time']).dt.year
            X = X.drop(['time'], axis = 1)
        return X

#Transformer to handle visitStartTime
time_transformer = Pipeline(steps =[
    ('time', TimeTransformer())
])

#Transformer to encode categorical features
ohe_transformer = Pipeline(steps = [
    ('ohe', OneHotEncoder())
])

from sklearn.compose import make_column_selector as selector
#Combined transfomrer
preprocessor = ColumnTransformer(transformers = [
    ('date', time_transformer, time_col ),
    ('ohe',ohe_transformer, selector(dtype_include = 'object'))
],remainder = 'passthrough', sparse_threshold = 0)

j = preprocessor.fit_transform(X_train)

When i check the output of j, i see that the categorical columns which were created as a result of time_transformer has not been converted.

output

How to correct this?

gkl kmr
  • 3
  • 3
  • `ColumnTransformer` applies its transformers in parallel, not in series. See e.g. https://stackoverflow.com/q/65554163/10495893 – Ben Reiniger Apr 04 '23 at 01:10
  • Does this answer your question? [Apply multiple preprocessing steps to a column in sklearn pipeline](https://stackoverflow.com/questions/65554163/apply-multiple-preprocessing-steps-to-a-column-in-sklearn-pipeline) – Ben Reiniger Apr 04 '23 at 12:45

1 Answers1

-1

OneHotEncoder has categories='auto' as default setting, which means it tries to detect the columns that need to be converted automatically.

There are two things you can do:

  1. Convert the columns you want to be treated as categorical to str or better categorical: df[col] = df[col].astype('category')
  2. Explicitly define your columns that need to be converted in OneHotEncoder: OneHotEncoder(categories=['col1', 'col2', ...])
Lukas Hestermeyer
  • 830
  • 1
  • 7
  • 19