Sklearn Pipeline classifier throwing ValueError even when the missing values are taken care of

Question

I have created sklearn pipeline for preprocessing and then running the model over the processed data. The preprocessing step takes care of missing values even after that it throws the following error:

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

The below is my code :

def test_sklearn_pipeline(random_state_num):
    numeric_features = ["x","y"]
    categorical_features = ["wconfid","pctid"]
    missing_features = ["x"]
    missing_transformer = Pipeline(
        steps=[("imputer", SimpleImputer(strategy="mean"))]
    )
    scale_transformer = Pipeline(
        steps=[("scaler", StandardScaler())]
    )
    categorical_transformer = Pipeline(
        steps=[('ohe',OneHotEncoder(handle_unknown="ignore"))]
    )
    preprocessor = ColumnTransformer(
        transformers=[
            ("miss", missing_transformer, missing_features),
            ("cat", categorical_transformer, categorical_features),
            ('outlier_remover',outlier_removal,numeric_features),
            ("num", scale_transformer, numeric_features)
        ],remainder='passthrough'
    )
    clf = Pipeline(
        steps=[("preprocessor", preprocessor), ("classifier", LinearRegression())]
    )
    df = pd.read_csv('accelerometer_modified.csv')
    df = df.drop(columns=['random'])
    X,y = df.drop(columns=['z']),df.loc[:,'z']
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=random_state_num)
    clf.fit(X_train, y_train)
    print("MSE: %.3f" % mean_squared_error(clf.predict(X_test), y_test))

Be aware that `ColumnTransformer` applies its transformations in parallel. Therefore, my guess is that you might have some issues as `missing_features` and `numeric_features` are not disjoint sets (basically, you're applying parallel transformations on common features: on `'x'` you're applying the first, third and fourth transformations at the same time). I may suggest https://stackoverflow.com/questions/70745198/how-to-execute-both-parallel-and-serial-transformations-with-sklearn-pipeline — amiola, Jan 21 '22 at 10:29

score 0 · Answer 1 · edited Apr 12 '22 at 19:39

0

Numeric features and missing features do have the column x in common. Columntransformer runs each transformation in the input dataframe. This means you are running the standard scaler in the raw column and not the imputed one. You probably need two transformers that run sequentially, or rather put a small Pipeline as you've done already with steps that are first impute second scale

edited Apr 12 '22 at 19:39

marc_s

732,580
175
1,330
1,459

answered Jan 21 '22 at 10:30

Simon Hawe

3,968
6
14

Sklearn Pipeline classifier throwing ValueError even when the missing values are taken care of

1 Answers1