1

I have created sklearn pipeline for preprocessing and then running the model over the processed data. The preprocessing step takes care of missing values even after that it throws the following error:

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

The below is my code :

def test_sklearn_pipeline(random_state_num):
    numeric_features = ["x","y"]
    categorical_features = ["wconfid","pctid"]
    missing_features = ["x"]
    missing_transformer = Pipeline(
        steps=[("imputer", SimpleImputer(strategy="mean"))]
    )
    scale_transformer = Pipeline(
        steps=[("scaler", StandardScaler())]
    )
    categorical_transformer = Pipeline(
        steps=[('ohe',OneHotEncoder(handle_unknown="ignore"))]
    )
    preprocessor = ColumnTransformer(
        transformers=[
            ("miss", missing_transformer, missing_features),
            ("cat", categorical_transformer, categorical_features),
            ('outlier_remover',outlier_removal,numeric_features),
            ("num", scale_transformer, numeric_features)
        ],remainder='passthrough'
    )
    clf = Pipeline(
        steps=[("preprocessor", preprocessor), ("classifier", LinearRegression())]
    )
    df = pd.read_csv('accelerometer_modified.csv')
    df = df.drop(columns=['random'])
    X,y = df.drop(columns=['z']),df.loc[:,'z']
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=random_state_num)
    clf.fit(X_train, y_train)
    print("MSE: %.3f" % mean_squared_error(clf.predict(X_test), y_test))
  • 1
    Be aware that `ColumnTransformer` applies its transformations in parallel. Therefore, my guess is that you might have some issues as `missing_features` and `numeric_features` are not disjoint sets (basically, you're applying parallel transformations on common features: on `'x'` you're applying the first, third and fourth transformations at the same time). I may suggest https://stackoverflow.com/questions/70745198/how-to-execute-both-parallel-and-serial-transformations-with-sklearn-pipeline – amiola Jan 21 '22 at 10:29

1 Answers1

0

Numeric features and missing features do have the column x in common. Columntransformer runs each transformation in the input dataframe. This means you are running the standard scaler in the raw column and not the imputed one. You probably need two transformers that run sequentially, or rather put a small Pipeline as you've done already with steps that are first impute second scale

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
Simon Hawe
  • 3,968
  • 6
  • 14