0

I have a pandas dataframe wher eone of the columns is ratedby, the values are male or female. My goal is to create 2 columns with OneHotEncoder (ratedbymale, ratedbyfemale) with values 1 or 0 appropriately.

I am using Azure ML Designer, with the Execute Python Script componen which takes a dataframe as a parameter and then it can output 2 dataframes

The code I entered is:

# The script MUST contain a function named azureml_main
# which is the entry point for this module.

# imports up here can be used to
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

# The entry point function MUST have two input arguments.
# If the input port is not connected, the corresponding
# dataframe argument will be None.
#   Param<dataframe1>: a pandas.DataFrame
#   Param<dataframe2>: a pandas.DataFrame
def azureml_main(dataframe1 = None, dataframe2 = None):

    # Execution logic goes here
    print(f'Input pandas.DataFrame #1: {dataframe1}')

    # If a zip file is connected to the third input port,
    # it is unzipped under "./Script Bundle". This directory is added
    # to sys.path. Therefore, if your zip file contains a Python file
    # mymodule.py you can import it using:
    # import mymodule

    # Return value must be of a sequence of pandas.DataFrame
    # E.g.
    #   -  Single return value: return dataframe1,
    #   -  Two return values: return dataframe1, dataframe2
    enc = OneHotEncoder(handle_unknown='ignore')  
    onehotencoder_df = pd.DataFrame(enc.fit_transform(dataframe1[['ratedby']]))
    dataframe1.join(onehotencoder_df)
    return dataframe1, onehotencoder_df

However I am having this error:

AmlExceptionMessage:User program failed with InvalidDatasetError: Result dataset2 contains invalid data, ('Could not convert   (0, 1)\t1.0 with type csr_matrix: did not recognize Python value type when inferring an Arrow data type', 'Conversion failed for column 0 with type object').

ModuleExceptionMessage:InvalidDataset: Result dataset2 contains invalid data, ('Could not convert   (0, 1)\t1.0 with type csr_matrix: did not recognize Python value type when inferring an Arrow data type', 'Conversion failed for column 0 with type object').
Luis Valencia
  • 32,619
  • 93
  • 286
  • 506
  • 1
    For me, you might try to explicitly convert the output of method `.fit_transform()` applied on the `OneHotEncoder()` instance to numpy array as such `enc.fit_transform(dataframe1[['ratedby']]).toarray()`. Alternatively, you might think of expliciting `sparse=False` as parameter of the `OneHotEncoder()` constructor. Wrt the first option, you might find some hints in https://stackoverflow.com/questions/70933014/how-to-use-columntransformer-to-return-a-dataframe/70934371#70934371; wrt the second option, see eg https://stackoverflow.com/questions/71308070/onehotencoder-returns-unexpected-result – amiola Mar 01 '22 at 13:40

1 Answers1

0

My guess is that there is some data in df['ratedby'] that cannot be onehotencoded.

Second I suggest you convert the encoded data to np.array first then join, as suggested in comments.

And for joining you can get the new features formed by enc_ft = enc.get_feature_names_out_()

pd.join(onehote_tr, columns =enc_ft)

Dharman
  • 30,962
  • 25
  • 85
  • 135
otaku
  • 86
  • 7