I would like to know how should I managed the following situation:
I have a dataset which I need to analyze. It is labeled data and I need to perform over it a classification task. Some features are numerical and others are categorical (non-ordinal), and my problem is I don't know how can I handle the categorical ones.
Before to classify, I usually apply a MinMaxScaler. But I can't do this in this particular dataset because of the categorical features.
I've read about the one-hot encoding
, but I don't understand how can apply it to my case because my dataset have some numerical features and 10 categorical features and the one-hot encoding generates more columns in the dataframe, and I don't know how do I need to prepare the resultant dataframe to sent it to the decision tree classifier.
In order to clarify the situation the code I'm using so far is the following:
y = df.class
X = df.drop(['class'] , axis=1)
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X = scaler.fit_transform(X)
# call DecisionTree classifier
When the df
has categorical features I get the following error: TypeError: data type not understood
. So, if I apply the one-hot encoding
I get a dataframe with many columns and I don't know if the decisionTree classifier is going to understand the real situation of my data. I mean how can I express to the classifier that a group of columns belongs to a specific feature? Am I understanding the whole situation wrong? Sorry if this a confused question but I am newbie and I fell pretty confused about how to handle this.