0

I would like to know how should I managed the following situation:

I have a dataset which I need to analyze. It is labeled data and I need to perform over it a classification task. Some features are numerical and others are categorical (non-ordinal), and my problem is I don't know how can I handle the categorical ones.

Before to classify, I usually apply a MinMaxScaler. But I can't do this in this particular dataset because of the categorical features.

I've read about the one-hot encoding, but I don't understand how can apply it to my case because my dataset have some numerical features and 10 categorical features and the one-hot encoding generates more columns in the dataframe, and I don't know how do I need to prepare the resultant dataframe to sent it to the decision tree classifier.

In order to clarify the situation the code I'm using so far is the following:

y = df.class
X = df.drop(['class'] , axis=1)

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X = scaler.fit_transform(X)

# call DecisionTree classifier

When the df has categorical features I get the following error: TypeError: data type not understood. So, if I apply the one-hot encoding I get a dataframe with many columns and I don't know if the decisionTree classifier is going to understand the real situation of my data. I mean how can I express to the classifier that a group of columns belongs to a specific feature? Am I understanding the whole situation wrong? Sorry if this a confused question but I am newbie and I fell pretty confused about how to handle this.

mjbsgll
  • 722
  • 9
  • 24
  • 2
    You don't really need to scale your data if using DecisionTree classifiers – Quang Hoang Oct 31 '19 at 13:47
  • Before trying one-hot encoding (not always great for Tree classifiers) try looking at some more basic encoding schemes such as ordinal encoding. You can try the sklearn implementation of such an [encoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html). Have a go and update your question and then we can help you debug it. – FChm Oct 31 '19 at 13:48

1 Answers1

0

I don't have enough reputation to comment, but note that decision tree classifiers don't require their input to be scaled. So if you're using a decision tree classifier, just use the features as they appear.

If you're using a method that requires feature scaling, then you should probably do one-hot-encoding and feature scaling separately - see this answer: https://stackoverflow.com/a/43798994/9988333

Alternatively, you could use a method that handles categorical variables 'out of the box', such as LGBM.

ignoring_gravity
  • 6,677
  • 4
  • 32
  • 65
  • Thanks for your answer. But how about to send to the decision tree classifier a dataset with numerical and categorical data, the classifier works well with this kind of dataset? – mjbsgll Oct 31 '19 at 14:42
  • 1
    @mjbsgll You could one-hot encode you categorical data before sending it. To see how well the classifier performs, you should be using a validation set (or, ideally, cross-validation) – ignoring_gravity Oct 31 '19 at 15:03