2

I have a method that one hot encoded a list of columns from a pandas dataframe and drops the original column. While this works very quickly for some fields, for others, this process takes an incredibly long time. For example, I am currently working on a highly categorical datasets (i.e., more than 80 categorical features) where a single feature drives me into over 100,000 dimensions.

I am looking for a more optimized, and memory efficient, routine to one hot encode high dimensional data.

Below is my current approach:

# For each column to encode
for col in encode_cols:
    col_name = str(col)
    if col not in ('PRICE_AMOUNT', 'CHECKSUM_VALUE'):
        old_cols = df.shape[1]
        print("Now testing: {}".format(col_name))
        # Use pandas get_dummies function
        temp = pd.get_dummies(df[col], prefix=col_name, prefix_sep='_')
        df.drop(col, axis=1, inplace=True)
        df = pd.concat([df, temp], axis=1, join='inner')
        print("New Size: {}".format(df.shape))
        sizes[col] = df.shape[1] - old_cols
    else:
        continue
    
    del(temp)
    gc.collect()

For my case, encode_cols is only about 75 elements, but the vector goes from 100 dimensions to 107,000 when complete. How can I optimize this routine?

artemis
  • 6,857
  • 11
  • 46
  • 99
  • What is the purpose of the encoding? What modelling are you looking forward to use? Have you tried [sklearns OneHotEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html)? – Andre S. Oct 28 '20 at 12:48
  • Naturally one hot encoding data feeds into a model @AndreS., the exact technique my team and I will figure out given results. I am aware of the curse of high dimensionality and we have techniques around that, but given our business problem, this is how we need to represent the data. – artemis Oct 28 '20 at 12:56
  • Okay, depending on the model you use you can also use Label Encoding to avoid all the extra columns. The drawback is that you can not use label encoding for linear models as the categories get ranked. On the other hand, tree-based models work fine with label encoded features. – Andre S. Oct 28 '20 at 13:02

2 Answers2

1

Without having access to your data, I cannot supply you with fully workable code, although here are my thoughts. When dealing with very sparse, binary, features, sparse matrices can be used, which is a clever (and very memory efficient) way of storing data.

You can then use OneHotEncoder from sklearn, as explained here, to generate one-hot-encoded sparse, categorical features. So in your case, you would have to compute, for each categorical feature - all its levels, and use that to encode the sparse vectors.

vec = OneHotEncoder(n_values=n_of_levels_among_all_features)
X = vec.fit_transform(level_ids_data)
 
X.toarray() # To get it back to an "normal" nd-array.

Then you could use hstack as described here in order to merge your dense features (PRICE_AMOUNT, CHECKSUM_VALUE) with your sparse ones.

from scipy.sparse import hstack

X = hstack((sparse_ohe_categorical_features, dense_features), format='csr')

X is now a sparse matrix, with all your data. Change the format, csr, depending on use case. For example, using Logistic Regression from sklearn, sparse matrices must be in csr format for the fit method to work.

Marcus
  • 943
  • 5
  • 21
0

I would suggest using the OneHotEncoder tool from scikit-learn.

from sklearn.preprocessing import OneHotEncoder

features_to_one_hot = ['feature1','feature2']
to_one_hot_df = df.loc[:,features_to_one_hot]

categorical_encoder = OneHotEncoder()
new_one_hot = cat_encoder.fit_transform(to_one_hot_df)

If you want for the encoder to do more specific actions, Scikit-learn uses duck typing. Which means you can implement your own class. Here i show how you could make one for the encoder to drop the old columns :

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import OneHotEncoder


class OneHotAndDrop(BaseEstimator, TransformerMixin):
    def __init__(self, operate=True):
        self.operate = operate

    def fit(self, X, y=None):
        return self

    def transform(self, X):

        if self.operate:
            old_columns = list(X.columns)
            new_one_hot = OneHotEncoder().fit_transform(X)
            X = new_one_hot.drop(old_columns, axis=1)
            
        return X

one_hot_costum = OneHotAndDrop()
new_one_hot = one_hot_costum.fit_transform(to_one_hot_df)

You can then use this class like you did for the first exemple. This method uses sparse matrix, it will most likely be way more efficient than your original function and will name the new features automatically.

Also,one hot encoding them might not be the best idea depending on why you need to encode them. If it is for machine learning, this will create way too many features and will likely overfit. I would recommend grouping them first then categorizing them to reduce the number of new features.

HotMailRob
  • 11
  • 4
  • Thanks for your answer. The encoding will be related to an ML project but the business requirement is what it is. Even at the sake of performance. But, that is unrelated to the question. Additionally, this disrupts the naming scheme with the `prefix` value. I am also getting errors when using this with some of our models. – artemis Oct 28 '20 at 15:06