I have a method that one hot encoded a list of columns from a pandas
dataframe and drops the original column. While this works very quickly for some fields, for others, this process takes an incredibly long time. For example, I am currently working on a highly categorical datasets (i.e., more than 80 categorical features) where a single feature drives me into over 100,000
dimensions.
I am looking for a more optimized, and memory efficient, routine to one hot encode high dimensional data.
Below is my current approach:
# For each column to encode
for col in encode_cols:
col_name = str(col)
if col not in ('PRICE_AMOUNT', 'CHECKSUM_VALUE'):
old_cols = df.shape[1]
print("Now testing: {}".format(col_name))
# Use pandas get_dummies function
temp = pd.get_dummies(df[col], prefix=col_name, prefix_sep='_')
df.drop(col, axis=1, inplace=True)
df = pd.concat([df, temp], axis=1, join='inner')
print("New Size: {}".format(df.shape))
sizes[col] = df.shape[1] - old_cols
else:
continue
del(temp)
gc.collect()
For my case, encode_cols
is only about 75 elements, but the vector goes from 100
dimensions to 107,000
when complete. How can I optimize this routine?