Shall I treat Industry Classification codes as double data type in K-means clustering?

Question

Since K-means cannot handle categorical variables directly, I want to know if it is correct to convert International Standard Industrial Classification of All Economic Activities or ISIC into double data types to cluster it using K-means along with other financial and transactional data? Or shall I try other techniques such as one hot encoding?

The biggest assumption is that ISIC codes are categorical not numeric variables since code “2930” refers to “Manufacture of parts and accessories for motor vehicles” and not money, kilos, feet, etc., but there is a sort of pattern in such codes since they are not assigned randomly and have a hierarchy for instance 2930 belongs to Section C “Manufacturing” and Division 29 “Manufacture of motor vehicles, trailers and semi-trailers”.

OmG · Answer 1 · 2019-04-20T01:05:06.297

As you want to use standard K-Means, you need your data has a geometric meaning. Hence, if your mapping of the codes into the geometric space is linear, you will not get any proper clustering result. As the distance of the code does not project in their value. For example code 2930 is as close to code 2931 as code 2929. Therefore, you need a nonlinear mapping for the categorical space to the geometric space to using the standard k-mean clustering.

One solution is using from machine learning techniques similar to word-to-vec (for vectorizing words) if you have enough data for co-occurrences of these codes.

score 0 · Answer 2 · answered Apr 20 '19 at 02:01

Clustering is all about distance measurement.

Discretizing numeric variable to categorical is a partial solution. As earlier highlighted, the underlying question is how to measure the distance for a discretized variable with other discretized variable and numeric variable?

In literature, there are several unsupervised algorithms for treating mixed data. Take a look at the k-prototypes algorithm and the Gower distance.

The k-prototypes in R is given in clustMixType package. The Gower distance in R is given in the function daisy in the cluster package. If using Python, you can look at this post

Huang, Z. (1997). Clustering large data sets with mixed numeric and categorical values. Paper presented at the Proceedings of the 1st Pacific-Asia Conference on Knowledge Discovery and Data Mining,(PAKDD).
Gower, J. C. (1971). A general coefficient of similarity and some of its properties. Biometrics, 857-871.

score 0 · Answer 3 · answered Apr 20 '19 at 10:07

K-means is designed to minimize the sum of squares.

Does minimizing the sum of squares make sense for your problem? Probably not!

While 29, 2903 and 2930 are supposedly all related 2899 likely is not very much related to 2900. Hence, a least squares approach will produce undesired results.

The method is really designed for continuous variables of the same type and scale. One-hot encoded variables cause more problems than they solve - these are a naive hack to make the function "run", but the results are statistically questionable.

Try to figure out what he right thing to do is. It's probably not least squares here.

Shall I treat Industry Classification codes as double data type in K-means clustering?

3 Answers3