Cluster Analysis in R with missing data

Question

So I spent a good amount of time trying to find the answer on how to do this. The only answer I have found so far is here: How to perform clustering without removing rows where NA is present in R

Unfortunately, this is not working for me.

So here is an example of my data (d in this example):

Q9Y6X2           NA -6.350055943 -5.78314068
Q9Y6X3           NA           NA -5.78314068
Q9Y6X6  0.831273549  4.875151493  0.78671493
Q9Y6Y8  4.831273549  0.457298979  5.59406985
Q9Y6Z4  4.831273549  4.875151493          NA

Here is what I tried:

> dist <- daisy(d,metric = "gower")
> hc <- hclust(dist)
Error in hclust(dist) : NA/NaN/Inf in foreign function call (arg 11)

From my understanding daisy should be able to handle NA values, but I am still receiving an error when trying to cluster my results.

Thanks.

I am using the following libraries: gplots,cluster. Daisy is an algorithm that computes a distance matrix, that allows for missing data. — akvallejos, Nov 12 '14 at 21:04

score 2 · Answer 1 · answered Mar 22 '18 at 13:14

Mixture models permit clustering of data set with missing values, by assuming that values are missing completely at random (MCAR). Moreover, information criteria (like BIC or ICL) permit to select the number of clusters. You can use the R package VarSelLCM to cluster these data (there is a Shiny application to interpret the results). A tutorial of this package is available here

score 1 · Accepted Answer · edited Aug 29 '18 at 23:38

1

If you look at the dist matrix, you will see that there is an NA present, because samples Q9Y6X3 and Q9Y6Z4 have no overlap. This results in an NA in your dist matrix, which hclust doesn't like. You could potentially force the NAs to be 0 or something, but I am not sure if that wouldn't leave statistical bias.

edited Aug 29 '18 at 23:38

Community

1
1

answered Nov 12 '14 at 19:53

darwin

433
2
7

score 0 · Answer 3 · edited May 23 '17 at 11:59

0

Within 2nd answer to the following post: How to perform clustering without removing rows where NA is present in R, such bug in the "daisy" function was reported. Formerly the function was coded by:

if (any(ina <- is.na(type3))) 
stop(gettextf("invalid type %s for column numbers %s", 
    type2[ina], pColl(which(is.na))))

The intended error message was not printed, as which(is.na) was wrongly used instead of which(ina).

The author of this function contained in the "cluster" package acknowledged the issue and fixed the code back in June 2015. http://svn.r-project.org/R-packages/trunk/cluster/R/daisy.q

edited May 23 '17 at 11:59

Community

1
1

answered Jan 26 '16 at 19:02

Ana Maria Mendes-Pereira

319
3
8

Please, try to include the relevant part from the link in your answer. It is a good practice in order to avoid having answers with dead links. – iled Jan 26 '16 at 19:30

score 0 · Answer 4 · answered Jun 19 '21 at 16:06

You should start with some descriptive statistics such as an analysis of the frequency of NAs per variable, and split histograms on whether a specific variable is missing or not (in case of a lot of variables with missings, this will be hardly possible though).

If you have very few missings (say, <1%) you might do a simple imputation by mean or median of the available values, or a random imputation (e.g. drawing randomly among available values). If you find variables with more missings but the histograms show now difference, then your data is completely missing at random so a random imputation is fine as well.

In most cases missings depend on other variables in your data set, or even unobserved information. In this case a multiple imputation is usually best. A very good book about this is: https://stefvanbuuren.name/fimd/ It's from the author of the mice package. There are other great imputation packages out there, for example, missRanger uses a fast Random Forest implementation to impute (e.g., estimate) missing values.

In any case, you should test out various approaches (and iterate non-deterministic ones) for their impact on the clustering result. FeatureImpCluster (which I authored) provides a global feature importance measure per variable. If a variable is, independently of the imputation result, rather irrelevant, you might not have to worry about the imputation technique you use.

Finally, specifically for missing values in k-means clustering I have written the ClustImpute package, it does not require you to impute the NAs beforehand.

Cluster Analysis in R with missing data

4 Answers4

Linked