Clustering using daisy and pam in R

Question

I'm trying to perform a pretty straightforward clustering analysis but can't get the results right. My question for a large dataset is "Which diseases are frequently reported together?". The simplified data sample below should result in 2 clusters: 1) headache / dizziness 2) nausea / abd pain. However, I can't get the code right. I'm using the pam and daisy functions. For this example I manually assign 2 clusters (k=2) because I know the desired result, but in reality I explore several values for k.

Does anyone know what I'm doing wrong here?

library(cluster)
library(dplyr)

dat <- data.frame(ID = c("id1","id1","id2","id2","id3","id3","id4","id4","id5","id5"),
                  PTName = c("headache","dizziness","nausea","abd pain","dizziness","headache","abd pain","nausea","headache","dizziness"))


gower_dist <- daisy(dat, metric = "gower")
k <- 2
pam_fit <- pam(gower_dist, diss = TRUE, k)  # performs cluster analysis
pam_results <- dat %>%
  mutate(cluster = pam_fit$clustering) %>%
  group_by(cluster) %>%
  do(the_summary = summary(.))
head(pam_results$the_summary)

Ric S · Accepted Answer · 2020-03-02T15:35:01.183

The format in which you give the dataset to the clustering algorithm is not precise for your objective. In fact, if you want to group diseases that are reported together but you also include IDs in your dissimilarity matrix, they will have a part in the matrix construction and you do not want that, since your objective regards only the diseases.

Hence, we need to build up a dataset in which each row is a patient with all the diseases he/she reported, and then construct the dissimilarity matrix only on the numeric features. For this task, I'm going to add a column presence with value 1 if the disease is reported by the patient, 0 otherwise; zeros will be filled automatically by the function pivot_wider (link).

Here is the code I used and I think I reached what you wanted to, please tell me if it is so.

library(cluster)
library(dplyr)
library(tidyr)

dat <- data.frame(ID = c("id1","id1","id2","id2","id3","id3","id4","id4","id5","id5"),
                  PTName = c("headache","dizziness","nausea","abd pain","dizziness","headache","abd pain","nausea","headache","dizziness"),
                  presence = 1)
# build the wider dataset: each row is a patient
dat_wider <- pivot_wider(
    dat,
    id_cols = ID,
    names_from = PTName,
    values_from = presence,
    values_fill = list(presence = 0)
)

# in the dissimalirity matrix construction, we leave out the column ID
gower_dist <- daisy(dat_wider %>% select(-ID), metric = "gower")
k <- 2

set.seed(123)
pam_fit <- pam(gower_dist, diss = TRUE, k) 
pam_results <- dat_wider %>%
    mutate(cluster = pam_fit$clustering) %>%
    group_by(cluster) %>%
    do(the_summary = summary(.))
head(pam_results$the_summary)

Furthermore, since you are working only with binary data, instead of Gower's distance you can consider using the Simple Matching or Jaccard distance if they suit your data better. In R you can employ them using

sm_dist <- dist(dat_wider %>% select(-ID), method = "manhattan")/p
j_dist <- dist(dat_wider %>% select(-ID), method = "binary")

respectively, where p is the number of binary variables you want to consider.

Thanks for your hrlp. It works for the example, but for my real data I get the following warning. Can you explain this? Warning message: In daisy(dat_pt_wide %>% select(-ID), metric = "gower") : binary variable(s) 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, [... truncated] — Joep_S, Mar 02 '20 at 08:10
In the `daisy` function docs the authors say "Note that daisy signals a warning when 2-valued numerical variables do not have an explicit type specified, because the reference authors recommend to consider using "asymm"; the warning may be silenced by warnBin = FALSE". I'm afraid it is for this reason, but I cannot give you further information rather than read the documentation carefully — Ric S, Mar 02 '20 at 15:35
Anyway, I've added in the answer a suggestion on alternative distances, if you want to check them out — Ric S, Mar 02 '20 at 15:36

Clustering using daisy and pam in R

1 Answers1