6

Trying to convert a data.frame with numeric, nominal, and NA values to a dissimilarity matrix using the daisy function from the cluster package in R. My purpose involves creating a dissimilarity matrix before applying k-means clustering for customer segmentation. The data.frame has 133,153 rows and 36 columns. Here's my machine.

sessionInfo()
R version 3.1.0 (2014-04-10)
Platform x86_64-w64-mingw32/x64 (64-bit) 

How can I fix the daisy warning?

Since the Windows computer has 3 Gb RAM, I increased the virtual memory to 100GB hoping that would be enough to create the matrix - it didn't work. I still got a couple errors about the memory. I've looked into other R packages for solving the memory problem, but they don't work. I cannot use the bigmemory with the biganalytics package because it only accepts numeric matrices. The clara and ff packages also accept only numeric matrices.

CRAN's cluster package suggests the gower similarity coefficient as a distance measure before applying k-means. The gower coefficient takes numeric, nominal, and NA values.

Store1 <- read.csv("/Users/scdavis6/Documents/Work/Client1.csv", head=FALSE)
df <- as.data.frame(Store1)
save(df, file="df.Rda")
library(cluster)
daisy1 <- daisy(df, metric = "gower", type = list(ordratio = c(1:35)))
#Error in daisy(df, metric = "gower", type = list(ordratio = c(1:35))) :
#long vectors (argument 11) are not supported in .C

**EDIT: I have RStudio lined to Amazon Web Service's (AWS) r3.8xlarge with 244Gbs of memory and 32 vCPUs. I tried creating the daisy matrix on my computer, but did not have enough RAM. **

**EDIT 2: I used the clara function for clustering the dataset. **

#50 samples
clara2 <- clara(df, 3, metric = "euclidean", stand = FALSE, samples = 50,
                rngR = FALSE, pamLike = TRUE)
www
  • 38,575
  • 12
  • 48
  • 84
Scott Davis
  • 983
  • 6
  • 22
  • 43

1 Answers1

2

Use algorithms that do not require O(n^2) memory, if you have a lot of data. Swapping to disk will kill performance, this is not a sensible option.

Instead, try either to reduce your data set size, or use index acceleration to avoid the O(n^2) memory cost. (And it's not only O(n^2) memory, but also O(n^2) distance computations, which will take a long time!)

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
  • How would I do index acceleration? Load the data in MySQL first before creating the daisy matrix? There's a RMySQL package, I'll look into that. – Scott Davis Jun 24 '14 at 20:09
  • No, don't even think of touching MySQL. I haven't used daisy, so I don't know *if* it can be index accelerated. You can't just throw magic index acceleration at everything. There is no "RmakeMagicIndexAcceleration" package for this very reason. The methods and implementations must be *designed* to be accelerated this way, and it looks as if this daisy implementation is designed to compute an upper diagonal matrix. – Has QUIT--Anony-Mousse Jun 25 '14 at 09:31
  • ELKI has a lot of index accelerated algorithms, but I did not see daisy on this list. And it's not R. – Has QUIT--Anony-Mousse Jun 25 '14 at 09:31
  • sorry for the late reply. I decided to use Amazon Web Services because of the RAM requirement. Thank you for showing ELKI. – Scott Davis Jul 05 '14 at 18:43
  • even with the AWS server, I still got an error about the dataset being too large. Maybe I will reduce the dataset size with random sampling and cross-validate the result? – Scott Davis Jul 18 '14 at 18:17
  • I'm not surprised. Yes, with many algorithms the result on a large enough sample will be essentially the same. But other than that, don't use R. Much of the stuff in R is designed to work on matrixes, and that will need O(n^2) memory; and you cannot afford buying quadratic amounts of memory on the long run. Also, working on a sample first allows you to figure out if your approach works at all, before scaling it up. – Has QUIT--Anony-Mousse Jul 19 '14 at 11:33
  • I tested by putting a different dataset with 15k observations (dataset above had 50k) in a daisy matrix and did k-medoid clustering. I did not have any errors, but the two steps took over an hour. I agree that testing on a smaller dataset is important to check for errors, but I need to cluster every customer in the dataset. What options do I have left? – Scott Davis Jul 19 '14 at 21:15
  • 2
    You may be able to generalize the results in O(n) to the complete data set. See e.g. CLARA clustering. It runs k-medoids on a sample only. Then assignes the full data set to these medoids. It's O(n) if you can affort to choose a small enough sampling rate, whereas PAM probably is O(n^3) or so. Oh, and you can try to use *index acceleration*, which for some algorithms can yield massive speedups. I've clustered 100k in a few minutes with ELKI; indexes helped tremendously. – Has QUIT--Anony-Mousse Jul 20 '14 at 14:26
  • @Anony-Mouse, I looked at the CLARA function and it requires a numerical matrix. Seems like ELKI is the only option left. – Scott Davis Jul 22 '14 at 03:02
  • CLARA should not require a full distance matrix `data matrix or data frame, each row corresponds to an observation`; this is not a n^2 distance matrix, but just the data.frame you already have. – Has QUIT--Anony-Mousse Jul 22 '14 at 10:18
  • I tried using CLARA and it worked without AWS. Now the problem is finding the command to see all observations assigned to each medoid. – Scott Davis Jul 25 '14 at 22:21