3

I'm trying to compute a dissimilarity matrix based on a big data frame with both numerical and categorical features. When I run the daisy function from the cluster package I get the error message:

Error: cannot allocate vector of size X.

In my case X is about 800 GB. Any idea how I can deal with this problem? Additionally it would be also great if someone could help me to run the function in parallel cores. Below you can find the function that computes the dissimilarity matrix on the iris dataset:

require(cluster)
d <- daisy(iris)
www
  • 38,575
  • 12
  • 48
  • 84
Codutie
  • 1,055
  • 13
  • 25
  • 1
    Do you mind adding more detail, why this is not dupe? From a quick look it does look like a duplicate post but with extra info. – zx8754 Dec 01 '17 at 09:39
  • 1
    I fail to understand how you expect parallelization to help if you don't have sufficient memory. Have you estimated how large the resulting matrix would be? – Roland Dec 01 '17 at 09:42
  • I've closed your previous question. – Roland Dec 01 '17 at 09:44
  • I need actually to speed up the ```daisy``` function https://stackoverflow.com/questions/47570984/compute-dissimilarity-matrix-on-parallel-cores and solve the memory issues. It's good to separate these questions because this are two different problems – Codutie Dec 01 '17 at 09:44
  • 1
    No, they are interconnected issues. Don't separate them. You can often solve memory problems at the expense of speed. – Roland Dec 01 '17 at 09:45
  • The dissimilarity matrix `d` has `nrow(iris)^2/2` elements. How many rows has your dataset? – Roland Dec 01 '17 at 09:59
  • These are my dimensions: ```[1] 465171 32``` – Codutie Dec 01 '17 at 10:08
  • And if I try to run the ```daisy``` function he tries to allocate 800 GB in memory. – Codutie Dec 01 '17 at 10:10
  • Yes, you can reproduce the error with `x <- seq_len(465171^2/2)`. You don't have sufficient memory to hold the result of this operation. You could create the result in chunks ... But I'd question why you are doing this. – Roland Dec 01 '17 at 10:19
  • The main goal would be to apply a cluster algorithm that accepts dissimilarity matrices as input, such as ```pam()```. The reason why I need ```daisy``` is that I have both numerical and categorical features. – Codutie Dec 01 '17 at 10:32
  • 4
    You need a smarter approach than brute force. – Roland Dec 01 '17 at 11:13

1 Answers1

0

I've had a similar issue before. Running daisy() on even 5k rows of my dataset took a really long time.

I ended up using the kmeans algorithm in the h2o package which parallelizes and 1-hot encodes categorical data. I would just make sure to center and scale your data (mean 0 w/ stdev = 1) before plugging it into h2o.kmeans. This is so that the clustering algorithm doesn't prioritize columns that have high nominal differences (since it's trying to minimize the distance calculation). I used the scale() function.

After installing h2o:

h2o.init(nthreads = 16, min_mem_size = '150G')
h2o.df <- as.h2o(df)
h2o_kmeans <- h2o.kmeans(training_frame = h2o.df, x = vars, k = 5, estimate_k = FALSE, seed = 1234)
summary(h2o_kmeans)
BigTimeStats
  • 447
  • 3
  • 12