2

I want to find the rank correlation of various columns in a data.frame using dplyr.

I am sure there is a simple solution to this problem, but I think the problem lies in me not being able to use two inputs in summarize_each_ in dplyr when using the cor function.

For the following df:

df <- data.frame(Universe=c(rep("A",5),rep("B",5)),AA.x=rnorm(10),BB.x=rnorm(10),CC.x=rnorm(10),AA.y=rnorm(10),BB.y=rnorm(10),CC.y=rnorm(10))

I want to get the rank correlations between all the .x and the .y combinations. My problem in the function below where you see ????

cor <- df %>% group_by(Universe) %>% 
summarize_each_(funs(cor(.,method = 'spearman',use = "pairwise.complete.obs")),????)

I want cor to just include the correlation pairs: AA.x.AA.y , AA.x,BB.y, ... for each Universe.

Please help!

Nick
  • 3,262
  • 30
  • 44
  • 2
    A base solution could be `lapply(split(df[-1], df$Universe), function(x) cor(x[grepl("\\.x", names(x))], x[grepl("\\.y", names(x))], method = 'spearman', use = "pairwise.complete.obs"))` – David Arenburg Nov 18 '15 at 10:23
  • That works neatly, thanks. But I have a very large data set and using the dplyr methods are definitely preferable. My own base R solution is taking ages.. – Nick Nov 18 '15 at 10:44
  • 1
    Related [question](http://stackoverflow.com/questions/33487705/fast-crosstabs-and-stats-on-all-pairs-of-variables) with `data.table` solution, if you're open to a different library – C8H10N4O2 Nov 18 '15 at 14:39
  • what do you want the output to look like? – C8H10N4O2 Nov 18 '15 at 15:39

3 Answers3

5

An alternative approach is to just call the cor function once since this will calculate all required correlations. Repeated calls to cor might be a performance issue for a large data set. Code to do this and extract the correlation pairs with labels could look like:

#
# calculate correlations and display in matrix format
#
cor_matrix <- df %>% group_by(Universe) %>%
              do(as.data.frame(cor(.[,-1], method="spearman", use="pairwise.complete.obs")))
#
# to add row names
#
cor_matrix1 <- cor_matrix %>%  
              data.frame(row=rep(colnames(.)[-1], n_groups(.))) 
#
# calculate correlations and display in column format
#
num_col=ncol(df[,-1])
out_indx <-  which(upper.tri(diag(num_col))) 
cor_cols <- df %>% group_by(Universe) %>%
            do(melt(cor(.[,-1], method="spearman", use="pairwise.complete.obs"), value.name="cor")[out_indx,])
WaltS
  • 5,410
  • 2
  • 18
  • 24
  • This is the fastest workaround for my problem. Brilliant. Using do in this example works well, even if it is far less intuitive for me to use! – Nick Nov 19 '15 at 08:10
3

So here follows the winning (time-wise) solution to my problem:

d <- df %>% gather(R1,R1v,contains(".x")) %>% gather(R2,R2v,contains(".y"),-Universe) %>% group_by(Universe,R1,R2) %>% 
       summarize(ICAC = cor(x=R1v, y=R2v,method = 'spearman',use = "pairwise.complete.obs")) %>% 
       unite(Pair, R1, R2, sep="_")

Albeit 0.005 milliseconds in this example, adding data adds time.

noamt
  • 7,397
  • 2
  • 37
  • 59
Nick
  • 3,262
  • 30
  • 44
2

Try this:

library(data.table)                                           # needed for fast melt
setDT(df)                                                     # sets by reference, fast
mdf <- melt(df[, id := 1:.N], id.vars = c('Universe','id'))

mdf %>% 
  mutate(obs_set = substr(variable, 4, 4) ) %>%               # ".x" or ".y" subgroup
  full_join(.,., by=c('Universe', 'obs_set', 'id')) %>%       # see notes
  group_by(Universe, variable.x, variable.y) %>%
  filter(variable.x != variable.y) %>%
  dplyr::summarise(rank_corr = cor(value.x, value.y, 
                   method='spearman', use='pairwise.complete.obs'))

Produces:

   Universe variable.x variable.y rank_corr
     (fctr)     (fctr)     (fctr)     (dbl)
1         A       AA.x       BB.x      -0.9
2         A       AA.x       CC.x      -0.9
3         A       BB.x       AA.x      -0.9
4         A       BB.x       CC.x       0.8
5         A       CC.x       AA.x      -0.9
6         A       CC.x       BB.x       0.8
7         A       AA.y       BB.y      -0.3
8         A       AA.y       CC.y       0.2
9         A       BB.y       AA.y      -0.3
10        A       BB.y       CC.y      -0.3
..      ...        ...        ...       ...

Explanation:

  1. Melt: converts table to long form, one row per observation. To do the melt in a dplyr chain, you would have to use tidyr::gather, I believe, so pick your dependency. Using data.table there is faster and not hard to understand. The step also creates an id for each observation, 1 to nrow(df). The rest is in dplyr like you wanted.

  2. Full join: joins the melted table to itself to create paired observations from all variable pairings based on common Universe and observation id (edit: and now '.x' or '.y' subgroup).

  3. Filter: we don't need to correlate observations paired to themselves, we know those correlations = 1. If you wanted to include them for a correlation matrix or something, comment out this step.

  4. Summarize using Spearman correlation. Note you should use dplyr::summarise since if you have plyr also loaded you might accidentally call plyr::summarise.

Community
  • 1
  • 1
C8H10N4O2
  • 18,312
  • 8
  • 98
  • 134
  • Your solution produces the desired result thanks. The way I figured it out eventually was slightly faster. I did it as follows: d <- df %>% gather(R1,R1v,contains(".x")) %>% gather(R2,R2v,contains(".y"),-Universe) %>% group_by(Universe,R1,R2) %>% summarize(ICAC = cor(x=R1v, y=R2v,method = 'spearman',use = "pairwise.complete.obs")) %>% unite(Pair, R1, R2, sep="_") – Nick Nov 19 '15 at 08:09
  • @Nick Interesting, I'm surprised that 2x `gather` proved faster in your example. Thanks for the response. If runtime is your #1 concern, you should definitely consider `data.table` or the parallelized uses of `plyr`. Coming up with the fastest `dplyr` solution might be like finding the world's shortest giant. – C8H10N4O2 Nov 19 '15 at 13:40