Why do group_by and group_by_ give different answers when summarizing by two variables?

Question

In the following example, I want to create a summary statistic by two variables. When I do it with dplyr::group_by, I get the correct answer, by when I do it with dplyr::group_by_, it summarizes one level more than I want it to.

library(dplyr)
set.seed(919)
df <- data.frame(
  a = c(1, 1, 1, 2, 2, 2),
  b = c(3, 3, 4, 4, 5, 5),
  x = runif(6)
)

# Gives correct answer
df %>%
  group_by(a, b) %>%
  summarize(total = sum(x))

# Source: local data frame [4 x 3]
# Groups: a [?]
# 
#       a     b     total
#   <dbl> <dbl>     <dbl>
# 1     1     3 1.5214746
# 2     1     4 0.7150204
# 3     2     4 0.1234555
# 4     2     5 0.8208454

# Wrong answer -- too many levels summarized
df %>%
  group_by_(c("a", "b")) %>%
  summarize(total = sum(x))
# # A tibble: 2 × 2
#       a     total
#   <dbl>     <dbl>
# 1     1 2.2364950
# 2     2 0.9443009

What's going on?

Might help: http://stackoverflow.com/questions/28667059/dplyr-whats-the-difference-between-group-by-and-group-by-functions — wbrugato, Nov 08 '16 at 19:55
Thanks @wbrugato. I did see that. It explains how the inputs to the functions are different (quoted vs. unquoted strings), but it doesn't explain why the functions would give different outputs from the same inputs (but please let me know if I'm missing something!). — Jake Fisher, Nov 08 '16 at 20:00
You need `group_by_(.dots = c("a", "b"))` or `group_by_("a", "b")`. — Psidom, Nov 08 '16 at 20:01
@Psidom that did the trick, thanks! If you add that as an answer, I'll accept it. — Jake Fisher, Nov 08 '16 at 20:03

score 4 · Accepted Answer · answered Nov 08 '16 at 20:07

If you want to use a vector of variable names, you can pass it to .dots parameter as:

df %>%
      group_by_(.dots = c("a", "b")) %>%
      summarize(total = sum(x))

#Source: local data frame [4 x 3]
#Groups: a [?]

#      a     b     total
#  <dbl> <dbl>     <dbl>
#1     1     3 1.5214746
#2     1     4 0.7150204
#3     2     4 0.1234555
#4     2     5 0.8208454

Or you can use it in the same way as you would do in NSE way:

df %>%
     group_by_("a", "b") %>%
     summarize(total = sum(x))

#Source: local data frame [4 x 3]
#Groups: a [?]

#      a     b     total
#  <dbl> <dbl>     <dbl>
#1     1     3 1.5214746
#2     1     4 0.7150204
#3     2     4 0.1234555
#4     2     5 0.8208454

Why do group_by and group_by_ give different answers when summarizing by two variables?

1 Answers1