How can I use summarise_at to apply different functions to different columns?

Question

I have a dataframe with the following columns:

> colnames(my.dataframe)
 [1] "id"                              "firstName"                       "lastName"                       
 [4] "position"                        "jerseyNumber"                    "currentTeamId"                  
 [7] "currentTeamAbbreviation"         "currentRosterStatus"             "height"                         
[10] "weight"                          "birthDate"                       "age"                            
[13] "birthCity"                       "birthCountry"                    "rookie"                         
[16] "handednessShoots"                "college"                         "twitter"                        
[19] "currentInjuryDescription"        "currentInjuryPlayingProbability" "teamId"                         
[22] "teamAbbreviation"                "fg2PtAtt"                        "fg3PtAtt"                       
[25] "fg2PtMade"                       "fg3PtMade"                       "ftMade"                         
[28] "fg2PtPct"                        "fg3PtPct"                        "ftPct"                          
[31] "ast"                             "tov"                             "offReb"                         
[34] "foulsDrawn"                      "blkAgainst"                      "plusMinus"                      
[37] "minSeconds"

And here is my code that isn't working:

my.dataframe %>%
  dplyr::group_by(id) %>%
  dplyr::summarise_at(vars(firstName:currentInjuryPlayingProbability), funs(min), na.rm = TRUE) %>%
  dplyr::summarise_at(vars(fg2PtAtt:minSeconds), funs(sum), na.rm = TRUE) %>% 
                    vars(), funs(min), na.rm = TRUE) %>%
  dplyr::summarise(teamId = paste(teamId), teamAbbreviation = paste(teamAbbreviation))

First I group by id (which is not a unique column in my dataframe, despite it being called id). For the next 19 columns up until currentInjuryPlayingProbability, these columns are always the same when grouped_by the ID, and so I use the min function to summarise / grab the value.

Next, I want to summarise all columns from fg2PtAtt to the end with the mean value (these columns are all numeric / integer).

Lastly, for the columns teamId and teamAbbreviation (which are not the same when grouped_by id), I want to paste them into a single string each with summarise.

My approach doesn't work because I don't think I can call summarise_at, followed by another summarise_at, followed by a summarise. By the time the second summarise_at is called, the columns trying to be summarised were already removed by the first summarise_at

Any help with this is appreciated!I will update with a subset of my dataframe shortly that code can be tested on.

EDIT:

dput(my.dataframe)
structure(list(id = c(10138L, 9466L, 9360L, 9360L), firstName = c("Alex", 
"Quincy", "Luke", "Luke"), lastName = c("Abrines", "Acy", "Babbitt", 
"Babbitt"), currentInjuryPlayingProbability = c(NA_character_, 
NA_character_, NA_character_, NA_character_), teamId = c(96L, 
84L, 91L, 92L), teamAbbreviation = c("OKL", "BRO", "ATL", "MIA"
), fg2PtAtt = c(70L, 73L, 57L, 2L), fg3PtAtt = c(221L, 292L, 
111L, 45L), minSeconds = c(67637L, 81555L, 34210L, 8676L)), row.names = c(NA, 
-4L), class = c("tbl_df", "tbl", "data.frame"))

my.dataframe
     id firstName lastName currentInjuryPlayingProbability teamId teamAbbreviation fg2PtAtt fg3PtAtt minSeconds
  <int> <chr>     <chr>    <chr>                            <int> <chr>               <int>    <int>      <int>
1 10138 Alex      Abrines  <NA>                                96 OKL                    70      221      67637
2  9466 Quincy    Acy      <NA>                                84 BRO                    73      292      81555
3  9360 Luke      Babbitt  <NA>                                91 ATL                    57      111      34210
4  9360 Luke      Babbitt  <NA>                                92 MIA                     2       45       8676

here is a shorted example with only 9 columns, but with enough data to highlight the problems. The resulting dataframe should look like this:

    id firstName lastName currentInjuryPlayingProbability teamId teamAbbreviation fg2PtAtt fg3PtAtt minSeconds
  <int> <chr>     <chr>    <chr>                            <chr>     <chr>               <int>    <int>      <int>
1 10138 Alex      Abrines  <NA>                                96      OKL                    70      221      67637
2  9466 Quincy    Acy      <NA>                                84      BRO                    73      292      81555
3  9360 Luke      Babbitt  <NA>                            91, 92 ATL, MIA                     57      156      42886

After the first summarise_at, you get a summarised output and the second one is calling on that summarised one. I think a better option would be to call `mutate_at` first and then use that also in the group_by before doing the summarise_at second. Anyway, a small reproducible example would help — akrun, Jul 31 '18 at 17:27
yeah i noticed that was happening with summarise_at - just not sure how to fix it — Canovice, Jul 31 '18 at 17:31
looks like mutate_at(), followed by removing the duplicate columns, might do the trick — Canovice, Jul 31 '18 at 17:32
I think you might want to use `mutate_at` instead since `summarize` is destructive. From `?summarize`: `Each summary call removes one grouping level (since that group is now just a single row)` and `Note that with data frames, newly created summaries immediately overwrite existing variables` — divibisan, Jul 31 '18 at 17:37
Try with `my.dataframe %>% dplyr::group_by(id) %>% dplyr::mutate_at(vars(firstName:currentInjuryPlayingProbability), funs(min(., na.rm = TRUE))) %>% group_by_at(names(my.dataframe)[2:4], add = TRUE) %>% dplyr::mutate_at(vars(fg2PtAtt:minSeconds), sum, na.rm = TRUE)` and afterwards do the `summarise`. — akrun, Jul 31 '18 at 17:39
i've never seen group_by_at used before, but will give this a try — Canovice, Jul 31 '18 at 18:06

Calum You · Answer 1 · 2018-07-31T20:50:13.273

This is what I think is the simplest way for this particular task, at least compared to some similar map2/reduce solutions I've seen.

First point is that if you are using min to grab a value because you think it should be the same for every value of your grouping variable, just add it to the grouping. Then it is automatically preserved.

Second is that you can use {} to override the automatic placement of the LHS of %>% into the first argument of the RHS. This lets you in a single step apply different transformations and recombine them. Usually you don't need this because the placeholder . will do it for you, but if the placeholder is not a naked argument to the RHS you sometimes need it. (I am sure I read some resource that describes the exact rules but I can't find it right now).

Third is that because you know that summarise will drop columns you didn't select except the grouping variables, left_join will automatically use the shared column names to join on.

This means that we can do the following, which I think is pretty clean. If the transformations start to get particularly complex though (like if there are pipes inside the left_join I would recommend giving each piece of the final output its own assignment and name, to be clearer. You also need to be careful if you want more than one summary of the same column (like both mean and standard deviation), because as written the names will collide.

library(tidyverse)

my_dataframe <- structure(list(id = c(10138L, 9466L, 9360L, 9360L), firstName = c("Alex", "Quincy", "Luke", "Luke"), lastName = c("Abrines", "Acy", "Babbitt", "Babbitt"), currentInjuryPlayingProbability = c(NA_character_, NA_character_, NA_character_, NA_character_), teamId = c(96L, 84L, 91L, 92L), teamAbbreviation = c("OKL", "BRO", "ATL", "MIA"), fg2PtAtt = c(70L, 73L, 57L, 2L), fg3PtAtt = c(221L, 292L, 111L, 45L), minSeconds = c(67637L, 81555L, 34210L, 8676L)), row.names = c(NA, -4L), class = c("tbl_df", "tbl", "data.frame"))

my_dataframe %>%
  group_by_at(.vars = vars(id:lastName)) %>%
  {left_join(
    summarise_at(., vars(teamId:teamAbbreviation), ~ str_c(., collapse = ",")),
    summarise_at(., vars(fg2PtAtt:minSeconds), mean)
  )}
#> Joining, by = c("id", "firstName", "lastName")
#> # A tibble: 3 x 8
#> # Groups:   id, firstName [?]
#>      id firstName lastName teamId teamAbbreviation fg2PtAtt fg3PtAtt
#>   <int> <chr>     <chr>    <chr>  <chr>               <dbl>    <dbl>
#> 1  9360 Luke      Babbitt  91,92  ATL,MIA              29.5       78
#> 2  9466 Quincy    Acy      84     BRO                  73        292
#> 3 10138 Alex      Abrines  96     OKL                  70        221
#> # ... with 1 more variable: minSeconds <dbl>

Created on 2018-07-31 by the reprex package (v0.2.0).

score 0 · Answer 2 · answered Jan 31 '21 at 23:57

0

Update after dplyr 1.0.2. You can use across:

summarise(across(teamId:teamAbbreviation, ~ str_c(., collapse = ",")), 
        across(fg2PtAtt:minSeconds, mean)) %>%

answered Jan 31 '21 at 23:57

Julio

283
2
13

How can I use summarise_at to apply different functions to different columns?

2 Answers2

Linked