which.max not functioning as expected

Question

I am trying to create a table that includes the value of y for when x is equal to or less than a certain value, by group. Below is my code using the iris data set.

For "<=2.5", I expect to get 4.5, 5.0, or 5.8 for the virginica group, since these are the values of Petal.Length associated with a Sepal.Width of 2.5 for virginica. But instead, I get 6.0. Any ideas of where I went wrong? (My actual data set does not have duplicates of the variable analogous to Sepal.Width for the same group, so choosing among those is not an issue for me.)

data(iris)

my.table <- iris %>%
  group_by(Species) %>%
  summarise("<=2.5" = Petal.Length[which.max(Sepal.Width[Sepal.Width<=2.5])],
            "<=3" = Petal.Length[which.max(Sepal.Width[Sepal.Width<=3])],
            "<=3.5" = Petal.Length[which.max(Sepal.Width[Sepal.Width<=3.5])],
            "<=4" = Petal.Length[which.max(Sepal.Width[Sepal.Width<=4])])

This is related to the question Create a table with values from ecdf graph

score 4 · Accepted Answer · answered Mar 17 '20 at 18:32

The problem is that you are first subsetting the Sepal.Width. Consequently, the index returned by which.max applies to that sub-vector, and no longer corresponds to the indices of the whole Petal.Length vector.

To fix this, you also need to subset Petal.Length correspondingly, e.g.

…
`<=2.5` = Petal.Length[Sepal.Width <= 2.5][which.max(Sepal.Width[Sepal.Width <= 2.5])],
…

… of course this gets rather verbose and repetitive. It might be better to perform the subsetting in a separate step. However, this means creating new columns for every threshold value.

Incidentally, this is unrelated to dplyr.

score 3 · Answer 2 · answered Mar 17 '20 at 20:41

To make it more scalable, using double loop:

myCuts <- c(2.5, 3, 3.5, 4)

res <- sapply(split(iris, iris$Species), function(i)
  sapply(myCuts, function(j){
    x <- i[ i$Sepal.Width <= j, ]
    x$Petal.Length[ which.max(x$Sepal.Width) ]
  }))

rownames(res) <- paste0("<=", myCuts)
res
#       setosa versicolor virginica
# <=2.5    1.3        3.9       4.5
# <=3      1.4        4.2       5.9
# <=3.5    1.4        4.5       5.6
# <=4      1.2        4.5       6.7

IceCreamToucan · Answer 3 · 2020-03-17T20:44:00.190

Here's another way to get the same data. Create a group variable according to Sepal.Width values. Then within each group, select the row with the top Sepal.Width value. It is in a different "shape", but you can always pivot_wider if you want all the values as columns instead of rows.

iris %>%
  group_by(Species,
           Sepal.Width_grp = case_when(Sepal.Width <= 2.5 ~ '<=2.5',
                                       Sepal.Width <= 3 ~ '<=3',
                                       Sepal.Width <= 3.5 ~ '<=3.5',
                                       Sepal.Width <= 4 ~ '<=4',
                                       TRUE ~ '> 4')) %>%
  top_n(1, -Sepal.Width) %>% 
  select(Species, Sepal.Width_grp, Top.Sepal.Width = Sepal.Width, Petal.Width)
# # A tibble: 25 x 4
# # Groups:   Species, Sepal.Width_grp [12]
#    Species Sepal.Width_grp Top.Sepal.Width Petal.Width
#    <fct>   <chr>                     <dbl>       <dbl>
#  1 setosa  <=3.5                       3.1         0.2
#  2 setosa  <=4                         3.6         0.2
#  3 setosa  <=3                         2.9         0.2
#  4 setosa  <=3.5                       3.1         0.1
#  5 setosa  <=4                         3.6         0.2
#  6 setosa  <=3.5                       3.1         0.2
#  7 setosa  > 4                         4.1         0.1
#  8 setosa  <=3.5                       3.1         0.2
#  9 setosa  <=4                         3.6         0.1
# 10 setosa  <=2.5                       2.3         0.3
# # ... with 15 more rows

Edit: A little simpler if you use cut

iris %>%
  group_by(Species,
           Sepal.Width_grp = cut(Sepal.Width, c(0, 2.5, 3, 3.5, 4, Inf))) %>% 
  top_n(1, -Sepal.Width) %>% 
  select(Species, Sepal.Width_grp, Top.Sepal.Width = Sepal.Width, Petal.Width)
# # A tibble: 25 x 4
# # Groups:   Species, Sepal.Width_grp [12]
#    Species Sepal.Width_grp Top.Sepal.Width Petal.Width
#    <fct>   <fct>                     <dbl>       <dbl>
#  1 setosa  (3,3.5]                     3.1         0.2
#  2 setosa  (3.5,4]                     3.6         0.2
#  3 setosa  (2.5,3]                     2.9         0.2
#  4 setosa  (3,3.5]                     3.1         0.1
#  5 setosa  (3.5,4]                     3.6         0.2
#  6 setosa  (3,3.5]                     3.1         0.2
#  7 setosa  (4,Inf]                     4.1         0.1
#  8 setosa  (3,3.5]                     3.1         0.2
#  9 setosa  (3.5,4]                     3.6         0.1
# 10 setosa  (0,2.5]                     2.3         0.3
# # ... with 15 more rows

which.max not functioning as expected

3 Answers3

Linked