0

let's say one has the following simplified data frame, composed by the variable 'Date' only, that contains a set of dates:

Date
2022-01-01
2022-01-01
2022-01-02
2022-01-04
2022-01-04
2022-01-06

I need to compute an ID counter variable that assumes values equal to 1 for the first record only and a cumulative value for the remaining ones. Specifically, if 2 dates are consecutive then the ID counter remains equal to the previous one, otherwise it increases by 1.

By assuming that, R has to return the following table as output:

Date ID.Counter
2022-01-01 1
2022-01-01 1
2022-01-02 1
2022-01-04 2
2022-01-04 2
2022-01-06 3

Could someone help me?

I tried to solve this simple problem by running:

library(dplyr)
  what_i_have <- c('2022-12-31','2022-12-31','2022-12-31','2022-01-01','2022-01-01','2022-01-02','2022-01-03','2022-01-03')
  what_i_have <- as.data.frame(what_i_have)
  what_i_have <- what_i_have %>% 
    mutate(ID = 1,
           ID = if_else(Date == lag(Date) ,ID = lag(ID),ID+1))

but R returns an error in the mutate() statement; specifically, the error is:

Error in mutate(): In argument: ID = if_else(Date == lag(Date), ID = lag(ID), ID + 1). Caused by error in if_else(): must be empty. Problematic argument: ID = lag(ID) Run rlang::last_trace() to see where the error occurred.

1 Answers1

2

Update: change in logic. Previous answer far below.

First, we need Date-class:

quux$Date <- as.Date(quux$Date)

From there,

cumsum(c(1, diff(quux$Date, units = "day") > 1))
# [1] 1 1 1 2 2 3

This can be placed directly within an ungrouped dplyr::mutate (without the quux$).

Data

quux <- structure(list(Date = c("2022-01-01", "2022-01-01", "2022-01-02", "2022-01-04", "2022-01-04", "2022-01-06")), class = "data.frame", row.names = c(NA, -6L))


(previous answer)

I recommend against the use of what I call "Excel grouping". In Microsoft Excel, often the most popular/frequent way to do grouped calculations (counting, aggregation, etc) is to look at this row's group variable(s) and compare with the previous row's group variable(s) to see if anything has changed, and use or reset as appropriate. This method is fragile as it relies wholly on the order of data. (There are some ways around this, but it's a common method.)

Instead in R, I strongly recommend the logic of grouped operations. In base R, this usually involves ave, aggregate, and tapply (others exist); in dplyr, this is often simpler by using group_by(.) previously in the chain (or, with dplyr_1.1 or newer, by= within the mutate/summarize/other-verb function call). (In data.table, the by=.(...) argument does the same.)

The benefit of using these grouping methods is that the function being used (whether a named function like sum or an anonymous function ala function(z) sum(z[z>5], na.rm=TRUE) only sees one group's data at a time ... regardless of whether the rows are contiguous.

In this case, the group calcs are not complex: we can use match in base R, for instance. But I'll extend the dplyr code a little to demonstrate the grouping methodology.

base R

match(quux$Date, unique(quux$Date))
# [1] 1 1 2 3 3 4

dplyr

(dplyr_1.1 or newer for the by=)

library(dplyr)
## dplyr_1.1 or newer
quux %>%
  mutate(id = cur_group_id(), .by = Date)
#         Date id
# 1 2022-01-01  1
# 2 2022-01-01  1
# 3 2022-01-02  2
# 4 2022-01-04  3
# 5 2022-01-04  3
# 6 2022-01-06  4
## works in dplyr before 1.1 (as well as 1.1 and newer)
quux %>%
  group_by(Date) %>%
  mutate(id = cur_group_id())

or we can use base::match as above:

quux %>%
  mutate(id = match(Date, unique(Date)))
r2evans
  • 141,215
  • 6
  • 77
  • 149
  • Thanks for your answer. Very clear. Unfortunately, I was wrong compiling the table in the example. I would like to get the variable ID such that it assumes value 1 for the date '2022-01-02' because '2022-01-01' and '2022-01-02' are consecutive dates and so on. Your answer works well but only by differentiating on the date. By the way, I edit the question by fixing the ID in the table what_i_want. – basiliscus Jun 09 '23 at 13:28
  • see my edit, @basiliscus – r2evans Jun 09 '23 at 13:38