Given a data.table, for each sub-group on each column, select the first non-NA

Question

Suppose we have extensive data.table containing multiple columns, some numeric and other characters. For each sub-group by and each column, find the first non-NA value: For example, if two rows represent one sub-group:

Group V1 V2 V3 V4 V5 V6
1      3 NA  5 NA  NA ab 
1      7 fn  0 2  NA NA

The expected result is:

Group V1 V2 V3 V4 V5 V6
1      3 fn  5 2  NA ab

Suppose we have data.table with about 40 million rows with 10 million groups and 60 columns. The expected result will contain 10 million (one record for each sub-group) and 60 columns. Other solutions like this assume only one column with missing values or only numeric columns with NA's. Using R data.table function nafill except only double and integer data types and na.locf nor na.locf0 from package zoo can run hours before completing.

For larger datasets (40M rows and 10M groups), you will still have performance issues: there is no way faster (currently implemented) to summarize by group to the first non-`NA` value. You may try using `Rcpp` or `cpp11` to write your own compiled/efficient function that handles the grouping itself, and does not try to `is.na` all values, just the values needed to find one non-NA value; having said that, doing the grouping part is not always the easiest to do fast and efficiently, especially when the groups must be discovered at runtime. — r2evans, Jun 28 '21 at 16:39

score 0 · Answer 1 · answered Jun 28 '21 at 16:35

library(data.table)
DT[, lapply(.SD, function(z) na.omit(z)[1]), by = Group]
#    Group    V1     V2    V3    V4     V5     V6
#    <int> <int> <char> <int> <int> <lgcl> <char>
# 1:     1     3     fn     5     2     NA     ab

Data

DT <- setDT(structure(list(Group = c(1L, 1L), V1 = c(3L, 7L), V2 = c(NA, "fn"), V3 = c(5L, 0L), V4 = c(NA, 2L), V5 = c(NA, NA), V6 = c("ab", NA)), class = c("data.table", "data.frame"), row.names = c(NA, -2L)))

Given a data.table, for each sub-group on each column, select the first non-NA

1 Answers1