-1

Suppose we have extensive data.table containing multiple columns, some numeric and other characters. For each sub-group by and each column, find the first non-NA value: For example, if two rows represent one sub-group:

Group V1 V2 V3 V4 V5 V6
1      3 NA  5 NA  NA ab 
1      7 fn  0 2  NA NA 

The expected result is:

Group V1 V2 V3 V4 V5 V6
1      3 fn  5 2  NA ab 

Suppose we have data.table with about 40 million rows with 10 million groups and 60 columns. The expected result will contain 10 million (one record for each sub-group) and 60 columns. Other solutions like this assume only one column with missing values or only numeric columns with NA's. Using R data.table function nafill except only double and integer data types and na.locf nor na.locf0 from package zoo can run hours before completing.

Alon Hazan
  • 191
  • 1
  • 7
  • For larger datasets (40M rows and 10M groups), you will still have performance issues: there is no way faster (currently implemented) to summarize by group to the first non-`NA` value. You may try using `Rcpp` or `cpp11` to write your own compiled/efficient function that handles the grouping itself, and does not try to `is.na` all values, just the values needed to find one non-NA value; having said that, doing the grouping part is not always the easiest to do fast and efficiently, especially when the groups must be discovered at runtime. – r2evans Jun 28 '21 at 16:39

1 Answers1

0
library(data.table)
DT[, lapply(.SD, function(z) na.omit(z)[1]), by = Group]
#    Group    V1     V2    V3    V4     V5     V6
#    <int> <int> <char> <int> <int> <lgcl> <char>
# 1:     1     3     fn     5     2     NA     ab

Data

DT <- setDT(structure(list(Group = c(1L, 1L), V1 = c(3L, 7L), V2 = c(NA, "fn"), V3 = c(5L, 0L), V4 = c(NA, 2L), V5 = c(NA, NA), V6 = c("ab", NA)), class = c("data.table", "data.frame"), row.names = c(NA, -2L)))
r2evans
  • 141,215
  • 6
  • 77
  • 149