0

Suppose we have extensive data.table containing multiple columns, some numeric and other characters. For each sub-group by and each column, find the first non-NA value: For example, if two rows represent one sub-group:

Group V1 V2 V3 V4 V5 V6
1      3 NA  5 NA  NA ab 
1      7 fn  0 2  NA NA 

The expected result is:

Group V1 V2 V3 V4 V5 V6
1      3 fn  5 2  NA ab 

Suppose we have data.table with about 40 million rows with 10 million groups and 60 columns. The expected result will contain 10 million (one record for each sub-group) and 60 columns. Other solutions assume only one column with missing values or only numeric columns with NA's. Using R data.table function nafill except only double and integer data types and na.locf nor na.locf0 from package zoo can run hours before completing.

Alon Hazan
  • 191
  • 1
  • 7

1 Answers1

1

You may try na.omit -

library(data.table)
setDT(df)[, lapply(.SD, function(x) na.omit(x)[1]), Type, .SDcols = x1:x5]

#   Type x1 x2 x3 x4 x5
#1:    A  4  0  5  2  a
#2:    B  7  2  5 NA  d

Or fcoalesce :

setDT(df)[, lapply(.SD,function(x) fcoalesce(as.list(x))), Type, .SDcols = x1:x5]
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • I tried both options; it is slow. The data is large, ~40M and 20 columns. I'm looking for a fast solution. – Alon Hazan Jun 27 '21 at 10:48
  • The only way I can think of that *may* speed that up (and it's really marginal) is to use `x[!is.na(x)][1]` instead of `na.omit(x)[1]`. The amount of speed-up is oddly related to the ratio of `NA`s in each column, so it's not a guarantee that it'll be better. Since you're grouping it by `Type`, I suspect that the slowdown is primarily due to the number of groups you have, in which case I don't think there's much you can do. How long is it taking? – r2evans Jun 27 '21 at 12:25
  • (To be clear, depending on the size and the ratio of `NA`s, `na.omit` may be faster. So not only is it a marginal speed-up, it's not guaranteed to be faster at all. But the speed difference is still marginal.) – r2evans Jun 27 '21 at 12:26
  • Maybe the best solution is to write a `c` code for speed. – Alon Hazan Jun 28 '21 at 16:06