R: find first non-NA observation in data.table column by group

Question

I have a data.table with many missing values and I want a variable which gives me a 1 for the first non-missin value in each group.

Say I have such a data.table:

library(data.table)
DT <- data.table(iris)[,.(Petal.Width,Species)]
DT[c(1:10,15,45:50,51:70,101:134),Petal.Width:=NA]

which now has missings in the beginning, at the end and in between. I have tried two versions, one is:

DT[min(which(!is.na(Petal.Width))),first_available:=1,by=Species]

but it only finds the global minimum (in this case, setosa gets the correct 1), not the minimum by group. I think this is the case because data.table first subsets by i, then sorts by group, correct? So it will only work with the row that is the global minimum of which(!is.na(Petal.Width)) which is the first non-NA value.

A second attempt with the test in j:

DT[,first_available:= ifelse(min(which(!is.na(Petal.Width))),1,0),by=Species]

which just returns a column of 1s. Here, I don't have a good explanation as to why it doesn't work.

my goal is this:

DT[,first_available:=0]
DT[c(11,71,135),first_available:=1]

but in reality I have hundreds of groups. Any help would be appreciated!

Edit: this question does come close but is not targeted at NA's and does not solve the issue here if I understand it correctly. I tried:

DT <- data.table(DT, key = c('Species'))
DT[unique(DT[,key(DT), with = FALSE]), mult = 'first']

Possible duplicate of [R: first observation by group using data.table & self-join](http://stackoverflow.com/questions/15776064/r-first-observation-by-group-using-data-table-self-join) — mtoto, Jun 09 '16 at 11:19

score 9 · Accepted Answer · answered Jun 09 '16 at 11:43

9

Here's one way:

DT[!is.na(Petal.Width), first := as.integer(seq_len(.N) == 1L), by = Species]

answered Jun 09 '16 at 11:43

Arun

116,683
26
284
387

nice, this one also preserves NA's in between, that might be handy – Jakob Jun 09 '16 at 11:45
Hey looks, good, can you explain this part of your code `seq_len(.N)` – Sowmya S. Manian Jun 09 '16 at 12:01
2

`.N` is a special symbol that holds the number of observations for each group. And `seq_len` constructs a sequence from 1 to .N. See `?data.table` for `.N` and other special symbols, and `?seq_len` for more. – Arun Jun 09 '16 at 12:03

akrun · Answer 2 · 2016-06-09T11:39:23.730

2

We can try

DT[DT[, .I[which.max(!is.na(Petal.Width))] , Species]$V1, 
     first_available := 1][is.na(first_available), first_available := 0]

Or a slightly more compact option is

DT[, first_available := as.integer(1:nrow(DT) %in% 
      DT[, .I[!is.na(Petal.Width)][1L], by = Species]$V1)][]

edited Jun 09 '16 at 11:39

answered Jun 09 '16 at 11:24

akrun

874,273
37
540
662

1

great, this does what I'm looking for. I'll catch up on .I and 1L – Jakob Jun 09 '16 at 11:46

Sowmya S. Manian · Answer 3 · 2016-06-09T11:34:00.530

-1

  > DT[!is.na(DT$Petal.Width) & DT$first_available == 1]
  #      Petal.Width    Species first_available
  #   1:         0.2     setosa               1
  #   2:         1.8 versicolor               1
  #   3:         1.4  virginica               1

  > rownames(DT)[!is.na(DT$Petal.Width) & DT$first_available == 1]
  # [1] "11"  "71"  "135"

  > rownames(DT)[!is.na(DT$Petal.Width) & DT$first_available == 0]
  # [1] "12"  "13"  "14"  "16"  "17"  "18"  "19"  "20"  "21"  "22"  "23"  "24" 
  # [13] "25"  "26"  "27"  "28"  "29"  "30"  "31"  "32"  "33"  "34"  "35"  "36" 
  # [25] "37"  "38"  "39"  "40"  "41"  "42"  "43"  "44"  "72"  "73"  "74"  "75" 
  # [37] "76"  "77"  "78"  "79"  "80"  "81"  "82"  "83"  "84"  "85"  "86"  "87" 
  # [49] "88"  "89"  "90"  "91"  "92"  "93"  "94"  "95"  "96"  "97"  "98"  "99" 
  # [61] "100" "136" "137" "138" "139" "140" "141" "142" "143" "144" "145" "146"
  # [73] "147" "148" "149" "150"

edited Jun 09 '16 at 11:34

answered Jun 09 '16 at 11:27

Sowmya S. Manian

3,723
3
18
30

1

but this assumes I already have the answer, no? `first_available` is what I'm trying to get, I just built it manually at the end to show what I'm aiming for. – Jakob Jun 09 '16 at 11:36
1

also, is it not considered bad style to mix `data.table` and `data.frame` syntax? I do it once in a while for simplicity as well so I'm not sure. What do you think? – Jakob Jun 09 '16 at 11:37
1

Oops, just checked. I was trying your first_available and then it was there in DT, My stupidity. I am doing by dataframe way. Editing my answer in a while. – Sowmya S. Manian Jun 09 '16 at 11:40

R: find first non-NA observation in data.table column by group

3 Answers3

Linked

Related