6

I have a data.table with many missing values and I want a variable which gives me a 1 for the first non-missin value in each group.

Say I have such a data.table:

library(data.table)
DT <- data.table(iris)[,.(Petal.Width,Species)]
DT[c(1:10,15,45:50,51:70,101:134),Petal.Width:=NA]

which now has missings in the beginning, at the end and in between. I have tried two versions, one is:

DT[min(which(!is.na(Petal.Width))),first_available:=1,by=Species]

but it only finds the global minimum (in this case, setosa gets the correct 1), not the minimum by group. I think this is the case because data.table first subsets by i, then sorts by group, correct? So it will only work with the row that is the global minimum of which(!is.na(Petal.Width)) which is the first non-NA value.

A second attempt with the test in j:

DT[,first_available:= ifelse(min(which(!is.na(Petal.Width))),1,0),by=Species]

which just returns a column of 1s. Here, I don't have a good explanation as to why it doesn't work.

my goal is this:

DT[,first_available:=0]
DT[c(11,71,135),first_available:=1]

but in reality I have hundreds of groups. Any help would be appreciated!

Edit: this question does come close but is not targeted at NA's and does not solve the issue here if I understand it correctly. I tried:

DT <- data.table(DT, key = c('Species'))
DT[unique(DT[,key(DT), with = FALSE]), mult = 'first']
Community
  • 1
  • 1
Jakob
  • 1,325
  • 15
  • 31
  • 2
    Possible duplicate of [R: first observation by group using data.table & self-join](http://stackoverflow.com/questions/15776064/r-first-observation-by-group-using-data-table-self-join) – mtoto Jun 09 '16 at 11:19
  • see edit, I'm not sure – Jakob Jun 09 '16 at 11:43

3 Answers3

9

Here's one way:

DT[!is.na(Petal.Width), first := as.integer(seq_len(.N) == 1L), by = Species]
Arun
  • 116,683
  • 26
  • 284
  • 387
  • nice, this one also preserves NA's in between, that might be handy – Jakob Jun 09 '16 at 11:45
  • Hey looks, good, can you explain this part of your code `seq_len(.N)` – Sowmya S. Manian Jun 09 '16 at 12:01
  • 2
    `.N` is a special symbol that holds the number of observations for each group. And `seq_len` constructs a sequence from 1 to .N. See `?data.table` for `.N` and other special symbols, and `?seq_len` for more. – Arun Jun 09 '16 at 12:03
2

We can try

DT[DT[, .I[which.max(!is.na(Petal.Width))] , Species]$V1, 
     first_available := 1][is.na(first_available), first_available := 0]

Or a slightly more compact option is

DT[, first_available := as.integer(1:nrow(DT) %in% 
      DT[, .I[!is.na(Petal.Width)][1L], by = Species]$V1)][]
akrun
  • 874,273
  • 37
  • 540
  • 662
-1
  > DT[!is.na(DT$Petal.Width) & DT$first_available == 1]
  #      Petal.Width    Species first_available
  #   1:         0.2     setosa               1
  #   2:         1.8 versicolor               1
  #   3:         1.4  virginica               1

  > rownames(DT)[!is.na(DT$Petal.Width) & DT$first_available == 1]
  # [1] "11"  "71"  "135"

  > rownames(DT)[!is.na(DT$Petal.Width) & DT$first_available == 0]
  # [1] "12"  "13"  "14"  "16"  "17"  "18"  "19"  "20"  "21"  "22"  "23"  "24" 
  # [13] "25"  "26"  "27"  "28"  "29"  "30"  "31"  "32"  "33"  "34"  "35"  "36" 
  # [25] "37"  "38"  "39"  "40"  "41"  "42"  "43"  "44"  "72"  "73"  "74"  "75" 
  # [37] "76"  "77"  "78"  "79"  "80"  "81"  "82"  "83"  "84"  "85"  "86"  "87" 
  # [49] "88"  "89"  "90"  "91"  "92"  "93"  "94"  "95"  "96"  "97"  "98"  "99" 
  # [61] "100" "136" "137" "138" "139" "140" "141" "142" "143" "144" "145" "146"
  # [73] "147" "148" "149" "150"
Sowmya S. Manian
  • 3,723
  • 3
  • 18
  • 30
  • 1
    but this assumes I already have the answer, no? `first_available` is what I'm trying to get, I just built it manually at the end to show what I'm aiming for. – Jakob Jun 09 '16 at 11:36
  • 1
    also, is it not considered bad style to mix `data.table` and `data.frame` syntax? I do it once in a while for simplicity as well so I'm not sure. What do you think? – Jakob Jun 09 '16 at 11:37
  • 1
    Oops, just checked. I was trying your first_available and then it was there in DT, My stupidity. I am doing by dataframe way. Editing my answer in a while. – Sowmya S. Manian Jun 09 '16 at 11:40