5

See the minimal example:

library(data.table)
DT <- data.table(x = 2, y = 3, z = 4)

DT[, c(1:2)]  # first way
#    x y
# 1: 2 3

DT[, (1:2)]  # second way
# [1] 1 2

DT[, 1:2]  # third way
#    x y
# 1: 2 3

As described in this post, subsetting columns with numeric indices is possible now. However, I would like to known why indices are evaluated to a vector in the second way rather than column indices?

In addition, I updated data.table just now:

> sessionInfo()
R version 3.4.4 (2018-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.4 LTS

Matrix products: default
BLAS: /usr/lib/atlas-base/atlas/libblas.so.3.0
LAPACK: /usr/lib/atlas-base/atlas/liblapack.so.3.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.11.2

loaded via a namespace (and not attached):
[1] compiler_3.4.4 tools_3.4.4    yaml_2.1.17
Henrik
  • 65,555
  • 14
  • 143
  • 159
mt1022
  • 16,834
  • 5
  • 48
  • 71
  • 1
    I'm guessing it is because the `j`th part in data.table isn't only used for selecting columns, hence, when you are evaluating it (using parenthesis), you are telling data.table that you want to evaluate an expression rather selecting specific columns. I guess, for that reason, when you are running `DT[, (1:2), verbose = TRUE]` it tells you that you haven't selected any columns in `j` – David Arenburg May 13 '18 at 08:26
  • 1
    @DavidArenburg, very likely. But if `(1:2)` is evaluated as expression, `c(1:2)` should be also evaluated as expression, right? – mt1022 May 13 '18 at 08:41
  • Don't think so. `c()` isn't evaluating anything, it's just returning a vector. I'm guessing that with the new changes. data.table expects to receive column names vector either in form of a list (NSE), e.g. `DT[, .(x)] `, or in form a vector of locations or quoted columns names (SE) `DT[, 1]` or `DT[, c("x")]` (or just `DT[, "x"]`) respectively. I haven't looked in the source code though, so can't tell for sure. – David Arenburg May 13 '18 at 08:47
  • @DavidArenburg, thanks for the explanation. I am also trying to figure out the reason by examining its source code, which is too long so that I have a hard time tracking the evaluation of `j`. :( – mt1022 May 13 '18 at 08:54
  • 1
    I think you can look [here](https://github.com/Rdatatable/data.table/blob/master/R/data.table.R#L278) – David Arenburg May 13 '18 at 08:54
  • Related: [Select a sequence of columns: `:` works but not `seq`](https://stackoverflow.com/questions/41775462/select-a-sequence-of-columns-works-but-not-seq) – Henrik Oct 28 '19 at 14:07

1 Answers1

5

By looking at the source code we can simulate data.tables behaviour for different inputs

if (!missing(j)) {
    jsub = replace_dot_alias(substitute(j))
    root = if (is.call(jsub)) as.character(jsub[[1L]])[1L] else ""
    if (root == ":" ||
        (root %chin% c("-","!") && is.call(jsub[[2L]]) && jsub[[2L]][[1L]]=="(" && is.call(jsub[[2L]][[2L]]) && jsub[[2L]][[2L]][[1L]]==":") ||
        ( (!length(av<-all.vars(jsub)) || all(substring(av,1L,2L)=="..")) &&
          root %chin% c("","c","paste","paste0","-","!") &&
          missing(by) )) {   # test 763. TODO: likely that !missing(by) iff with==TRUE (so, with can be removed)
      # When no variable names (i.e. symbols) occur in j, scope doesn't matter because there are no symbols to find.
      # If variable names do occur, but they are all prefixed with .., then that means look up in calling scope.
      # Automatically set with=FALSE in this case so that DT[,1], DT[,2:3], DT[,"someCol"] and DT[,c("colB","colD")]
      # work as expected.  As before, a vector will never be returned, but a single column data.table
      # for type consistency with >1 cases. To return a single vector use DT[["someCol"]] or DT[[3]].
      # The root==":" is to allow DT[,colC:colH] even though that contains two variable names.
      # root == "-" or "!" is for tests 1504.11 and 1504.13 (a : with a ! or - modifier root)
      # We don't want to evaluate j at all in making this decision because i) evaluating could itself
      # increment some variable and not intended to be evaluated a 2nd time later on and ii) we don't
      # want decisions like this to depend on the data or vector lengths since that can introduce
      # inconistency reminiscent of drop=TRUE in [.data.frame that we seek to avoid.
      with=FALSE

Basically, "[.data.table" catches the expression passed to j and decides how to treat it based on some predefined rules. If one of the rules is satisfied, it sets with=FALSE which basically means that column names were passed to j, using standard evaluation.

The rules are (roughly) as follows:

  1. Set with=FALSE,

    1.1. if j expression is a call and the call is : or

    1.2. if the call is a combination of c("-","!") and ( and : or

    1.3. if some value (character, integer, numeric, etc.) or .. was passed to j and the call is in c("","c","paste","paste0","-","!") and there is no a by call

otherwise set with=TRUE

So we can convert this into a function and see if any of the conditions were satisfied (I've skipped the converting the . to list function as it is irrelevant here. We will just test with list directly)

is_satisfied <- function(...) {
  jsub <- substitute(...)
  root = if (is.call(jsub)) as.character(jsub[[1L]])[1L] else ""
  if (root == ":" ||
    (root %chin% c("-","!") && 
     is.call(jsub[[2L]]) && 
     jsub[[2L]][[1L]]=="(" && 
     is.call(jsub[[2L]][[2L]]) && 
     jsub[[2L]][[2L]][[1L]]==":") ||
    ( (!length(av<-all.vars(jsub)) || all(substring(av,1L,2L)=="..")) &&
      root %chin% c("","c","paste","paste0","-","!"))) TRUE else FALSE
}

is_satisfied("x")
# [1] TRUE
is_satisfied(c("x", "y"))
# [1] TRUE
is_satisfied(..x)
# [1] TRUE
is_satisfied(1:2)
# [1] TRUE
is_satisfied(c(1:2))
# [1] TRUE
is_satisfied((1:2))
# [1] FALSE
is_satisfied(y)
# [1] FALSE
is_satisfied(list(x, y))
# [1] FALSE
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
  • 1
    Very clear. In the `is_satisfied`, the `root` for `c(1:2)`, `(1:2)`, and `1:2` are `c`, `(`, and `:`, respectively. `(` is not included in the special cases so that `is_satisfied` returns `FALSE`. – mt1022 May 13 '18 at 11:32