0

How can I efficiently sample one row for each unique variable in a column from a datatable in R? For example, given the data.table:

library(data.table)
set.seed(1)

dt <- data.table( 
                   A = sample(c("A", "B", "C", "D", "E"), 100, replace = T),
                   B = sample(1:100, 100, replace = T),
                   C = sample(101:200, 100, replace = T) 
                 )

I need to sample one row for each unique character in column A. For example:

out <- list()
for (i in 1:length(unique(dt$A))){
  out[[i]] <- dt[sample(dt[, .I[A == unique(dt$A)[i]]], 1, replace = T)]
}
out <- do.call("rbind", out)

However, the data table I am applying this to is vary large. Is there a data.table method I can use to improve performance?

Powege
  • 685
  • 5
  • 12

1 Answers1

1

You can use sample on .N for each group and select 1 random row.

library(data.table)
set.seed(123)
dt[, .SD[sample(.N, 1)], A]

#   A   B   C
#1: A  31 143
#2: D  16 175
#3: B 100 165
#4: E  27 190
#5: C  90 197

dplyr has slice_sample (previously sample_n) function for it :

library(dplyr)
dt %>% group_by(A) %>% slice_sample(n = 1)
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • If this is done repeatedly it would make sense to start by `set.key(dt, A)` to improve performance. – s_baldur Sep 07 '20 at 10:05