-2

Sorry for another "vectorize for loop" question but I have not been able to figure out how to do this. The function I'm trying to write is simple:

For each row in enroll.in, first use the hasMedClaims logistic model output as probability of response.

Generate random number and use that to determine if a response should be modeled.

If yes, model the response. If no, just put a 0. Repeat for each row of enroll.in nsim times.

simMedClaims.loop<-function(hasMedClaims.in, MedClaims.in,  enroll.in, nsim = 100){

  set.seed(100)
  #dataframe to hold results
  results<-matrix(0, ncol = nsim, nrow = nrow(enroll.in))
  results<-data.frame(results)

  hasclaims<-predict(hasMedClaims.in, newdata = enroll.in, type = "response")
  means<-predict(MedClaims.in, newdata = enroll.in, type="response")
  for(ii in 1:nrow(enroll.in))
  {
    for(jj in 1:nsim){
      unif.rand<-runif(1)
      results[ii,jj]<-ifelse(unif.rand < hasclaims[ii], exp(rnorm(1,mean = means[ii], sd = sqrt(MedClaims.in$sig2))), 0)
    }

  }

  return(results)
}

set.seed(100)
dummy<-data.frame(hasresponse = rbinom(100000, 1, .5), response = rnorm(100000, mean = 5, sd = 1), x1 = runif(100000, 0, 60), x2 = as.factor(rbinom(100000, 1, .5)+1))
dummy$response<-dummy$hasresponse*dummy$response
hasresponse_gam<-mgcv::gam(hasresponse ~ s(x1,bs="ps", by=x2)+x2, data=dummy, family = binomial(link="logit"), method="REML")
response<-mgcv::gam(response ~ s(x1,bs="ps", by=x2)+x2, data=dummy[dummy$hasresponse==1,])
dummyEnroll<-data.frame(x1 = runif(10, 20, 50), x2 = as.factor(rbinom(10, 1, .5)+1))
system.time(result<-simMedClaims.loop(hasresponse_gam, response, dummyEnroll, 1000))

user  system elapsed 
38.66    0.00   39.35 

I've tried lots of different ideas but I get different problems with each one.

Both hasMedClaims.in and MedClaims.in are GAMs fit using the mgcv gam function.

Clarification on why I'm asking this: As the output shows, it takes a few seconds per subject to run 1000 simulations. I'll be using this on datasets with tens of thousands of subjects, and I want to run at least 50,000 simulations. My current code works but it is just way too slow. My goal is to optimize my function to run much faster.

Attempt at @Parfait's func2

simMedClaims2<-function(hasMedClaims.in, MedClaims.in,  enroll.in, nsim = 100){
  set.seed(100)
  hasclaims<-predict(hasMedClaims.in, newdata = enroll.in, type = "response")
  means<-predict(MedClaims.in, newdata = enroll.in, type="response")
  results<-data.frame(t(vapply(seq(nrow(enroll.in)), function(ii, jj){
    ifelse(runif(jj) < hasclaims[ii],1,0)*exp(rnorm(nsim,mean = means[ii], sd = sqrt(MedClaims.in$sig2)))
  },numeric(nsim),seq(nsim))))
  return(results)
}

Results look reasonable though I have not fully vetted them yet. I also edited my original loop function to calculate the means outside of the loop. Much faster

> system.time(result<-simMedClaims.loop(hasresponse_gam, response, dummyEnroll, 100))
   user  system elapsed 
   0.06    0.00    0.13
> system.time(result2<-simMedClaims2(hasresponse_gam, response, dummyEnroll, 100))
   user  system elapsed 
   0.02    0.00    0.02

However, running all.equal(result, result2) shows that the outputs are not equivalent. I can't figure out why that is.

gabagool
  • 640
  • 1
  • 7
  • 18

1 Answers1

1

Consider passing two vector arguments in sapply or vapply to avoid the nested for loop and need to initialize results dataframe. Of course it is still arguable if apply family is truly vectorized:

simMedClaims.loop <- function(hasMedClaims.in, MedClaims.in, enroll.in, nsim = 100){

  hasclaims <- predict(hasMedClaims.in, newdata = enroll.in, type = "response")

  results <- data.frame(t(vapply(seq(nrow(enroll.in)), function(ii,jj) { 
                                      unif.rand <- runif(jj) 
                                      ifelse(unif.rand < hasclaims[ii], ..., 0)
                                  numeric(nsim), seq(nsim))))    
}

Alternatively, consider an expand.grid() approach with wrangling at end into needed format of multiple columns. Though without the data wrangling this would be vectorized (no R loops used, but maybe C loops).

simMedClaims.loop <- function(hasMedClaims.in, MedClaims.in, enroll.in, nsim = 100){

  hasclaims <- predict(hasMedClaims.in, newdata = enroll.in, type = "response")

  # LONG FORMAT
  df <- expand.grid(1:nrow(enroll.in), 1:nsim)
  df$unif.rand <- runif(nrow(df))
  df$val <- ifelse(df$unif.rand < hasclaims[ii], ..., 0)

  # WIDE FORMAT 
  results <- data.frame(t(sapply(seq(1, nrow(df), by=nsim), function(i) 
                                 df$random_num[i:(i+(nsim-1))])))

}

Above methods have been tested with random data and return the same results as nested for loops (not including OP's predict or ifelse due to no reproducible example):

Data

enroll.in <- sapply(1:5, function(i) rnorm(15))
nsim <- 100

Methods

func1 <- function() {      
  set.seed(98)
  results1<-matrix(0, ncol = nsim, nrow = nrow(enroll.in))
  results1<-data.frame(results1)

  for(ii in 1:nrow(enroll.in))
  {
   for(jj in 1:nsim){

     results1[ii,jj] <- runif(1)
   }
  }
  return(results1)
}

func2 <- function() {
  set.seed(98)
  results2 <- data.frame(t(vapply(seq(nrow(enroll.in)), function(ii,jj) 
                                       runif(jj), 
                                  numeric(nsim), seq(nsim))))
}

func3 <- function() {
  set.seed(98)
  df <- expand.grid(1:nrow(enroll.in), 1:nsim)
  df$random_num <- runif(nrow(df))

  results3 <- data.frame(t(sapply(seq(1, nrow(df), by=nsim), function(i) 
                                  df$random_num[i:(i+(nsim-1))])))
}

Outcome

all.equal(func1(), func2())
# [1] TRUE
all.equal(func2(), func3())
# [1] TRUE

And benchmarks indicate at least for small data, processing isn't any much better between the methods. NOTE: the large nanosecond processing is due to the functions' set.seed() in order to compare random generated data. So old adage holds: there's nothing wrong with for loops:

library(microbenchmark)

microbenchmark(func1)
# Unit: nanoseconds
#   expr min lq  mean median uq max neval
#  func1  30 32 37.07     32 33 461   100

microbenchmark(func2)
# Unit: nanoseconds
#   expr min lq  mean median uq max neval
#  func2  29 31 39.41     32 33 729   100

microbenchmark(func3)
# Unit: nanoseconds
#   expr min lq mean median uq max neval
#  func3  30 31 35.6     32 33 370   100
Parfait
  • 104,375
  • 17
  • 94
  • 125
  • I've made some additions to improve my question – gabagool Sep 04 '17 at 22:23
  • Did you try this solution with your example data? – Parfait Sep 04 '17 at 23:42
  • Then what is your remaining issue? – Parfait Sep 05 '17 at 16:51
  • Only remaining issue is that the loop function and vectorized functions do not return equivalent results. I'm not sure if it's just a randomization issue or if there's something else wrong with the vectorized function – gabagool Sep 05 '17 at 17:35
  • You need to `set.seed()` with each random draw. Your `ifelse()` has a `rnorm()` call. To reproduce for both `loop` and `vapply`, add a `set.seed()` with same number just after `unif.rand <- runif(...)` but before `ifelse()`. Also, is the end structure the same (nrows and ncols)? – Parfait Sep 05 '17 at 18:08
  • Actually thinking about it even `set.seed()` may not work as `for` loop will seed with each one pick but `vapply` by each vector (i.e., multiple values). If it wasn't for your random calls, the two should be equivalent as I also show. – Parfait Sep 05 '17 at 18:12
  • In that case, I will be marking this answered. Thanks for all the help – gabagool Sep 05 '17 at 18:18