Sorry for another "vectorize for loop" question but I have not been able to figure out how to do this. The function I'm trying to write is simple:
For each row in enroll.in, first use the hasMedClaims logistic model output as probability of response.
Generate random number and use that to determine if a response should be modeled.
If yes, model the response. If no, just put a 0. Repeat for each row of enroll.in nsim times.
simMedClaims.loop<-function(hasMedClaims.in, MedClaims.in, enroll.in, nsim = 100){
set.seed(100)
#dataframe to hold results
results<-matrix(0, ncol = nsim, nrow = nrow(enroll.in))
results<-data.frame(results)
hasclaims<-predict(hasMedClaims.in, newdata = enroll.in, type = "response")
means<-predict(MedClaims.in, newdata = enroll.in, type="response")
for(ii in 1:nrow(enroll.in))
{
for(jj in 1:nsim){
unif.rand<-runif(1)
results[ii,jj]<-ifelse(unif.rand < hasclaims[ii], exp(rnorm(1,mean = means[ii], sd = sqrt(MedClaims.in$sig2))), 0)
}
}
return(results)
}
set.seed(100)
dummy<-data.frame(hasresponse = rbinom(100000, 1, .5), response = rnorm(100000, mean = 5, sd = 1), x1 = runif(100000, 0, 60), x2 = as.factor(rbinom(100000, 1, .5)+1))
dummy$response<-dummy$hasresponse*dummy$response
hasresponse_gam<-mgcv::gam(hasresponse ~ s(x1,bs="ps", by=x2)+x2, data=dummy, family = binomial(link="logit"), method="REML")
response<-mgcv::gam(response ~ s(x1,bs="ps", by=x2)+x2, data=dummy[dummy$hasresponse==1,])
dummyEnroll<-data.frame(x1 = runif(10, 20, 50), x2 = as.factor(rbinom(10, 1, .5)+1))
system.time(result<-simMedClaims.loop(hasresponse_gam, response, dummyEnroll, 1000))
user system elapsed
38.66 0.00 39.35
I've tried lots of different ideas but I get different problems with each one.
Both hasMedClaims.in and MedClaims.in are GAMs fit using the mgcv gam function.
Clarification on why I'm asking this: As the output shows, it takes a few seconds per subject to run 1000 simulations. I'll be using this on datasets with tens of thousands of subjects, and I want to run at least 50,000 simulations. My current code works but it is just way too slow. My goal is to optimize my function to run much faster.
Attempt at @Parfait's func2
simMedClaims2<-function(hasMedClaims.in, MedClaims.in, enroll.in, nsim = 100){
set.seed(100)
hasclaims<-predict(hasMedClaims.in, newdata = enroll.in, type = "response")
means<-predict(MedClaims.in, newdata = enroll.in, type="response")
results<-data.frame(t(vapply(seq(nrow(enroll.in)), function(ii, jj){
ifelse(runif(jj) < hasclaims[ii],1,0)*exp(rnorm(nsim,mean = means[ii], sd = sqrt(MedClaims.in$sig2)))
},numeric(nsim),seq(nsim))))
return(results)
}
Results look reasonable though I have not fully vetted them yet. I also edited my original loop function to calculate the means outside of the loop. Much faster
> system.time(result<-simMedClaims.loop(hasresponse_gam, response, dummyEnroll, 100))
user system elapsed
0.06 0.00 0.13
> system.time(result2<-simMedClaims2(hasresponse_gam, response, dummyEnroll, 100))
user system elapsed
0.02 0.00 0.02
However, running all.equal(result, result2)
shows that the outputs are not equivalent. I can't figure out why that is.