parallelize call to old R projects using source

Question

I am in the process of updating old R projects. In this process I need executed old projects and new projects.

I have manage to run old projects sequentially (one after another) and I am trying to do the same thing in parallel.

These scripts do not return any value because they write into a database.

This is the working code

execute_script <- function(script_name, folder_path, required_data){
  #' This function prepares the input for a folder and a given script
  #' @param script_name: Contains the name of the executable .R script to be executed
  #' @param folder_path: Contains the folder where the script to executed is stored
  #' @param required_data: This contains the data required to run the script. 
  #' @return when finished returns TRUE to ensure it has ended 

  # I need to do this to join path on server with filename (it depends on user selection)
  script_to_execute <- paste0(folder_path,"\\", script_name)
  script_to_execute <- paste0((unlist(strsplit(script_to_execute, split='\\', fixed = TRUE))), collapse = "/")

  tryCatch( 
    expr = {    
      source(script_to_execute, local = TRUE, echo = TRUE, print.eval = TRUE, spaced = TRUE, chdir = TRUE)
    }, 
    error = function(e){

      print(paste(Sys.time(),"Excepcion execute_script: ",script_name))
      print(paste(Sys.time(),"Excepcion execute_script: ",e))
      return(FALSE)      
    }
  )
  return(TRUE) # TODO chech
}

 model_execution_sequence <- function(script_required_data, selected_scripts){
  #' Executes the scripts received in selected_projects using script_required_data
  #' @param script_required_data contains a dataframe with the input
  #' 
  #' @param selected_scripts is a datatable that contains all the information required to execute a script
  #'                    
  #' @return a boolean information to determine if execution was success or not 


  TYPE_1_MODEL <- selected_scripts[APPROACH == '1']
  if (nrow(TYPE_1_MODEL) > 0) {


    my_iters <- 1:dim(TYPE_1_MODEL)[1]
    for (x in my_iters) {
      cat("\n\n")
      a <- execute_script(TYPE_1_MODEL$FILE[x],
                            TYPE_1_MODEL$PATH[x],
                            script_required_data)

      print(paste(Sys.time(),"Executed",TYPE_1_MODEL$FILE[x],"result:",a))
    }

  }else{
    print(paste("Error: Unexpected approach named: ",selected_scripts," "))

  }
  return(TRUE)
}

This is the code I have tested with parallelization

model_execution_foreach <- function(script_required_data, selected_models){
  #' Executes the scripts received in selected_projects using script_required_data
  #' @param script_required_data contains a dataframe with the input
  #' 
  #' @param selected_scripts is a datatable that contains all the information required to execute a script
  #'                    
  #' @return a boolean information to determine if execution was success or not 

  TYPE_1_MODEL <- selected_scripts[TYPE == '1']

  if (nrow(TYPE_1_MODEL) > 0) {
    TYPE_1_MODEL <- distinct(TYPE_1_MODEL)


    no_cores <- detectCores() - 1
    cl <- makeCluster(no_cores, type = "PSOCK")

    foreach (i = 1:dim(TYPE_1_MODEL)[1] ) %dopar%{
      ml<- data.frame(execute_script(TYPE_1_MODEL$FILE[i], TYPE_1_MODEL$PATH[i], script_required_data))
    }
    stopCluster(cl)

  }else{
    print(paste("Error: Unexpected TYPE named: ",selected_scripts))

  }
  return(TRUE)
}

Reason to say it does not work

The script doesn't break (I put it inside a tryCatch)
Each script should write data into the database and they are not

I have followed links like: parallelize function

Parallelism in R

Thanks in advance

You are asking “what am I missing?” but you didn’t tell us what the problem is: what isn’t working? What is happening, and what are you expecting to happen? — Konrad Rudolph, Oct 23 '19 at 09:45
in what way doesn't it work? do you get an error? does it do anything? we need more info here — morgan121, Oct 23 '19 at 10:40
Add some logging in `execute_script` (you can keep using "print" but I would recommend the `logging` package) to see what is being run when. You probably have to write the logs to a file since you are running your jobs in parallel. — asachet, Oct 23 '19 at 10:59
By the way, if you think it is not working because you cannot see the usual print-out, keep in mind that jobs running in parallel do not write to stdout. See [this question](https://stackoverflow.com/questions/10903787/how-can-i-print-when-using-dopar). — asachet, Oct 23 '19 at 11:01
I have omitted the prints included in the job. I say it does not work because the scripts should write information on databases that are not populated. — jalazbe, Oct 23 '19 at 12:53

jalazbe · Answer 1 · 2019-11-06T17:18:28.087

I have managed to solve part of the problem.

Instead of using foreach I have used parLapply like this:

models_names<- list("TYPE_1", "TYPE_2", "TYPE_3", "TYPE_4" )
no_cores <- detectCores() - 1
cl <- makeCluster(no_cores)
registerDoParallel(cl)
clusterExport(cl, varlist =  vars_list <-
                as.vector(ls()), envir = environment())
input <- as.data.table(data)


tryCatch({
  RESULTS <- as.data.frame(parLapply(cl, models_names, function(x) {
    source(
      BU_available_paths[[x]],
      local = TRUE,
      echo = TRUE,
      print.eval = TRUE,
      spaced = TRUE,
      chdir = TRUE
    )
    return(TRUE)
  }))
},
error = function(e) {
  stopCluster(cl)




},
finally = {
  stopCluster(cl)
  sink(log_path_main, type = "output", append = TRUE)
  print("Exit from Parallelization call ")
})


    stopCluster(cl)

parallelize call to old R projects using source

This is the working code

This is the code I have tested with parallelization

Reason to say it does not work

1 Answers1