1

I'm in my first week programming in R and while I've made much progress on solving specific issues, I am in need for advice on a larger scale.

I have a directory full of data files in CSV format. The file names specifically identify the data source. I need to import the data, condition the data through various calculations, and keep the results of each file's conditioning for analysis and review. I have successfully learned to open and do extensive conditioning of the data on an individual file basis. The conditioning results in multiple calculation output. I need to automate this process and dynamically name the results based on the respective file name.

Since the data conditioning is the same for each file, I've written a function that can be called for each file. I understand functions operate in their own environment which disappears after the function runs. I can dynamically name variables using paste to build names and assign to assign results to those names. Those assignments will then be lost when the function closes.

I'm not certain of the optimal way to step through all the files and keep all the individual calculation results available in the workspace. I know I'm "supposed to" write the function output to a single list which I can later index. However, I will have hundreds of calculation results and later indexing will be complicated. Lets say two of the files contains air temperature measurements at a different locations. Since I dynamically name my calculation results based on the descriptive file names, I can have results stored as Temperature.Air.Location1 and Temperature.Air.Location2. I much prefer the ability to later calculate a temperature delta by simply typing Temperature.Air.Location1 - Temperature.Air.Location2 instead of having to look up the corresponding indices of a large list.

I'm certain there is an elegant way of achieving this that's staring me in the face, but I'm afraid I've gotten so wrapped up in learning about functions, interpolation, and plotting in R that I've lost sight of the big picture. Any advice is much appreciated.

EDIT TO ADD EXAMPLE CODE In this portion of the function, I'm converting a table to x,y,z coordinates as well as interpolating the values.

CalibrationImport.Table <- function(filename, parametername, xmin, xmax, ymin, ymax){
  Path.File <- paste0(Path.Folder,filename)
  assign(parametername, read.csv(Path.File, header = FALSE))

  # Extract x coordinates from original table
  assign(paste0(parametername,".x"), get(parametername)[1, ])
  assign(paste0(parametername,".x"), unlist(get(paste0(parametername,".x"))[-1], use.names=FALSE))
  assign(paste0(parametername,".x"), c(t(replicate(nrow(get(parametername))-1, get(paste0(parametername,".x"))))))

  # Extract y coordinates from original table
  assign(paste0(parametername,".y"), get(parametername)[ ,1])
  assign(paste0(parametername,".y"), unlist(get(paste0(parametername,".y"))[-1], use.names=FALSE))
  assign(paste0(parametername,".y"), c(replicate(ncol(get(parametername))-1, get(paste0(parametername,".y")))))

  # Extract data for original table
  assign(paste0(parametername,".z"), unlist(get(parametername)[-1, -1], use.names=FALSE))

  # Interpolate 100x100 surface
  assign(paste0(parametername,".i"), interp(get(paste0(parametername,".x")), get(paste0(parametername,".y")), get(paste0(parametername,".z")),
                                        xo=seq(xmin, xmax, length=100), yo=seq(ymin, ymax, length=100)))
}
DSG
  • 13
  • 4
  • 2
    It always helps to provide a reproducible example, including a sample of your data. http://stackoverflow.com/q/5963269 – Maxim.K May 03 '13 at 10:54

2 Answers2

4

Don't use assign inside the function, use it outside to assign the result of the function, i.e.

 `assign( "name1" , myfunc(x) )`

If you are applying it to your directory of CSV files, you can do something akin to this:

fl <- list.files( "path/to/my/directory" , pattern = ".csv" )

for( i in 1: length(fl) ){      
  assign( paste0( "file." , i ) , myfunc( fl[i] ) )
}

Which is one of the classic uses of a for loop - applying it for it's side-effects.

However, you have hundreds of files so an lapply might be better, which will return results in a list, and is syntactically very simple:

myresults <- lapply( fl , myfunc )

However, you may need to rewrite parts of your function so it doesn't assign anything, but instead returns the values you want to keep. Use assigment (i.e. <- ) to put the return values in an object in the workspace. Without a reproducible example this can only be a rough sketch.

If you want to retain the names of the files, sapply might be better, and it returns your results as a vector and can keep the names:

sapply( fl , myfunc , USE.NAMES = TRUE )
Simon O'Hanlon
  • 58,647
  • 14
  • 142
  • 184
4

In general the workflow that works well for me is to use lapply. For example:

file_names = list.files(pattern = "*csv")
data_list = lapply(file_names, read.csv)

perform_interpolation = function(dataset) {
   # Perform interpolation on dataset
   return(interpolated_dataset)
}
interpolated_data_list = lapply(data_list, perform_interpolation)

Here I have lists of objects which I transform using functions (i.e. functional programming). The crux is to have simple functions that take a few inputs, and generate one output object.

Without more specifics from you, it is hard to provide more detailed advice.

Paul Hiemstra
  • 59,984
  • 12
  • 142
  • 149
  • +1 this is probably a cleaner workflow by reading in the data separately. – Simon O'Hanlon May 03 '13 at 11:19
  • It depends a bit on how large the script is, but especially if you want to perform several transformations on the same data it makes sense to separate the data reading and the calculation step. – Paul Hiemstra May 03 '13 at 11:22
  • Thank you, Simon and Paul. I'll experiment with lapply this afternoon. I've added example code if you want to further critique my work. – DSG May 03 '13 at 18:58
  • @DSG In general, using assign is this way is a really, really, really bad idea. Using lists has so much advantages. Also, how does your stuff end up in the global environment? The intermediate results aren't really needed, but at some stage the function needs to return something. Functions need to have a few inputs and one output object (e.g. an interpolation result), and no side effects, i.e. the only way a function influences the rest of the program is through its interface. – Paul Hiemstra May 03 '13 at 19:34
  • @Paul Hiemstra Therein lies the root of my question - how does the output end up in the global environment? This oversimplified example doesn't make the best case, but I WILL need access to many of the intermediate results of my calculations, therefore I want them passed to the global environment. I also want the ability to reference them by name rather than by index in an effort to make off-the-cuff calculations or analysis simpler. I admit I haven't yet the opportunity to explore `lapply` today but appreciate your advice. – DSG May 04 '13 at 00:08
  • I would just use the approach I outline in my answer above. For each step for which you need intermediate results create a function, apply it accordingly. Then you have a series of lists of objects with the stuff you need. How to exactly organize this takes a lot of practice. – Paul Hiemstra May 04 '13 at 04:11