11

Hi I have a lot of CSV files to process. Each file is generated by a run of an algorithm. My data always has one key and a value like this:

csv1:

        index value
  1     1     1
  2     2     1
  3     3     1
  4     4     1
  5     5     1

csv2:

      index value
1     4     3
2     5     3
3     6     3
4     7     3
5     8     3

Now I want to aggregate these CSV data, like this:

When both files contain an identical key e.g. 5, the resulting row should contain the key both files share (5) and the mean of both values ((1+3)/2 = 2). If only one file contains a key (e.g. 2), this row is just added to the result table (key = 2, value = 1).

Something like this:

      index value
1     1     1
2     2     1
3     3     1
4     4     2 (as (1+4)/2 = 2)
5     5     2 (as (1+4)/2 = 2)
6     6     3
7     7     3
8     8     3

At first I thought rbind() does the job, but it does not aggregate the values, only concatenates the data. How can I achieve that with R?

Matthias B
  • 5,523
  • 3
  • 45
  • 47
  • 6
    `rbind`, then `aggregate` – James Mar 21 '12 at 15:44
  • 4
    You should probably just do this in two separate steps - 1) Combine the values into one structure (which you did using rbind) and 2) Aggregate based on index. You can easily do step 2 using something like ddply (from the plyr package) or tapply or aggregate. – Dason Mar 21 '12 at 15:46
  • 5
    using `aggregate(. ~ index, data = rbind(data1,data2), mean)` does what I want, thx! – Matthias B Mar 21 '12 at 15:59
  • 2
    Feel free to add it as the answer to your question and if nobody else gives a better answer feel free to accept it. This will help others with a similar problem find an answer easier. – Dason Mar 21 '12 at 16:07
  • As this is a comment, I cannot accept it as awnswer, as far as I know. – Matthias B Mar 21 '12 at 16:48
  • 3
    @elton: what Dason was saying is that you can write the answer to you own question (for example posting the code that you put in the 3rd comment) and accept it ;) – digEmAll Mar 21 '12 at 17:48

2 Answers2

13

Here is a solution. I am following all the excellent comments so far, and hopefully adding value by showing you how to handle any number of files. I am assuming you have all your csv files in the same directory (my.csv.dir below).

# locate the files
files <- list.files(my.csv.dir)

# read the files into a list of data.frames
data.list <- lapply(files, read.csv)

# concatenate into one big data.frame
data.cat <- do.call(rbind, data.list)

# aggregate
data.agg <- aggregate(value ~ index, data.cat, mean)

Edit: to handle your updated question in your comment below:

files     <- list.files(my.csv.dir)
algo.name <- sub("-.*", "", files)
data.list <- lapply(files, read.csv)
data.list <- Map(transform, data.list, algorithm = algo.name)
data.cat  <- do.call(rbind, data.list)
data.agg  <- aggregate(value ~ algorithm + index, data.cat, mean)
flodel
  • 87,577
  • 21
  • 185
  • 223
  • I have an addition to this problem. I have multiple algorithms producing these csv files. So my CSV filenames look like this: `algorithm1-values-run1.csv, algorithm1-values-run2.csv, algorithm2-values-run1.csv, algorithm2-values-run2.csv`. Now I want to read all CSV files but get them aggregated by algorithm. So I produce a list of lists or something like this. Also it should preserve the algorithm names. – Matthias B Apr 03 '12 at 15:49
  • I know it's been two years but if you are still interested I handled your extra request. Please consider accepting my answer. – flodel May 18 '14 at 11:06
0

What I have understood from the question is that you want a list which will contain lists of data.frame of csv files or txt files and aggregate it.

Create a directory and drop all your csv and txt files in the folder. Now run the following command to get the result in list.

l=list.files(pattern=".csv")

this l object will contain the names of csv files

m=Map(read.csv,l)

This Map function will map read.csv function to all csv files and m object contains csv files as data.frame in list.

dat=do.call(rbind,m)

Now call plyr library

library(plyr)

res=ddply(dat,~index,summarize,value=mean(value))

this res object will contain the aggregated value

I hope this will help you to get your desire result.

Community
  • 1
  • 1
furianpandit
  • 161
  • 7