0

I'm trying to find the mean of values in certain columns, and append that into a new column.

I have a variable containing a list of character elements, and these elements correspond to certain column names. For example,

qcs <- list.files(qcDirectory)
str(qcs)
chr [1:5] "201701" "201702" "201703" "201704" 

Here's an example data table:

        201701  201702  201703  201704
row1    12      20      50      54
row2    15      23      45      35
row3    16      20      52      58
row4    17      13      47      48

For example, I'd like to get the average of the values contained in the columns named "201701" and "201702", so the expected output would be:

        201701  201702  201703  201704  mean
row1    12      20      50      54      16
row2    15      23      45      35      19
row3    16      20      52      58      18
row4    17      13      47      48      15

I can't just apply it to the first x number of columns, because they could appear anywhere in the table. I'm thinking lapply might be the way to go, but I'm not sure how I would get this to work. In plain English, I basically want the function to be, "for all column names that are contained in this variable [qcs], find the mean of those values." Not sure whether I need some sort of a loop, or what. I'm sure the answer is already out there, but I'm a noob and having trouble trying to apply similar posts like this and this. I'm thinking something vaguely like what I have below, which can successfully identify the right columns to operate on, but that's just averaging the values in each column and outputting a list, rather than averaging the values across the rows of the relevant columns and putting them into a new column. Any ideas are appreciated.

sapply(colnames(datamatrix), function(c) {
  if (c %in% qcs) datamatrix$mean = mean(test1[, c])
}) 
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
heds1
  • 3,203
  • 2
  • 17
  • 32

3 Answers3

1

This is what you can do. Store the columns of interest in a vector col as below. Then use apply.

col=c('201701','201702') df$mean=apply(df[,col],1,mean)

Yogesh
  • 1,384
  • 1
  • 12
  • 16
1

This should be a strategy that will fit well in your situation, since (as I understand) you might want to compute this for several different combinations of columns.

Illustrating with the iris dataset: say you want to add a new column which for each row is the mean of the values in the "Sepal.Width" and "Petal.Width" columns at the corresponding row. Here is a vector holding the relevant column names (here they're selected by a pattern in the column name, as in your case as I understand it).

data(iris)
cname_pattern <- "Width"
target_cols <- grep(cname_pattern, names(iris), value=TRUE)

Then you can just say

iris[[paste0(cname_pattern, "_mean")]] <- rowMeans(iris[, target_cols])

And you'll get the mean of the relevant columns at each row in the new column. This should be flexible enough to fit the case at hand. Might want to use na.rm=TRUE in rowMeans() if it's possible there'll be missing values anywhere.

lefft
  • 2,065
  • 13
  • 20
1

Using dplyr and assuming df as your data table, you can do:

select(df,c(201701,201702)) %>% summarise_all(mean)

dplyr gives you added flexibility for selecting groups columns as you can use helpers like matches, starts_with,contains etc.

hvollmeier
  • 2,956
  • 1
  • 12
  • 17