How to apply a function to a subset of columns in r?

Question

I am using by to apply a function to a range columns of a data frame based on a factor. Everything works perfectly well if I use mean() as the function but if I use median() I get an error of the type "Error in median.default(x) : need numeric data" even if I don't have NAs in the data frame.

The line that works using mean():

by(iris[,1:3], iris$Species, function(x) mean(x,na.rm=T))

> by(iris[,1:3], iris$Species, function(x) mean(x,na.rm=T))
iris$Species: setosa
Sepal.Length  Sepal.Width Petal.Length 
       5.006        3.428        1.462 
------------------------------------------------------------ 
iris$Species: versicolor
Sepal.Length  Sepal.Width Petal.Length 
       5.936        2.770        4.260 
------------------------------------------------------------ 
iris$Species: virginica
Sepal.Length  Sepal.Width Petal.Length 
       6.588        2.974        5.552 
Warning messages:
1: mean(<data.frame>) is deprecated.
 Use colMeans() or sapply(*, mean) instead. 
2: mean(<data.frame>) is deprecated.
 Use colMeans() or sapply(*, mean) instead. 
3: mean(<data.frame>) is deprecated.
 Use colMeans() or sapply(*, mean) instead.

But if I use median() (note the na.rm=T option):

> by(iris[,1:3], iris$Species, function(x) median(x,na.rm=T))
Error in median.default(x, na.rm = T) : need numeric data

However if instead of choosing the range [,1:3] of columns I choose only one of the columns it works:

> by(iris[,1], iris$Species, function(x) median(x,na.rm=T))
iris$Species: setosa
[1] 5
------------------------------------------------------------ 
iris$Species: versicolor
[1] 5.9
------------------------------------------------------------ 
iris$Species: virginica
[1] 6.5

How can I achieve this behaviour while selecting a range of columns?

The warning messages you get when you use `mean` should be a strong clue that, in fact, everything doesn't work "just fine". This recent [answer](http://stackoverflow.com/a/9424510/324364) of mine might shed some light on this for you. — joran, Mar 01 '12 at 16:35

score 4 · Accepted Answer · answered Mar 01 '12 at 16:31

4

You are using a split-apply strategy when you use by. The objects being passed to the function are dataframes and you are getting the warning and error because of the non-existence of median.data.frame and the impending non-existence of mean.data.frame. It might work better if you used aggregate:

> aggregate(iris[,1:3], iris["Species"], function(x) mean(x,na.rm=T))
     Species Sepal.Length Sepal.Width Petal.Length
1     setosa        5.006       3.428        1.462
2 versicolor        5.936       2.770        4.260
3  virginica        6.588       2.974        5.552
> aggregate(iris[,1:3], iris["Species"], function(x) median(x,na.rm=T))
     Species Sepal.Length Sepal.Width Petal.Length
1     setosa          5.0         3.4         1.50
2 versicolor          5.9         2.8         4.35
3  virginica          6.5         3.0         5.55

aggregate works on the column vectors individually and then tabulates the results.

answered Mar 01 '12 at 16:31

IRTFM

258,963
21
364
487

Thanks. It works now. I just have now the doubt on what is the difference between: `aggregate(iris[,1:3], iris["Species"], function(x) median(x,na.rm=T))` and `aggregate(iris[,1:3], iris$Species, function(x) median(x,na.rm=T))`. The second one returns this error `Error in aggregate.data.frame(iris[, 1:3], iris$Species, function(x) median(x, : 'by' must be a list` – pedrosaurio Mar 01 '12 at 16:53
1

@pedrosaurio The error message says it all. `iris["Species"]` is a list (a data frame, actually), whereas `iris$Species` is not. You can verify this using `str()`. – joran Mar 01 '12 at 17:03
1

I thought of adding a note saying that you were using `$Species` which is equivalent to `[["Species"]]` which returns an atomic vector and that I was using `["Species"]` which returns a list. I guess that I should have done so. – IRTFM Mar 01 '12 at 17:31

Plantaloons · Answer 2 · 2017-08-14T23:10:20.037

The original question is answered. If, however, the range happens to be (instead) all columns except those specified as the independent variable in the formula, the dot formula notation works, and represents a nifty alternative:

> aggregate(. ~ Species, data = iris, mean)
     Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1     setosa        5.006       3.428        1.462       0.246
2 versicolor        5.936       2.770        4.260       1.326
3  virginica        6.588       2.974        5.552       2.026

> aggregate(. ~ Species, data = iris, median)
     Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1     setosa          5.0         3.4         1.50         0.2
2 versicolor          5.9         2.8         4.35         1.3
3  virginica          6.5         3.0         5.55         2.0

How to apply a function to a subset of columns in r?

2 Answers2

Linked