6

I am using by to apply a function to a range columns of a data frame based on a factor. Everything works perfectly well if I use mean() as the function but if I use median() I get an error of the type "Error in median.default(x) : need numeric data" even if I don't have NAs in the data frame.

The line that works using mean():

by(iris[,1:3], iris$Species, function(x) mean(x,na.rm=T))

> by(iris[,1:3], iris$Species, function(x) mean(x,na.rm=T))
iris$Species: setosa
Sepal.Length  Sepal.Width Petal.Length 
       5.006        3.428        1.462 
------------------------------------------------------------ 
iris$Species: versicolor
Sepal.Length  Sepal.Width Petal.Length 
       5.936        2.770        4.260 
------------------------------------------------------------ 
iris$Species: virginica
Sepal.Length  Sepal.Width Petal.Length 
       6.588        2.974        5.552 
Warning messages:
1: mean(<data.frame>) is deprecated.
 Use colMeans() or sapply(*, mean) instead. 
2: mean(<data.frame>) is deprecated.
 Use colMeans() or sapply(*, mean) instead. 
3: mean(<data.frame>) is deprecated.
 Use colMeans() or sapply(*, mean) instead. 

But if I use median() (note the na.rm=T option):

> by(iris[,1:3], iris$Species, function(x) median(x,na.rm=T))
Error in median.default(x, na.rm = T) : need numeric data

However if instead of choosing the range [,1:3] of columns I choose only one of the columns it works:

> by(iris[,1], iris$Species, function(x) median(x,na.rm=T))
iris$Species: setosa
[1] 5
------------------------------------------------------------ 
iris$Species: versicolor
[1] 5.9
------------------------------------------------------------ 
iris$Species: virginica
[1] 6.5

How can I achieve this behaviour while selecting a range of columns?

zx8754
  • 52,746
  • 12
  • 114
  • 209
pedrosaurio
  • 4,708
  • 11
  • 39
  • 53
  • The warning messages you get when you use `mean` should be a strong clue that, in fact, everything doesn't work "just fine". This recent [answer](http://stackoverflow.com/a/9424510/324364) of mine might shed some light on this for you. – joran Mar 01 '12 at 16:35

2 Answers2

4

You are using a split-apply strategy when you use by. The objects being passed to the function are dataframes and you are getting the warning and error because of the non-existence of median.data.frame and the impending non-existence of mean.data.frame. It might work better if you used aggregate:

> aggregate(iris[,1:3], iris["Species"], function(x) mean(x,na.rm=T))
     Species Sepal.Length Sepal.Width Petal.Length
1     setosa        5.006       3.428        1.462
2 versicolor        5.936       2.770        4.260
3  virginica        6.588       2.974        5.552
> aggregate(iris[,1:3], iris["Species"], function(x) median(x,na.rm=T))
     Species Sepal.Length Sepal.Width Petal.Length
1     setosa          5.0         3.4         1.50
2 versicolor          5.9         2.8         4.35
3  virginica          6.5         3.0         5.55

aggregate works on the column vectors individually and then tabulates the results.

IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • Thanks. It works now. I just have now the doubt on what is the difference between: `aggregate(iris[,1:3], iris["Species"], function(x) median(x,na.rm=T))` and `aggregate(iris[,1:3], iris$Species, function(x) median(x,na.rm=T))`. The second one returns this error `Error in aggregate.data.frame(iris[, 1:3], iris$Species, function(x) median(x, : 'by' must be a list` – pedrosaurio Mar 01 '12 at 16:53
  • 1
    @pedrosaurio The error message says it all. `iris["Species"]` is a list (a data frame, actually), whereas `iris$Species` is not. You can verify this using `str()`. – joran Mar 01 '12 at 17:03
  • 1
    I thought of adding a note saying that you were using `$Species` which is equivalent to `[["Species"]]` which returns an atomic vector and that I was using `["Species"]` which returns a list. I guess that I should have done so. – IRTFM Mar 01 '12 at 17:31
1

The original question is answered. If, however, the range happens to be (instead) all columns except those specified as the independent variable in the formula, the dot formula notation works, and represents a nifty alternative:

> aggregate(. ~ Species, data = iris, mean)
     Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1     setosa        5.006       3.428        1.462       0.246
2 versicolor        5.936       2.770        4.260       1.326
3  virginica        6.588       2.974        5.552       2.026

> aggregate(. ~ Species, data = iris, median)
     Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1     setosa          5.0         3.4         1.50         0.2
2 versicolor          5.9         2.8         4.35         1.3
3  virginica          6.5         3.0         5.55         2.0