1

I am trying to reproduce the analysis given in this blog post for the by() function. When I paste the code into R I get an error message, however, instead of the nice table of summarised iris data on the blog post.

attach(iris)
head(iris)

      Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

So the data frame's there and all is well.

Pasting in the by() function from the blog gives me this error:

by(iris[, 1:4], Species, mean)
Species: setosa
[1] NA
---------------------------------------------------------------------------------- 
Species: versicolor
[1] NA
---------------------------------------------------------------------------------- 
Species: virginica
[1] NA
Warning messages:
1: In mean.default(data[x, , drop = FALSE], ...) :
  argument is not numeric or logical: returning NA
2: In mean.default(data[x, , drop = FALSE], ...) :
  argument is not numeric or logical: returning NA
3: In mean.default(data[x, , drop = FALSE], ...) :
      argument is not numeric or logical: returning NA

I really can't see what's wrong here. I've tried it with other data frames and so on and the problem seems to be with the 1:4 sequence in the indexing for the data frame. If I just specify one column it gives me the means no problem. I can't work out why it's spitting its dummy when given more than one column. Any suggestions?

Alexander Farber
  • 21,519
  • 75
  • 241
  • 416
user3359624
  • 304
  • 2
  • 13
  • 2
    Found the answer: turns out it will only work with colMeans() nowadays but it would have worked with mean() in the past. In the traditional way Sod's law operated and I found this after posting despite having searched extensively before asking the question. http://stackoverflow.com/questions/21233448/mean-within-by-throwing-warning-in-r?rq=1 – user3359624 Feb 27 '14 at 09:30
  • 1
    As it says on my blog's "about" page: "in general, I don’t go back and revise old posts...be aware that [...] posts are out of date and probably no longer relevant" :) Thank you for noting that this example no longer works. – neilfws Feb 27 '14 at 11:13
  • 1
    Maybe just delete the post then? – user3359624 Feb 27 '14 at 11:14
  • 1
    Well, the rest of the examples are fine (I hope!) I've updated with a link to this question, noting that the example doesn't work. Thanks again. – neilfws Feb 27 '14 at 11:17
  • See the NEWS for R 3.0.0: "mean() for data frames and sd() for data frames and matrices are defunct." – Roland Feb 28 '14 at 10:13
  • Also linked: http://stackoverflow.com/questions/9519976/how-to-apply-a-function-to-a-subset-of-columns-in-r – Dave X Feb 28 '14 at 10:24

3 Answers3

4

I am not sure, how old is the blogpost, but if I look into documentation of by, the functionality is different from what the blogpost describes.

by splits input data into subseted dataframes, but you can not get a mean of a dataframe!

mean(iris[,1:4])
[1] NA
Warning message:
In mean.default(iris[, 1:4]) :
  argument is not numeric or logical: returning NA

You can use by, if you want to get mean of values in one column

by(iris[,1], iris$Species, mean)
iris$Species: setosa
[1] 5.006
--------------------------------------------------------------------------------------------- 
iris$Species: versicolor
[1] 5.936
--------------------------------------------------------------------------------------------- 
iris$Species: virginica
[1] 6.588

But for getting means for all columns, use aggregate as suggested by @Thomas

Zbynek
  • 5,673
  • 6
  • 30
  • 52
  • I never knew that you couldn't calculate a mean of a subset of data.frame, even when the subset is all `numeric`. Yet, `sum(iris[, 1:4])` works just fine. Is there a sensible reason why `mean(iris[, 1:4])` doesn't work? – jbaums Feb 27 '14 at 09:44
  • I would guess, that the structure of `data.frame` is supposed to hold individual variables with different meaning in every column and hence calculating mean across those would not make sence (compared to calculating mean of `matrix`, where the meaning of all values is the same). But I am only guessing. – Zbynek Feb 27 '14 at 09:48
  • 1
    I agree, @Zbynek, which is why I'm surprised `sum` would be allowed. – jbaums Feb 27 '14 at 09:49
  • 1
    Any function is allowed to be called by 'by()', but some functions can deal with a dataframe better than others. sum() is implemented by "Summary" {?sum; methods(Summary,'data.frame')} which coerces its argument into a matrix {getAnywhere(Summary.data.frame)}. You could do the same coercion with mean: by(iris[, 1:4], Species, function(x){mean(as.matrix(x))}) but, like sum, it sums across columns. "mean" won't do it column-wise as it once upon a time (https://stat.ethz.ch/pipermail/r-help/2006-December/122810.html) – Dave X Feb 28 '14 at 10:23
1

I'm not sure how that blog post got that answer, because R produces the same out for me as it does for you. Consider aggregate instead:

> aggregate(. ~ Species, iris, mean)
     Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1     setosa        5.006       3.428        1.462       0.246
2 versicolor        5.936       2.770        4.260       1.326
3  virginica        6.588       2.974        5.552       2.026
Thomas
  • 43,637
  • 12
  • 109
  • 140
1

The error message is telling you that 'mean.default' is giving you the error. If you want to know why mean.default is doing what it does, you could look at the source:

> mean.default
function (x, trim = 0, na.rm = FALSE, ...) 
{
    if (!is.numeric(x) && !is.complex(x) && !is.logical(x)) {
        warning("argument is not numeric or logical: returning NA")
        return(NA_real_)}
...

'by()' does what it is supposed to, but "mean()" fails because a dataframe it is passed fails the is.numeric() test.

Dave X
  • 4,831
  • 4
  • 31
  • 42