1

This question may sound similar to others, but I hope it is different enough. I want to take a specific list of values and count how often they appear in another list of values where non-occurring values are retuned as '0'.

I have a Data Frame (df1) with the following values:

Items <- c('Carrots','Plums','Pineapple','Turkey')   
df1<-data.frame(Items)

>df1
Items
1   Carrots
2     Plums
3 Pineapple
4    Turkey

And a second Data Frame (df2) that contains a column called 'Thing':

> head(df2,n=10)
  ID       Date     Thing
1  58150 2012-09-12  Potatoes
2  12357 2012-09-28   Turnips
3  50788 2012-10-04   Oranges
4  66038 2012-10-11  Potatoes
5  18119 2012-10-11   Oranges
6  48349 2012-10-14   Carrots
7  23328 2012-10-16   Peppers
8  66038 2012-10-26 Pineapple
9  32717 2012-10-28   Turnips
10 11345 2012-11-08   Oranges

I know the word 'Turkey' only appears in df1 NOT in df2. I want to return a frequency table or count of the items in df1 that appears in df2 and return '0' for the count of Turkey.

How can I summarize values of on Data Frame column using the values from another? The closest I got was:

df2%>% count (Thing) %>% filter(Thing %in% df1$Items,)

But this return a list of items filtered between df1 and df2 so 'Turkey' gets excluded. So close!

> df2%>% count (Thing) %>% filter(Thing %in% df1$Items,)
# A tibble: 3 x 2
      Thing     n
     <fctr> <int>
1   Carrots    30
2 Pineapple    30
3     Plums    38

I want my output to look like this:

1   Carrots    30
2 Pineapple    30
3     Plums    38
4    Turkey     0

I am newish to R and completely new to dplyr.

Uwe
  • 41,420
  • 11
  • 90
  • 134
gzrcm
  • 11
  • 2

4 Answers4

0

I use this sort of thing all the time. I'm sure there's a more savvy way to code it, but it's what I got:

item <- vector()
count <- vector()
items <- list(unique(df1$Items))

for (i in 1:length(items)){
    item[i] <- items[i]
    count[i] <- sum(df2$Thing == item)
}

df3 <- data.frame(cbind(item, count))

Hope this helps!

0

Stephen's solution worked with a slight modification, adding the [i] to the item at the end of count[i] line. See below:

item <- vector() 
count <- vector()

for (i in 1:length(unique(Items))){ 
    item[i] <- Items[i] 
    count[i]<- sum(df2$Thing == item[i])
} 

df3 <- data.frame(cbind(item, count))

> df3
       item count
1   Carrots    30
2     Plums    38
3 Pineapple    30
4    Turkey     0
gzrcm
  • 11
  • 2
0

dplyr drops 0 count rows, and you have the added complication that the possible categories of Thing are different between your two datasets.

If you add the factor levels from df1 to df2, you can use complete from tidyr, which is a common way to add 0 count rows.

I'm adding the factor levels from df1 to df2 using a convenience function from package forcats called fct_expand.

library(dplyr)
library(tidyr)
library(forcats)

df2 %>% 
    mutate(Thing = fct_expand(Thing, as.character(df1$Item) ) ) %>%
    count(Thing) %>% 
    complete(Thing, fill = list(n = 0) ) %>%
    filter(Thing %in% df1$Items,)
aosmith
  • 34,856
  • 9
  • 84
  • 118
0

A different approach is to aggregate df2 first, to right join with df1 (to pick all rows of df1), and to replace NA by zero.

library(dplyr)
df2 %>% 
  count(Thing) %>% 
  right_join(unique(df1), by = c("Thing" = "Items")) %>% 
  mutate(n = coalesce(n, 0L))
# A tibble: 4 x 2
      Thing     n
      <chr> <int>
1   Carrots     1
2     Plums     0
3 Pineapple     1
4    Turkey     0
Warning message:
Column `Thing`/`Items` joining factors with different levels, coercing to character vector

The same approach in data.table:

library(data.table)
setDT(df2)[, .N, by = Thing][unique(setDT(df1)), on = .(Thing = Items)][is.na(N), N := 0L][]
       Thing N
1:   Carrots 1
2:     Plums 0
3: Pineapple 1
4:    Turkey 0

Note that in both implementations unique(df1) is used to avoid unintended duplicate rows after the join.

Edit 2019-06-22: With development version 1.12.3 data.table has gained a coalesce() function. So, above statement can be written

setDT(df2)[, .N, by = Thing][unique(setDT(df1)), on = .(Thing = Items)][, N := coalesce(N, 0L)][]

If df2 is large and df1 contains only a few Items it might be more efficient to join first and then to aggregate:

library(dplyr)
df2 %>% 
  right_join(unique(df1), by = c("Thing" = "Items")) %>% 
  group_by(Thing) %>% 
  summarise(n = sum(!is.na(ID)))
# A tibble: 4 x 2
      Thing     n
      <chr> <int>
1   Carrots     1
2 Pineapple     1
3     Plums     0
4    Turkey     0
Warning message:
Column `Thing`/`Items` joining factors with different levels, coercing to character vector

The same in data.table syntax:

library(data.table)
setDT(df2)[unique(setDT(df1)), on = .(Thing = Items)][, .(N = sum(!is.na(ID))), by = Thing][]
       Thing N
1:   Carrots 1
2:     Plums 0
3: Pineapple 1
4:    Turkey 0

Edit 2019-06-22: Above can be written more concisely by aggregating in a join:

setDT(df2)[setDT(df1), on = .(Thing = Items), .N, by = .EACHI]
Uwe
  • 41,420
  • 11
  • 90
  • 134