R: Extract controlled vocabulary from character vector

Question

Suppose text data looks like this:

txt <- c("peter likes red", "mary likes green", "bob likes blue")

I want to reduce those string to words from this controlled vocabulary:

voc <- c("peter", "mary", "bob", "red", "green", "blue")

The result should be a vector:

c("peter red", "mary green", "bob blue")

One can use the tm library but that only gives me a dense document term matrix:

foo <- VCorpus(VectorSource(txt))
inspect(DocumentTermMatrix(foo, list(dictionary = voc)))
Non-/sparse entries: 6/12
Sparsity           : 67%
Maximal term length: 5
Weighting          : term frequency (tf)

    Terms
Docs blue bob green mary peter red
   1    0   0     0    0     1   1
   2    0   0     1    1     0   0
   3    1   1     0    0     0   0

How can I get the vector solution with one string per vector element?

The solution should be fast. I'm also a big fan of base R.

EDIT: Comparison of solutions so far

On my data, James' solution is about four times faster than Sotos'. But it runs out of memory when I make the step from length(text) 1k to 10k. Sotos' solution still runs at 10k.

Given that my data has length(txt) ~1M and length(voc) ~5k I estimate that Sotos' solution will take 18 hours to finish, given that it does not run out of memory.

Isn't there anything faster?

Would a non regex approach suffice for your case? E.g. something like `sapply(strsplit(txt, " ", TRUE), function(x) paste(collapse = " ", x[x %in% voc]))` — alexis_laz, Jan 24 '17 at 17:25
@alexis_laz you win! Your solution finishes in 10 minutes instead of 18 hours. Would you like to create a dedicated answer so I can mark it as the solution? — hyco, Jan 24 '17 at 18:09

James · Answer 1 · 2017-01-24T14:09:42.830

3

A base only method is:

apply(sapply(paste0("\\b",voc,"\\b"), function(x) grepl(x,txt)), 1, function(x) paste(voc[x],collapse=" "))
[1] "peter red"  "mary green" "bob blue"

The sapply part recreates the membership matrix you used the tm package for, while the apply iterates over its rows to pull out the relevant terms from the vocabulary to paste together.

edited Jan 24 '17 at 14:09

answered Jan 24 '17 at 13:29

James

65,548
14
155
193

@Sotos I've fixed it to add word boundaries, so it should work now. – James Jan 24 '17 at 14:10

Sotos · Answer 2 · 2017-01-24T13:50:06.473

2

You can use stringi

library(stringi)
sapply(stri_extract_all_regex(txt, paste0('\\b', voc, collapse = '|', '\\b')), paste, collapse = ' ')
#[1] "peter red"  "mary green" "bob blue"

or full stringi

stri_paste_list(stri_extract_all_regex(txt, paste0('\\b', voc, collapse = '|', '\\b')), sep = ' ')
#[1] "peter red"  "mary green" "bob blue"

edited Jan 24 '17 at 13:50

answered Jan 24 '17 at 13:15

Sotos

51,121
6
32
66

I'm checking your solutions. Note that the result of `txt <- c("peters like reds", "marys like greens", "bobs like blues")` should be empty. – hyco Jan 24 '17 at 13:23
What do you mean "It should be empty"? – Sotos Jan 24 '17 at 13:29
I mean that your first solution should not extract `peter` from `peters`. In other words, `peters` is not in the controlled vocabulyry, `peter` is. – hyco Jan 24 '17 at 13:36
The problem with your first solution is that it produces the same output for `txt1 <- c("peter likes red", "mary likes green", "bob likes blue")` and for `txt2 <- c("peters like reds", "marys like greens", "bobs like blues")`. For `txt2` the result should be empty because no word is in the vocabulary. For your second solution I may have to update the `stringi` library cause the `stri_paste_list` function can't be found. – hyco Jan 24 '17 at 13:42
Do you see what I mean? – hyco Jan 24 '17 at 13:45
Ok check now. I added boundaries to get exact match so it should work – Sotos Jan 24 '17 at 13:50
`stringi` is really fast actually from my experience – Sotos Jan 24 '17 at 14:08

R: Extract controlled vocabulary from character vector

2 Answers2