Suppose text data looks like this:
txt <- c("peter likes red", "mary likes green", "bob likes blue")
I want to reduce those string to words from this controlled vocabulary:
voc <- c("peter", "mary", "bob", "red", "green", "blue")
The result should be a vector:
c("peter red", "mary green", "bob blue")
One can use the tm
library but that only gives me a dense document term matrix:
foo <- VCorpus(VectorSource(txt))
inspect(DocumentTermMatrix(foo, list(dictionary = voc)))
Non-/sparse entries: 6/12
Sparsity : 67%
Maximal term length: 5
Weighting : term frequency (tf)
Terms
Docs blue bob green mary peter red
1 0 0 0 0 1 1
2 0 0 1 1 0 0
3 1 1 0 0 0 0
How can I get the vector solution with one string per vector element?
The solution should be fast. I'm also a big fan of base R.
EDIT: Comparison of solutions so far
On my data, James' solution is about four times faster than Sotos'. But it runs out of memory when I make the step from length(text)
1k to 10k. Sotos' solution still runs at 10k.
Given that my data has length(txt)
~1M and length(voc)
~5k I estimate that Sotos' solution will take 18 hours to finish, given that it does not run out of memory.
Isn't there anything faster?