6

I am a beginner at R programming language and currently try to work on a project. There's a huge Document Term Matrix (DTM) and I would like to convert it into a Data Frame. However due to the restrictions of the functions, I am not able to do so.

The method that I have been using is to first convert it into a matrix, and then convert it to data frame.

DF <- data.frame(as.matrix(DTM), stringsAsFactors=FALSE)

It was working perfectly with smaller size DTM. However when the DTM is too large, I am not able to convert it to a matrix, yielding the error as shown below:

Error: cannot allocate vector of size 2409.3 Gb

Tried looking online for a few days however I am not able to find a solution. Would be really thankful if anyone is able to suggest what is the best way to convert a DTM into a DF (especially when dealing with large size DTM).

Jaap
  • 81,064
  • 34
  • 182
  • 193
Jeffrey
  • 139
  • 1
  • 1
  • 10
  • Possible duplicate of [R Memory Allocation "Error: cannot allocate vector of size 75.1 Mb"](http://stackoverflow.com/questions/10917532/r-memory-allocation-error-cannot-allocate-vector-of-size-75-1-mb) – Rilcon42 May 17 '17 at 01:39
  • Probably not, the authors are different and the desired memory allocation here is very large. DTMs tend to be be sparse so it can be dangerous to trice to naively convert them to (non-sparse) matrices. – beigel May 17 '17 at 04:17

2 Answers2

8

In the tidytext package there is actually a function to do just that. Try using the tidy function which will return a tibble (basically a fancy dataframe that will print nicely). The nice thing about the tidy function is it'll take care of the pesky StringsAsFactors=FALSE issue by not converting strings to factors and it will deal nicely with the sparsity of your DTM.

as.matrix is trying to convert your DTM into a non-sparse matrix with an entry for every document and term even if the term occurs 0 times in that document, which is causing your memory usage to ballon. tidy` will convert it into a dataframe where each document only has the counts for the term found in them.

In your example here you'd run

library(tidytext)
DF <- tidy(DTM)

There's even a vignette on how to use the tidytext packages (meant to work in the tidyverse) here.

beigel
  • 1,190
  • 8
  • 14
  • thank you for the suggestion. however I would actually want to keep the dataframe as it is, as I will be using it for further processing, hence I would actually want to keep the sparse data frame instead. – Jeffrey May 17 '17 at 14:14
  • How do you mean keep the dataframe as is? Not sure if this helps, but `tidy` returns something that is basically a dataframe. – beigel May 17 '17 at 16:38
  • Not sure if this is what OP is looking for, but when I imagine a data frame for a document-term matrix, I imagine one row per document and one column per word, with the records showing the frequency of the given word in that document. This allows comparison of frequencies between documents, and is not at all what tidy produces. – Sarah Messer Aug 07 '19 at 00:41
1

It's possible that as.data.frame(as.matrix(DTM), stringsAsFactors=False) instead of data.frame(as.matrix(DTM), stringsAsFactors=False) might do the trick.

The API documentation notes that as.data.frame() simply coerces a matrix into a dataframe, whereas data.frame() creates a new data frame from the input.

as.data.frame(...) -> https://stat.ethz.ch/R-manual/R-devel/library/base/html/as.data.frame.html

data.frame(...) -> https://stat.ethz.ch/R-manual/R-devel/library/base/html/data.frame.html

Glitch253
  • 11
  • 2