r - 从TermDocumentMatrix创建稀疏矩阵

我已经从R中的TermDocumentMatrix库创建了一个tm。它看起来像这样：

> inspect(freq.terms)

A document-term matrix (19 documents, 214 terms)

Non-/sparse entries: 256/3810
Sparsity           : 94%
Maximal term length: 19
Weighting          : term frequency (tf)

Terms
Docs abundant acid active adhesion aeropyrum alternative
  1         0    0      1        0         0           0
  2         0    0      0        0         0           0
  3         0    0      0        1         0           0
  4         0    0      0        0         0           0
  5         0    0      0        0         0           0
  6         0    1      0        0         0           0
  7         0    0      0        0         0           0
  8         0    0      0        0         0           0
  9         0    0      0        0         0           0
  10        0    0      0        0         1           0
  11        0    0      1        0         0           0
  12        0    0      0        0         0           0
  13        0    0      0        0         0           0
  14        0    0      0        0         0           0
  15        1    0      0        0         0           0
  16        0    0      0        0         0           0
  17        0    0      0        0         0           0
  18        0    0      0        0         0           0
  19        0    0      0        0         0           1

这只是矩阵的一小部分；我正在使用214个术语。在小范围内，这很好。如果要将TermDocumentMatrix转换为普通矩阵，则可以执行以下操作：

data.matrix <- as.matrix(freq.terms)

但是，我上面显示的数据只是我整体数据的一部分。我的整体数据可能至少包含10,000个字词。当我尝试从总体数据创建TDM时，出现错误：

> Error cannot allocate vector of size n Kb

因此，从这里开始，我正在寻找为tdm查找有效内存分配的替代方法。

我尝试将tdm从Matrix库转换为稀疏矩阵，但遇到了同样的问题。

目前我有什么选择？我觉得我应该调查以下情况之一：

关于here的bigmemory / ff软件包（尽管目前Windows似乎无法使用bigmemory软件包）
irlba软件包，用于计算我的tdm的部分SVD，如here所述

我已经尝试了两个库中的函数，但是似乎没有任何实质性的实现。有谁知道最好的前进方式是什么？我花了很长时间来摆弄这些东西，以至于我想问问那些比我自己拥有更多经验的人，他们在处理大型数据集之前会浪费更多的时间去朝错误的方向发展。

编辑：从10,00更改为10,000。谢谢@nograpes。

最佳答案

软件包qdap似乎能够处理这么大的问题。第一部分是重新创建与OP问题匹配的数据集，然后是解决方案。从qdap version 1.1.0开始，与tm软件包兼容：

library(qdapDictionaries)

FUN <- function() {
   paste(sample(DICTIONARY[, 1], sample(seq(100, 10000, by=1000), 1, TRUE)), collapse=" ")
}

library(qdap)
mycorpus <- tm::Corpus(tm::VectorSource(lapply(paste0("doc", 1:15), function(i) FUN())))

这给出了类似的语料库...

现在使用qdap方法。您必须首先将语料库转换为数据框（tm_corpus2df），然后使用tdm函数创建TermDocumentMatrix。

out <- with(tm_corpus2df(mycorpus), tdm(text, docs))
tm::inspect(out)

## A term-document matrix (19914 terms, 15 documents)
##
## Non-/sparse entries: 80235/218475
## Sparsity           : 73%
## Maximal term length: 19
## Weighting          : term frequency (tf)