更有效的手段创建一个语料库和DTM有4M行

本文介绍了更有效的手段创建一个语料库和DTM有4M行的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我的文件有超过400万行，我需要一个更有效的方式将我的数据转换为语料库和文档术语矩阵，以便我可以将其传递给贝叶斯分类器。

My file has over 4M rows and I need a more efficient way of converting my data to a corpus and document term matrix such that I can pass it to a bayesian classifier.

请考虑以下代码：

library(tm)

GetCorpus <-function(textVector)
{
  doc.corpus <- Corpus(VectorSource(textVector))
  doc.corpus <- tm_map(doc.corpus, tolower)
  doc.corpus <- tm_map(doc.corpus, removeNumbers)
  doc.corpus <- tm_map(doc.corpus, removePunctuation)
  doc.corpus <- tm_map(doc.corpus, removeWords, stopwords("english"))
  doc.corpus <- tm_map(doc.corpus, stemDocument, "english")
  doc.corpus <- tm_map(doc.corpus, stripWhitespace)
  doc.corpus <- tm_map(doc.corpus, PlainTextDocument)
  return(doc.corpus)
}

data <- data.frame(
  c("Let the big dogs hunt","No holds barred","My child is an honor student"), stringsAsFactors = F)

corp <- GetCorpus(data[,1])

inspect(corp)

dtm <- DocumentTermMatrix(corp)

inspect(dtm)

输出：

> inspect(corp)
<<VCorpus (documents: 3, metadata (corpus/indexed): 0/0)>>

[[1]]
<<PlainTextDocument (metadata: 7)>>
let big dogs hunt

[[2]]
<<PlainTextDocument (metadata: 7)>>
 holds bar

[[3]]
<<PlainTextDocument (metadata: 7)>>
 child honor stud
> inspect(dtm)
<<DocumentTermMatrix (documents: 3, terms: 9)>>
Non-/sparse entries: 9/18
Sparsity           : 67%
Maximal term length: 5
Weighting          : term frequency (tf)

              Terms
Docs           bar big child dogs holds honor hunt let stud
  character(0)   0   1     0    1     0     0    1   1    0
  character(0)   1   0     0    0     1     0    0   0    0
  character(0)   0   0     1    0     0     1    0   0    1

，什么可以用来创建语料库和DTM更快？如果我使用超过300k行，这似乎是非常慢。

My question is, what can I use to create a corpus and DTM faster? It seems to be extremely slow if I use over 300k rows.

我听说我可以使用 data.table 但我不知道如何。

I have heard that I could use data.table but I am not sure how.

我也看了 qdap 包，尝试加载包时发生错误，加上我甚至不知道是否会工作。

I have also looked at the qdap package, but it gives me an error when trying to load the package, plus I don't even know if it will work.

参考。

推荐答案

我想你可能想考虑一个更加正则表达式的解决方案。这些是一些问题/思维我作为开发人员摔跤。我目前正在查看 stringi 包大量的开发，因为它有一些一致命名的函数，被邪恶的快速字符串操作。

I think you may want to consider a more regex focused solution. These are some of the problems/thinking I'm wrestling with as a developer. I'm currently looking at the stringi package heavily for development as it has some consistently named functions that are wicked fast for string manipulation.

在此响应中，我试图使用任何我知道的工具比更方便的方法 tm 可以给我们（当然要比$ qdap 快得多）。这里我甚至没有探索并行处理或data.table / dplyr，而是专注于使用 stringi 的字符串操作，并将数据保存在一个矩阵中，并使用特定的包操作处理该格式。我拿你的例子，乘以100000x。即使是stemming，这需要17秒在我的机器。

In this response I'm attempting to use any tool I know of that is faster than the more convenient methods tm may give us (and certainly much faster than qdap). Here I haven't even explored parallel processing or data.table/dplyr and instead focus on string manipulation with stringi and keeping the data in a matrix and manipulating with specific packages meant to handle that format. I take your example and multiply it 100000x. Even with stemming, this takes 17 seconds on my machine.

data <- data.frame(
    text=c("Let the big dogs hunt",
        "No holds barred",
        "My child is an honor student"
    ), stringsAsFactors = F)

## eliminate this step to work as a MWE
data <- data[rep(1:nrow(data), 100000), , drop=FALSE]

library(stringi)
library(SnowballC)
out <- stri_extract_all_words(stri_trans_tolower(SnowballC::wordStem(data[[1]], "english"))) #in old package versions it was named 'stri_extract_words'
names(out) <- paste0("doc", 1:length(out))

lev <- sort(unique(unlist(out)))
dat <- do.call(cbind, lapply(out, function(x, lev) {
    tabulate(factor(x, levels = lev, ordered = TRUE), nbins = length(lev))
}, lev = lev))
rownames(dat) <- sort(lev)

library(tm)
dat <- dat[!rownames(dat) %in% tm::stopwords("english"), ]

library(slam)
dat2 <- slam::as.simple_triplet_matrix(dat)

tdm <- tm::as.TermDocumentMatrix(dat2, weighting=weightTf)
tdm

## or...
dtm <- tm::as.DocumentTermMatrix(dat2, weighting=weightTf)
dtm

这篇关于更有效的手段创建一个语料库和DTM有4M行的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！

May