问题描述
我的文件有超过400万行,我需要一个更有效的方式将我的数据转换为语料库和文档术语矩阵,以便我可以将其传递给贝叶斯分类器。
My file has over 4M rows and I need a more efficient way of converting my data to a corpus and document term matrix such that I can pass it to a bayesian classifier.
请考虑以下代码:
library(tm)
GetCorpus <-function(textVector)
{
doc.corpus <- Corpus(VectorSource(textVector))
doc.corpus <- tm_map(doc.corpus, tolower)
doc.corpus <- tm_map(doc.corpus, removeNumbers)
doc.corpus <- tm_map(doc.corpus, removePunctuation)
doc.corpus <- tm_map(doc.corpus, removeWords, stopwords("english"))
doc.corpus <- tm_map(doc.corpus, stemDocument, "english")
doc.corpus <- tm_map(doc.corpus, stripWhitespace)
doc.corpus <- tm_map(doc.corpus, PlainTextDocument)
return(doc.corpus)
}
data <- data.frame(
c("Let the big dogs hunt","No holds barred","My child is an honor student"), stringsAsFactors = F)
corp <- GetCorpus(data[,1])
inspect(corp)
dtm <- DocumentTermMatrix(corp)
inspect(dtm)
输出:
> inspect(corp)
<<VCorpus (documents: 3, metadata (corpus/indexed): 0/0)>>
[[1]]
<<PlainTextDocument (metadata: 7)>>
let big dogs hunt
[[2]]
<<PlainTextDocument (metadata: 7)>>
holds bar
[[3]]
<<PlainTextDocument (metadata: 7)>>
child honor stud
> inspect(dtm)
<<DocumentTermMatrix (documents: 3, terms: 9)>>
Non-/sparse entries: 9/18
Sparsity : 67%
Maximal term length: 5
Weighting : term frequency (tf)
Terms
Docs bar big child dogs holds honor hunt let stud
character(0) 0 1 0 1 0 0 1 1 0
character(0) 1 0 0 0 1 0 0 0 0
character(0) 0 0 1 0 0 1 0 0 1
,什么可以用来创建语料库和DTM更快?如果我使用超过300k行,这似乎是非常慢。
My question is, what can I use to create a corpus and DTM faster? It seems to be extremely slow if I use over 300k rows.
我听说我可以使用 data.table
但我不知道如何。
I have heard that I could use data.table
but I am not sure how.
我也看了 qdap
包,尝试加载包时发生错误,加上我甚至不知道是否会工作。
I have also looked at the qdap
package, but it gives me an error when trying to load the package, plus I don't even know if it will work.
参考。
推荐答案
我想你可能想考虑一个更加正则表达式的解决方案。这些是一些问题/思维我作为开发人员摔跤。我目前正在查看 stringi
包大量的开发,因为它有一些一致命名的函数,被邪恶的快速字符串操作。
I think you may want to consider a more regex focused solution. These are some of the problems/thinking I'm wrestling with as a developer. I'm currently looking at the stringi
package heavily for development as it has some consistently named functions that are wicked fast for string manipulation.
在此响应中,我试图使用任何我知道的工具比更方便的方法 tm
可以给我们(当然要比$ qdap
快得多)。这里我甚至没有探索并行处理或data.table / dplyr,而是专注于使用 stringi
的字符串操作,并将数据保存在一个矩阵中,并使用特定的包操作处理该格式。我拿你的例子,乘以100000x。即使是stemming,这需要17秒在我的机器。
In this response I'm attempting to use any tool I know of that is faster than the more convenient methods tm
may give us (and certainly much faster than qdap
). Here I haven't even explored parallel processing or data.table/dplyr and instead focus on string manipulation with stringi
and keeping the data in a matrix and manipulating with specific packages meant to handle that format. I take your example and multiply it 100000x. Even with stemming, this takes 17 seconds on my machine.
data <- data.frame(
text=c("Let the big dogs hunt",
"No holds barred",
"My child is an honor student"
), stringsAsFactors = F)
## eliminate this step to work as a MWE
data <- data[rep(1:nrow(data), 100000), , drop=FALSE]
library(stringi)
library(SnowballC)
out <- stri_extract_all_words(stri_trans_tolower(SnowballC::wordStem(data[[1]], "english"))) #in old package versions it was named 'stri_extract_words'
names(out) <- paste0("doc", 1:length(out))
lev <- sort(unique(unlist(out)))
dat <- do.call(cbind, lapply(out, function(x, lev) {
tabulate(factor(x, levels = lev, ordered = TRUE), nbins = length(lev))
}, lev = lev))
rownames(dat) <- sort(lev)
library(tm)
dat <- dat[!rownames(dat) %in% tm::stopwords("english"), ]
library(slam)
dat2 <- slam::as.simple_triplet_matrix(dat)
tdm <- tm::as.TermDocumentMatrix(dat2, weighting=weightTf)
tdm
## or...
dtm <- tm::as.DocumentTermMatrix(dat2, weighting=weightTf)
dtm
这篇关于更有效的手段创建一个语料库和DTM有4M行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!