本文介绍了遍历tm语料库而不会丢失语料库结构的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个tm语料库和一个单词列表.我想在语料库上运行一个for循环,以便该循环从语料库中顺序删除列表中的每个单词.

I have a tm corpus of documents and a list of words. I want to run a for loop over the corpus, so that the loop removes each word in the list from the corpus sequentially.

一些复制数据:

library(tm)
m <- cbind(c("Apple blue two","Pear yellow five","Banana yellow two"),
             c(1, 2, 3))
tm_corpus <- Corpus(VectorSource(m[,1]))
words <- as.list(c("Apple", "yellow", "two"))

tm_corpus现在是由3个文档组成的语料库对象:

tm_corpus is now a corpus object consisting of 3 documents:

<<SimpleCorpus>>
Metadata:  corpus specific: 1, document level (indexed): 0
Content:  documents: 3

words是3个单词的列表:

[[1]]
[1] "Apple"

[[2]]
[1] "yellow"

[[3]]
[1] "two"

我尝试了三个不同的循环.第一个是:

I have tried three different loops. The first one is:

tm_corpusClean <- tm_corpus
for (i in seq_along(tm_corpusClean)) {
  for (u in seq_along(words)) {
    tm_corpusClean[i] <- tm_map(tm_corpusClean[i], removeWords, words[[u]])
  }
}

其中7次返回以下错误(编号1-7):

Which returns the following error 7 times (numbered 1-7):

Error in x$dmeta[i, , drop = FALSE] : incorrect number of dimensions
In addition: Warning messages:
1: In tm_corpusClean[i] <- tm_map(tm_corpusClean[i], removeWords,
words[[u]]) :
  number of items to replace is not a multiple of replacement length
2: In tm_corpusClean[i] <- tm_map(tm_corpusClean[i], removeWords,
words[[u]]) :
  number of items to replace is not a multiple of replacement length
[...]

第二个是:

tm_corpusClean <- tm_corpus
for (i in seq_along(words)) {
  for (u in seq_along(tm_corpusClean)) {
    tm_corpusClean[u] <- tm_map(tm_corpusClean[u], removeWords, words[[i]])
  }
}

哪个返回错误:

Error in x$dmeta[i, , drop = FALSE] : incorrect number of dimensions

最后一个循环是:

tm_corpusClean <- tm_corpus
for (i in seq_along(words)) {
  tm_corpusClean <- tm_map(tm_corpusClean, removeWords, words[[i]])
}

这实际上返回一个名为tm_corpusClean的对象,但是该对象仅返回第一个文档,而不是所有原始的三个文档:

This actually returns an object named tm_corpusClean, but this object only returns the first document instead of all original three:

inspect(tm_corpusClean[[1]])

<<PlainTextDocument>>
Metadata:  7
Content:  chars: 6

 blue

我要去哪里错了?

推荐答案

在进行顺序删除之前,请测试tm_map是否适用于您的示例:

Before we go to the sequential removal, test if tm_map work on your example:

obj1 <- tm_map(tm_corpus, removeWords, unlist(words))
sapply(obj1, `[`, "content")

$`1.content`
[1] " blue "

$`2.content`
[1] "Pear  five"

$`3.content`
[1] "Banana  "

接下来,使用lapply一次顺序删除一个单词,即"Apple", "yellow", "two":

Next, use lapply to sequentially remove one word at a time, i.e. "Apple", "yellow", "two":

obj2 <- lapply(words, function(word) tm_map(tm_corpus, removeWords, word))
sapply(obj2, function(x) sapply(x, `[`, "content"))

          [,1]                [,2]             [,3]
1.content " blue two"         "Apple blue two" "Apple blue "
2.content "Pear yellow five"  "Pear  five"     "Pear yellow five"
3.content "Banana yellow two" "Banana  two"    "Banana yellow "

请注意,生成的语料库在嵌套列表中(为什么要使用两个sapply来查看内容).

Note that the resultant corpus are in a nested-list (reason why two sapply were used to view the content).

这篇关于遍历tm语料库而不会丢失语料库结构的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-12 16:22