本文介绍了语料库参数上的 DocumentTermMatrix 错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下代码:

# returns string w/o leading or trailing whitespace
trim <- function (x) gsub("^\\s+|\\s+$", "", x)

news_corpus <- Corpus(VectorSource(news_raw$text)) # a column of strings.

corpus_clean <- tm_map(news_corpus, tolower)
corpus_clean <- tm_map(corpus_clean, removeNumbers)
corpus_clean <- tm_map(corpus_clean, removeWords, stopwords('english'))
corpus_clean <- tm_map(corpus_clean, removePunctuation)
corpus_clean <- tm_map(corpus_clean, stripWhitespace)
corpus_clean <- tm_map(corpus_clean, trim)

news_dtm <- DocumentTermMatrix(corpus_clean) # errors here

当我运行 DocumentTermMatrix() 方法时,它给了我这个错误:

When I run the DocumentTermMatrix() method, it gives me this error:

错误:inherits(doc, "TextDocument") is not TRUE

为什么会出现这个错误?我的行不是文本文档吗?

Why do I get this error? Are my rows not text documents?

这是检查 corpus_clean 时的输出:

Here is the output upon inspecting corpus_clean:

[[153]]
[1] obama holds technical school model us

[[154]]
[1] oil boom produces jobs bonanza archaeologists

[[155]]
[1] islamic terrorist group expands territory captures tikrit

[[156]]
[1] republicans democrats feel eric cantors loss

[[157]]
[1] tea party candidates try build cantor loss

[[158]]
[1] vehicles materials stored delaware bridges

[[159]]
[1] hill testimony hagel defends bergdahl trade

[[160]]
[1] tweet selfpropagates tweetdeck

[[161]]
[1] blackwater guards face trial iraq shootings

[[162]]
[1] calif man among soldiers killed afghanistan

[[163]]
[1] stocks fall back world bank cuts growth outlook

[[164]]
[1] jabhat alnusra longer useful turkey

[[165]]
[1] catholic bishops keep focus abortion marriage

[[166]]
[1] barbra streisand visits hill heart disease

[[167]]
[1] rand paul cantors loss reason stop talking immigration

[[168]]
[1] israeli airstrike kills northern gaza

这是我的数据:

type,text
neutral,The week in 32 photos
neutral,Look at me! 22 selfies of the week
neutral,Inside rebel tunnels in Homs
neutral,Voices from Ukraine
neutral,Water dries up ahead of World Cup
positive,Who's your hero? Nominate them
neutral,Anderson Cooper: Here's how
positive,"At fire scene, she rescues the pet"
neutral,Hunger in the land of plenty
positive,Helping women escape 'the life'
neutral,A tour of the sex underworld
neutral,Miss Universe Thailand steps down
neutral,China's 'naked officials' crackdown
negative,More held over Pakistan stoning
neutral,Watch landmark Cold War series
neutral,In photos: History of the Cold War
neutral,Turtle predicts World Cup winner
neutral,What devoured great white?
positive,Nun wins Italy's 'The Voice'
neutral,Bride Price app sparks debate
neutral,China to deport 'pork' artist
negative,Lightning hits moving car
neutral,Singer won't be silenced
neutral,Poland's mini desert
neutral,When monarchs retire
negative,Murder on Street View?
positive,Meet armless table tennis champ
neutral,Incredible 400 year-old globes
positive,Man saves falling baby
neutral,World's most controversial foods

我检索的内容:

news_raw <- read.csv('news_csv.csv', stringsAsFactors = F)

这是回溯():

> news_dtm <- DocumentTermMatrix(corpus_clean)
Error: inherits(doc, "TextDocument") is not TRUE
> traceback()
9: stop(sprintf(ngettext(length(r), "%s is not TRUE", "%s are not all TRUE"),
       ch), call. = FALSE, domain = NA)
8: stopifnot(inherits(doc, "TextDocument"), is.list(control))
7: FUN(X[[1L]], ...)
6: lapply(X, FUN, ...)
5: mclapply(unname(content(x)), termFreq, control)
4: TermDocumentMatrix.VCorpus(x, control)
3: TermDocumentMatrix(x, control)
2: t(TermDocumentMatrix(x, control))
1: DocumentTermMatrix(corpus_clean)

当我评估 inherits(corpus_clean, "TextDocument") 时,结果为 FALSE.

When I evaluate inherits(corpus_clean, "TextDocument") it is FALSE.

推荐答案

这似乎在 tm 0.5.10 中效果很好,但在 tm 0.6.0 中发生了变化> 好像弄坏了.问题是函数 tolowertrim 不一定返回 TextDocuments(看起来旧版本可能已经自动完成了转换).相反,它们返回字符,而 DocumentTermMatrix 不确定如何处理字符语料库.

It seems this would have worked just fine in tm 0.5.10 but changes in tm 0.6.0 seems to have broken it. The problem is that the functions tolower and trim won't necessarily return TextDocuments (it looks like the older version may have automatically done the conversion). They instead return characters and the DocumentTermMatrix isn't sure how to handle a corpus of characters.

所以你可以改成

corpus_clean <- tm_map(news_corpus, content_transformer(tolower))

或者你可以运行

corpus_clean <- tm_map(corpus_clean, PlainTextDocument)

在您完成所有非标准转换(那些不在 getTransformations() 中的转换)之后,就在您创建 DocumentTermMatrix 之前.这应该确保您的所有数据都在 PlainTextDocument 中,并且应该让 DocumentTermMatrix 满意.

after all of your non-standard transformations (those not in getTransformations()) are done and just before you create the DocumentTermMatrix. That should make sure all of your data is in PlainTextDocument and should make DocumentTermMatrix happy.

这篇关于语料库参数上的 DocumentTermMatrix 错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-12 16:22