问题描述
我正在使用tm包对修复数据进行文本分析,将数据读入数据框,转换为语料库对象,使用lower、stipWhitespace、removestopwords等应用各种方法清理数据.
I am using tm package for text analysis of repair data, Reading data into data frame, converting to Corpus object, applied various methods to clean data using lower, stipWhitespace, removestopwords and so on.
为词干完成取回 Corpus 对象.
Taken back of Corpus object for stemCompletion.
使用tm_map函数执行stemDocument,我的对象词被词干了
Performed stemDocument using tm_map function, my object words got stemmed
达到了预期的结果.
当我使用 tm_map 函数运行 stemCompletion 操作时,它不起作用并得到以下错误
When I am running stemCompletion operation using tm_map function, it is not workingand got below error
UseMethod("words") 中的错误:'words' 没有适用的方法应用于类字符"的对象
执行 trackback() 以显示并得到如下步骤
Executed trackback() to show and got steps as below
> traceback()
9: FUN(X[[1L]], ...)
8: lapply(dictionary, words)
7: unlist(lapply(dictionary, words))
6: unique(unlist(lapply(dictionary, words)))
5: FUN(X[[1L]], ...)
4: lapply(X, FUN, ...)
3: mclapply(content(x), FUN, ...)
2: tm_map.VCorpus(c, stemCompletion, dictionary = c_orig)
1: tm_map(c, stemCompletion, dictionary = c_orig)
我该如何解决这个错误?
How can I resolve this error?
推荐答案
我在使用 tm v0.6 时遇到了同样的错误.我怀疑发生这种情况是因为 stemCompletion
不在此版本的 tm 包的默认转换中:
I received the same error when using tm v0.6. I suspect this occurs because stemCompletion
is not in the default transformations for this version of the tm package:
> getTransformations
function ()
c("removeNumbers", "removePunctuation", "removeWords", "stemDocument",
"stripWhitespace")
<environment: namespace:tm>
现在,tolower
函数也有同样的问题,但可以通过使用 content_transformer
函数使其运行.我为 stemCompletion
尝试了类似的方法,但没有成功.
Now, the tolower
function has the same problem, but can be made operational by using the content_transformer
function. I tried a similar approach for stemCompletion
but was not successful.
注意,即使 stemCompletion
不是默认转换,当手动输入词干时它仍然有效:
Note, even though stemCompletion
isn't a default transformation, it still works when manually fed stemmed words:
> stemCompletion("compani",dictCorpus)
compani
"companies"
为了继续我的工作,我手动用单个空格分隔语料库中的每个文档,通过 stemCompletion
输入它们,然后将它们与以下内容连接在一起(笨拙而不优雅!) 函数:
So that I could continue with my work, I manually delimited each document in a corpus by single spaces, feed them through stemCompletion
, and concatenated them back together with the following (clunky and not graceful!) function:
stemCompletion_mod <- function(x,dict=dictCorpus) {
PlainTextDocument(stripWhitespace(paste(stemCompletion(unlist(strsplit(as.character(x)," ")),dictionary=dict, type="shortest"),sep="", collapse=" ")))
}
其中 dictCorpus
只是清理过的语料库的副本,但在它被词干之前.额外的 stripWhitespace
特定于我的语料库,但对于一般语料库来说可能是良性的.您可能希望根据需要将 type
选项从最短"更改.
where dictCorpus
is just a copy of the cleaned corpus, but before it's stemmed. The extra stripWhitespace
is specific for my corpus, but is likely benign for a general corpus. You may want to change the type
option from "shortest" as needed.
对于一个完整的例子,让我们使用 tm 包中的 crude
数据设置一个虚拟语料库:
For a full example, let's setup a dummy corpus using the crude
data in the tm package:
> data("crude")
> docs = Corpus(VectorSource(crude))
> docs <- tm_map(docs, content_transformer(tolower))
> docs <- tm_map(docs, removeNumbers)
> docs <- tm_map(docs, removeWords, stopwords("english"))
> docs <- tm_map(docs, removePunctuation)
> docs <- tm_map(docs, stripWhitespace)
> docs <- tm_map(docs, PlainTextDocument)
> dictCorpus <- docs
> docs <- tm_map(docs, stemDocument)
> # Define modified stemCompletion function
> stemCompletion_mod <- function(x,dict=dictCorpus) {
PlainTextDocument(stripWhitespace(paste(stemCompletion(unlist(strsplit(as.character(x)," ")),dictionary=dict, type="shortest"),sep="", collapse=" ")))
}
> # Original doc in crude data
> crude[[1]]
<<PlainTextDocument (metadata: 15)>>
Diamond Shamrock Corp said that
effective today it had cut its contract prices for crude oil by
1.50 dlrs a barrel.
The reduction brings its posted price for West Texas
Intermediate to 16.00 dlrs a barrel, the copany said.
"The price reduction today was made in the light of falling
oil product prices and a weak crude oil market," a company
spokeswoman said.
Diamond is the latest in a line of U.S. oil companies that
have cut its contract, or posted, prices over the last two days
citing weak oil markets.
Reuter
> # Stemmed example in crude data
> docs[[1]]
<<PlainTextDocument (metadata: 7)>>
diamond shamrock corp said effect today cut contract price crude oil dlrs barrel
reduct bring post price west texa intermedi dlrs barrel copani said price reduct today
made light fall oil product price weak crude oil market compani spokeswoman said diamond
latest line us oil compani cut contract post price last two day cite weak oil market reuter
> # Stem comlpeted example in crude data
> stemCompletion_mod(docs[[1]],dictCorpus)
<<PlainTextDocument (metadata: 7)>>
diamond shamrock corp said effect today cut contract price crude oil dlrs barrel
reduction brings posted price west texas intermediate dlrs barrel NA said price reduction today
made light fall oil product price weak crude oil market companies spokeswoman said diamond
latest line us oil companies cut contract posted price last two day cited weak oil market reuter
注意:这个例子很奇怪,因为拼错的单词copany"在这个过程中被映射:->copani"->NA".不知道如何纠正这个...
为了在整个语料库中运行 stemCompletion_mod
,我只使用 sapply
(或 parSapply
和雪包).
To run the stemCompletion_mod
through the entire corpus, I just use sapply
(or parSapply
with snow package).
也许比我更有经验的人可以建议进行更简单的修改,以使 stemCompletion
在 tm 包的 v0.6 中工作.
Perhaps someone with more experience than me could suggest a simpler modification to get stemCompletion
to work in v0.6 of the tm package.
这篇关于茎完成不起作用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!