问题描述
我已经阅读了这篇和这个问题,但是我还是没明白tm_map
stemDocument
的用法/代码>.让我们按照这个例子:
I have already read this and this questions, but I still didn't understand the use of stemDocument
in tm_map
. Let's follow this example:
q17 <- VCorpus(VectorSource(x = c("poder", "pode")),
readerControl = list(language = "pt",
load = TRUE))
lapply(q17, content)
$`character(0)`
[1] "poder"
$`character(0)`
[1] "pode"
如果我使用:
> stemDocument("poder", language = "portuguese")
[1] "pod"
> stemDocument("pode", language = "portuguese")
[1] "pod"
确实有效!但如果我使用:
it does work! But if I use:
> q17 <- tm_map(q17, FUN = stemDocument, language = "portuguese")
> lapply(q17, content)
$`character(0)`
[1] "poder"
$`character(0)`
[1] "pode"
它不起作用.为什么会这样?
it doesn't work. Why so?
推荐答案
不幸的是,您发现了一个错误.stemDocument
如果您在执行时传递语言,则可以使用:
Unfortunately you stumbled on a bug. stemDocument
works if you pass on the language when you do:
stemDocument(x = c("poder", "pode"), language = "pt")
[1] "pod" "pod"
但是当在 tm_map
中使用它时,函数以 stemDocument.PlainTextDocument
开头.在此函数中,根据您在函数中提供的语言检查语料库的语言.这工作正常.但是在这个函数的末尾,所有的东西都被传递给函数stemDocument.character
,但是没有语言组件.在stemDocument.character
中,默认语言指定为英语.因此,在 tm_map
调用(或 DocumentTermMatrix
)中,您提供的语言将恢复为英语,并且词干提取无法正常工作.
But when using this in tm_map
, the function starts of with stemDocument.PlainTextDocument
. In this function the language of the corpus is checked against the language you supply in the function. This works correctly. But at the end of this function everything is passed on to the function stemDocument.character
, but without the language component. In stemDocument.character
the default language is specified as English. So within the tm_map
call (or the DocumentTermMatrix
) the language you supply with it will revert back to English and the stemming doesn't work correctly.
解决方法可能是使用 quanteda 包:
A workaround could be using the package quanteda:
library(quanteda)
my_dfm <- dfm(x = c("poder", "pode"))
my_dfm <- dfm_wordstem(my_dfm, language = "pt")
my_dfm
Document-feature matrix of: 2 documents, 1 feature (0.0% sparse).
2 x 1 sparse Matrix of class "dfm"
features
docs pod
text1 1
text2 1
由于您使用的是葡萄牙语,我建议您使用 quanteda、udpipe 或两者兼而有之的软件包.这两个软件包在处理非英语语言方面都比 tm 好得多.
Since you are working with Portuguese, I suggest using the packages quanteda, udpipe, or both. Both packages handle non-English languages a lot better than tm.
这篇关于如何正确使用stemDocument?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!