问题描述
有些单词有时被用作动词,有时被用作语言的其他部分.
There are some words which are used sometimes as a verb and sometimes as other part of speech.
示例
一词的含义是动词:
I blame myself for what happened
还有一个单词的名词为名词的句子:
And a sentence with the meaning of word as noun:
For what happened the blame is yours
我想检测的单词是我所知道的,在上面的示例中是责备".我只想在具有动词含义的情况下才将其检测为停用词.
The word I want to detect is known to me, in the example above is "blame". I would like to detect and remove as stopwords only when it has meaning like a verb.
有什么简单的方法可以做到吗?
Is there any easy way to make it?
推荐答案
您可以安装TreeTagger ,然后在R中使用koRpus
包以从R中使用TreeTagger.将其安装在例如C:\Treetagger
.
You can install TreeTagger and then use the koRpus
package in R to use TreeTagger from R. Install it in a location like e.g. C:\Treetagger
.
我将首先展示treetagger的工作原理,以便您了解此答案下方的实际解决方案中的情况:
I will first show how treetagger works so you understand what's going in the actual solution further down below in this answer:
library(koRpus)
your_sentences <- c("I blame myself for what happened",
"For what happened the blame is yours")
text.tagged <- treetag(file="I blame myself for what happened",
format="obj", treetagger="manual", lang="en",
TT.options = list(path="C:\\Treetagger", preset="en") )
text.tagged@TT.res[, 1:2]
# token tag
#1 I PP
#2 blame VVP
#3 myself PP
#4 for IN
#5 what WP
#6 happened VVD
现在已经对句子进行了分析,剩下的唯一内容"是删除出现在动词上的"blame"
.
The sentences have been analysed now and the "only thing left" is to remove those occurrences of "blame"
that are a verb.
我将通过创建一个函数来对句子进行句子处理,该函数首先标记句子,然后检查像"blame"
一样也是动词的坏词",最后将它们从句子中删除:
I'll do this sentence for sentence by creating a function that first tags the sentence, then checks for "bad words" like "blame"
that are also a verb and finally removes them from the sentence:
remove_words <- function(sentence, badword="blame"){
tagged.text <- treetag(file=sentence, format="obj", treetagger="manual", lang="en",
TT.options=list(path=":C\\Treetagger", preset="en"))
# Check for bad words AND verb:
cond1 <- (tagged.text@TT.res$token == badword)
cond2 <- (substring(tagged.text@TT.res$tag, 0, 1) == "V")
redflag <- which(cond1 & cond2)
# If no such case, return sentence as is. If so, then remove that word:
if(length(redflag) == 0) return(sentence)
else{
splitsent <- strsplit(sentence, " ")[[1]]
splitsent <- splitsent[-redflag]
return(paste0(splitsent, collapse=" "))
}
}
lapply(your_sentences, remove_words)
# [[1]]
# [1] "I myself for what happened"
# [[2]]
# [1] "For what happened the blame is yours"
这篇关于删除动词作为停用词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!