本文介绍了R文本挖掘 - 处理复数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我正在 R 中学习文本挖掘,并且取得了相当大的成功.但我被困在如何处理复数.即我希望nation"和nations"被算作同一个词,理想情况下dictionary"和dictionaries"被算作同一个词.
I'm learning text mining in R and have had pretty good success. But I am stuck on how to deal with plurals. i.e. I want "nation" and "nations" to be counted as the same word and ideally "dictionary" and "dictionaries" to be counted as the same word.
x <- '"nation" and "nations" to be counted as the same word and ideally "dictionary" and "dictionaries" to be counted as the same word.'
推荐答案
一种可能的解决方案.这里我使用 pacman 包使解决方案自包含:
One possible solution. Here I use the pacman package to make the solution self contained:
if (!require("pacman")) install.packages("pacman"); library(pacman)
p_load_gh('hrbrmstr/pluralize')
p_load(quanteda)
x <- '"nation" and "nations" to be counted as the same word and ideally "dictionary" and "dictionaries"'
singularize(unlist(tokenize(x)))
## [1] "\"" "nation" "\"" "and" "\"" "nation" "\""
## [8] "to" "be" "counted" "a" "the" "same" "word"
## [15] "and" "ideally" "\"" "dictionary" "\"" "and" "\""
## [22] "dictionary" "\""
这篇关于R文本挖掘 - 处理复数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!