本文介绍了R文本挖掘 - 处理复数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在 R 中学习文本挖掘,并且取得了相当大的成功.但我被困在如何处理复数.即我希望nation"和nations"被算作同一个词,理想情况下dictionary"和dictionaries"被算作同一个词.

I'm learning text mining in R and have had pretty good success. But I am stuck on how to deal with plurals. i.e. I want "nation" and "nations" to be counted as the same word and ideally "dictionary" and "dictionaries" to be counted as the same word.

x <- '"nation" and "nations" to be counted as the same word and ideally "dictionary" and "dictionaries" to be counted as the same word.'

推荐答案

一种可能的解决方案.这里我使用 pacman 包使解决方案自包含:

One possible solution. Here I use the pacman package to make the solution self contained:

if (!require("pacman")) install.packages("pacman"); library(pacman)
p_load_gh('hrbrmstr/pluralize')
p_load(quanteda)

x <- '"nation" and "nations" to be counted as the same word and ideally "dictionary" and "dictionaries"'
singularize(unlist(tokenize(x)))

##  [1] "\""         "nation"     "\""         "and"        "\""         "nation"     "\""
##  [8] "to"         "be"         "counted"    "a"          "the"        "same"       "word"
## [15] "and"        "ideally"    "\""         "dictionary" "\""         "and"        "\""
## [22] "dictionary" "\""

这篇关于R文本挖掘 - 处理复数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-03 10:43