(R) 关于 DocumentTermMatrix 中的停用词

本文介绍了(R) 关于 DocumentTermMatrix 中的停用词的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一些关于 DocumentTermMatrix() 及其停用词的问题.我输入如下，但无法得到我想要的结果.

I have some questions about DocumentTermMatrix() and about its stopwords.I typed as below, but couldn't get the results that I wanted.

text <- "text is my text but also his text."
mycorpus <- VCorpus(VectorSource(text))
mydtm <- DocumentTermMatrix(mycorpus, control=list(stopwords=F))
lapply(mycorpus, function(x){str_extract_all(x, boundary("word"))}) %>% unlist() %>% table()
.
also  but  his   is   my text
   1    1    1    1    1    3
apply(mydtm, 2, sum)
 also   but   his  text text.
    1     1     1     2     1

首先，即使我使用了 stopwords=F，dtm 仍然删除了一些停用词，例如is".然而，它并没有删除his"，尽管它在stopwords("en") 和stopwords("SMART") 中都有列出.所以我真的不明白 DTM 使用什么停用词以及为什么 stopwords=F 不起作用.我应该怎么做才能让它发挥作用?

First is that even though I used stopwords=F, the dtm still removed some stopwords such as "is." However, it didn't remove "his" although it is listed in both stopwords("en") and stopwords("SMART").So I really don't understand what stopwords that DTM uses and why stopwords=F doesn't work. and What should I do to make it work?

推荐答案

您可以尝试替代软件包:quanteda.它允许您在标记化后或在创建文档特征矩阵后删除停用词.下面，我使用 pad = TRUE 只是为了显示匹配停用词的标记已被删除的插槽.

You could try an alternative package: quanteda. It allows you to remove stopwords after tokenizing, or after creating the document-feature matrix. Below, I used pad = TRUE simply to show the slots where the tokens matching stopwords have been removed.

library("quanteda")
## Package version: 1.4.1
## Parallel computing: 2 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
##
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
##
##     View

text <- "text is my text but also his text."

tokens(text) %>%
  tokens_remove(stopwords("en"), pad = TRUE)
## tokens from 1 document.
## text1 :
## [1] "text" ""     ""     "text" ""     "also" ""     "text" "."

或者:

dfm(text)
## Document-feature matrix of: 1 document, 7 features (0.0% sparse).
## 1 x 7 sparse Matrix of class "dfm"
##        features
## docs    text is my but also his .
##   text1    3  1  1   1    1   1 1

dfm(text, remove_punct = TRUE) %>%
  dfm_remove(stopwords("en"))
## Document-feature matrix of: 1 document, 2 features (0.0% sparse).
## 1 x 2 sparse Matrix of class "dfm"
##        features
## docs    text also
##   text1    3    1

英文停用词列表只是由 stopwords() 函数(实际上来自 stopwords 包)返回的字符向量.默认英文列表与 tm::stopwords("en") 相同，除了 tm 包包含will".(如果你想要 SMART 列表，它是 stopwords("en", source = "smart").)

The list of English stopwords is just a character vector returned by the stopwords() function (which actually comes from the stopwords package). The default English list is the same as tm::stopwords("en") except the tm package includes "will". (If you want the SMART list, it's stopwords("en", source = "smart").)

stopwords("en")
##   [1] "i"          "me"         "my"         "myself"     "we"
##   [6] "our"        "ours"       "ourselves"  "you"        "your"
##  [11] "yours"      "yourself"   "yourselves" "he"         "him"
##  [16] "his"        "himself"    "she"        "her"        "hers"
##  [21] "herself"    "it"         "its"        "itself"     "they"
##  [26] "them"       "their"      "theirs"     "themselves" "what"
##  [31] "which"      "who"        "whom"       "this"       "that"
##  [36] "these"      "those"      "am"         "is"         "are"
##  [41] "was"        "were"       "be"         "been"       "being"
##  [46] "have"       "has"        "had"        "having"     "do"
##  [51] "does"       "did"        "doing"      "would"      "should"
##  [56] "could"      "ought"      "i'm"        "you're"     "he's"
##  [61] "she's"      "it's"       "we're"      "they're"    "i've"
##  [66] "you've"     "we've"      "they've"    "i'd"        "you'd"
##  [71] "he'd"       "she'd"      "we'd"       "they'd"     "i'll"
##  [76] "you'll"     "he'll"      "she'll"     "we'll"      "they'll"
##  [81] "isn't"      "aren't"     "wasn't"     "weren't"    "hasn't"
##  [86] "haven't"    "hadn't"     "doesn't"    "don't"      "didn't"
##  [91] "won't"      "wouldn't"   "shan't"     "shouldn't"  "can't"
##  [96] "cannot"     "couldn't"   "mustn't"    "let's"      "that's"
## [101] "who's"      "what's"     "here's"     "there's"    "when's"
## [106] "where's"    "why's"      "how's"      "a"          "an"
## [111] "the"        "and"        "but"        "if"         "or"
## [116] "because"    "as"         "until"      "while"      "of"
## [121] "at"         "by"         "for"        "with"       "about"
## [126] "against"    "between"    "into"       "through"    "during"
## [131] "before"     "after"      "above"      "below"      "to"
## [136] "from"       "up"         "down"       "in"         "out"
## [141] "on"         "off"        "over"       "under"      "again"
## [146] "further"    "then"       "once"       "here"       "there"
## [151] "when"       "where"      "why"        "how"        "all"
## [156] "any"        "both"       "each"       "few"        "more"
## [161] "most"       "other"      "some"       "such"       "no"
## [166] "nor"        "not"        "only"       "own"        "same"
## [171] "so"         "than"       "too"        "very"       "will"

这篇关于(R) 关于 DocumentTermMatrix 中的停用词的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！

1403页，肝出来的..