问题描述
我有一些关于 DocumentTermMatrix()
及其停用词的问题.我输入如下,但无法得到我想要的结果.
I have some questions about DocumentTermMatrix()
and about its stopwords.I typed as below, but couldn't get the results that I wanted.
text <- "text is my text but also his text."
mycorpus <- VCorpus(VectorSource(text))
mydtm <- DocumentTermMatrix(mycorpus, control=list(stopwords=F))
lapply(mycorpus, function(x){str_extract_all(x, boundary("word"))}) %>% unlist() %>% table()
.
also but his is my text
1 1 1 1 1 3
apply(mydtm, 2, sum)
also but his text text.
1 1 1 2 1
首先,即使我使用了 stopwords=F
,dtm 仍然删除了一些停用词,例如is".然而,它并没有删除his",尽管它在stopwords("en")
和stopwords("SMART")
中都有列出.所以我真的不明白 DTM 使用什么停用词以及为什么 stopwords=F
不起作用.我应该怎么做才能让它发挥作用?
First is that even though I used stopwords=F
, the dtm still removed some stopwords such as "is." However, it didn't remove "his" although it is listed in both stopwords("en")
and stopwords("SMART")
.So I really don't understand what stopwords that DTM uses and why stopwords=F
doesn't work. and What should I do to make it work?
推荐答案
您可以尝试替代软件包:quanteda.它允许您在标记化后或在创建文档特征矩阵后删除停用词.下面,我使用 pad = TRUE
只是为了显示匹配停用词的标记已被删除的插槽.
You could try an alternative package: quanteda. It allows you to remove stopwords after tokenizing, or after creating the document-feature matrix. Below, I used pad = TRUE
simply to show the slots where the tokens matching stopwords have been removed.
library("quanteda")
## Package version: 1.4.1
## Parallel computing: 2 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
##
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
##
## View
text <- "text is my text but also his text."
tokens(text) %>%
tokens_remove(stopwords("en"), pad = TRUE)
## tokens from 1 document.
## text1 :
## [1] "text" "" "" "text" "" "also" "" "text" "."
或者:
dfm(text)
## Document-feature matrix of: 1 document, 7 features (0.0% sparse).
## 1 x 7 sparse Matrix of class "dfm"
## features
## docs text is my but also his .
## text1 3 1 1 1 1 1 1
dfm(text, remove_punct = TRUE) %>%
dfm_remove(stopwords("en"))
## Document-feature matrix of: 1 document, 2 features (0.0% sparse).
## 1 x 2 sparse Matrix of class "dfm"
## features
## docs text also
## text1 3 1
英文停用词列表只是由 stopwords()
函数(实际上来自 stopwords 包)返回的字符向量.默认英文列表与 tm::stopwords("en")
相同,除了 tm 包包含will".(如果你想要 SMART 列表,它是 stopwords("en", source = "smart")
.)
The list of English stopwords is just a character vector returned by the stopwords()
function (which actually comes from the stopwords package). The default English list is the same as tm::stopwords("en")
except the tm package includes "will". (If you want the SMART list, it's stopwords("en", source = "smart")
.)
stopwords("en")
## [1] "i" "me" "my" "myself" "we"
## [6] "our" "ours" "ourselves" "you" "your"
## [11] "yours" "yourself" "yourselves" "he" "him"
## [16] "his" "himself" "she" "her" "hers"
## [21] "herself" "it" "its" "itself" "they"
## [26] "them" "their" "theirs" "themselves" "what"
## [31] "which" "who" "whom" "this" "that"
## [36] "these" "those" "am" "is" "are"
## [41] "was" "were" "be" "been" "being"
## [46] "have" "has" "had" "having" "do"
## [51] "does" "did" "doing" "would" "should"
## [56] "could" "ought" "i'm" "you're" "he's"
## [61] "she's" "it's" "we're" "they're" "i've"
## [66] "you've" "we've" "they've" "i'd" "you'd"
## [71] "he'd" "she'd" "we'd" "they'd" "i'll"
## [76] "you'll" "he'll" "she'll" "we'll" "they'll"
## [81] "isn't" "aren't" "wasn't" "weren't" "hasn't"
## [86] "haven't" "hadn't" "doesn't" "don't" "didn't"
## [91] "won't" "wouldn't" "shan't" "shouldn't" "can't"
## [96] "cannot" "couldn't" "mustn't" "let's" "that's"
## [101] "who's" "what's" "here's" "there's" "when's"
## [106] "where's" "why's" "how's" "a" "an"
## [111] "the" "and" "but" "if" "or"
## [116] "because" "as" "until" "while" "of"
## [121] "at" "by" "for" "with" "about"
## [126] "against" "between" "into" "through" "during"
## [131] "before" "after" "above" "below" "to"
## [136] "from" "up" "down" "in" "out"
## [141] "on" "off" "over" "under" "again"
## [146] "further" "then" "once" "here" "there"
## [151] "when" "where" "why" "how" "all"
## [156] "any" "both" "each" "few" "more"
## [161] "most" "other" "some" "such" "no"
## [166] "nor" "not" "only" "own" "same"
## [171] "so" "than" "too" "very" "will"
这篇关于(R) 关于 DocumentTermMatrix 中的停用词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!