r - 从 R 中的数字和停用词中过滤文本(不适用于 tdm)

我有文本语料库。

mytextdata = read.csv(path to texts.csv)
Mystopwords=read.csv(path to mystopwords.txt)

如何过滤此文本？我必须删除:

1) all numbers

2) pass through the stop words

3) remove the brackets

我不会使用 dtm ，我只需要从数字和停用词中清除此文本数据

样本数据:

112773-Tablet for cleaning the hydraulic system Jura (6 pcs.) 62715

Jura,the 是停用词。

在我期望的输出中

  Tablet for cleaning hydraulic system

最佳答案

由于目前问题中只有一个字符串可用，我决定自己创建一个示例数据。我希望这与您的实际数据接近。正如 Nate 所建议的，使用 tidytext 包是一种方法。在这里，我首先删除了数字、标点符号、括号中的内容以及括号本身。然后，我使用 unnest_tokens() 拆分每个字符串中的单词。然后，我删除了停用词。由于您有自己的停用词，因此您可能希望创建自己的词典。我只是在 jura 部分添加了 filter()。按 id 对数据进行分组，我将单词组合起来以在 summarise() 中创建字符串。请注意，我使用 jura 而不是 Jura 。这是因为 unnest_tokens() 将大写字母转换为小写字母。

mydata <- data.frame(id = 1:2,
                     text = c("112773-Tablet for cleaning the hydraulic system Jura (6 pcs.) 62715",
                              "1234567-Tablet for cleaning the mambojumbo system Jura (12 pcs.) 654321"),
                     stringsAsFactors = F)

library(dplyr)
library(tidytext)

data(stop_words)

mutate(mydata, text = gsub(x = text, pattern = "[0-9]+|[[:punct:]]|\\(.*\\)", replacement = "")) %>%
unnest_tokens(input = text, output = word) %>%
filter(!word %in% c(stop_words$word, "jura")) %>%
group_by(id) %>%
summarise(text = paste(word, collapse = " "))

#     id                              text
#  <int>                             <chr>
#1     1  tablet cleaning hydraulic system
#2     2 tablet cleaning mambojumbo system

另一种方法如下。在这种情况下，我没有使用 unnest_tokens() 。

library(magrittr)
library(stringi)
library(tidytext)

data(stop_words)

gsub(x = mydata$text, pattern = "[0-9]+|[[:punct:]]|\\(.*\\)", replacement = "") %>%
stri_split_regex(str = ., pattern = " ", omit_empty = TRUE) %>%
lapply(function(x){
    foo <- x[which(!x %in% c(stop_words$word, "Jura"))] %>%
           paste(collapse = " ")
    foo}) %>%
unlist

#[1] "Tablet cleaning hydraulic system"  "Tablet cleaning mambojumbo system"

关于r - 从 R 中的数字和停用词中过滤文本(不适用于 tdm)，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/47596065/