我在R中有一个数据集,其中列出了一堆公司名称,并希望删除诸如“ Inc”,“ Company”,“ LLC”等之类的词作为清理工作的一部分。我有以下示例数据:

样本数据

  Location             Company
1 New York, NY         XYZ Company
2 Chicago, IL          Consulting Firm LLC
3 Miami, FL            Smith & Co.


我不想在输出中包含的词:

stopwords = c("Inc","inc","co","Co","Inc.","Co.","LLC","Corporation","Corp","&")


我构建了以下函数来分解每个单词,删除停用词,然后将单词重新组合在一起,但是它不会遍历数据集的每一行。

removeWords <- function(str, stopwords) {
  x <- unlist(strsplit(str, " "))
  paste(x[!x %in% stopwords], collapse = " ")
}

removeWords(sampleData$Company,stopwords)


上面函数的输出如下所示:

[1] "XYZ Company Consulting Firm Smith"


Ť
他的输出应该是:

 Location              Company
1 New York, NY         XYZ Company
2 Chicago, IL          Consulting Firm
3 Miami, FL            Smith


任何帮助,将不胜感激。

最佳答案

我们可以使用“ tm”包

library(tm)

stopwords = readLines('stopwords.txt')     #Your stop words file
x  = df$company        #Company column data
x  =  removeWords(x,stopwords)     #Remove stopwords

df$company_new <- x     #Add the list as new column and check

关于r - 从R的数据框中的列中删除字符串中的某些单词,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/40901100/

10-12 17:36
查看更多