问题描述
我有一个文件"check_text.txt",其中包含"所说的说来做".我想对它执行词干以获得说说说说"的意思.我尝试在tm
包中使用stemDocument
,如下所示,但只得到说出来说得好"的意思.有没有办法对过去时词进行词干处理?在现实世界的自然语言处理中是否有必要这样做?谢谢!
I have a file 'check_text.txt' that contains "said say says make made". I'd like to perform stemming on it to get "say say say make make". I tried to use stemDocument
in tm
package, as the following, but only get "said say say make made". Is there a way to perform stemming on past tense words? Is it necessary to do so in real-world natural language processing? Thanks!
filename = 'check_text.txt'
con <- file(filename, "rb")
text_data <- readLines(con,skipNul = TRUE)
close(con)
text_VS <- VectorSource(text_data)
text_corpus <- VCorpus(text_VS)
text_corpus <- tm_map(text_corpus, stemDocument, language = "english")
as.data.frame(text_corpus)$text
编辑:我还尝试了SnowballC
软件包中的wordStem
EDIT: I also tried wordStem
in SnowballC
package
> library(SnowballC)
> wordStem(c("said", "say", "says", "make", "made"))
[1] "said" "sai" "sai" "make" "made"
推荐答案
如果包装中有不规则英语动词的数据集,那么此任务将很容易.我只是不知道有任何包含此类数据的软件包,因此我选择通过抓取来创建自己的数据库.我不确定该网站是否涵盖所有不规则单词.如有必要,您想搜索更好的网站来创建自己的数据库.一旦拥有数据库,就可以从事您的任务.
If there is a data set of irregular English verbs in a package, this task would be easy. I just do not know any packages with such data, so I chose to create my own database by scraping. I am not sure if this website covers all irregular words. If necessary, you want to search better websites to create your own database. Once you have your database, You can engage in your task.
首先,我使用stemDocument()
并使用-s清理当前表单.然后,我收集了words
中的过去形式(即past
),确定了过去形式中的不定形式(即inf1
),并确定了temp
中过去形式的顺序.我进一步确定了temp
中过去表格的位置.最后,我将sat形式替换为它们的不定式形式.我对过去分词重复了相同的步骤.
First, I used stemDocument()
and clean up present forms with -s. Then, I collected past forms in words
(i.e., past
), infinitive forms of the past forms (i.e., inf1
),identified the order of the past forms in temp
. I further identified the positions of the past forms in temp
. I finally replaced the sat forms with their infinitive forms. I repeated the same procedure for past participles.
library(tm)
library(rvest)
library(dplyr)
library(splitstackshape)
### Create a database
x <- read_html("http://www.englishpage.com/irregularverbs/irregularverbs.html")
x %>%
html_table(header = TRUE) %>%
bind_rows %>%
rename(Past = `Simple Past`, PP = `Past Participle`) %>%
filter(!Infinitive %in% LETTERS) %>%
cSplit(splitCols = c("Past", "PP"),
sep = " / ", direction = "long") %>%
filter(complete.cases(.)) %>%
mutate_each(funs(gsub(pattern = "\\s\\(.*\\)$|\\s\\[\\?\\]",
replacement = "",
x = .))) -> mydic
### Work on the task
words <- c("said", "drawn", "say", "says", "make", "made", "done")
### says to say
temp <- stemDocument(words)
### past forms become present form
### Collect past forms
past <- mydic$Past[which(mydic$Past %in% temp)]
### Collect infinitive forms of past forms
inf1 <- mydic$Infinitive[which(mydic$Past %in% temp)]
### Identify the order of past forms in temp
ind <- match(temp, past)
ind <- ind[is.na(ind) == FALSE]
### Where are the past forms in temp?
position <- which(temp %in% past)
temp[position] <- inf1[ind]
### Check
temp
#[1] "say" "drawn" "say" "say" "make" "make" "done"
### PP forms to infinitive forms (same as past forms)
pp <- mydic$PP[which(mydic$PP %in% temp)]
inf2 <- mydic$Infinitive[which(mydic$PP %in% temp)]
ind <- match(temp, pp)
ind <- ind[is.na(ind) == FALSE]
position <- which(temp %in% pp)
temp[position] <- inf2[ind]
### Check
temp
#[1] "say" "draw" "say" "say" "make" "make" "do"
这篇关于tm包中的stemDocment无法处理过去时词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!