问题描述
我看到这个问题用其他语言回答过,但没有用 R 语言回答过.
I have seen this question answered in other languages but not in R.
[专门用于 R 文本挖掘] 我有一组从语料库中获取的常用短语.现在我想搜索这些短语在另一个语料库中出现的次数.
[Specifically for R text mining] I have a set of frequent phrases that is obtained from a Corpus. Now I would like to search for the number of times these phrases have appeared in another corpus.
有没有办法在 TM 包中做到这一点?(或其他相关包)
Is there a way to do this in TM package? (Or another related package)
例如,假设我有一组短语,即从 CorpusA 获得的标签".另一个语料库 CorpusB,包含数千个子文本.我想知道标签中的每个短语在 CorpusB 中出现了多少次.
For example, say I have an array of phrases, "tags" obtained from CorpusA. And another Corpus, CorpusB, of couple thousand sub texts. I want to find out how many times each phrase in tags have appeared in CorpusB.
一如既往,我感谢您的所有帮助!
As always, I appreciate all your help!
推荐答案
并不完美,但这应该能让你开始.
Ain't perfect but this should get you started.
#User Defined Function
strip <- function(x, digit.remove = TRUE, apostrophe.remove = FALSE){
strp <- function(x, digit.remove, apostrophe.remove){
x2 <- Trim(tolower(gsub(".*?($|'|[^[:punct:]]).*?", "\1", as.character(x))))
x2 <- if(apostrophe.remove) gsub("'", "", x2) else x2
ifelse(digit.remove==TRUE, gsub("[[:digit:]]", "", x2), x2)
}
unlist(lapply(x, function(x) Trim(strp(x =x, digit.remove = digit.remove,
apostrophe.remove = apostrophe.remove)) ))
}
#==================================================================
#Create 2 'corpus' documents (you'd have to actually do all this in tm
corpus1 <- 'I have seen this question answered in other languages but not in R.
[Specifically for R text mining] I have a set of frequent phrases that is obtained from a Corpus.
Now I would like to search for the number of times these phrases have appeared in another corpus.
Is there a way to do this in TM package? (Or another related package)
For example, say I have an array of phrases, "tags" obtained from CorpusA. And another Corpus, CorpusB, of
couple thousand sub texts. I want to find out how many times each phrase in tags have appeared in CorpusB.
As always, I appreciate all your help!'
corpus2 <- "What have you tried? If you have seen it answered in another language, why don't you try translating that
language into R? – Eric Strom 2 hours ago
I am not a coder, otherwise would do. I just do not know a way to do this. – appletree 1 hour ago
Could you provide some example? or show what you have in mind for input and output? or a pseudo code?
As it is I find the question a bit too general. As it sounds I think you could use regular expressions
with grep to find your 'tags'. – AndresT 15 mins ago"
#=======================================================
#Clean up the text
corpus1 <- gsub("\s+", " ", gsub("
| ", " ", corpus1))
corpus2 <- gsub("\s+", " ", gsub("
| ", " ", corpus2))
corpus1.wrds <- as.vector(unlist(strsplit(strip(corpus1), " ")))
corpus2.wrds <- as.vector(unlist(strsplit(strip(corpus2), " ")))
#create frequency tables for each corpus
corpus1.Freq <- data.frame(table(corpus1.wrds))
corpus1.Freq$corpus1.wrds <- as.character(corpus1.Freq$corpus1.wrds)
corpus1.Freq <- corpus1.Freq[order(-corpus1.Freq$Freq), ]
rownames(corpus1.Freq) <- 1:nrow(corpus1.Freq)
key.terms <- corpus1.Freq[corpus1.Freq$Freq>2, 'corpus1.wrds'] #key words to match on corpus 2
corpus2.Freq <- data.frame(table(corpus2.wrds))
corpus2.Freq$corpus2.wrds <- as.character(corpus2.Freq$corpus2.wrds)
corpus2.Freq <- corpus2.Freq[order(-corpus2.Freq$Freq), ]
rownames(corpus2.Freq) <- 1:nrow(corpus2.Freq)
#Match key words to the words in corpus 2
corpus2.Freq[corpus2.Freq$corpus2.wrds %in%key.terms, ]
这篇关于R 文本挖掘:计算特定单词在语料库中出现的次数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!