如何抓取网页内容，然后计算 R 中单词的频率?

本文介绍了如何抓取网页内容，然后计算 R 中单词的频率?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

这是我的代码:

library(XML)
library(RCurl)
url.link <- 'http://www.jamesaltucher.com/sitemap.xml'
blog <- getURL(url.link)
blog          <- htmlParse(blog, encoding = "UTF-8")
titles  <- xpathSApply (blog ,"//loc",xmlValue)             ## titles

traverse_each_page <- function(x){
  tmp <- htmlParse(x)
  xpathApply(tmp, '//div[@id="mainContent"]')
}
pages <- lapply(titles[2:3], traverse_each_page)

这是伪代码:

取一个xml文档:http://www.jamesaltucher.com/sitemap.xml
转到每个链接
解析每个链接的html内容
提取div id="mainContent"
计算所有文章中出现的每个单词的频率，不区分大小写.

我已经完成了第 1-4 步.我需要一些帮助，没有.5.

基本上如果the"这个词在文章1中出现了两次，在文章2中出现了5次.我想知道the"在2篇文章中总共出现了7次.

Basically if the word "the" appears twice in article 1 and five times in article 2. I want to know that "the" appears a total of seven times in 2 articles.

另外，我不知道如何查看我提取到pages 中的内容.我想学习如何查看内容，这将使我更容易调试.

Also, I do not know how to view the contents I have extracted into pages. I want to learn how to view the contents which will make it easier for me to debug.

推荐答案

开始吧，从头到尾.我更改了您的网页抓取代码，以便减少非文本内容，然后底部是字数.

Here you go, start to finish. I changed your code for web-scraping so it gets less non-text stuff and then down the bottom is the word counts.

这是下载 URL 的代码...

Here's your code for downloading the URLs...

library(XML)
library(RCurl)
url.link <- 'http://www.jamesaltucher.com/sitemap.xml'
blog <- getURL(url.link)
blog          <- htmlParse(blog, encoding = "UTF-8")
titles  <- xpathSApply (blog ,"//loc",xmlValue)             ## titles

我已经更改了您的功能以从每个页面中提取文本...

I've changed your function to extract the text from each page...

traverse_each_page <- function(x){
  tmp <- htmlParse(getURI(x))
  xpathSApply(tmp, '//div[@id="mainContent"]', xmlValue)
}
pages <- sapply(titles[2:3], traverse_each_page)

让我们删除换行符和其他非文本字符...

Let's remove newline and other non-text characters...

nont <- c("\n", "\t", "\r")
pages <- gsub(paste(nont,collapse="|"), " ", pages)

关于你的第二个问题，要检查pages中的内容，只需在控制台输入:

Regarding your second question, to inspect the contents in pages, just type it at the console:

pages

现在让我们执行第 5 步计算所有文章中出现的每个单词的频率，不区分大小写."

Now let's do your step 5 'Count the frequencies of each word that appears for all the articles, case-insensitive.'

require(tm)
# convert list into corpus
mycorpus <- Corpus(VectorSource(pages))
# prepare to remove stopwords, ie. common words like 'the'
skipWords <- function(x) removeWords(x, stopwords("english"))
# prepare to remove other bits we usually don't care about
funcs <- list(tolower, removePunctuation, removeNumbers, stripWhitespace, skipWords)
# do it
a <- tm_map(mycorpus, FUN = tm_reduce, tmFuns = funcs)
# make document term matrix
mydtm <- DocumentTermMatrix(a, control = list(wordLengths = c(3,10)))

在这里您可以看到每个文档的每个单词的数量

Here's where you see the count of each word per document

inspect(mydtm)
# you can assign it to a data frame for more convenient viewing
my_df <- inspect(mydtm)
my_df

以下是计算所有文章中出现的每个单词的总频率的方法，不区分大小写...

Here's how you count the total frequencies of each word that appears for all the articles, case-insensitive...

apply(mydtm, 2, sum)

这能回答你的问题吗?我猜您可能真的只对最常用的词(如@buruzaemon 的回答细节)或某个词的子集感兴趣，但这是另一个问题...

Does that answer your question? I guess that you're probably really only interested in the most frequent words (as @buruzaemon's answer details), or a certain sub-set of words, but that's another question...

这篇关于如何抓取网页内容，然后计算 R 中单词的频率?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！