问题描述
我正在尝试使用 XML 包抓取一些表格(选举数据).浏览 SO,我发现了如何使用以下方法抓取单个网址:
I'm trying to scrape some tables (election data) using the XML package. Browsing SO, I found out how to scrape a single url using:
library(XML)
url <- "http://www.elecciones2011.gob.ar/paginas/paginas/dat99/DPR99999A.htm"
total <- readHTMLTable(url)
n.rows <- unlist(lapply(total, function(t) dim(t)[1]))
df<-as.data.frame(total[[which.max(n.rows)]])
使用上面的代码,我得到了一个足够好的结果.我还能够(使用 readLines 函数和一些调整)获得一个包含我想要抓取的所有 url 的向量.像这样:
With the above code I get a nice enough result. I'm also able (with the readLines function and some tweaking) to get a vector with all the urls I want to scrape. Like this:
base_url <- "http://www.elecciones2011.gob.ar/paginas/paginas/"
urls <- paste(
base_url,
c(
"dat02/DPR02999A",
"dat03/DPR03999A",
"dat04/DPR04999A",
"dat05/DPR05999A",
"dat06/DPR06999A",
"dat07/DPR07999A",
"dat08/DPR08999A",
"dat09/DPR09999A",
"dat10/DPR10999A",
"dat11/DPR11999A",
"dat12/DPR12999A",
"dat13/DPR13999A",
"dat14/DPR14999A",
"dat15/DPR15999A",
"dat16/DPR16999A",
"dat17/DPR17999A",
"dat18/DPR18999A",
"dat19/DPR19999A",
"dat20/DPR20999A",
"dat21/DPR21999A",
"dat22/DPR22999A",
"dat23/DPR23999A",
"dat24/DPR24999A"
),
".htm",
sep = ""
)
我想做的是创建一个函数,该函数在所有 url 中运行 readHTMLTable 函数,并将结果存储在向量或数据框中(在一个或多个中,以更简单的方式).我对 R 很陌生,而且我在函数方面特别糟糕.我尝试过类似...
What I'd like to do is to create a function that runs the readHTMLTable function in all the urls and store the results in a vector or data frame (in one or many, whatever is easier). I'm quite new with R, and I'm particularly bad at functions. I tried something like...
tabla<- for (i in urls){
readHTMLTable(urls)
}
...但它甚至还没有接近.
...but it's not even close.
推荐答案
最基本的方法,使用循环.这只是将您提供的代码包装在 for
中.
The most basic approach, using a loop. This just wraps the code you supplied inside a for
.
tabla <- list()
for(i in seq_along(urls))
{
total <- readHTMLTable(urls[i])
n.rows <- unlist(lapply(total, function(t) dim(t)[1]))
tabla[[i]] <- as.data.frame(total[[which.max(n.rows)]])
}
一种更优雅的方法,使用 lapply
.现在提供的代码放在一个函数中,该函数为每个 url 调用.
A more elegant approach, using lapply
. Now the code supplied is put inside a function, which is called for each url.
tabla <- lapply(urls, function(url) {
total <- readHTMLTable(url)
n.rows <- unlist(lapply(total, function(t) dim(t)[1]))
as.data.frame(total[[which.max(n.rows)]])
})
这篇关于如何从链接列表中抓取 HTML 表格的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!