问题描述
我正在使用 XML 包将芝加哥马拉松赛的结果抓取到 CSV 文件中.问题是该网站只能在一个页面上显示 1,000 名跑步者,所以我必须抓取多个页面.到目前为止,我编写的脚本适用于第一页:
I'm using the XML package to scrape results from the Chicago marathon into a CSV. The problem is that the site can only display 1,000 runners on a single page, so I have to scrape multiple pages. The script I've written so far works for the first page:
rm(list=ls())
library(XML)
page_numbers <- 1:1429
urls <- paste(
"http://results.public.chicagomarathon.com/2011/index.php?page",
page_numbers,
sep = "="
)
tables <-(for i in page_numbers){
readHTMLTable(urls)
}
n.rows <- unlist(lapply(tables, function(t) dim(t)[1]))
times <- tables[[which.max(n.rows)]]
如何使用此代码抓取所有 21 个页面以获得完整结果.我应该使用 for()
循环还是 lapply
函数或其他东西,我在这里有点迷茫.
How can I use this code to scrape all 21 pages to get the complete results. Should I use a for()
loop or an lapply
function or something else, I'm a bit lost here.
谢谢!
推荐答案
为每个 URL 添加页码.
Add the page number to each URL.
page_numbers <- 1:1429
urls <- paste(
"http://results.public.chicagomarathon.com/2011/index.php?pid=list&page",
page_numbers,
sep = "="
)
现在遍历每一页,抓取每一页.使用 for
循环还是 *apply
函数并不重要.例如,参见 R Inferno (pdf) 的第 4 圈讨论for"循环和lapply"之间的区别.
Now loop over each page, scraping each one. It doesn't matter too much whether you use a for
loop or an *apply
function. See, e.g., Circle 4 of the R Inferno (pdf) for a discussion of the difference between 'for' loops and 'lapply'.
这篇关于如何使用 XML 和 ReadHTMLTable 抓取多个页面?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!