如何使用 XML 和 ReadHTMLTable 抓取多个页面?

本文介绍了如何使用 XML 和 ReadHTMLTable 抓取多个页面?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在使用 XML 包将芝加哥马拉松赛的结果抓取到 CSV 文件中.问题是该网站只能在一个页面上显示 1,000 名跑步者，所以我必须抓取多个页面.到目前为止，我编写的脚本适用于第一页:

I'm using the XML package to scrape results from the Chicago marathon into a CSV. The problem is that the site can only display 1,000 runners on a single page, so I have to scrape multiple pages. The script I've written so far works for the first page:

rm(list=ls())

library(XML)

page_numbers <- 1:1429
urls <- paste(
"http://results.public.chicagomarathon.com/2011/index.php?page",
page_numbers,
sep = "="
)

tables <-(for i in page_numbers){
readHTMLTable(urls)
}
n.rows <- unlist(lapply(tables, function(t) dim(t)[1]))

times <- tables[[which.max(n.rows)]]

如何使用此代码抓取所有 21 个页面以获得完整结果.我应该使用 for() 循环还是 lapply 函数或其他东西，我在这里有点迷茫.

How can I use this code to scrape all 21 pages to get the complete results. Should I use a for() loop or an lapply function or something else, I'm a bit lost here.

谢谢！

推荐答案

为每个 URL 添加页码.

Add the page number to each URL.

page_numbers <- 1:1429
urls <- paste(
  "http://results.public.chicagomarathon.com/2011/index.php?pid=list&page",
  page_numbers,
  sep = "="
)

现在遍历每一页，抓取每一页.使用 for 循环还是 *apply 函数并不重要.例如，参见 R Inferno (pdf) 的第 4 圈讨论for"循环和lapply"之间的区别.

Now loop over each page, scraping each one. It doesn't matter too much whether you use a for loop or an *apply function. See, e.g., Circle 4 of the R Inferno (pdf) for a discussion of the difference between 'for' loops and 'lapply'.

这篇关于如何使用 XML 和 ReadHTMLTable 抓取多个页面?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！

Pages