使用R导航和抓取具有下拉HTML表单的网页

使用R导航和抓取具有下拉HTML表单的网页

本文介绍了使用R导航和抓取具有下拉HTML表单的网页的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从 http://www.footballoutsiders.com/stats/snapcounts ,但是我无法更改网站下拉框中的字段(团队",星期",职位"和年份").我尝试用rvest刮擦与team ="ALL",week ="1",pos ="All"和year ="2015"关联的表.

I'm attempting to scrape data from http://www.footballoutsiders.com/stats/snapcounts, but I can't change the fields in the drop down boxes on the site ("team", "week", "position", and "year"). My attempt to scrape the table associated with team = "ALL", week= "1", pos = "All", and year= "2015" with rvest is below.

url <- "http://www.footballoutsiders.com/stats/snapcounts"
pgsession <- html_session(url)
pgform <-html_form(pgsession)[[3]]
filled_form <-set_values(pgform,
            "team" = "ALL",
            "week" = "1",
            "pos"  = "ALL",
            "year" = "2015"
 )

 submit_form(session=pgsession,form=filled_form, POST=url)

 y <- read_html("http://www.footballoutsiders.com/stats/snapcounts")

 y <- y %>%
    html_nodes("table") %>%
    .[[2]] %>%
    html_table(header=TRUE)

此代码返回与下拉列表框中的默认变量相关联的表,这些变量是team ="ALL",week ="20",pos ="QB"和year ="2015",这是一个仅包含以下内容的数据帧11个观察.如果它实际更改了字段,它将返回一个包含1,695个观测值的数据框.

This code returns the table associated the default variables in the dropdown box which are team = "ALL", week= "20", pos = "QB", and year= "2015" which is a data frame that only contains 11 observations. If it had actually changed the fields it would have returned a data frame with 1,695 observations.

推荐答案

您可以捕获提交表单时生成的会话,并将该会话用作html_nodes的输入:

You can capture the session produced when the form is submitted and use that session as input to html_nodes:

d <- submit_form(session=pgsession, form=filled_form)

y <- d %>%
    html_nodes("table") %>%
    .[[2]] %>%
    html_table(header=TRUE)

dim(y)
#[1] 1695   11

否则,如果使用read_html(url),则正在阅读原始页面.

Otherwise, if you use read_html(url) you are reading the original page.

这篇关于使用R导航和抓取具有下拉HTML表单的网页的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-22 20:51