使用 rvest 在 R 中抓取交互式表格

本文介绍了使用 rvest 在 R 中抓取交互式表格的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试从以下链接抓取滚动表:http://proximityone.com/cd114_2013_2014.htm

I'm trying to scrape the scrolling table from the following link: http://proximityone.com/cd114_2013_2014.htm

我正在使用 rvest，但无法为表格找到正确的 xpath.我目前的代码如下:

I'm using rvest but am having trouble finding the correct xpath for the table. My current code is as follows:

url <- "http://proximityone.com/cd114_2013_2014.htm"
table <- gis_data_html %>%
html_node(xpath = '//span') %>%
html_table()

目前我收到错误没有适用于 'html_table' 的方法应用于类xml_missing"的对象"

Currently I get the error "no applicable method for 'html_table' applied to an object of class "xml_missing""

有人知道我需要更改什么才能抓取链接中的交互式表格吗?

Anyone know what I would need to change to scrape the interactive table in the link?

推荐答案

所以你面临的问题是 rvest 会读取页面的源代码，但不会执行 javascript在页面上.当我检查交互式表格时，我看到

So the problem you're facing is that rvest will read the source of a page, but it won't execute the javascript on the page. When I inspect the interactive table, I see

<textarea id="aw52-box-focus" class="aw-control-focus " tabindex="0"
onbeforedeactivate="AW(this,event)" onselectstart="AW(this,event)"
onbeforecopy="AW(this,event)" oncut="AW(this,event)" oncopy="AW(this,event)"
onpaste="AW(this,event)" style="z-index: 1; width: 100%; height: 100%;">
</textarea>

但是当我查看页面源代码时，aw52-box-focus"不存在.这是因为它是在页面通过 javascript 加载时创建的.

but when I look at the page source, "aw52-box-focus" doesn't exist. This is because it's created as the page loads via javascript.

您有多种选择来处理这个问题.简单"的是使用 RSelenium 并使用实际浏览器加载页面，然后在加载后获取元素.另一个选项是通读 javascript 并查看它从哪里获取数据，然后利用它而不是抓取表格.

You have a couple of options to deal with this. The 'easy' one is to use RSelenium and use an actual browser to load the page and then get the element after it's loaded. The other options is to read through the javascript and see where it's getting the data from and then tap into that rather than scraping the table.

更新

事实证明，阅读 javascript 真的很容易——它只是加载一个 CSV 文件.地址为纯文本，http://proximityone.com/countytrends/cd114_acs2014utf8_hl.csv

Turns out it's really easy to read the javascript - it's just loading a CSV file. The address is in plain text, http://proximityone.com/countytrends/cd114_acs2014utf8_hl.csv

.csv 没有列标题，但那些也在中

The .csv doesn't have column headers, but those are in the <script> as well

var columns = [
"FirstNnme",
"LastName",
"Party",
"Feature",
"St",
"CD",
"State<br>CD",
"State<br>CD",
"Population<br>2013",
"Population<br>2014",
"PopCh<br>2013-14",
"%PopCh<br>2013-14",
"MHI<br>2013",
"MHI<br>2014",
"MHI<br>Change<br>2013-14",
"%MHI<br>Change<br>2013-14",
"MFI<br>2013",
"MFI<br>2014",
"MFI<br>Change<br>2013-14",
"%MFI<br>Change<br>2013-14",
"MHV<br>2013",
"MHV<br>2014",
"MHV<br>Change<br>2013-14",
"%MHV<br>Change<br>2013-14",

]

程序化解决方案

您也可以通过编程方式尝试此操作，而不是通过 javacript 进行挖掘(如果您想要此网站上有多个此类页面).我们阅读页面，获取注释，获取文本"(脚本本身)并查找对 csv 文件的引用.然后我们展开相对 URL 并将其读入.这对列名没有帮助，但也不应该太难提取.

Instead of digging through the javacript (in case there are several such pages on this website you want) you can attempt this pro programmatically too. We read the page, get the <script> notes, get the "text" (the script itself) and look for references to a csv file. Then we expand out the relative URL and read it in. This doesn't help with column names, but shouldn't be too hard to extract that too.

library(rvest)
page = read_html("http://proximityone.com/cd114_2013_2014.htm")
scripts = page %>%
  html_nodes("script") %>%
  html_text() %>%
  grep("\\.csv",.,value=T)
relCSV = stringr::str_extract(scripts,"\\.\\./.*?csv")
fullCSV = gsub("\\.\\.","http://proximityone.com",relCSV)
data = read.csv(fullCSV,header = FALSE)

这篇关于使用 rvest 在 R 中抓取交互式表格的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！