问题描述
我正在尝试通过阅读 XPath 代码来抓取网站.当我进入开发人员部分时,我看到了以下几行:
I am trying to scrape a website by reading XPath code.When I go in the developer section, I see those lines:
<span class="js-bestRate-show" data-crid="11232895" data-id="928723" data-abc="0602524361510" data-referecenceta="44205406" data-catalog="1">
我想抓取 data-abc 的所有值.假设网站上的每个元素都是一部电影,所以我想为页面的每个电影抓取所有 data-abc 元素.
I would like to scrape all values for data-abc.Let's say each element on the site is a movie, so I would like to scrape all data-abc elements for each movie of the page.
我想将 Rvest 包与 R 一起使用.以下是两种不同的尝试,但均无效...
I would like to do so using Rvest package with R.Below are two different attempts that did not work...
website %>% html_nodes("js-bestRate-show") %>% html_text()
website %>%
html_nodes(xpath = "js-bestRate-show") %>%
html_nodes(xpath = "//div") %>%
html_nodes(xpath = "//span") %>%
html_nodes(xpath = "//data-abc")
有人知道 html_nodes 和 Rvest 是如何工作的吗?
Anyone knows how html_nodes and Rvest work?
推荐答案
节点是 span
,类为 js-bestRate-show
.其他一切都是一个属性.所以你想要这样的东西:
The node is span
with class js-bestRate-show
. Everything else is an attribute. So you want something like:
library(rvest)
h <- '<span class="js-bestRate-show" data-crid="11232895" data-id="928723" data-abc="0602524361510" data-referecenceta="44205406" data-catalog="1">'
h %>%
read_html() %>%
html_nodes("span.js-bestRate-show") %>%
html_attr("data-abc")
这篇关于Rvest html_nodes 跨越 div 和 Xpath的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!