本文介绍了Rvest html_nodes 跨越 div 和 Xpath的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试通过阅读 XPath 代码来抓取网站.当我进入开发人员部分时,我看到了以下几行:

I am trying to scrape a website by reading XPath code.When I go in the developer section, I see those lines:

<span class="js-bestRate-show" data-crid="11232895" data-id="928723" data-abc="0602524361510" data-referecenceta="44205406" data-catalog="1">

我想抓取 data-abc 的所有值.假设网站上的每个元素都是一部电影,所以我想为页面的每个电影抓取所有 data-abc 元素.

I would like to scrape all values for data-abc.Let's say each element on the site is a movie, so I would like to scrape all data-abc elements for each movie of the page.

我想将 Rvest 包与 R 一起使用.以下是两种不同的尝试,但均无效...

I would like to do so using Rvest package with R.Below are two different attempts that did not work...

website %>% html_nodes("js-bestRate-show") %>% html_text()

website %>%
  html_nodes(xpath = "js-bestRate-show") %>%
  html_nodes(xpath = "//div") %>%
  html_nodes(xpath = "//span") %>%
  html_nodes(xpath = "//data-abc")

有人知道 html_nodes 和 Rvest 是如何工作的吗?

Anyone knows how html_nodes and Rvest work?

推荐答案

节点是 span,类为 js-bestRate-show.其他一切都是一个属性.所以你想要这样的东西:

The node is span with class js-bestRate-show. Everything else is an attribute. So you want something like:

library(rvest)
h <- '<span class="js-bestRate-show" data-crid="11232895" data-id="928723" data-abc="0602524361510" data-referecenceta="44205406" data-catalog="1">'

h %>% 
  read_html() %>% 
  html_nodes("span.js-bestRate-show") %>% 
  html_attr("data-abc")

这篇关于Rvest html_nodes 跨越 div 和 Xpath的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-21 14:32