本文介绍了R:使用 rvest 包而不是 XML 包从 URL 获取链接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用 XML 包从 这个 url 获取链接.

I use XML package to get the links from this url.

# Parse HTML URL
v1WebParse <- htmlParse(v1URL)
# Read links and and get the quotes of the companies from the href
t1Links <- data.frame(xpathSApply(v1WebParse, '//a', xmlGetAttr, 'href'))

虽然这种方法非常有效,但我使用了 rvest 并且在解析网页时似乎比 XML 更快.我尝试了 html_nodeshtml_attrs 但我无法让它工作.

While this method is very efficient, I've used rvest and seems faster at parsing a web than XML. I tried html_nodes and html_attrs but I can't get it to work.

推荐答案

尽管有我的评论,以下是使用 rvest 的方法.请注意,我们需要首先使用 htmlParse 读取页面,因为该站点的内容类型设置为该文件的 text/plain 并且会抛出 rvest 有点晕.

Despite my comment, here's how you can do it with rvest. Note that we need to read in the page with htmlParse first since the site has the content-type set to text/plain for that file and that tosses rvest into a tizzy.

library(rvest)
library(XML)

pg <- htmlParse("http://www.bvl.com.pe/includes/empresas_todas.dat")
pg %>% html_nodes("a") %>% html_attr("href")

##   [1] "/inf_corporativa71050_JAIME1CP1A.html" "/inf_corporativa10400_INTEGRC1.html"
##   [3] "/inf_corporativa66100_ACESEGC1.html"   "/inf_corporativa71300_ADCOMEC1.html"
## ...
## [273] "/inf_corporativa64801_VOLCAAC1.html"   "/inf_corporativa58501_YURABC11.html"
## [275] "/inf_corporativa98959_ZNC.html"

这进一步说明了 rvestXML 包基础.

That further illustrates rvest's XML package underpinnings.

更新

rvest::read_html() 现在可以直接处理了:

rvest::read_html() can handle this directly now:

pg <- read_html("http://www.bvl.com.pe/includes/empresas_todas.dat")

这篇关于R:使用 rvest 包而不是 XML 包从 URL 获取链接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

05-29 21:47