xml - R:使用rvest软件包而不是XML软件包从URL获取链接

我使用XML包从this url获取链接。

# Parse HTML URL
v1WebParse <- htmlParse(v1URL)
# Read links and and get the quotes of the companies from the href
t1Links <- data.frame(xpathSApply(v1WebParse, '//a', xmlGetAttr, 'href'))

尽管此方法非常有效，但我使用了rvest，并且在解析Web时似乎比XML更快。我尝试了html_nodes和html_attrs，但无法正常工作。

最佳答案

尽管有我的评论，但这是使用rvest的方法。请注意，我们首先需要使用htmlParse读取页面，因为该站点的内容类型已针对该文件设置为text/plain，并且使rvest变得头昏眼花。

library(rvest)
library(XML)

pg <- htmlParse("http://www.bvl.com.pe/includes/empresas_todas.dat")
pg %>% html_nodes("a") %>% html_attr("href")

##   [1] "/inf_corporativa71050_JAIME1CP1A.html" "/inf_corporativa10400_INTEGRC1.html"
##   [3] "/inf_corporativa66100_ACESEGC1.html"   "/inf_corporativa71300_ADCOMEC1.html"
## ...
## [273] "/inf_corporativa64801_VOLCAAC1.html"   "/inf_corporativa58501_YURABC11.html"
## [275] "/inf_corporativa98959_ZNC.html"

这进一步说明了rvest的XML软件包的基础。

更新
rvest::read_html()现在可以直接处理此问题:

pg <- read_html("http://www.bvl.com.pe/includes/empresas_todas.dat")

关于xml - R:使用rvest软件包而不是XML软件包从URL获取链接，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/27297484/