问题描述
我使用 XML 包从 这个 url 获取链接.
I use XML package to get the links from this url.
# Parse HTML URL
v1WebParse <- htmlParse(v1URL)
# Read links and and get the quotes of the companies from the href
t1Links <- data.frame(xpathSApply(v1WebParse, '//a', xmlGetAttr, 'href'))
虽然这种方法非常有效,但我使用了 rvest
并且在解析网页时似乎比 XML
更快.我尝试了 html_nodes
和 html_attrs
但我无法让它工作.
While this method is very efficient, I've used rvest
and seems faster at parsing a web than XML
. I tried html_nodes
and html_attrs
but I can't get it to work.
推荐答案
尽管有我的评论,以下是使用 rvest
的方法.请注意,我们需要首先使用 htmlParse
读取页面,因为该站点的内容类型设置为该文件的 text/plain
并且会抛出 rvest
有点晕.
Despite my comment, here's how you can do it with rvest
. Note that we need to read in the page with htmlParse
first since the site has the content-type set to text/plain
for that file and that tosses rvest
into a tizzy.
library(rvest)
library(XML)
pg <- htmlParse("http://www.bvl.com.pe/includes/empresas_todas.dat")
pg %>% html_nodes("a") %>% html_attr("href")
## [1] "/inf_corporativa71050_JAIME1CP1A.html" "/inf_corporativa10400_INTEGRC1.html"
## [3] "/inf_corporativa66100_ACESEGC1.html" "/inf_corporativa71300_ADCOMEC1.html"
## ...
## [273] "/inf_corporativa64801_VOLCAAC1.html" "/inf_corporativa58501_YURABC11.html"
## [275] "/inf_corporativa98959_ZNC.html"
这进一步说明了 rvest
的 XML
包基础.
That further illustrates rvest
's XML
package underpinnings.
更新
rvest::read_html()
现在可以直接处理了:
rvest::read_html()
can handle this directly now:
pg <- read_html("http://www.bvl.com.pe/includes/empresas_todas.dat")
这篇关于R:使用 rvest 包而不是 XML 包从 URL 获取链接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!