问题描述
我正在尝试从网站获取数据,多亏了帮手,我才能访问以下脚本:
I am trying to obtain data from a website and thanks to a helper i could get to the following script:
require(httr)
require(rvest)
res <- httr::POST(url = "http://apps.kew.org/wcsp/advsearch.do",
body = list(page = "advancedSearch",
AttachmentExist = "",
family = "",
placeOfPub = "",
genus = "Arctodupontia",
yearPublished = "",
species ="scleroclada",
author = "",
infraRank = "",
infraEpithet = "",
selectedLevel = "cont"),
encode = "form")
pg <- content(res, as="parsed")
lnks <- html_attr(html_node(pg,"td"), "href")
但是,在某些情况下,如上面的示例,它不会检索正确的链接,因为出于某种原因,html_attr 在 html_node 检测到的节点内找不到 url(href").到目前为止,我尝试了不同的 CSS 选择器,例如td"、a.onwardnav"和.plantname",但它们都没有生成 html_attr 可以正确处理的对象.有什么提示吗?
However, in some cases, like the example above, it does not retrieve the right link because, for some reason, html_attr does not find urls ("href") within the node detected by html_node. So far, i have tried different CSS selector, like "td", "a.onwardnav" and ".plantname" but none of them generate an object that html_attr can handle correctly.Any hint?
推荐答案
您真的很接近获得预期的答案.如果您想从所需页面中拉出链接,请执行以下操作:
You are really close on getting the answer your were expecting. If you would like to pull the links off of the desired page then:
lnks <- html_attr(html_nodes(pg,"a"), "href")
将返回带有href"属性的a"标签处所有链接的列表.注意命令是 html_nodes 而不是 node.有多个a"标签,因此是复数.
如果您正在寻找正文中表格中的信息,请尝试以下操作:
will return a list of all of the links at the "a" tag with a "href" attribute. Notice the command is html_nodes and not node. There are multiple "a" tags thus the plural.
If you are looking for the information from the table in the body of then try this:
html_table(pg, fill=TRUE)
#or this
html_nodes(pg,"tr")
第二行将返回表中 9 行的列表,然后可以解析这些行以获得行名称(th")和/或行值(td").
希望这会有所帮助.
The second line will return a list of the 9 rows from the table which one could then parse to obtain the row names ("th") and/or row values ("td").
Hope this helps.
这篇关于为 R 脚本确定 url 的正确 CSS 选择器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!