本文介绍了为 R 脚本确定 url 的正确 CSS 选择器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从网站获取数据,多亏了帮手,我才能访问以下脚本:

I am trying to obtain data from a website and thanks to a helper i could get to the following script:

require(httr)
require(rvest)
      res <- httr::POST(url = "http://apps.kew.org/wcsp/advsearch.do",
                    body = list(page = "advancedSearch",
                                AttachmentExist = "",
                                family = "",
                                placeOfPub = "",
                                genus =      "Arctodupontia",
                                yearPublished = "",
                                species ="scleroclada",
                                author = "",
                                infraRank = "",
                                infraEpithet = "",
                                selectedLevel = "cont"),
                    encode = "form")
  pg <- content(res, as="parsed")
  lnks <- html_attr(html_node(pg,"td"), "href")

但是,在某些情况下,如上面的示例,它不会检索正确的链接,因为出于某种原因,html_attr 在 html_node 检测到的节点内找不到 url(href").到目前为止,我尝试了不同的 CSS 选择器,例如td"、a.onwardnav"和.plantname",但它们都没有生成 html_attr 可以正确处理的对象.有什么提示吗?

However, in some cases, like the example above, it does not retrieve the right link because, for some reason, html_attr does not find urls ("href") within the node detected by html_node. So far, i have tried different CSS selector, like "td", "a.onwardnav" and ".plantname" but none of them generate an object that html_attr can handle correctly.Any hint?

推荐答案

您真的很接近获得预期的答案.如果您想从所需页面中拉出链接,请执行以下操作:

You are really close on getting the answer your were expecting. If you would like to pull the links off of the desired page then:

lnks <- html_attr(html_nodes(pg,"a"), "href")

将返回带有href"属性的a"标签处所有链接的列表.注意命令是 html_nodes 而不是 node.有多个a"标签,因此是复数.
如果您正在寻找正文中表格中的信息,请尝试以下操作:

will return a list of all of the links at the "a" tag with a "href" attribute. Notice the command is html_nodes and not node. There are multiple "a" tags thus the plural.
If you are looking for the information from the table in the body of then try this:

html_table(pg, fill=TRUE)
#or this
html_nodes(pg,"tr")

第二行将返回表中 9 行的列表,然后可以解析这些行以获得行名称(th")和/或行值(td").
希望这会有所帮助.

The second line will return a list of the 9 rows from the table which one could then parse to obtain the row names ("th") and/or row values ("td").
Hope this helps.

这篇关于为 R 脚本确定 url 的正确 CSS 选择器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-05 06:26
查看更多