我不太明白为什么我不能在某些带有Rvest的网站上使用选择器。

例:

url <- read_html("http://www.cbc.ca/news/politics")

headlines <- url %>%
html_nodes(".headline") %>%
html_text()


另一个例子:

library(RSelenium)

rD <- rsDriver(verbose = FALSE)
rD
remDr <- rD$client

url <- "http://www.cbc.ca/news/politics"
remDr$navigate(url)

remDr$getTitle()
remDr$getCurrentUrl()

webElem <- remDr$findElement(using = "class", value = 'headline')

webElem$getElementAttribute("class")

remDr$close()
rD$server$stop()


它应该足够简单。当我查看结构时,标题位于课程标题之下。除此之外,还有类card-content,card-content-top,但似乎没有css选择器或xpath的组合起作用。

最佳答案

由于选择器包存在一些问题(至少在Debian上存在问题),CSS选择器可能无法在rvest中工作,有关更多信息,请参见:
https://github.com/sjp/selectr/issues/7

通过使用SelectorGadget和Chrome Developer工具,我使用了以下xpath在网页上查找和标识“标题”。有关如何找到正确的xpath的更多信息,可以在这里找到:
https://medium.com/@peterjgensler/functions-with-r-and-rvest-a-laymens-guide-acda42325a77

library('rvest')
library('magrittr')
url <- read_html("http://www.cbc.ca/news/politics")


headlines <- url %>%
html_nodes(xpath = '//*[contains(concat( " ", @class, " " ), concat( " ", "pinnableHeadline", " " ))]') %>%
html_text()

headlines[1]
"On Trudeau's 2nd trip to China, time may be ripe to advance free
trade"
headlines[2]
"Liberals want to be global leader on open government, but face complaints at home"

关于css - html_nodes未检测到Rvest节点,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/47611664/

10-12 14:01