我不太明白为什么我不能在某些带有Rvest的网站上使用选择器。
例:
url <- read_html("http://www.cbc.ca/news/politics")
headlines <- url %>%
html_nodes(".headline") %>%
html_text()
另一个例子:
library(RSelenium)
rD <- rsDriver(verbose = FALSE)
rD
remDr <- rD$client
url <- "http://www.cbc.ca/news/politics"
remDr$navigate(url)
remDr$getTitle()
remDr$getCurrentUrl()
webElem <- remDr$findElement(using = "class", value = 'headline')
webElem$getElementAttribute("class")
remDr$close()
rD$server$stop()
它应该足够简单。当我查看结构时,标题位于课程标题之下。除此之外,还有类card-content,card-content-top,但似乎没有css选择器或xpath的组合起作用。
最佳答案
由于选择器包存在一些问题(至少在Debian上存在问题),CSS选择器可能无法在rvest中工作,有关更多信息,请参见:
https://github.com/sjp/selectr/issues/7
通过使用SelectorGadget和Chrome Developer工具,我使用了以下xpath在网页上查找和标识“标题”。有关如何找到正确的xpath的更多信息,可以在这里找到:
https://medium.com/@peterjgensler/functions-with-r-and-rvest-a-laymens-guide-acda42325a77
library('rvest')
library('magrittr')
url <- read_html("http://www.cbc.ca/news/politics")
headlines <- url %>%
html_nodes(xpath = '//*[contains(concat( " ", @class, " " ), concat( " ", "pinnableHeadline", " " ))]') %>%
html_text()
headlines[1]
"On Trudeau's 2nd trip to China, time may be ripe to advance free
trade"
headlines[2]
"Liberals want to be global leader on open government, but face complaints at home"
关于css - html_nodes未检测到Rvest节点,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/47611664/