问题描述
这是我正在运行的代码
library(rvest)
rootUri <- "https://github.com/rails/rails/pull/"
PR <- as.list(c(100, 200, 300))
list <- paste0(rootUri, PR)
messages <- lapply(list, function(l) {
html(l)
})
到目前为止它似乎工作正常,但是当我尝试提取文本时:
Up until this point it seems to work fine, but when I try to extract the text:
html_text(messages)
我明白了:
Error in xml_apply(x, XML::xmlValue, ..., .type = character(1)) :
Unknown input of class: list
尝试提取特定元素:
html_text(messages[1])
也不能这样做...
Error in xml_apply(x, XML::xmlValue, ..., .type = character(1)) :
Unknown input of class: list
所以我尝试了一种不同的方式:
So I try a different way:
html_text(messages[[1]])
这似乎至少得到了数据,但仍然不成功:
This seems to at least get at the data, but is still not succesful:
Error in UseMethod("xmlValue") :
no applicable method for 'xmlValue' applied to an object of class "c('HTMLInternalDocument', 'HTMLInternalDocument', 'XMLInternalDocument', 'XMLAbstractDocument')"
如何从列表的每个元素中提取文本材料?
How can I extract the text material from each of the elements of my list?
推荐答案
您的代码有两个问题.在此处查看有关如何使用该软件包的示例.
1.您不能将所有功能都用于所有功能.
html()
用于下载内容html_node()
用于从页面的下载内容中选择节点html_text()
用于从先前选择的节点中提取文本
html()
is for download of contenthtml_node()
is for selecting node(s) from the downloaded content of a pagehtml_text()
is for extracting text from a previously selected node
因此,要下载您的页面之一并提取 html 节点的文本,请使用:
Therefore, to download one of your pages and extract the text of the html-node, use this:
library(rvest)
老派风格:
url <- "https://github.com/rails/rails/pull/100"
url_content <- html(url)
url_mainnode <- html_node(url_content, "*")
url_mainnode_text <- html_text(url_mainnode)
url_mainnode_text
...或者这个...
难以阅读的老式风格:
url_mainnode_text <- html_text(html_node(html("https://github.com/rails/rails/pull/100"), "*"))
url_mainnode_text
...或者这个...
magritr-piping 风格
url_mainnode_text <-
html("https://github.com/rails/rails/pull/100") %>%
html_node("*") %>%
html_text()
url_mainnode_text
2.使用列表时,您必须将函数应用于列表,例如lapply()
如果您想批量处理几个 URL,您可以尝试这样的操作:
If you want to kind of batch-process several URLs you can try something like this:
url_list <- c("https://github.com/rails/rails/pull/100",
"https://github.com/rails/rails/pull/200",
"https://github.com/rails/rails/pull/300")
get_html_text <- function(url, css_or_xpath="*"){
html_text(
html_node(
html("https://github.com/rails/rails/pull/100"), css_or_xpath
)
)
}
lapply(url_list, get_html_text, css_or_xpath="a[class=message]")
这篇关于Rvest 抓取错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!