问题描述
我对 R 比较陌生(并且对使用 R 刮擦也很陌生),所以如果我在这里忽略了一些明显的东西,请提前道歉!
I'm relatively new to R (and brand spanking new to scraping with R), so apologies in advance if I'm overlooking something obvious here!
我一直在尝试按照本教程学习如何使用 RSelenium 进行抓取:https://rawgit.com/petrkeil/Blog/master/2017_08_15_Web_scraping/web_scraping.html#advanced-scraping-with-rselenium
I've been trying to learn how to scrape with RSelenium by following this tutorial: https://rawgit.com/petrkeil/Blog/master/2017_08_15_Web_scraping/web_scraping.html#advanced-scraping-with-rselenium
在终端 (docker run -d -p 4445:4444 selenium/standalone-firefox) 中运行以下命令后,我尝试运行下面的 R 代码,仅从上面的超链接教程中稍作修改即可:
After running the following in Terminal (docker run -d -p 4445:4444 selenium/standalone-firefox), I tried to run the R code below, pulled with only slight modifications from the tutorial hyperlinked above:
get.tree <- function(genus, species)
{
# navigate to the page
browser <- remoteDriver(port=4445L)
browser$open(silent = T)
browser$navigate("http://www.bgci.org/global_tree_search.php?sec=globaltreesearch")
browser$refresh()
# create r objects from the web search input and button elements
genusElem <- browser$findElement(using = 'id', value = "genus-field")
specElem <- browser$findElement(using = 'id', value = "species-field")
buttonElem <- browser$fiendElement(using = 'class', value = "btn_ohoDO")
# tell R to fill in the fields
genusElem$sendKeysToElement(list(genus))
specElem$sendKeysToElement(list(species))
# tell R to click the search button
buttonElem$clickElement()
# get output
out <- browser$findElement(using = "css", value = "td.cell_1O3UaG:nth-child(4)") # the country origin
out <- out$getElementText()[[1]] # extract actual text string
out <- strsplit(out, split = "; ")[[1]] # turns into character vector
# close browser
browser$close()
return(out)
}
# Now let's try it:
get.tree("Abies", "alba")
但是在完成所有这些之后,我收到以下错误:
But after doing all that, I get the following error:
Selenium 消息:无法解码来自牵线木偶的响应构建信息:版本:'3.6.0',修订版:'6fbf3ec767',时间:'2017-09-27T16:15:40.131Z' 系统信息:主机:'d260fa60d69b',ip:'172.17.0.2', os.name: 'Linux', os.arch: 'amd64', os.version:'4.9.49-moby', java.version: '1.8.0_131' 驱动信息:driver.version:未知
错误:摘要:未知错误详细信息:未知的服务器端错误处理命令时发生.班级:org.openqa.selenium.WebDriverException 更多细节:运行errorDetails 方法
Error: Summary: UnknownError Detail: An unknown server-side error occurred while processing the command. class: org.openqa.selenium.WebDriverException Further Details: run errorDetails method
有人知道这意味着什么以及我哪里出错了吗?
Anyone have any idea what this means and where I went wrong?
非常感谢您的帮助!
推荐答案
只需利用它发出的 XHR 请求来检索内嵌结果并抛出 RSelenium:
Just take advantage of the XHR request it makes to retrieve the in-line results and toss RSelenium:
library(httr)
library(tidyverse)
get_tree <- function(genus, species) {
GET(
url = sprintf("https://data.bgci.org/treesearch/genus/%s/species/%s", genus, species),
add_headers(
Origin = "http://www.bgci.org",
Referer = "http://www.bgci.org/global_tree_search.php?sec=globaltreesearch"
)
) -> res
stop_for_status(res)
matches <- content(res, flatten=TRUE)$results[[1]]
flatten_df(matches[c("id", "taxon", "family", "author", "source", "problems", "distributionDone", "note", "wcsp")]) %>%
mutate(geo = list(map_chr(matches$TSGeolinks, "country"))) %>%
mutate(taxas = list(map_chr(matches$TSTaxas, "checkTaxon")))
}
xdf <- get_tree("Abies", "alba")
xdf
## # A tibble: 1 x 8
## id taxon family author source distributionDone geo taxas
## <int> <chr> <chr> <chr> <chr> <chr> <list> <list>
## 1 58373 Abies alba Pinaceae Mill. WCSP Phans yes <chr [21]> <chr [45]>
glimpse(xdf)
## Observations: 1
## Variables: 8
## $ id <int> 58373
## $ taxon <chr> "Abies alba"
## $ family <chr> "Pinaceae"
## $ author <chr> "Mill."
## $ source <chr> "WCSP Phans"
## $ distributionDone <chr> "yes"
## $ geo <list> [<"Albania", "Andorra", "Austria", "Bulgaria", "Croatia", "Czech Republic", "Fr...
## $ taxas <list> [<"Abies abies", "Abies alba f. columnaris", "Abies alba f. compacta", "Abies a...
您很可能在某个时候需要修改 get_tree()
,但这比将 Selenium、Splash、phantomjs 或 Headless Chrome 作为依赖项要好.
It's highly likely you'll need to modify get_tree()
at some point but it's better than having Selenium or Splash or phantomjs or Headless Chrome as a dependency.
这篇关于获取 RSelenium 错误:“无法解码来自牵线木偶的响应"的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!