本文介绍了R-使用PhantomJS等待页面加载到RSelenium中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我拼凑了一块粗刮板,用来刮掉Expedia的价格/航空公司的机票:

I put together a crude scraper that scrapes prices/airlines from Expedia:

# Start the Server
rD <- rsDriver(browser = "phantomjs", verbose = FALSE)

# Assign the client
remDr <- rD$client

# Establish a wait for an element
remDr$setImplicitWaitTimeout(1000)

# Navigate to Expedia.com
appurl <- "https://www.expedia.com/Flights-Search?flight-type=on&starDate=04/30/2017&mode=search&trip=oneway&leg1=from:Denver,+Colorado,to:Oslo,+Norway,departure:04/30/2017TANYT&passengers=children:0,adults:1"
remDr$navigate(appURL)

# Give a crawl delay to see if it gives time to load web page
Sys.sleep(10)   # Been testing with 10

###ADD JAVASCRIPT INJECTION HERE###
remDr$executeScript(?)

# Extract Prices
webElem <- remDr$findElements(using = "css", "[class='dollars price-emphasis']")
prices <- unlist(lapply(webElem, function(x){x$getElementText()}))
print(prices)

# Extract Airlines
webElem <- remDr$findElements(using = "css", "[data-test-id='airline-name']")
airlines <- unlist(lapply(webElem, function(x){x$getElementText()}))
print(airlines)

# close client/server
remDr$close()
rD$server$stop()

如您所见,我内置了ImplicitWaitTimeoutSys.Sleep调用,以便该页面有时间加载phantomJS且不会使网站超载请求.

As you can see, I built in an ImplicitWaitTimeout and a Sys.Sleep call so that the page has time to load in phantomJS and to not overload the website with requests.

通常来说,在日期范围内循环播放时,刮板效果很好.但是,当连续遍历10个以上的日期时,Selenium有时会引发StaleElementReference错误并停止执行.我知道这样做的原因是因为页面尚未完全加载并且class='dollars price-emphasis'还不存在. URL的构造很好.

Generally speaking, when looping over a date range, the scraper works well. However, when looping through 10+ dates consecutively, Selenium sometimes throws a StaleElementReference error and stops the execution. I know the reason for this is because the page has yet to load completely and the class='dollars price-emphasis' doesn't exist yet. The URL construction is fine.

只要页面一路成功加载,刮板就会获得接近60个的价格和航班.我之所以这样说是因为有时脚本仅返回15-20个条目(通常在浏览器中检查该日期时,有60个条目).在这里,我得出的结论是,我只发现60个元素中的20个,这意味着页面仅被部分加载.

Whenever the page successfully loads all the way, the scraper gets near 60 prices and flights. I'm mentioning this because there are times when the script returns only 15-20 entries (when checking this date normally on a browser, there are 60). Here, I conclude that I'm only finding 20 of 60 elements, meaning the page has only partially loaded.

我想通过injecting JavaScript使该脚本更健壮,该脚本在等待元素加载之前等待页面完全加载.我知道要执行此操作的方法是remDr$executeScript(),并且我发现了许多有用的JS代码段来实现此目的,但是由于JS知识有限,我无法适应这些解决方案以与我的脚本句法协同工作.

I want to make this script more robust by injecting JavaScript that waits for the page to fully load prior to looking for elements. I know the way to do this is remDr$executeScript(), and I have found many useful JS snippets for accomplishing this, but due to limited knowledge in JS, I'm having problems adapting these solutions to work syntactically with my script.

以下是提出的一些解决方案,请等待硒中的页面加载& 硒-如何等待页面完全加载 :

Here are several solutions that have been proposed from Wait for page load in Selenium & Selenium - How to wait until page is completely loaded:

基本代码:

remDr$executeScript(
WebDriverWait wait = new WebDriverWait(driver, 20);
By addItem = By.cssSelector("class=dollars price-emphasis");, args = list()
)

基本脚本的补充:

1)检查元素的陈旧性

1) Check for Staleness of an Element

# get the "Add Item" element
WebElement element = wait.until(ExpectedConditions.presenceOfElementLocated(addItem));
# wait the element "Add Item" to become stale
wait.until(ExpectedConditions.stalenessOf(element));

2)等待元素的可见性

2) Wait for Visibility of element

wait.until(ExpectedConditions.visibilityOfElementLocated(addItem));

我尝试使用remDr$executeScript("return document.readyState").equals("complete")作为进行抓取之前的检查,但是页面始终显示为完整,即使不是完整页面.

I have tried to useremDr$executeScript("return document.readyState").equals("complete") as a check before proceeding with the scrape, but the page always shows as complete, even if it's not.

有人对我如何使这些解决方案之一适合我的R脚本有任何建议吗?关于如何完全等待页面加载近60个找到的元素的任何想法?我仍然在学习,所以我们将不胜感激.

Does anyone have any suggestions about how I could adapt one of these solutions to work with my R script? Any ideas on how I could wait entirely for the page to load with nearly 60 found elements? I'm still leaning, so any help would be greatly appreciated.

推荐答案

使用while/tryCatch的解决方案:

Solution using while/tryCatch:

remDr$navigate("<webpage url>")
webElem <-NULL
while(is.null(webElem)){
  webElem <- tryCatch({remDr$findElement(using = 'name', value = "<value>")},
  error = function(e){NULL})
 #loop until element with name <value> is found in <webpage url>
}

这篇关于R-使用PhantomJS等待页面加载到RSelenium中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-22 21:21