R-使用PhantomJS等待页面加载到RSelenium中

本文介绍了R-使用PhantomJS等待页面加载到RSelenium中的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我拼凑了一块粗刮板，用来刮掉Expedia的价格/航空公司的机票:

I put together a crude scraper that scrapes prices/airlines from Expedia:

# Start the Server
rD <- rsDriver(browser = "phantomjs", verbose = FALSE)

# Assign the client
remDr <- rD$client

# Establish a wait for an element
remDr$setImplicitWaitTimeout(1000)

# Navigate to Expedia.com
appurl <- "https://www.expedia.com/Flights-Search?flight-type=on&starDate=04/30/2017&mode=search&trip=oneway&leg1=from:Denver,+Colorado,to:Oslo,+Norway,departure:04/30/2017TANYT&passengers=children:0,adults:1"
remDr$navigate(appURL)

# Give a crawl delay to see if it gives time to load web page
Sys.sleep(10)   # Been testing with 10

###ADD JAVASCRIPT INJECTION HERE###
remDr$executeScript(?)

# Extract Prices
webElem <- remDr$findElements(using = "css", "[class='dollars price-emphasis']")
prices <- unlist(lapply(webElem, function(x){x$getElementText()}))
print(prices)

# Extract Airlines
webElem <- remDr$findElements(using = "css", "[data-test-id='airline-name']")
airlines <- unlist(lapply(webElem, function(x){x$getElementText()}))
print(airlines)

# close client/server
remDr$close()
rD$server$stop()

如您所见，我内置了ImplicitWaitTimeout和Sys.Sleep调用，以便该页面有时间加载phantomJS且不会使网站超载请求.

As you can see, I built in an ImplicitWaitTimeout and a Sys.Sleep call so that the page has time to load in phantomJS and to not overload the website with requests.

通常来说，在日期范围内循环播放时，刮板效果很好.但是，当连续遍历10个以上的日期时，Selenium有时会引发StaleElementReference错误并停止执行.我知道这样做的原因是因为页面尚未完全加载并且class='dollars price-emphasis'还不存在. URL的构造很好.

Generally speaking, when looping over a date range, the scraper works well. However, when looping through 10+ dates consecutively, Selenium sometimes throws a StaleElementReference error and stops the execution. I know the reason for this is because the page has yet to load completely and the class='dollars price-emphasis' doesn't exist yet. The URL construction is fine.

只要页面一路成功加载，刮板就会获得接近60个的价格和航班.我之所以这样说是因为有时脚本仅返回15-20个条目(通常在浏览器中检查该日期时，有60个条目).在这里，我得出的结论是，我只发现60个元素中的20个，这意味着页面仅被部分加载.

Whenever the page successfully loads all the way, the scraper gets near 60 prices and flights. I'm mentioning this because there are times when the script returns only 15-20 entries (when checking this date normally on a browser, there are 60). Here, I conclude that I'm only finding 20 of 60 elements, meaning the page has only partially loaded.

我想通过injecting JavaScript使该脚本更健壮，该脚本在等待元素加载之前等待页面完全加载.我知道要执行此操作的方法是remDr$executeScript()，并且我发现了许多有用的JS代码段来实现此目的，但是由于JS知识有限，我无法适应这些解决方案以与我的脚本句法协同工作.

I want to make this script more robust by injecting JavaScript that waits for the page to fully load prior to looking for elements. I know the way to do this is remDr$executeScript(), and I have found many useful JS snippets for accomplishing this, but due to limited knowledge in JS, I'm having problems adapting these solutions to work syntactically with my script.

以下是提出的一些解决方案，请等待硒中的页面加载& 硒-如何等待页面完全加载 :

Here are several solutions that have been proposed from Wait for page load in Selenium & Selenium - How to wait until page is completely loaded:

基本代码:

remDr$executeScript(
WebDriverWait wait = new WebDriverWait(driver, 20);
By addItem = By.cssSelector("class=dollars price-emphasis");, args = list()
)

基本脚本的补充:

1)检查元素的陈旧性

1) Check for Staleness of an Element

# get the "Add Item" element
WebElement element = wait.until(ExpectedConditions.presenceOfElementLocated(addItem));
# wait the element "Add Item" to become stale
wait.until(ExpectedConditions.stalenessOf(element));

2)等待元素的可见性

2) Wait for Visibility of element

wait.until(ExpectedConditions.visibilityOfElementLocated(addItem));

我尝试使用remDr$executeScript("return document.readyState").equals("complete")作为进行抓取之前的检查，但是页面始终显示为完整，即使不是完整页面.

I have tried to useremDr$executeScript("return document.readyState").equals("complete") as a check before proceeding with the scrape, but the page always shows as complete, even if it's not.

有人对我如何使这些解决方案之一适合我的R脚本有任何建议吗?关于如何完全等待页面加载近60个找到的元素的任何想法?我仍然在学习，所以我们将不胜感激.

Does anyone have any suggestions about how I could adapt one of these solutions to work with my R script? Any ideas on how I could wait entirely for the page to load with nearly 60 found elements? I'm still leaning, so any help would be greatly appreciated.

页面加载

R-使用PhantomJS等待页面加载到RSelenium中

问题描述

推荐答案