问题描述
我拼凑了一块粗刮板,用来刮掉Expedia的价格/航空公司的机票:
I put together a crude scraper that scrapes prices/airlines from Expedia:
# Start the Server
rD <- rsDriver(browser = "phantomjs", verbose = FALSE)
# Assign the client
remDr <- rD$client
# Establish a wait for an element
remDr$setImplicitWaitTimeout(1000)
# Navigate to Expedia.com
appurl <- "https://www.expedia.com/Flights-Search?flight-type=on&starDate=04/30/2017&mode=search&trip=oneway&leg1=from:Denver,+Colorado,to:Oslo,+Norway,departure:04/30/2017TANYT&passengers=children:0,adults:1"
remDr$navigate(appURL)
# Give a crawl delay to see if it gives time to load web page
Sys.sleep(10) # Been testing with 10
###ADD JAVASCRIPT INJECTION HERE###
remDr$executeScript(?)
# Extract Prices
webElem <- remDr$findElements(using = "css", "[class='dollars price-emphasis']")
prices <- unlist(lapply(webElem, function(x){x$getElementText()}))
print(prices)
# Extract Airlines
webElem <- remDr$findElements(using = "css", "[data-test-id='airline-name']")
airlines <- unlist(lapply(webElem, function(x){x$getElementText()}))
print(airlines)
# close client/server
remDr$close()
rD$server$stop()
如您所见,我内置了ImplicitWaitTimeout
和Sys.Sleep
调用,以便该页面有时间加载phantomJS
且不会使网站超载请求.
As you can see, I built in an ImplicitWaitTimeout
and a Sys.Sleep
call so that the page has time to load in phantomJS
and to not overload the website with requests.
通常来说,在日期范围内循环播放时,刮板效果很好.但是,当连续遍历10个以上的日期时,Selenium
有时会引发StaleElementReference
错误并停止执行.我知道这样做的原因是因为页面尚未完全加载并且class='dollars price-emphasis'
还不存在. URL的构造很好.
Generally speaking, when looping over a date range, the scraper works well. However, when looping through 10+ dates consecutively, Selenium
sometimes throws a StaleElementReference
error and stops the execution. I know the reason for this is because the page has yet to load completely and the class='dollars price-emphasis'
doesn't exist yet. The URL construction is fine.
只要页面一路成功加载,刮板就会获得接近60个的价格和航班.我之所以这样说是因为有时脚本仅返回15-20个条目(通常在浏览器中检查该日期时,有60个条目).在这里,我得出的结论是,我只发现60个元素中的20个,这意味着页面仅被部分加载.
Whenever the page successfully loads all the way, the scraper gets near 60 prices and flights. I'm mentioning this because there are times when the script returns only 15-20 entries (when checking this date normally on a browser, there are 60). Here, I conclude that I'm only finding 20 of 60 elements, meaning the page has only partially loaded.
我想通过injecting JavaScript
使该脚本更健壮,该脚本在等待元素加载之前等待页面完全加载.我知道要执行此操作的方法是remDr$executeScript()
,并且我发现了许多有用的JS代码段来实现此目的,但是由于JS知识有限,我无法适应这些解决方案以与我的脚本句法协同工作.
I want to make this script more robust by injecting JavaScript
that waits for the page to fully load prior to looking for elements. I know the way to do this is remDr$executeScript()
, and I have found many useful JS snippets for accomplishing this, but due to limited knowledge in JS, I'm having problems adapting these solutions to work syntactically with my script.
以下是提出的一些解决方案,请等待硒中的页面加载& 硒-如何等待页面完全加载 :
Here are several solutions that have been proposed from Wait for page load in Selenium & Selenium - How to wait until page is completely loaded:
基本代码:
remDr$executeScript(
WebDriverWait wait = new WebDriverWait(driver, 20);
By addItem = By.cssSelector("class=dollars price-emphasis");, args = list()
)
基本脚本的补充:
1)检查元素的陈旧性
1) Check for Staleness of an Element
# get the "Add Item" element
WebElement element = wait.until(ExpectedConditions.presenceOfElementLocated(addItem));
# wait the element "Add Item" to become stale
wait.until(ExpectedConditions.stalenessOf(element));
2)等待元素的可见性
2) Wait for Visibility of element
wait.until(ExpectedConditions.visibilityOfElementLocated(addItem));
我尝试使用remDr$executeScript("return document.readyState").equals("complete")
作为进行抓取之前的检查,但是页面始终显示为完整,即使不是完整页面.
I have tried to useremDr$executeScript("return document.readyState").equals("complete")
as a check before proceeding with the scrape, but the page always shows as complete, even if it's not.
有人对我如何使这些解决方案之一适合我的R脚本有任何建议吗?关于如何完全等待页面加载近60个找到的元素的任何想法?我仍然在学习,所以我们将不胜感激.
Does anyone have any suggestions about how I could adapt one of these solutions to work with my R script? Any ideas on how I could wait entirely for the page to load with nearly 60 found elements? I'm still leaning, so any help would be greatly appreciated.
推荐答案
使用while/tryCatch的解决方案:
Solution using while/tryCatch:
remDr$navigate("<webpage url>")
webElem <-NULL
while(is.null(webElem)){
webElem <- tryCatch({remDr$findElement(using = 'name', value = "<value>")},
error = function(e){NULL})
#loop until element with name <value> is found in <webpage url>
}
这篇关于R-使用PhantomJS等待页面加载到RSelenium中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!