问题描述
当我设置 options.add_argument(-headless")
时,我的刮板当前有问题.但是,将其卸下时,效果很好.谁能建议我如何使用无头模式实现相同的结果?
I am current having an issue with my scraper when I set options.add_argument("--headless")
. However, it works perfectly fine when it is removed. Could anyone advise how I can achieve the same results with headless mode?
以下是我的python代码:
Below is my python code:
from seleniumwire import webdriver as wireDriver
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.chrome.options import Options
chromedriverPath = '/Users/applepie/Desktop/chromedrivermac'
def scraper(search):
mit = "https://orbit-kb.mit.edu/hc/en-us/search?utf8=✓&query=" # Empty search on mit site
mit += "+".join(search) + "&commit=Search"
results = []
options = Options()
options.add_argument("--headless")
options.add_argument("--window-size=1440, 900")
driver = webdriver.Chrome(options=options, executable_path= chromedriverPath)
driver.get(mit)
# Wait 20 seconds for page to load
timeout = 20
try:
WebDriverWait(driver, timeout).until(EC.visibility_of_element_located((By.CLASS_NAME, "header")))
search_results = driver.find_element_by_class_name("search-results")
for result in search_results.find_elements_by_class_name("search-result"):
resultObject = {
"url": result.find_element_by_class_name('search-result-link').get_attribute("href")
}
results.append(resultObject)
driver.quit()
except TimeoutException:
print("Timed out waiting for page to load")
driver.quit()
return results
这也是我在 get()
之后进行 print(driver.page_source)
时的屏幕截图:
Here is also a screenshot of when I print(driver.page_source)
after get()
:
推荐答案
此屏幕截图...
...表示 Cloudflare 已将您对网站的请求检测为自动bot,随后拒绝您访问该应用程序.
...implies that the Cloudflare have detected your requests to the website as an automated bot and subsequently denying you the access to the application.
在这些情况下,可能的解决方案是在 undetected-chromedriver 中使用 headless 模式来初始化 google-chrome-headless 浏览上下文.
In these cases the a potential solution would be to use the undetected-chromedriver in headless mode to initialize the google-chrome-headless browsing context.
undetected-chromedriver 是经过优化的Selenium Chromedriver补丁,不会触发反机器人服务例如Distill Network/Imperva/DataDome/Botprotect.io.它会自动下载驱动程序二进制文件并对其进行修补.
undetected-chromedriver is an optimized Selenium Chromedriver patch which does not trigger anti-bot services like Distill Network / Imperva / DataDome / Botprotect.io. It automatically downloads the driver binary and patches it.
-
代码块:
Code Block:
import undetected_chromedriver as uc
from selenium import webdriver
options = webdriver.ChromeOptions()
options.headless = True
driver = uc.Chrome(options=options)
driver.get(url)
您可以在以下位置找到几个相关的详细讨论:
You can find a couple of relevant detailed discussions in:
这篇关于无头Chrome驱动程序不适用于Selenium的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!