问题描述
我正在尝试使用 python 抓取
当我检查
这清楚地表明该网站受到机器人管理服务提供商的保护Distil Networks 和 ChromeDriver 的导航被检测到并随后被阻止.
蒸馏
根据文章确实有一些关于 Distil.it 的东西......:
Distil 通过观察站点行为和识别抓取工具特有的模式来保护站点免受自动内容抓取机器人的侵害.当 Distil 在一个站点上识别出恶意机器人时,它会创建一个列入黑名单的行为配置文件,并将其部署给所有客户.类似于机器人防火墙,Distil 会检测模式并做出反应.
进一步,
Selenium 的一个模式是自动窃取网络内容"
,Distil 首席执行官 Rami Essaid 上周在接受采访时说.即使他们可以创建新的机器人,我们还是找到了一种方法来识别 Selenium 是他们正在使用的工具,因此无论他们在该机器人上迭代多少次,我们都会阻止 Selenium.我们正在这样做现在使用 Python 和许多不同的技术.一旦我们看到一种模式从一种机器人中出现,我们就会对他们使用的技术进行逆向工程并将其识别为恶意".
参考
您可以在以下位置找到一些详细的讨论:
- Distil 检测 WebDriver 驱动的 Chrome 浏览上下文
- Selenium webdriver:修改导航器.webdriver 标志以防止硒检测
- Akamai Bot Manager 检测到 WebDriver 驱动的 Chrome 浏览上下文
I am trying to scrape https://www.controller.com/ with python, and since the page detected a bot using pandas.get_html
, and requests using user-agents and a rotating proxy, i resorted to using selenium webdriver. However, this is also being detected as a bot with the following message. Can anybody explain how can I get past this?:
Here is my code:
from selenium import webdriver
import requests
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.keys import Keys
options = webdriver.ChromeOptions()
options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
options.add_argument("disable-infobars")
options.add_argument("--disable-extensions")
#options.add_argument('headless')
driver = webdriver.Chrome(chrome_options=options)
driver.get('https://www.controller.com/')
driver.implicitly_wait(30)
You have mentioned about pandas.get_html
only in your question and options.add_argument('headless')
only in your code so not sure if you are implementing them. However taking out minimum code from your code attempt as follows:
Code Block:
from selenium import webdriver options = webdriver.ChromeOptions() options.add_argument("start-maximized") options.add_argument("disable-infobars") options.add_argument("--disable-extensions") driver = webdriver.Chrome(chrome_options=options, executable_path=r'C:UtilityBrowserDriverschromedriver.exe') driver.get('https://www.controller.com/') print(driver.title)
I have faced the same issue.
- Browser Snashot:
When I inspected the HTML DOM it was observed that the website refers the distil_referrer on window.onbeforeunload
as follows:
<script type="text/javascript" id="">
window.onbeforeunload=function(a){"undefined"!==typeof sessionStorage&&sessionStorage.removeItem("distil_referrer")};
</script>
Snapshot:
This is a clear indication that the website is protected by Bot Management service provider Distil Networks and the navigation by ChromeDriver gets detected and subsequently blocked.
Distil
As per the article There Really Is Something About Distil.it...:
Further,
Reference
You can find a couple of detailed discussion in:
- Distil detects WebDriver driven Chrome Browsing Context
- Selenium webdriver: Modifying navigator.webdriver flag to prevent selenium detection
- Akamai Bot Manager detects WebDriver driven Chrome Browsing Context
这篇关于网页正在使用 Chromedriver 作为机器人检测 Selenium Webdriver的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!