本文介绍了网页正在使用 Chromedriver 作为机器人检测 Selenium Webdriver的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

限时删除!!

我正在尝试使用 python 抓取

当我检查

这清楚地表明该网站受到机器人管理服务提供商的保护Distil NetworksChromeDriver 的导航被检测到并随后被阻止.

蒸馏

根据文章确实有一些关于 Distil.it 的东西......:

Distil 通过观察站点行为和识别抓取工具特有的模式来保护站点免受自动内容抓取机器人的侵害.当 Distil 在一个站点上识别出恶意机器人时,它会创建一个列入黑名单的行为配置文件,并将其部署给所有客户.类似于机器人防火墙,Distil 会检测模式并做出反应.

进一步,

Selenium 的一个模式是自动窃取网络内容",Distil 首席执行官 Rami Essaid 上周在接受采访时说.即使他们可以创建新的机器人,我们还是找到了一种方法来识别 Selenium 是他们正在使用的工具,因此无论他们在该机器人上迭代多少次,我们都会阻止 Selenium.我们正在这样做现在使用 Python 和许多不同的技术.一旦我们看到一种模式从一种机器人中出现,我们就会对他们使用的技术进行逆向工程并将其识别为恶意".

参考

您可以在以下位置找到一些详细的讨论:

I am trying to scrape https://www.controller.com/ with python, and since the page detected a bot using pandas.get_html, and requests using user-agents and a rotating proxy, i resorted to using selenium webdriver. However, this is also being detected as a bot with the following message. Can anybody explain how can I get past this?:

Here is my code:

from selenium import webdriver
import requests
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.keys import Keys
options = webdriver.ChromeOptions()
options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
options.add_argument("disable-infobars")
options.add_argument("--disable-extensions")
#options.add_argument('headless')
driver = webdriver.Chrome(chrome_options=options)
driver.get('https://www.controller.com/')
driver.implicitly_wait(30)
解决方案

You have mentioned about pandas.get_html only in your question and options.add_argument('headless') only in your code so not sure if you are implementing them. However taking out minimum code from your code attempt as follows:

  • Code Block:

    from selenium import webdriver
    
    options = webdriver.ChromeOptions()
    options.add_argument("start-maximized")
    options.add_argument("disable-infobars")
    options.add_argument("--disable-extensions")
    driver = webdriver.Chrome(chrome_options=options, executable_path=r'C:UtilityBrowserDriverschromedriver.exe')
    driver.get('https://www.controller.com/')
    print(driver.title)
    

I have faced the same issue.

  • Browser Snashot:

When I inspected the HTML DOM it was observed that the website refers the distil_referrer on window.onbeforeunload as follows:

<script type="text/javascript" id="">
    window.onbeforeunload=function(a){"undefined"!==typeof sessionStorage&&sessionStorage.removeItem("distil_referrer")};
</script>

Snapshot:

This is a clear indication that the website is protected by Bot Management service provider Distil Networks and the navigation by ChromeDriver gets detected and subsequently blocked.


Distil

As per the article There Really Is Something About Distil.it...:

Further,


Reference

You can find a couple of detailed discussion in:

这篇关于网页正在使用 Chromedriver 作为机器人检测 Selenium Webdriver的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

1403页,肝出来的..

09-07 00:26