本文介绍了使用DataDome的网站在使用Selenium和Python进行抓取时被阻止验证码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我实际上是在尝试从不同的网站上抓取一些汽车数据,我一直在将硒与chromebrowser一起使用,但有些网站实际上通过验证码验证来阻止硒(例如: https://www.leboncoin.fr/),而只需1或2个请求即可.我尝试在chromebrowser中更改$ _cdc,但这不能解决问题,我一直在chromebrowser中使用这些选项

I'm actually trying to scrape some car datas from different websites, i've been using selenium with chromebrowser but some websites actually block selenium with captcha validation(example: https://www.leboncoin.fr/), and this in just 1 or 2 requests.I tried changing $_cdc in the chromebrowser but this didn't resolve the problem, and I've been using those options for the chromebrowser

user_agent = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36'
options = webdriver.ChromeOptions()
options.add_argument(f'user-agent={user_agent}')
options.add_argument('start-maximized')
options.add_argument('disable-infobars')
options.add_argument('--profile-directory=Default')
options.add_argument("--incognito")
options.add_argument("--disable-plugins-discovery")
options.add_experimental_option("excludeSwitches", ["ignore-certificate-errors", "safebrowsing-disable-download-protection", "safebrowsing-disable-auto-update", "disable-client-side-phishing-detection"])
options.add_argument('--disable-extensions')
browser = webdriver.Chrome(chrome_options=options)

browser.delete_all_cookies()

browser.set_window_size(800,800)

browser.set_window_position(0,0)

我要抓取的网站将DataDome用于机器人安全,有什么线索吗?

The website I'm trying to scrape uses DataDome for bot security, any clue ?

推荐答案

有关从不同网站或 https://www.leboncoin.fr/可以帮助我们构建更规范的答案.但是,我可以使用如下:

A bit more details about your usecase on scraping car datas from different websites or from https://www.leboncoin.fr/ would have helped us to construct a more canonical answer. However, I was able to access the Page Source using Selenium as follows:

  • 代码块:

  • Code Block:

from selenium import webdriver

options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
driver = webdriver.Chrome(options=options, executable_path=r'C:\WebDrivers\chromedriver.exe')
driver.get('https://www.leboncoin.fr/')
print(driver.page_source)

  • 控制台输出:

  • Console Output:

    <html class="gServer"><head><link rel="preconnect" href="//fonts.googleapis.com" crossorigin=""><link rel="preload" href="https://fonts.googleapis.com/css2?family=Open+Sans:wght@400;600;700&amp;display=swap" crossorigin="" as="style"><link rel="stylesheet" href="https://fonts.googleapis.com/css2?family=Open+Sans:wght@400;600;700&amp;display=swap" crossorigin=""><style data-emotion-css=""></style><meta charset="utf-8"><link rel="manifest" href="/manifest.json"><link type="application/opensearchdescription+xml" rel="search" href="/opensearch.xml"><meta name="theme-color" content="#ff6e14"><meta property="og:locale" content="fr_FR"><meta property="og:site_name" content="leboncoin"><meta name="twitter:site" content="leboncoin"><meta http-equiv="P3P" content="CP=&quot;This is not a P3P policy&quot;"><meta name="viewport" content="initial-scale=1.0, width=device-width, maximum-scale=1.0, user-scalable=0"><script type="text/javascript" async="" src="https://www.googleadservices.com/pagead/conversion_async.js"></script><script type="text/javascript" async="" src="https://tp.realytics.io/sync/se/cnktbDNiMG5jb3xyeV83NTFGRUQwMy1CMDdGLTRBQTgtOTAxRi1DNUREMDVGRjkxQTJ8?ct=1&amp;rt=1&amp;u=https%3A%2F%2Fwww.leboncoin.fr%2F&amp;r=&amp;ts=1591306049397"></script><script type="text/javascript" async="" src="https://www.googleadservices.com/pagead/conversion_async.js"></script><script type="text/javascript" async="" src="https://www.googleadservices.com/pagead/conversion_async.js"></script><script type="text/javascript" async="" src="https://www.googletagmanager.com/gtag/js?id=AW-766292687&amp;l=dataLayer&amp;cx=c"></script><script type="text/javascript" async="" src="https://www.googletagmanager.com/gtag/js?id=AW-667462656&amp;l=dataLayer&amp;cx=c"></script><script type="text/javascript" async="" src="https://cdn-eu.realytics.net/realytics-1.2.min.js"></script><script type="text/javascript" async="" src="https://i.realytics.io/tc.js?cb=1591306047755"></script><script type="text/javascript" async="" src="https://www.googletagmanager.com/gtag/js?id=DC-4167650&amp;l=dataLayer&amp;cx=c"></script><script type="text/javascript" async="" src="https://www.googletagmanager.com/gtag/js?id=AW-744431185&amp;l=dataLayer&amp;cx=c"></script><script type="text/javascript" async="" charset="utf-8" src="//www.googleadservices.com/pagead/conversion_async.js" id="utag_82"></script><script type="text/javascript" async="" charset="utf-8" src="//sdk.mpianalytics.com/pulse.min.js" id="utag_47"></script><script async="true" type="text/javascript" src="https://sslwidget.criteo.com/event?a=50103&amp;v=5.5.0&amp;p0=e%3Dexd%26site_type%3Dd&amp;p1=e%3Dvh&amp;p2=e%3Ddis&amp;adce=1&amp;tld=leboncoin.fr&amp;dtycbr=6569" data-owner="criteo-tag"></script><script type="text/javascript" src="//try.abtasty.com/09643a1c5bc909059579da8aac99e8f1.js"></script><script>window.dataLayer = window.dataLayer || [];
    .
    .
    .
    <iframe height="1" width="1" style="display:none" src="//4167650.fls.doubleclick.net/activityi;src=4167650;type=slbc01;cat=all-site;u1=homepage;ord=9979622847645.51?" id="utag_179_iframe"></iframe></body></html>
    

  • 但是,从 DOM树可以明显看出,该网站已受到保护,免受 Bad的破坏机器人通过> DataDome ,如下所示:

    However, it's quite evident from the DOM Tree that the website is protected from Bad Bots through DataDome as in:

    主要功能如下:

    • DataDome是作为服务交付的唯一机器人保护解决方案.
    • DataDome不需要架构更改或DNS重新路由.
    • DataDome的机器人检测引擎将对网站的每个请求与庞大的内存模式数据库进行比较,并结合使用AI和机器学习来在不到2毫秒的时间内确定是否应授予对页面的访问权限. /li>
    • DataDome可检测并识别100%的OWASP自动威胁.
    • DataDome的自定义规则"功能甚至可以使您阻止不向您出售产品的国家/地区的人流量,或者仅在特定情况下允许合作伙伴机器人访问您的网站.
    • DataDome is the only bot protection solution delivered as-a-service.
    • DataDome requires no architecture changes or DNS rerouting.
    • DataDome's bot detection engine compares every request to the website with a massive in-memory pattern database, and uses a blend of AI and machine learning to decide in less than 2 milliseconds whether access to your pages should be granted or not.
    • DataDome detects and identifies 100% of OWASP automated threats.
    • DataDome's Custom Rules function can even allow you to block human traffic from countries you are not selling to, or to allow partner bots to access your site only in specific circumstances.

    有关DataDoe的文档可在以下位置找到:

    Documentation on DataDoe can be found at:

    • Bot detection
    • Server-side bot detection is not enough

    这篇关于使用DataDome的网站在使用Selenium和Python进行抓取时被阻止验证码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

    06-25 12:37