

对于学校项目,我需要获取 200 家公司的网址(基于列表).我的脚本工作正常,但是当我在公司 80 左右时,我被谷歌屏蔽了.这是我得到的消息.

For a school project I need get the web addresses of 200 companies (based on a list). My script is working fine, but when I'm around the company 80, I get blocked by google. This is the message that I'm getting.

> Our systems have detected unusual traffic from your computer network.
> This page checks to see if it's really you sending the requests, and
> not a robot.  <a href="#"
> onclick="document.getElementById('infoDiv').style.display='block'


I tried two different ways to get my data:


for company_name in data:
     search = company_name
     results = 1
     page = requests.get("https://www.google.com/search?q={}&num={}".format(search, results))

     soup = BeautifulSoup(page.content, "html5lib")


for company_name in data:
    search = company_name
    results = 1

    s = requests.Session()
    retries = Retry(total=3, backoff_factor=0.5)
    s.mount('http://', HTTPAdapter(max_retries=retries))
    s.mount('https://', HTTPAdapter(max_retries=retries))
    page = s.get("https://www.google.com/search?q={}&num={}".format(search, results))

    soup = BeautifulSoup(page.content, "html5lib")


But I'm getting the same mistake over and over. Is there a way I could overcome this issue? thanks!


如果您只想确保每 0.6 秒发出的请求不超过 1 个,您只需要休眠直到距离上次请求至少 0.6 秒请求.

If you just want to make sure you never make more than 1 request every 0.6 seconds, you just need to sleep until it's been at least 0.6 seconds since the last request.

如果处理每个请求所需的时间仅为 0.6 秒的一小部分,您可以取消注释代码中已有的行.但是,在循环的末尾而不是在中间执行它可能更有意义:

If the amount of time it takes you to process each request is a tiny fraction of 0.6 seconds, you can uncomment the line already in your code. However, it probably makes more sense to do it at the end of the loop, rather than in the middle:

for company_name in data:
    # blah blah
    page = s.get("https://www.google.com/search?q={}&num={}".format(search, results))
    soup = BeautifulSoup(page.content, "html5lib")
    # do whatever you wanted with soup

如果您的处理需要 0.6 秒的相当大的一部分,那么等待 0.6 秒就太长了.例如,如果有时需要 0.1 秒,有时需要 1.0 秒,那么您想在第一种情况下等待 0.5 秒,但在第二种情况下根本不需要,对吗?

If your processing takes a sizable fraction of 0.6 seconds, then waiting 0.6 seconds is too long. For example, if it sometimes takes 0.1 seconds, sometimes 1.0, then you want to wait 0.5 seconds in the first case, but not at all in the second, right?

在这种情况下,只需跟踪您上次发出请求的时间,并在此之后的 0.6 秒内休眠:

In that case, just keep track of the last time you made a request, and sleep until 0.6 seconds after that:

last_req = time.time()
for company_name in data:
    # blah blah
    page = s.get("https://www.google.com/search?q={}&num={}".format(search, results))
    soup = BeautifulSoup(page.content, "html5lib")
    # do whatever you wanted with soup

    now = time.time()
    delay = last_req + 0.600 - now
    last_req = now
    if delay >= 0:

如果您需要恰好每 0.6 秒(或尽可能接近该时间)发出一次请求,您可以启动执行此操作的线程,并将结果放入队列,而另一个线程(可能是您的主线程)只是阻止从该队列中弹出请求并处理它们.

If you need to make requests exactly once every 0.6 seconds—or as close to that as possible—you could kick off a thread that does that, and tosses the results in a queue, while another thread (possibly your main thread) just blocks popping requests off that queue and processing them.


But I can't imagine why you'd need that.


08-01 03:47