本文介绍了对于循环网络抓取网站会出现 timeouterror、newconnectionerror 和 requests.exceptions.ConnectionError的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

抱歉,我是 Python 和网页抓取的初学者.

我正在网络抓取 wugniu.com 以提取我输入的字符的读数.我制作了一个包含 10273 个字符的列表以格式化为 URL 并打开带有读数的页面,然后我使用 Requests 模块返回源代码,然后使用 Beautiful Soup 返回所有音频 ID(因为它们的字符串包含输入字符 - 我无法使用表格中出现的文本,因为它们是 svgs).然后我尝试将字符及其读数输出到 out.txt.

# -*- 编码:utf-8 -*-导入请求,时间从 bs4 导入 BeautifulSoup从 requests.packages.urllib3.exceptions 导入 InsecureRequestWarningrequests.packages.urllib3.disable_warnings(InsecureRequestWarning)字符 = [#characters 去这里]输出=打开(out.txt",a",编码=utf-8")tic = time.perf_counter()对于字符中的字符:# 列表中的字符被格式化为 urlurl = "https://wugniu.com/search?char=%s&table=wenzhou"% 字符page = requests.get(url, verify=False)汤 = BeautifulSoup(page.text, 'html.parser')对于soup.find_all('audio') 中的audio_tag:audio_id = audio_tag.get('id').replace("0-","")#output.write(char)#output.write("")#output.write(audio_id)#output.write("
")打印(一)时间.睡眠(60)输出.close()toc = time.perf_counter()持续时间 = int(toc) - int(tic)打印(花了 %d 秒"% 持续时间)

out.txt 是我试图将结果输出到的输出文件.我测量了该过程用于衡量性能的时间.

但是,在大约 50 个循环之后,我在 cmd 中得到了这个:

回溯(最近一次调用最后一次):文件C:Users[user]Documentswenzhou-imeenvlibsite-packagesurllib3connection.py",第 169 行,在 _new_connconn = connection.create_connection(文件C:Users[user]Documentswenzhou-imeenvlibsite-packagesurllib3utilconnection.py",第 96 行,在 create_connection提高错误文件C:Users[user]Documentswenzhou-imeenvlibsite-packagesurllib3utilconnection.py",第 86 行,在 create_connectionsock.connect(sa)TimeoutError: [WinError 10060] 连接尝试失败,因为连接方在一段时间后没有正确响应,或者因为连接的主机没有响应而建立连接失败在处理上述异常的过程中,又发生了一个异常:回溯(最近一次调用最后一次):文件C:Users[user]Documentswenzhou-imeenvlibsite-packagesurllib3connectionpool.py",第 699 行,在 urlopen httplib_response = self._make_request(文件C:Users[user]Documentswenzhou-imeenvlibsite-packagesurllib3connectionpool.py",第 382 行,在 _make_request 中self._validate_conn(conn)文件C:Users[user]Documentswenzhou-imeenvlibsite-packagesurllib3connectionpool.py",第 1010 行,在 _validate_conn连接.connect()文件C:Users[user]Documentswenzhou-imeenvlibsite-packagesurllib3connection.py",第353行,在connectconn = self._new_conn()文件C:Users[user]Documentswenzhou-imeenvlibsite-packagesurllib3connection.py",第 181 行,在 _new_connraise NewConnectionError(urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPSConnection object at 0x000002035D5F9040>: 建立新连接失败:[WinError 10060] 连接尝试失败,因为连接方在一段时间后没有正确响应,或建立连接失败,因为连接的主机未能响应在处理上述异常的过程中,又发生了一个异常:回溯(最近一次调用最后一次):文件C:Users[user]Documentswenzhou-imeenvlibsite-packages
equestsadapters.py",第 439 行,发送resp = conn.urlopen(文件C:Users[user]Documentswenzhou-imeenvlibsite-packagesurllib3connectionpool.py",第 755 行,在 urlopen重试 = 重试.增量(文件C:Users[user]Documentswenzhou-imeenvlibsite-packagesurllib3util
etry.py",第573行,增量引发 MaxRetryError(_pool, url, error or ResponseError(cause)) urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='wugniu.com', port=443): 最大重试次数超过 url:/search?char=%E8%87%B4&table=wenzhou (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x000002035D5F9040>: 建立新连接失败:[WinError 10060] 连接尝试失败,因为连接方在一段时间后没有正确响应的时间,或建立的连接失败,因为连接的主机未能响应'))在处理上述异常的过程中,又发生了一个异常:回溯(最近一次调用最后一次):文件C:Users[user]Documentswenzhou-ime	est.py",第 3282 行,在 <module> 中.page = requests.get(url, verify=False)文件C:Users[user]Documentswenzhou-imeenvlibsite-packages
equestsapi.py",第 76 行,在 getreturn request('get', url, params=params, **kwargs) 文件C:Users[user]Documentswenzhou-imeenvlibsite-packages
equestsapi.py",第 61 行,请求中返回 session.request(method=method, url=url, **kwargs) 文件C:Users[user]Documentswenzhou-imeenvlibsite-packages
equestssessions.py",第 542 行,请求中resp = self.send(prep, **send_kwargs) 文件C:Users[user]Documentswenzhou-imeenvlibsite-packages
equestssessions.py",第 655 行,发送r = adapter.send(request, **kwargs) 文件C:Users[user]Documentswenzhou-imeenvlibsite-packages
equestsadapters.py",第 516 行,发送引发 ConnectionError(e, request=request)requests.exceptions.ConnectionError: HTTPSConnectionPool(host='wugniu.com', port=443): Max retries exceeded with url:/search?char=%E8%87%B4&table=wenzhou (由 NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x000002035D5F9040>: Failed to build a new connection: [WinError 10060] 连接尝试失败,因为连接方在一段时间后没有正确响应,或者建立连接失败,因为连接的主机没有响应'))

我试图通过添加 time.sleep(60) 来解决这个问题,但错误仍然发生.当我昨天制作这个脚本时,我能够以最多 1500 个字符的列表运行它,没有错误.有人可以帮我解决这个问题吗?谢谢.

解决方案

这是完全正常的预期行为.因为这与 Chicken-Egg 问题有关.

想象一下,您打开 Firefox 浏览器,然后打开 google.com,然后关闭它并重复圆圈!

这算作 DDOS 攻击,所有现代服务器都会阻止您的请求并标记您的 IP,因为这确实损害了他们的带宽!

合乎逻辑且正确的方法是使用相同的session 而不是继续创建多个会话.因为这不会显示在 TCP-Syn Flood 标志下.检查法律 tcp-flags.

另一方面,你真的需要使用Context-Manager而不是记住你的变量.

示例:

output = open(out.txt", a", encoding=utf-8")输出.close()

可以通过With处理,如下:

 with open('out.txt', 'w', newline='', encoding='utf-8') 作为输出:# 在这里你可以做你的操作.

一旦您退出with,您的文件将自动关闭!

另外,考虑使用新的格式字符串而不是旧的

url = "https://wugniu.com/search?char=%s&table=wenzhou"% 字符

可以是:

"https://wugniu.com/search?char={}&table=wenzhou".format(char)

我不会在这里使用专业代码,我已经使您可以轻松理解这个概念.

注意我如何获取所需的 element 以及我如何将其写入文件.和 lxmlhtml.parser 的不同速度可以找到 这里

导入请求从 bs4 导入 BeautifulSoup导入 urllib3urllib3.disable_warnings()定义主(网址,字符):以 open('result.txt', 'w', newline='', encoding='utf-8') 作为 f, requests.Session() 作为请求:req.verify = False对于字符中的字符:打印(f提取{字符}")r = req.get(url.format(char))汤 = BeautifulSoup(r.text, 'lxml')target = [x['id'][2:] for x in soup.select('audio[id^=0-"]')]打印(目标)f.write(f'{char}
{str(target)}
')如果 __name__ == __main__":chars = ['核']main('https://wugniu.com/search?char={}&table=wenzhou', chars)

同样遵循Python Dry Principle 你可以设置req.verify = False 而不是在每个请求上都设置 verify = False.

下一步:您应该查看 Threading 或 AsyncProgrammingg 以提高您的代码操作时间,因为在实际项目中,我们没有使用普通的 for 循环(算作非常慢),而您可以发送一堆网址并等待响应.

Apologies, I am a beginning at Python and webscraping.

I am web scraping wugniu.com to extract readings for characters that I input. I made a list of 10273 characters to format into the URL and bring up the page with readings, then I used the Requests module to return the source code, then Beautiful Soup to return all the audio ids (as their strings contain the readings for the input characters - I couldn't use the text that comes up in the table as they are svgs). Then I tried to output the characters and their readings to out.txt.

# -*- coding: utf-8 -*-
import requests, time
from bs4 import BeautifulSoup
from requests.packages.urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)

characters = [
#characters go here
]

output = open("out.txt", "a", encoding="utf-8")

tic = time.perf_counter()

for char in characters:
    # Characters from the list are formatted into the url
    url = "https://wugniu.com/search?char=%s&table=wenzhou" % char

    page = requests.get(url, verify=False)
    soup = BeautifulSoup(page.text, 'html.parser')

    for audio_tag in soup.find_all('audio'):
        audio_id = audio_tag.get('id').replace("0-","")
        #output.write(char)
        #output.write("  ")
        #output.write(audio_id)
        #output.write("
")
        print(i)
        time.sleep(60)

output.close()
toc = time.perf_counter()
duration = int(toc) - int(tic)
print("Took %d seconds" % duration)

out.txt is the output file I tried to output the results to. I measured the time the process took to measure performance.

However, after 50 or so loops, I get this in the cmd:

Traceback (most recent call last):
 File "C:Users[user]Documentswenzhou-imeenvlibsite-packagesurllib3connection.py", line 169, in _new_conn
conn = connection.create_connection(
 File "C:Users[user]Documentswenzhou-imeenvlibsite-packagesurllib3utilconnection.py", line 96, in create_connection
raise err
File "C:Users[user]Documentswenzhou-imeenvlibsite-packagesurllib3utilconnection.py", line 86, in create_connection
sock.connect(sa)
TimeoutError: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File"C:Users[user]Documentswenzhou-imeenvlibsite-packagesurllib3connectionpool.py", line 699, in urlopen         httplib_response = self._make_request(
File "C:Users[user]Documentswenzhou-imeenvlibsite-packagesurllib3connectionpool.py", line 382, in _make_request
self._validate_conn(conn)
 File "C:Users[user]Documentswenzhou-imeenvlibsite-packagesurllib3connectionpool.py", line 1010, in _validate_conn
conn.connect()
File "C:Users[user]Documentswenzhou-imeenvlibsite-packagesurllib3connection.py", line 353, in connect
conn = self._new_conn()
File "C:Users[user]Documentswenzhou-imeenvlibsite-packagesurllib3connection.py", line 181, in _new_conn
raise NewConnectionError(urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPSConnection object at 0x000002035D5F9040>: Failed to establish a new connection: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:Users[user]Documentswenzhou-imeenvlibsite-packages
equestsadapters.py", line 439, in send
resp = conn.urlopen(
File "C:Users[user]Documentswenzhou-imeenvlibsite-packagesurllib3connectionpool.py", line 755, in urlopen
retries = retries.increment(
File "C:Users[user]Documentswenzhou-imeenvlibsite-packagesurllib3util
etry.py", line 573, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))                                                      urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='wugniu.com', port=443): Max retries exceeded with url: /search?char=%E8%87%B4&table=wenzhou (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x000002035D5F9040>: Failed to establish a new connection: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:Users[user]Documentswenzhou-ime	est.py", line 3282, in <module>                                               page = requests.get(url, verify=False)
File "C:Users[user]Documentswenzhou-imeenvlibsite-packages
equestsapi.py", line 76, in get
return request('get', url, params=params, **kwargs)                                                                   File "C:Users[user]Documentswenzhou-imeenvlibsite-packages
equestsapi.py", line 61, in request
return session.request(method=method, url=url, **kwargs)                                                              File "C:Users[user]Documentswenzhou-imeenvlibsite-packages
equestssessions.py", line 542, in request
resp = self.send(prep, **send_kwargs)                                                                                 File "C:Users[user]Documentswenzhou-imeenvlibsite-packages
equestssessions.py", line 655, in send
r = adapter.send(request, **kwargs)                                                                                   File "C:Users[user]Documentswenzhou-imeenvlibsite-packages
equestsadapters.py", line 516, in send
raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='wugniu.com', port=443): Max retries exceeded with url: /search?char=%E8%87%B4&table=wenzhou (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x000002035D5F9040>: Failed to establish a new connection: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond'))

I tried to fix this by adding time.sleep(60) but the errors still happened. When I made this script yesterday, I was able to run it with a list of up to 1500 characters with no errors. Could someone please help me with this? Thanks.

解决方案

That's completely normal behavior expected. as that's related to Chicken-Egg Issue.

Imagine that you open the Firefox browser and then you open google.com, And then you close it and repeat the circle!

The logical and right approach is to use the same session instead of keep creating multiple sessions. As that's will not be shown under TCP-Syn Flood Flag. Check Legal tcp-flags.

On the other side, you really need to use Context-Manager instead of keep remember your variables.

output = open("out.txt", "a", encoding="utf-8")
output.close()

Can be handled via With as below:

with open('out.txt', 'w', newline='', encoding='utf-8') as output:
    # here you can do your operation.

Also, consider using the new format string instead of the old

url = "https://wugniu.com/search?char=%s&table=wenzhou" % char

Can be:

"https://wugniu.com/search?char={}&table=wenzhou".format(char)

I'll not use a professional code here, I've made it simple for you to can understand the concept.

import requests
from bs4 import BeautifulSoup
import urllib3

urllib3.disable_warnings()


def main(url, chars):
    with open('result.txt', 'w', newline='', encoding='utf-8') as f, requests.Session() as req:
        req.verify = False
        for char in chars:
            print(f"Extracting {char}")
            r = req.get(url.format(char))
            soup = BeautifulSoup(r.text, 'lxml')
            target = [x['id'][2:] for x in soup.select('audio[id^="0-"]')]
            print(target)
            f.write(f'{char}
{str(target)}
')


if __name__ == "__main__":
    chars = ['核']
    main('https://wugniu.com/search?char={}&table=wenzhou', chars)

Also as to follow the Python Dry Principle You can set req.verify = False instead of keep setting verify = False on each request.

这篇关于对于循环网络抓取网站会出现 timeouterror、newconnectionerror 和 requests.exceptions.ConnectionError的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

07-17 12:03
查看更多