问题描述
我正在尝试在简单线程中使用urllib3来获取几个Wiki页面.该脚本将
I am trying to use urllib3 in simple thread to fetch several wiki pages. The script will
为每个线程创建1个连接(我不明白为什么)并永久挂起.urllib3和线程的任何技巧,建议或简单示例
Create 1 connection for every thread (I don't understand why) and Hang forever.Any tip, advice or simple example of urllib3 and threading
import threadpool
from urllib3 import connection_from_url
HTTP_POOL = connection_from_url(url, timeout=10.0, maxsize=10, block=True)
def fetch(url, fiedls):
kwargs={'retries':6}
return HTTP_POOL.get_url(url, fields, **kwargs)
pool = threadpool.ThreadPool(5)
requests = threadpool.makeRequests(fetch, iterable)
[pool.putRequest(req) for req in requests]
@Lennart的脚本出现此错误:
@Lennart's script got this error:
http://en.wikipedia.org/wiki/2010-11_Premier_LeagueTraceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/threadpool.py", line 156, in run
http://en.wikipedia.org/wiki/List_of_MythBusters_episodeshttp://en.wikipedia.org/wiki/List_of_Top_Gear_episodes http://en.wikipedia.org/wiki/List_of_Unicode_characters result = request.callable(*request.args, **request.kwds)
File "crawler.py", line 9, in fetch
print url, conn.get_url(url)
AttributeError: 'HTTPConnectionPool' object has no attribute 'get_url'
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/threadpool.py", line 156, in run
result = request.callable(*request.args, **request.kwds)
File "crawler.py", line 9, in fetch
print url, conn.get_url(url)
AttributeError: 'HTTPConnectionPool' object has no attribute 'get_url'
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/threadpool.py", line 156, in run
result = request.callable(*request.args, **request.kwds)
File "crawler.py", line 9, in fetch
print url, conn.get_url(url)
AttributeError: 'HTTPConnectionPool' object has no attribute 'get_url'
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/threadpool.py", line 156, in run
result = request.callable(*request.args, **request.kwds)
File "crawler.py", line 9, in fetch
print url, conn.get_url(url)
AttributeError: 'HTTPConnectionPool' object has no attribute 'get_url'
添加import threadpool; import urllib3
和tpool = threadpool.ThreadPool(4)
@ user318904的代码后,出现此错误:
After adding import threadpool; import urllib3
and tpool = threadpool.ThreadPool(4)
@user318904's code got this error:
Traceback (most recent call last):
File "crawler.py", line 21, in <module>
tpool.map_async(fetch, urls)
AttributeError: ThreadPool instance has no attribute 'map_async'
推荐答案
这是我的看法,这是使用Python3和concurrent.futures.ThreadPoolExecutor
的最新解决方案.
Here is my take, a more current solution using Python3 and concurrent.futures.ThreadPoolExecutor
.
import urllib3
from concurrent.futures import ThreadPoolExecutor
urls = ['http://en.wikipedia.org/wiki/2010-11_Premier_League',
'http://en.wikipedia.org/wiki/List_of_MythBusters_episodes',
'http://en.wikipedia.org/wiki/List_of_Top_Gear_episodes',
'http://en.wikipedia.org/wiki/List_of_Unicode_characters',
]
def download(url, cmanager):
response = cmanager.request('GET', url)
if response and response.status == 200:
print("+++++++++ url: " + url)
print(response.data[:1024])
connection_mgr = urllib3.PoolManager(maxsize=5)
thread_pool = ThreadPoolExecutor(5)
for url in urls:
thread_pool.submit(download, url, connection_mgr)
一些评论
- 我的代码基于Beazley和Jones的
Python Cookbook
中的类似示例. - 我特别喜欢这样的事实,除了
urllib3
外,您还需要一个标准模块. - 设置非常简单,如果您只想了解
download
中的副作用(例如打印,保存到文件等),则无需额外的精力来连接线程. - 如果您想要其他东西,
ThreadPoolExecutor.submit
实际上会返回download
会返回的内容,并包装在Future
中. - 我发现将线程池中的线程数量与连接池中的
HTTPConnection
数量对齐(通过maxsize
)很有帮助.否则,当所有线程尝试访问同一服务器时,您可能会遇到(无害)警告(如示例中所示). - My code is based on a similar example from the
Python Cookbook
by Beazley and Jones. - I particularly like the fact that you only need a standard module besides
urllib3
. - The setup is extremely simple, and if you are only going for side-effects in
download
(like printing, saving to a file, etc.), there is no additional effort in joining the threads. - If you want something different,
ThreadPoolExecutor.submit
actually returns whateverdownload
would return, wrapped in aFuture
. - I found it helpful to align the number of threads in the thread pool with the number of
HTTPConnection
's in a connection pool (viamaxsize
). Otherwise you might encounter (harmless) warnings when all threads try to access the same server (as in the example).
Some remarks
这篇关于示例urllib3和python中的线程的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!