我尝试组织最多 10 个并发下载的池。该函数应下载基本 url,然后解析此页面上的所有 url 并下载每个 url,但同时下载的 OVERALL 数量不应超过 10。
from lxml import etree
import gevent
from gevent import monkey, pool
import requests
monkey.patch_all()
urls = [
'http://www.google.com',
'http://www.yandex.ru',
'http://www.python.org',
'http://stackoverflow.com',
# ... another 100 urls
]
LINKS_ON_PAGE=[]
POOL = pool.Pool(10)
def parse_urls(page):
html = etree.HTML(page)
if html:
links = [link for link in html.xpath("//a/@href") if 'http' in link]
# Download each url that appears in the main URL
for link in links:
data = requests.get(link)
LINKS_ON_PAGE.append('%s: %s bytes: %r' % (link, len(data.content), data.status_code))
def get_base_urls(url):
# Download the main URL
data = requests.get(url)
parse_urls(data.content)
如何组织它以并发方式进行,但要保持所有 Web 请求的通用全局池限制?
最佳答案
我认为以下内容应该可以满足您的需求。我在我的示例中使用 BeautifulSoup 而不是您拥有的链接 strip 化内容。
from bs4 import BeautifulSoup
import requests
import gevent
from gevent import monkey, pool
monkey.patch_all()
jobs = []
links = []
p = pool.Pool(10)
urls = [
'http://www.google.com',
# ... another 100 urls
]
def get_links(url):
r = requests.get(url)
if r.status_code == 200:
soup = BeautifulSoup(r.text)
links.extend(soup.find_all('a'))
for url in urls:
jobs.append(p.spawn(get_links, url))
gevent.joinall(jobs)
关于python - 带有嵌套 Web 请求的 Gevent 池,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/15322701/