我有一个关于ThreadPoolExecutor
和Thread
类的性能的问题,在我看来,我缺乏一些基本的了解。
我有两个功能的网络抓取工具。首先解析网站首页每个图片的链接,其次从解析的链接中加载图片:
import threading
import urllib.request
from bs4 import BeautifulSoup as bs
import os
from concurrent.futures import ThreadPoolExecutor
path = r'C:\Users\MyDocuments\Pythom\Networking\bbc_images_scraper_test'
url = 'https://www.bbc.co.uk'
# Function to parse link anchors for images
def img_links_parser(url, links_list):
res = urllib.request.urlopen(url)
soup = bs(res,'lxml')
content = soup.findAll('div',{'class':'top-story__image'})
for i in content:
try:
link = i.attrs['style']
# Pulling the anchor from parentheses
link = link[link.find('(')+1 : link.find(')')]
# Putting the anchor in the list of links
links_list.append(link)
except:
# links might be under 'data-lazy' attribute w/o paranthesis
links_list.append(i.attrs['data-lazy'])
# Function to load images from links
def img_loader(base_url, links_list, path_location):
for link in links_list:
try:
# Pulling last element off the link which is name.jpg
file_name = link.split('/')[-1]
# Following the link and saving content in a given direcotory
urllib.request.urlretrieve(urllib.parse.urljoin(base_url, link),
os.path.join(path_location, file_name))
except:
print('Error on {}'.format(urllib.parse.urljoin(base_url, link)))
以下代码分为两种情况:
情况1:我正在使用多个线程:
threads = []
t1 = threading.Thread(target = img_loader, args = (url, links[:10], path))
t2 = threading.Thread(target = img_loader, args = (url, links[10:20], path))
t3 = threading.Thread(target = img_loader, args = (url, links[20:30], path))
t4 = threading.Thread(target = img_loader, args = (url, links[30:40], path))
t5 = threading.Thread(target = img_loader, args = (url, links[40:50], path))
t6 = threading.Thread(target = img_loader, args = (url, links[50:], path))
threads.extend([t1,t2,t3,t4,t5,t6])
for t in threads:
t.start()
for t in threads:
t.join()
上面的代码在我的计算机上工作了10秒钟。
情况2:我使用的是
ThreadPoolExecutor
with ThreadPoolExecutor(50) as exec:
results = exec.submit(img_loader, url, links, path)
上面的代码导致18秒。
我的理解是
ThreadPoolExecutor
为每个 worker 创建了一个线程。因此,给定我将max_workers
设置为50将导致50个线程,因此应该可以更快地完成工作。有人可以解释一下我在这里想念什么吗?我承认我在这里犯了一个愚蠢的错误,但我只是不明白。
非常感谢!
最佳答案
在案例2中,您将所有链接发送给一位工作人员。代替
exec.submit(img_loader, url, links, path)
您需要:
for link in links:
exec.submit(img_loader, url, [link], path)
我自己没有尝试过,那只是来自reading the documentation of ThreadPoolExecutor
关于multithreading - ThreadPoolExecutor与threading.Thread,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/47995566/