本文介绍了带有代理支持的多线程蜘蛛Python包?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
除了使用urllib之外,没有人知道可以通过http代理进行快速,多线程下载的URL的最有效的软件包吗?我知道诸如Twisted,Scrapy,libcurl等之类的东西,但我对它们还不够了解,无法做出决定,甚至他们也可以使用代理.谢谢!
Instead of just using urllib does anyone know of the most efficient package for fast, multithreaded downloading of URLs that can operate through http proxies? I know of a few such as Twisted, Scrapy, libcurl etc. but I don't know enough about them to make a decision or even if they can use proxies.. Anyone know of the best one for my purposes? Thanks!
推荐答案
很容易在python中实现.
is's simple to implement this in python.
# -*- coding: utf-8 -*-
import sys
from urllib import urlopen
from BeautifulSoup import BeautifulSoup
from Queue import Queue, Empty
from threading import Thread
visited = set()
queue = Queue()
def get_parser(host, root, charset):
def parse():
try:
while True:
url = queue.get_nowait()
try:
content = urlopen(url).read().decode(charset)
except UnicodeDecodeError:
continue
for link in BeautifulSoup(content).findAll('a'):
try:
href = link['href']
except KeyError:
continue
if not href.startswith('http://'):
href = 'http://%s%s' % (host, href)
if not href.startswith('http://%s%s' % (host, root)):
continue
if href not in visited:
visited.add(href)
queue.put(href)
print href
except Empty:
pass
return parse
if __name__ == '__main__':
host, root, charset = sys.argv[1:]
parser = get_parser(host, root, charset)
queue.put('http://%s%s' % (host, root))
workers = []
for i in range(5):
worker = Thread(target=parser)
worker.start()
workers.append(worker)
for worker in workers:
worker.join()
这篇关于带有代理支持的多线程蜘蛛Python包?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!