问题描述
我试图写一个网络爬虫的事情,并希望使HTTP请求尽快。 似乎是一个不错的选择,但所有的例如code我看到(如)基本上叫 AsyncHttpClient.fetch
巨大的网址,让龙卷风排队起来,并最终使请求的列表中。
I'm trying to write a web crawler thing and want to make HTTP requests as quickly as possible. tornado's AsyncHttpClient seems like a good choice, but all the example code I've seen (e.g. http://stackoverflow.com/a/25549675/1650177) basically call AsyncHttpClient.fetch
on a huge list of URLs to let tornado queue them up and eventually make the requests.
但如果我要处理无限长(或只是一个真正的大),从文件或网络URL列表?我不希望加载所有URL到内存中。
But what if I want to process an indefinitely long (or just a really big) list of URLs from a file or the network? I don't want to load all the URLs into memory.
用Google搜索周围,但似乎无法找到一个迭代的方式来 AsyncHttpClient.fetch
。但是我没有找到一个方法来做到我想要使用的是什么GEVENT:。有没有办法做到在龙卷风类似的事情?
Googled around but can't seem to find a way to AsyncHttpClient.fetch
from an iterator. I did however find a way to do what I want using gevent: http://gevent.org/gevent.threadpool.html#gevent.threadpool.ThreadPool.imap. Is there a way to do something similar in tornado?
一个解决方案,我想到的是,只排队那么多的网址,然后开始添加逻辑更多的排队时,取
操作完成,但我希望有一个更清洁的解决方案。
One solution I've thought of is to only queue up so many URLs initially then add logic to queue up more when a fetch
operation completes but I'm hoping there's a cleaner solution.
任何帮助或建议将AP preciated!
Any help or recommendations would be appreciated!
推荐答案
我会用队列和多个工人为此,在上的
I would do this with a Queue and multiple workers, in a variation on https://github.com/tornadoweb/tornado/blob/master/demos/webspider/webspider.py
import tornado.queues
from tornado import gen
from tornado.httpclient import AsyncHTTPClient
from tornado.ioloop import IOLoop
NUM_WORKERS = 10
QUEUE_SIZE = 100
q = tornado.queues.Queue(QUEUE_SIZE)
AsyncHTTPClient.configure(None, max_clients=NUM_WORKERS)
http_client = AsyncHTTPClient()
@gen.coroutine
def worker():
while True:
url = yield q.get()
try:
response = yield http_client.fetch(url)
print('got response from', url)
except Exception:
print('failed to fetch', url)
finally:
q.task_done()
@gen.coroutine
def main():
for i in range(NUM_WORKERS):
IOLoop.current().spawn_callback(worker)
with open("urls.txt") as f:
for line in f:
url = line.strip()
# When the queue fills up, stop here to wait instead
# of reading more from the file.
yield q.put(url)
yield q.join()
if __name__ == '__main__':
IOLoop.current().run_sync(main)
这篇关于龙卷风:从迭代器AsyncHttpClient.fetch?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!