龙卷风：从迭代器AsyncHttpClient.fetch？

本文介绍了龙卷风：从迭代器AsyncHttpClient.fetch？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我试图写一个网络爬虫的事情，并希望使HTTP请求尽快。似乎是一个不错的选择，但所有的例如code我看到（如）基本上叫 AsyncHttpClient.fetch 巨大的网址，让龙卷风排队起来，并最终使请求的列表中。

I'm trying to write a web crawler thing and want to make HTTP requests as quickly as possible. tornado's AsyncHttpClient seems like a good choice, but all the example code I've seen (e.g. http://stackoverflow.com/a/25549675/1650177) basically call AsyncHttpClient.fetch on a huge list of URLs to let tornado queue them up and eventually make the requests.

但如果我要处理无限长（或只是一个真正的大），从文件或网络URL列表？我不希望加载所有URL到内存中。

But what if I want to process an indefinitely long (or just a really big) list of URLs from a file or the network? I don't want to load all the URLs into memory.

用Google搜索周围，但似乎无法找到一个迭代的方式来 AsyncHttpClient.fetch 。但是我没有找到一个方法来做到我想要使用的是什么GEVENT：。有没有办法做到在龙卷风类似的事情？

Googled around but can't seem to find a way to AsyncHttpClient.fetch from an iterator. I did however find a way to do what I want using gevent: http://gevent.org/gevent.threadpool.html#gevent.threadpool.ThreadPool.imap. Is there a way to do something similar in tornado?

一个解决方案，我想到的是，只排队那么多的网址，然后开始添加逻辑更多的排队时，取操作完成，但我希望有一个更清洁的解决方案。

One solution I've thought of is to only queue up so many URLs initially then add logic to queue up more when a fetch operation completes but I'm hoping there's a cleaner solution.

任何帮助或建议将AP preciated！

Any help or recommendations would be appreciated!

推荐答案

我会用队列和多个工人为此，在上的

I would do this with a Queue and multiple workers, in a variation on https://github.com/tornadoweb/tornado/blob/master/demos/webspider/webspider.py

import tornado.queues
from tornado import gen
from tornado.httpclient import AsyncHTTPClient
from tornado.ioloop import IOLoop

NUM_WORKERS = 10
QUEUE_SIZE = 100
q = tornado.queues.Queue(QUEUE_SIZE)
AsyncHTTPClient.configure(None, max_clients=NUM_WORKERS)
http_client = AsyncHTTPClient()

@gen.coroutine
def worker():
    while True:
        url = yield q.get()
        try:
            response = yield http_client.fetch(url)
            print('got response from', url)
        except Exception:
            print('failed to fetch', url)
        finally:
            q.task_done()

@gen.coroutine
def main():
    for i in range(NUM_WORKERS):
        IOLoop.current().spawn_callback(worker)
    with open("urls.txt") as f:
        for line in f:
            url = line.strip()
            # When the queue fills up, stop here to wait instead
            # of reading more from the file.
            yield q.put(url)
    yield q.join()

if __name__ == '__main__':
    IOLoop.current().run_sync(main)

这篇关于龙卷风：从迭代器AsyncHttpClient.fetch？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！

UP