本文介绍了使用pycurl获取多个页面?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想从网站获取许多网页,例如

I want to get many pages from a website, like

curl "http://farmsubsidy.org/DE/browse?page=[0000-3603]" -o "de.#1"

页面数据在python,而不是磁盘文件。
有人可以发布 pycurl 代码来执行此操作,

或快速 urllib2 (不是一次一个),如果可能的话,

或者说忘记它,卷曲更快,更健壮?感谢

but get the pages' data in python, not disk files.Can someone please post pycurl code to do this,
or fast urllib2 (not one-at-a-time) if that's possible,
or else say "forget it, curl is faster and more robust" ? Thanks

推荐答案

这里是基于urllib2和线程的解决方案。

here is a solution based on urllib2 and threads.

import urllib2
from threading import Thread

BASE_URL = 'http://farmsubsidy.org/DE/browse?page='
NUM_RANGE = range(0000, 3603)
THREADS = 2

def main():
    for nums in split_seq(NUM_RANGE, THREADS):
        t = Spider(BASE_URL, nums)
        t.start()

def split_seq(seq, num_pieces):
    start = 0
    for i in xrange(num_pieces):
        stop = start + len(seq[i::num_pieces])
        yield seq[start:stop]
        start = stop

class Spider(Thread):
    def __init__(self, base_url, nums):
        Thread.__init__(self)
        self.base_url = base_url
        self.nums = nums
    def run(self):
        for num in self.nums:
            url = '%s%s' % (self.base_url, num)
            data = urllib2.urlopen(url).read()
            print data

if __name__ == '__main__':
    main()

这篇关于使用pycurl获取多个页面?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

06-12 07:02