本文介绍了使用 python urlib2.open 流式读取(chunk-by-chunk reading)只能得到部分结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在这篇文章中投票最多的答案中找到了一种在 Python 中进行流式阅读的方法.

I found a way to do streaming reading in Python in this post's most voted answer.

使用 urllib2 将大型二进制文件流式传输到文件.

但是出错了,当我在读取块后做一些耗时的任务时,我只能获取部分前端数据.

But it went wrong that I could only get partial front data when I was doing some time-consuming task after the chunk had been read.

from urllib2 import urlopen
from urllib2 import HTTPError

import sys
import time

CHUNK = 1024 * 1024 * 16


try:
     response = urlopen("XXX_domain/XXX_file_in_net.gz")
except HTTPError as e:
     print e
     sys.exit(1)


while True:
     chunk = response.read(CHUNK)

     print 'CHUNK:', len(chunk)

     #some time-consuming work, just as example
     time.sleep(60)

     if not chunk:
            break

如果没有睡眠,则输出正确(添加的总大小与实际大小一致):

If no sleep, the output is right(the total size added is verified to be same with the actual size ):

    CHUNK: 16777216
    CHUNK: 16777216
    CHUNK: 6888014
    CHUNK: 0

如果睡觉:

    CHUNK: 16777216
    CHUNK: 766580
    CHUNK: 0

然后我解压了这些块,发现只读取了 gz 文件的前面部分内容.

And I decompressed these chunk and find only front partial content of the gz file had been read.

推荐答案

尝试支持断点恢复下载,以防服务器在发送所有足够数据之前关闭链接.

Try to support breakpoint-resuming-download in case the server closes the link before sending all enough data.

   try:
        request =  Request(the_url, headers={'Range': 'bytes=0-'})
        response = urlopen(request, timeout = 60)
   except HTTPError as e:
        print e
        return  'Connection Error'

   print dict(response.info())
   header_dict = dict(response.info())

   global content_size
   if 'content-length' in header_dict:
        content_size = int(header_dict['content-length'])

   CHUNK = 16*1024 * 1024

   while True:
       while True:
            try:
                chunk = response.read(CHUNK )
            except socket.timeout, e:
                print 'time_out'
                break
            if not chunk:
                   break

            DoSomeTimeConsumingJob()

            global handled_size
            handled_size = handled_size + len(chunk)

       if handled_size == content_size and content_size != 0:
           break
       else:
          try:
               request =  Request(the_url, headers={'Range': 'bytes='+ str(handled_size) + '-'})
               response = urlopen(request, timeout = 60)
          except HTTPError as e:
               print e

    response.close()

这篇关于使用 python urlib2.open 流式读取(chunk-by-chunk reading)只能得到部分结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

05-26 11:58
查看更多