本文介绍了Python无法读取"warc.gz"完全归档的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

限时删除!!

在我的工作中,我刮擦了网站并将其写到gzip压缩的Web归档文件(扩展名为"warc.gz").我使用Python 2.7.11和warc 0.2.1库.

For my work, I scrape web-sites and write them to gzipped web-archives (with extension "warc.gz"). I use Python 2.7.11 and the warc 0.2.1 library.

我注意到,对于大多数文件,我无法使用warc库完全读取它们.例如,如果warc.gz文件具有517条记录,那么我只能读取其中的200条记录.

I noticed that for majority of files I cannot read them completely with the warc-library. For example if the warc.gz file has 517 records, I can read only about 200 of them.

经过一些研究,我发现只有gzip压缩的文件才会出现此问题.扩展名为"warc"的文件没有此问题.

After some research I found out that this problem happens only with the gzipped files. The files with extension "warc" do not have this problem.

我发现有些人也有此问题( https://github.com /internetarchive/warc/issues/21 ),但找不到解决方案.

I have found out that some people have this problem as well (https://github.com/internetarchive/warc/issues/21), while no solution for it is found.

我猜想Python 2.7.11中的"gzip"中可能存在错误.也许有人对此有经验,并且知道可以解决此问题吗?

I guess that there might be a bug in "gzip" in Python 2.7.11. Does maybe someone have experience with this, and know what can be done about this problem?

提前谢谢!

示例:

我创建新的warc.gz文件,如下所示:

I create new warc.gz files like this:

import warc
warc_path = "\\some_path\file_name.warc.gz"
warc_file = warc.open(warc_path, "wb")

我要写记录:

record = warc.WARCRecord(payload=value, headers=headers)
warc_file.write_record(record)

这将创建完美的"warc.gz"文件.他们没有问题.包括"\ r \ n"在内的所有内容都是正确的.但是,当我读取这些文件时,问题开始了.

This creates perfect "warc.gz" files. There are no problems with them. All, including "\r\n" is correct. But the problem starts when I read these files.

我使用以下文件来读取文件:

To read files I use:

warc_file = warc.open(warc_path, "rb")

要遍历记录,我使用:

for record in warc_file:
    ...

问题在于,在此循环中,对于"warc.gz"文件找不到所有记录,而对于"warc"文件却找到了所有记录.在warc库本身中解决了处理两种类型的文件的问题.

The problem is that not all records are found during this looping for "warc.gz" file, while they all are found for "warc" files. Working with both types of files is addressed in the warc-library itself.

推荐答案

似乎在 warc.gzip2.GzipFile ,使用 warc.utils.FilePart 并阅读 warc.warc.WARCReader 整体已损坏(已通过python 2.7.9、2.7.10和2.7.11测试).当它没有接收到数据而不是新的报头时,它会停止运行 a>.

似乎基本的stdlib gzip 只能处理链接的文件很好,所以这也应该工作:

It would seem that basic stdlib gzip handles the catenated files just fine and so this should work as well:

import gzip
import warc

with gzip.open('my_test_file.warc.gz', mode='rb') as gzf:
    for record in warc.WARCFile(fileobj=gzf):
        print record.payload.read()

这篇关于Python无法读取"warc.gz"完全归档的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

1403页,肝出来的..

09-09 01:17