问题描述
我是编程新手,正在尝试通过将WARC文件拆分为多个块然后将每个块存储在字典中的方式来处理它.
I'm new to programming and am trying to process a WARC file by splitting it into chunks and then storing each chunk in a dictionary.
每个块都应以WARC/1.0标头开头,并由3个空行分隔.我也想删除前两段:
Each chunk should start with the WARC/1.0 header and is separated by 3 empty lines. I also would like to remove the first 2 paragraphs:
WARC/1.0
WARC-Type: warcinfo
WARC-Date: 2020-08-04T01:43:40Z
WARC-Record-ID: <urn:uuid:959ea654-33fd-466b-b1bf-f08aa8abe774>
Content-Length: 500
Content-Type: application/warc-fields
WARC-Filename: CC-MAIN-20200804014340-20200804044340-00045.warc.gz
isPartOf: CC-MAIN-2020-34
publisher: Common Crawl
description: Wide crawl of the web for August 2020
operator: Common Crawl Admin ([email protected])
hostname: ip-10-67-67-22.ec2.internal
software: Apache Nutch 1.17 (modified, https://github.com/commoncrawl/nutch/)
robots: checked via crawler-commons 1.2-SNAPSHOT (https://github.com/crawler-commons/crawler-commons)
format: WARC File Format 1.1
conformsTo: http://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/
#从这里开始保持一切:
WARC/1.0
WARC-Type: request
WARC-Date: 2020-08-04T03:25:25Z
WARC-Record-ID: <urn:uuid:6c0b749a-4d02-4a77-ab93-9bc4ba094cdc>
Content-Length: 303
Content-Type: application/http; msgtype=request
WARC-Warcinfo-ID: <urn:uuid:959ea654-33fd-466b-b1bf-f08aa8abe774>
WARC-IP-Address: 104.254.66.40
WARC-Target-URI: http://00.auto.sohu.com/d/details?cityCode=450100&planId=1450&trimId=145372
我尝试使用生成器对块进行分组,但是它返回一个组(整个文件).有一种简单的方法可以将它们分开吗?
I've tried using a generator to group the chunks, but it's returning one group (the whole file). Is there a simple way to separate these?
我无法导入任何库.
任何帮助将不胜感激!
推荐答案
到目前为止,执行此任务的最佳方法是使用warcio库,该库知道如何将warc文件正确解析为记录.
By far the best way to do this task is to use the warcio library, which knows how to properly parse warc files into records.
除非如此,否则我将把warcio代码复制到您的代码中(许可证是允许的.)
Barring that, I would copy the warcio code into yours (the license is permissive.)
Warc文件很复杂,使用经过充分测试和广泛使用的库是解析它们的正确方法.
Warc files are complicated, and using a fully tested and widely used library is the right way to parse them.
如果您要从Common Crawl下载数据,我还建议您检出python软件包cdx_toolkit.它在后台使用warcio,并处理下载步骤.
If you're downloading data from Common Crawl, I would also recommend checking out my python package cdx_toolkit. It uses warcio under the hood, and handles the downloading steps.
这篇关于根据标头将WARC文件拆分为多个块:WARC/1.0 Python的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!