如何获得对gzip压缩文件的随机访问

本文介绍了如何获得对gzip压缩文件的随机访问的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我了解 Biopyton 1.60 ，其中:

但是对于我的用例，我不想使用该格式.基本上我想要一些东西，它可以模仿下面的代码:

But for my use case I don't want to use that format. Basically I want something, which emulates the code below:

import gzip
large_integer_new_line_start = 10**9
with gzip.open('large_file.gz','rt') as f:
    f.seek(large_integer_new_line_start)

，但具有本机zlib.net提供的效率，以提供对压缩流的随机访问.如何利用Python中的随机访问功能?

but with the efficiency offered by the native zlib.net to provide random access to the compressed stream. How do I leverage that random access capability in Python?

推荐答案

我放弃了使用Python对压缩文件进行随机访问的打算.取而代之的是，我将gzip压缩文件转换为块gzip压缩文件，并在该文件上使用块压缩/解压缩实用程序命令行:

I gave up on doing random access on a gzipped file using Python. Instead I converted my gzipped file to a block gzipped file with a block compression/decompression utility on the command line:

zcat large_file.gz | bgzip > large_file.bgz

然后，我使用了 BioPython ，并告诉我要获取bgzipped文件第100万行的virtual_offset.然后我便可以迅速找到virtual_offset了:

Then I used BioPython and tell to get the virtual_offset of line number 1 million of the bgzipped file. And then I was able to rapidly seek the virtual_offset afterwards:

from Bio import bgzf

file='large_file.bgz'

handle = bgzf.BgzfReader(file)
for i in range(10**6):
    handle.readline()
virtual_offset = handle.tell()
line1 = handle.readline()
handle.close()

handle = bgzf.BgzfReader(file)
handle.seek(virtual_offset)
line2 = handle.readline()
handle.close()

assert line1==line2

我还要在 examples/zran.c 在 zlib 分发.

这篇关于如何获得对gzip压缩文件的随机访问的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！