问题描述
根据 zlib.net上的常见问题解答,可以:
我了解 Biopyton 1.60 ,其中:
但是对于我的用例,我不想使用该格式.基本上我想要一些东西,它可以模仿下面的代码:
But for my use case I don't want to use that format. Basically I want something, which emulates the code below:
import gzip
large_integer_new_line_start = 10**9
with gzip.open('large_file.gz','rt') as f:
f.seek(large_integer_new_line_start)
,但具有本机zlib.net提供的效率,以提供对压缩流的随机访问.如何利用Python中的随机访问功能?
but with the efficiency offered by the native zlib.net to provide random access to the compressed stream. How do I leverage that random access capability in Python?
推荐答案
我放弃了使用Python对压缩文件进行随机访问的打算.取而代之的是,我将gzip压缩文件转换为块gzip压缩文件,并在该文件上使用块压缩/解压缩实用程序命令行:
I gave up on doing random access on a gzipped file using Python. Instead I converted my gzipped file to a block gzipped file with a block compression/decompression utility on the command line:
zcat large_file.gz | bgzip > large_file.bgz
然后,我使用了 BioPython ,并告诉我要获取bgzipped文件第100万行的virtual_offset.然后我便可以迅速找到virtual_offset了:
Then I used BioPython and tell to get the virtual_offset of line number 1 million of the bgzipped file. And then I was able to rapidly seek the virtual_offset afterwards:
from Bio import bgzf
file='large_file.bgz'
handle = bgzf.BgzfReader(file)
for i in range(10**6):
handle.readline()
virtual_offset = handle.tell()
line1 = handle.readline()
handle.close()
handle = bgzf.BgzfReader(file)
handle.seek(virtual_offset)
line2 = handle.readline()
handle.close()
assert line1==line2
我还要在 examples/zran.c 在 zlib 分发.
这篇关于如何获得对gzip压缩文件的随机访问的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!