问题描述
是否有压缩工具可让您将其字典(或类似文件)与压缩输出分开输出,以便在以后的压缩中可以重新使用该字典?想法是一次转移字典,或在远程站点使用参考字典,并使压缩文件更小以进行转移。
Is there a compression tool that will let you output its dictionary (or similar) separate from the compressed output such that the dictionary can be re-used on a subsequent compression? The idea would be to transfer the dictionary one time, or use a reference dictionary at a remote site, and make a compressed file even smaller to transfer.
我看过在文档中找到了一堆常见的压缩工具,但我真的找不到支持这种压缩工具的工具。但是大多数常见的压缩工具不是直接的字典压缩。
I've looked at the docs for a bunch of common compression tools, and I can't really find one that supports this. But most common compression tools aren't straight dictionary compression.
我想象的用法是:
compress_tool --dictionary compressed.dict -o compressed.data uncompressed
decompress_tool --dictionary compressed.dict -o uncompressed compressed.data
为了扩展用例,我有一个500MB的二进制文件,希望通过慢速网络进行复制。仅压缩文件即可产生200MB的大小,仍然比我想要的大。但是,我的源文件和目标文件都有一个文件F',该文件与F非常相似,但是文件差异太大,二进制diff工具无法正常工作。我当时想,如果我在两个站点上都压缩F’,然后重新使用有关该压缩的信息来压缩源上的F,则可能会从传输中消除一些信息,而这些信息可以使用F’在目标上重建。
To expand on my use case, I have a binary 500MB file F I want to copy over a slow network. Compressing the file alone yields a size of 200MB, which is still larger than I'd like. However, both my source and destination have a file F' which is very similar to F, but sufficiently different that binary diff tools don't work well. I was thinking that if I compress F' on both sites and then re-use information about that compression to compress F on the source, I could possibly eliminate some information from the transfer that could be rebuilt on the destination using F'.
推荐答案
预设字典对于这么大的文件并不是很有用。它们非常适合小数据(例如压缩数据库中的字段,RPC查询/响应,XML或JSON片段等),但是对于像您这样的大文件,该算法可以快速建立自己的字典。
Preset dictionaries aren't really useful for files that size. They're great for small data (think compressing fields in a database, RPC queries/responses, snippets of XML or JSON, etc.), but for larger files like you have the algorithm builds up its own dictionary very quickly.
这就是说,正巧我在是最近才发布的,我确实有一些代码可以执行您在zlib插件中所讨论的内容。我不会将其推送到主节点(如果我决定支持预设字典,我会想到一个不同的API),但是如果您想使用它,我只是将其推送到 deflate-dictionary-file分支看。要进行压缩,请执行以下操作
That said, it just so happens that I was playing with preset dictionaries in Squash fairly recently, and I do have some code which does pretty much what you're talking about for the zlib plugin. I'm not going to push it to master (I have a different API in mind if I decide to support preset dictionaries), but I've just pushed it to the 'deflate-dictionary-file' branch if you want to take a look. To compress, do something like
squash -ko dictionary-file=foo.dict -c zlib:deflate uncompressed compressed.deflate
要解压缩,
squash -dko dictionary-file=foo.dict -c zlib:deflate compressed.deflate decompressed
AFAIK zlib中没有支持构建字典的内容-您必须自己做。 zlib文档描述了格式:
AFAIK there is nothing in zlib which supports building a dictionary--you have to do that yourself. The zlib documentation describes the "format":
为了进行测试,我使用了类似以下内容(YMMV):
For testing I was using something like this (YMMV):
cat input | tr ' ' '\n' | sort | uniq -c | awk '{printf "%06d %s\n",$1,$2}' | sort | cut -b8- | tail -c32768
这篇关于重用压缩字典的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!