重用压缩字典

本文介绍了重用压缩字典的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

是否有压缩工具可让您将其字典（或类似文件）与压缩输出分开输出，以便在以后的压缩中可以重新使用该字典？想法是一次转移字典，或在远程站点使用参考字典，并使压缩文件更小以进行转移。

Is there a compression tool that will let you output its dictionary (or similar) separate from the compressed output such that the dictionary can be re-used on a subsequent compression? The idea would be to transfer the dictionary one time, or use a reference dictionary at a remote site, and make a compressed file even smaller to transfer.

我看过在文档中找到了一堆常见的压缩工具，但我真的找不到支持这种压缩工具的工具。但是大多数常见的压缩工具不是直接的字典压缩。

I've looked at the docs for a bunch of common compression tools, and I can't really find one that supports this. But most common compression tools aren't straight dictionary compression.

我想象的用法是：

compress_tool --dictionary compressed.dict -o compressed.data uncompressed
decompress_tool --dictionary compressed.dict -o uncompressed compressed.data

为了扩展用例，我有一个500MB的二进制文件，希望通过慢速网络进行复制。仅压缩文件即可产生200MB的大小，仍然比我想要的大。但是，我的源文件和目标文件都有一个文件F'，该文件与F非常相似，但是文件差异太大，二进制diff工具无法正常工作。我当时想，如果我在两个站点上都压缩F’，然后重新使用有关该压缩的信息来压缩源上的F，则可能会从传输中消除一些信息，而这些信息可以使用F’在目标上重建。

To expand on my use case, I have a binary 500MB file F I want to copy over a slow network. Compressing the file alone yields a size of 200MB, which is still larger than I'd like. However, both my source and destination have a file F' which is very similar to F, but sufficiently different that binary diff tools don't work well. I was thinking that if I compress F' on both sites and then re-use information about that compression to compress F on the source, I could possibly eliminate some information from the transfer that could be rebuilt on the destination using F'.

推荐答案

预设字典对于这么大的文件并不是很有用。它们非常适合小数据（例如压缩数据库中的字段，RPC查询/响应，XML或JSON片段等），但是对于像您这样的大文件，该算法可以快速建立自己的字典。

Preset dictionaries aren't really useful for files that size. They're great for small data (think compressing fields in a database, RPC queries/responses, snippets of XML or JSON, etc.), but for larger files like you have the algorithm builds up its own dictionary very quickly.

这就是说，正巧我在是最近才发布的，我确实有一些代码可以执行您在zlib插件中所讨论的内容。我不会将其推送到主节点（如果我决定支持预设字典，我会想到一个不同的API），但是如果您想使用它，我只是将其推送到 deflate-dictionary-file分支看。要进行压缩，请执行以下操作

That said, it just so happens that I was playing with preset dictionaries in Squash fairly recently, and I do have some code which does pretty much what you're talking about for the zlib plugin. I'm not going to push it to master (I have a different API in mind if I decide to support preset dictionaries), but I've just pushed it to the 'deflate-dictionary-file' branch if you want to take a look. To compress, do something like

squash -ko dictionary-file=foo.dict -c zlib:deflate uncompressed compressed.deflate

要解压缩，

squash -dko dictionary-file=foo.dict -c zlib:deflate compressed.deflate decompressed

AFAIK zlib中没有支持构建字典的内容-您必须自己做。 zlib文档描述了格式：

AFAIK there is nothing in zlib which supports building a dictionary--you have to do that yourself. The zlib documentation describes the "format":

为了进行测试，我使用了类似以下内容（YMMV）：

For testing I was using something like this (YMMV):

cat input | tr ' ' '\n' | sort | uniq -c | awk '{printf "%06d %s\n",$1,$2}' | sort | cut -b8- | tail -c32768

这篇关于重用压缩字典的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！