问题描述
我们已经意识到,将GZip格式的文件归档为Hadoop处理并不是一个好主意。 GZip不可拆分,以供参考,以下是我不再重复的问题:
我的问题是:BZip2是最好的档案压缩方式,它允许Hadoop并行处理单个档案文件? Gzip绝对不是,从我的阅读LZO有一些问题。 解析方案
我们已经意识到,将GZip格式的文件归档为Hadoop处理并不是一个好主意。 GZip不可拆分,以供参考,以下是我不再重复的问题:
我的问题是:BZip2是最好的档案压缩方式,它允许Hadoop并行处理单个档案文件? Gzip绝对不是,从我的阅读LZO有一些问题。 解析方案
hadoop - 它提供了非常好的压缩比,但是由于CPU时间和性能并没有提供最佳结果,因为压缩非常耗费CPU资源。
LZO 可以在hadoop中分割 - 利用 ,您可以分割压缩LZO文件。您需要具有外部.lzo.index文件才能够并行处理。该库提供了以本地或分布式方式生成这些索引的所有方法。
We've realized a bit too late that archiving our files in GZip format for Hadoop processing isn't such a great idea. GZip isn't splittable, and for reference, here are the problems which I won't repeat:
My question is: is BZip2 the best archival compression that will allow a single archive file to be processed in parallel by Hadoop? Gzip is definitely not, and from my reading LZO has some problems.
BZIP2 is splittable in hadoop - it provides very good compression ratio but from CPU time and performances is not providing optimal results, as compression is very CPU consuming.
LZO is splittable in hadoop - leveraging hadoop-lzo you have splittable compressed LZO files. You need to have external .lzo.index files to be able to process in parallel. The library provides all means of generating these indexes in local or distributed manner.
LZ4 is splittable in hadoop - leveraging hadoop-4mc you have splittable compressed 4mc files. You don't need any external indexing, and you can generate archives with provided command line tool or by Java/C code, inside/outside hadoop. 4mc makes available on hadoop LZ4 at any level of speed/compression-ratio: from fast mode reaching 500 MB/s compression speed up to high/ultra modes providing increased compression ratio, almost comparable with GZIP one.
这篇关于Hadoop输入的最佳可拆分压缩= bz2?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!