Hadoop的GZIP COM pressed文件

本文介绍了Hadoop的GZIP COM pressed文件的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我是新来的Hadoop，并试图处理维基百科转储。这是一个6.7 GB的gzip COM pressed XML文件。我读了Hadoop的支持gzip的COM pressed文件，但只能通过映射在一个工作作为唯一一个映射器DECOM preSS它处理。这似乎把一个限制的处理。是否有别的选择吗？像DECOM pressing和拆分XML文件分成多个块，RECOM $ P $用gzip pssing他们。

I am new to hadoop and trying to process wikipedia dump. It's a 6.7 GB gzip compressed xml file. I read that hadoop supports gzip compressed files but can only be processed by mapper on a single job as only one mapper can decompress it. This seems to put a limitation on the processing. Is there an alternative? like decompressing and splitting the xml file into multiple chunks and recompressing them with gzip.

我读到有关从http://researchcomputing.blogspot.com/2008/04/hadoop-and-com$p$pssed-files.html

感谢您的帮助。

推荐答案

一个文件，COM pressed与GZIP codeC无法在此codeC的作品，因为道路分割。在Hadoop中单个裂口只能由一个单一的映射处理;因此单个GZIP文件只能由单个映射器进行处理。

A file compressed with the GZIP codec cannot be split because of the way this codec works.A single SPLIT in Hadoop can only be processed by a single mapper; so a single GZIP file can only be processed by a single Mapper.

有ATLEAST三种方式去解决这个限制：

There are atleast three ways of going around that limitation:

作为preprocessing步：Uncom preSS文件，RECOM preSS使用的可分裂codeC（LZO）
作为preprocessing步：Uncom preSS文件，分割成更小集，RECOM preSS。（See这）
使用这个补丁的Hadoop（这是我写的），它允许一个办法解决：可裂Gzip已

As a preprocessing step: Uncompress the file and recompress using a splittable codec (LZO)
As a preprocessing step: Uncompress the file, split into smaller sets and recompress. (See this)
Use this patch for Hadoop (which I wrote) that allows for a way around this: Splittable Gzip

心连心

这篇关于Hadoop的GZIP COM pressed文件的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！