问题描述
我是新来的Hadoop,并试图处理维基百科转储。这是一个6.7 GB的gzip COM pressed XML文件。我读了Hadoop的支持gzip的COM pressed文件,但只能通过映射在一个工作作为唯一一个映射器DECOM preSS它处理。这似乎把一个限制的处理。是否有别的选择吗?像DECOM pressing和拆分XML文件分成多个块,RECOM $ P $用gzip pssing他们。
I am new to hadoop and trying to process wikipedia dump. It's a 6.7 GB gzip compressed xml file. I read that hadoop supports gzip compressed files but can only be processed by mapper on a single job as only one mapper can decompress it. This seems to put a limitation on the processing. Is there an alternative? like decompressing and splitting the xml file into multiple chunks and recompressing them with gzip.
我读到有关从http://researchcomputing.blogspot.com/2008/04/hadoop-and-com$p$pssed-files.html
感谢您的帮助。
推荐答案
一个文件,COM pressed与GZIP codeC无法在此codeC的作品,因为道路分割。在Hadoop中单个裂口只能由一个单一的映射处理;因此单个GZIP文件只能由单个映射器进行处理。
A file compressed with the GZIP codec cannot be split because of the way this codec works.A single SPLIT in Hadoop can only be processed by a single mapper; so a single GZIP file can only be processed by a single Mapper.
有ATLEAST三种方式去解决这个限制:
There are atleast three ways of going around that limitation:
- 作为preprocessing步:Uncom preSS文件,RECOM preSS使用的可分裂codeC(LZO)
- 作为preprocessing步:Uncom preSS文件,分割成更小集,RECOM preSS。 (See这)
- 使用这个补丁的Hadoop(这是我写的),它允许一个办法解决:可裂Gzip已
- As a preprocessing step: Uncompress the file and recompress using a splittable codec (LZO)
- As a preprocessing step: Uncompress the file, split into smaller sets and recompress. (See this)
- Use this patch for Hadoop (which I wrote) that allows for a way around this: Splittable Gzip
心连心
这篇关于Hadoop的GZIP COM pressed文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!