本文介绍了为 Hadoop MapReduce 中的单个地图提供多个非文本文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在编写分布式应用程序,它在 Hadoop MapReduce 的帮助下解析 Pdf 文件.MapReduce 作业的输入是数千个 Pdf 文件(大部分范围从 100KB 到 ~2MB),输出是一组已解析的文本文件.

I'm currently writing distributed application which parses Pdf files with the help of Hadoop MapReduce. Input to MapReduce job is thousands of Pdf files (which mostly range from 100KB to ~2MB), and output is a set of parsed text files.

出于测试目的,最初我使用了 Tom White 的 Hadoop 中提供的 WholeFileInputFormat.Definitive Guide 书,提供单个文件到单个地图.这适用于少量输入文件,但是,由于显而易见的原因,它不适用于数千个文件.需要大约一秒钟才能完成的任务的单一地图效率低下.

For testing purposes, initially I used WholeFileInputFormat provided in Tom White's Hadoop. The Definitive Guide book, which provides single file to single map. This worked fine with small number of input files, however, it does not work properly with thousands of files for obvious reasons. Single map for the task which takes around a second to complete is inefficient.

所以,我想做的是将多个 Pdf 文件提交到一个 Map 中(例如,将多个文件组合成单个块,其 HDFS 块大小约为 64MB).我发现 CombineFileInputFormat 对我的情况很有用.但是我不知道如何扩展该抽象类,以便可以将每个文件及其文件名作为单个键值记录处理.

So, what I want to do is to submit several Pdf files into one Map (for example, combining several files into single chunk which has around HDFS block size ~64MB). I found out that CombineFileInputFormat is useful for my case. However I cannot come out with idea how to extend that abstract class, so that I can process each file and its filename as a single Key-Value record.

感谢任何帮助.谢谢!

推荐答案

我认为 SequenceFile 将满足您的需求:http://wiki.apache.org/hadoop/SequenceFile

I think a SequenceFile will suit your needs here: http://wiki.apache.org/hadoop/SequenceFile

基本上,您将所有 PDF 放入一个序列文件中,映射器将接收到适合该序列文件的一个 HDFS 块的尽可能多的 PDF.创建序列文件时,您需要将键设置为 PDF 文件名,值将是 PDF 的二进制表示.

Essentially, you put all your PDFs into a sequence file and the mappers will receive as many PDFs as fit into one HDFS block of the sequence file. When you create the sequence file, you'll set the key to be the PDF filename, and the value will be the binary representation of the PDF.

这篇关于为 Hadoop MapReduce 中的单个地图提供多个非文本文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

05-29 03:18
查看更多