我面临以下情况。请帮助我。我使用hadoop Mapreduce处理XML文件。
通过引用此网站,我无法记录我的记录https://gist.github.com/sritchie/808035
但是,当XML文件的大小大于块大小时,im得不到适当的值
所以我需要阅读整个文件
为此,我得到了这个链接
https://github.com/pyongjoo/MapReduce-Example/blob/master/mysrc/XmlInputFormat.java
但是现在的问题是如何将两个输入格式实现为单个输入格式
请尽快帮助我
谢谢
更新
public class XmlParser11
{
public static class XmlInputFormat1 extends TextInputFormat {
public static final String START_TAG_KEY = "xmlinput.start";
public static final String END_TAG_KEY = "xmlinput.end";
@Override
protected boolean isSplitable(JobContext context, Path file) {
return false;
}
public RecordReader<LongWritable, Text> createRecordReader(InputSplit split, TaskAttemptContext context) {
return new XmlRecordReader();
}
/**
* XMLRecordReader class to read through a given xml document to output
* xml blocks as records as specified by the start tag and end tag
*
*/
public static class XmlRecordReader extends RecordReader<LongWritable, Text> {
private byte[] startTag;
private byte[] endTag;
private long start;
private long end;
private FSDataInputStream fsin;
private DataOutputBuffer buffer = new DataOutputBuffer();
private LongWritable key = new LongWritable();
private Text value = new Text();
@Override
public void initialize(InputSplit split, TaskAttemptContext context)
throws IOException, InterruptedException {
Configuration conf = context.getConfiguration();
startTag = conf.get(START_TAG_KEY).getBytes("utf-8");
endTag = conf.get(END_TAG_KEY).getBytes("utf-8");
FileSplit fileSplit = (FileSplit) split;
但不起作用
最佳答案
使用 isSplitable 属性可指定否以分割文件(即使已达到块大小)。当您要确保单个映射程序应处理大文件时,通常使用此方法。
public class XmlInputFormat extends FileInputFormat {
@Override
protected boolean isSplitable(JobContext context, Path file) {
return false;
}
@Override
public RecordReader<LongWritable, Text> createRecordReader(InputSplit split,TaskAttemptContext context)
throws IOException {
// return your version of XML record reader
}
}
另外,您还可以使用以下方法为每个拆分增加块大小:
// Set the maximum split size
setMaxSplitSize(MAX_INPUT_SPLIT_SIZE);
关于java - Hadoop中的多个输入格式为单个,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/20213974/