问题描述
我解决了这个问题 如何在运行 Hadoop MapReduce 作业时获取文件名/文件内容作为 MAP 的键/值输入? 在这里.虽然它解释了这个概念,但我无法成功地将其转换为代码.
I went through the question How to get Filename/File Contents as key/value input for MAP when running a Hadoop MapReduce Job? here. Though it explains the concept, I am unable to successfully transform it to code.
基本上,我希望文件名作为键,文件数据作为值.为此,我按照上述问题中的建议编写了一个自定义 RecordReader
.但是我不明白如何在这个类中获取文件名作为键.另外,在编写自定义 FileInputFormat
类时,我无法理解如何返回我之前编写的自定义 RecordReader
.
Basically, I want the file name as key and the file data as value. For that I wrote a custom RecordReader
as recommended in the aforementioned question. But I couldn't understand how to get the file name as the key in this class. Also, while writing the custom FileInputFormat
class, I couldn't understand how to return the custom RecordReader
I wrote previously.
RecordReader
代码为:
import java.io.IOException;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
public class CustomRecordReader extends RecordReader<Text, Text> {
private static final String LINE_SEPARATOR = System.getProperty("line.separator");
private StringBuffer valueBuffer = new StringBuffer("");
private Text key = new Text();
private Text value = new Text();
private RecordReader<Text, Text> recordReader;
public SPDRecordReader(RecordReader<Text, Text> recordReader) {
this.recordReader = recordReader;
}
@Override
public void close() throws IOException {
recordReader.close();
}
@Override
public Text getCurrentKey() throws IOException, InterruptedException {
return key;
}
@Override
public Text getCurrentValue() throws IOException, InterruptedException {
return value;
}
@Override
public float getProgress() throws IOException, InterruptedException {
return recordReader.getProgress();
}
@Override
public void initialize(InputSplit arg0, TaskAttemptContext arg1)
throws IOException, InterruptedException {
recordReader.initialize(arg0, arg1);
}
@Override
public boolean nextKeyValue() throws IOException, InterruptedException {
if (valueBuffer.equals("")) {
while (recordReader.nextKeyValue()) {
valueBuffer.append(recordReader.getCurrentValue());
valueBuffer.append(LINE_SEPARATOR);
}
value.set(valueBuffer.toString());
return true;
}
return false;
}
}
而不完整的FileInputFormat
类是:
import java.io.IOException;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.InputSplit;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.RecordReader;
import org.apache.hadoop.mapred.Reporter;
public class CustomFileInputFormat extends FileInputFormat<Text, Text> {
@Override
protected boolean isSplitable(FileSystem fs, Path filename) {
return false;
}
@Override
public RecordReader<Text, Text> getRecordReader(InputSplit arg0, JobConf arg1,
Reporter arg2) throws IOException {
return null;
}
}
推荐答案
在你的 CustomRecordReader
类中有这个代码.
Have this code in your CustomRecordReader
class.
private LineRecordReader lineReader;
private String fileName;
public CustomRecordReader(JobConf job, FileSplit split) throws IOException {
lineReader = new LineRecordReader(job, split);
fileName = split.getPath().getName();
}
public boolean next(Text key, Text value) throws IOException {
// get the next line
if (!lineReader.next(key, value)) {
return false;
}
key.set(fileName);
value.set(value);
return true;
}
public Text createKey() {
return new Text("");
}
public Text createValue() {
return new Text("");
}
删除 SPDRecordReader
构造函数(这是一个错误).
Remove SPDRecordReader
constructor (It is an error).
并在您的 CustomFileInputFormat
类中包含此代码
And have this code in your CustomFileInputFormat
class
public RecordReader<Text, Text> getRecordReader(
InputSplit input, JobConf job, Reporter reporter)
throws IOException {
reporter.setStatus(input.toString());
return new CustomRecordReader(job, (FileSplit)input);
}
这篇关于在运行 Hadoop MapReduce 作业时获取文件名/文件数据作为 Map 的键/值输入的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!