之前做的Demo太无聊了,决心改造一下~~1. 输入格式。之前的程序,StatMapper莫名其妙被输入了一堆key,value,应该是一种默认的输入格式,找了一下,原来是这个: org.apache.hadoop.mapred.InputFormatBase, 继承了InputFormat接口。接口里面有一个 FileSplit[] getSplits(FileSystem fs, JobConf job, int numSplits) throws IOException;看来所有输入输出都必须以文件为单位存放了,就像Lucene一样。一般输入数据都是按照行来分隔的,看来一般用这个InputFormatBase就可以了。 2. 输入数据。这东东本来就是用来高效处理海量数据的,于是我想到了那iHome的ActionLog....,加起来好几百个M的,符合要求吧。这里统计一下这几天,指令被调用的次数。3. 修改程序。StatMapper.java: public void map(WritableComparable key, Writable value, OutputCollector output, Reporter reporter) throws IOException { String[] token = value.toString().split(" "); String "id"= token[6]; String act = token[7]; output.collect(new UTF8(act), new LongWritable(1)); }StatReducer.java: public void reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { long sum = 0; while (values.hasNext()) { sum += ((LongWritable)values.next()).get(); } System.out.println("Action: " + key + ", Count: " + sum); output.collect(key, new LongWritable(sum)); }4. 运行。这回日志看清楚了:...060328 162626 E:\workground\opensource\hadoop-nightly\tmp\input\action_log.txt.2006-03-21:0+21357898060328 162626 map 8% reduce 0%060328 162627 E:\workground\opensource\hadoop-nightly\tmp\input\action_log.txt.2006-03-21:0+21357898060328 162627 map 22% reduce 0%060328 162628 E:\workground\opensource\hadoop-nightly\tmp\input\action_log.txt.2006-03-21:0+21357898060328 162628 map 37% reduce 0%060328 162629 E:\workground\opensource\hadoop-nightly\tmp\input\action_log.txt.2006-03-21:0+21357898060328 162629 map 52% reduce 0%060328 162630 E:\workground\opensource\hadoop-nightly\tmp\input\action_log.txt.2006-03-21:0+21357898060328 162631 E:\workground\opensou 09-25 09:54