Hadoop如何实现关联计算

选择Hadoop，低成本和高扩展性是主要原因，但但它的开发效率实在无法让人满意。以关联计算为例。假设：HDFS上有2个文件，分别是客户信息和订单信息，customerID是它们之间的关联字段。如何进行关联计算，以便将客户名称添加到订单列表中？一般方法是：输入2个源文件。根据文件名在Map中处理每条数据，如果是Order，则在foreign key上加标记”O”，形成combined key；如果是Customer则做标记”C”。Map之后的数据按照key分区，再按照combined key分组排序。最后在reduce中合并结果再输出。实现代码： public static class JMapper extends Mapper { //mark every row with "O" or "C" according to file name @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String pathName = ((FileSplit) context.getInputSplit()).getPath().toString(); if (pathName.contains("order.txt")) {//identify order by file name String values[] = value.toString().split("\t"); TextPair tp = new TextPair(new Text(values[1]), new Text("O"));//mark with "O" context.write(tp, new Text(values[0] + "\t" + values[2])); } if (pathName.contains("customer.txt")) {//identify customer by file name String values[] = value.toString().split("\t"); TextPair tp = new TextPair(new Text(values[0]), new Text("C"));//mark with "C" context.write(tp, new Text(values[1])); } } } public static class JPartitioner extends Partitioner { //partition by key, i.e. customerID @Override public int getPartition(TextPair key, Text value, int numParititon) { return Math.abs(key.getFirst().hashCode() * 127) % numParititon; } } public static class JComparator extends WritableComparator { //group by muti-key public JComparator() { super(TextPair.class, true); } @SuppressWarnings("unchecked") public int compare(WritableComparable a, WritableComparable b) { TextPair t1 = (TextPair) a; TextPair t2 = (TextPair) b; return t1.getFirst().compareTo(t2.getFirst()); } } public static class JReduce extends Reducer { //merge and output protected void reduce(TextPair key, Iterable values, Context context) throws IOException,InterruptedException { Text pid = key.getFirst(); String desc = values.iterator().next().toString(); while (values.iterator().hasNext()) { context.write(pid, new Text(values.iterator().next().toString() + "\t" + desc)); } } } public class TextPair implements WritableComparable { //make muti-key private Text first; private Text second; public TextPair() { set(new Text(), new Text()); } public TextPair(String first, String second) { set(new Text(first), new Text(second)); } public TextPair(Text first, Text second) { set(first, second); } public void set(Text first, Text second) { this.first = first; this.second = second; } public Text getFirst() { return first; } public Text getSecond() { return second; } public void write(DataOutput out) throws IOException { first.write(out); second.write(out); } public void readFields(DataInput in) throws IOException { first.readFields(in); second.readFields(in); } public int compareTo(TextPair tp) { int cmp = first.compareTo(tp.first); if (cmp != 0) { return cmp; } return second.compareTo(tp.second); } } public static void main(String agrs[]) throws IOException, InterruptedException, ClassNotFoundException { //job entrance Configuration conf = new Configuration(); GenericOptionsParser parser = new GenericOptionsParser(conf, agrs); String[] otherArgs = parser.getRemainingArgs(); if (agrs.length System.err.println("Usage: J "); System.exit(2); } Job job = new Job(conf, "J"); job.setJarByClass(J.class);//Join class job.setMapperClass(JMapper.class);//Map class job.setMapOutputKeyClass(TextPair.class);//Map output key class job.setMapOutputValueClass(Text.class);//Map output value class job.setPartitionerClass(JPartitioner.class);//partition class job.setGroupingComparatorClass(JComparator.class);//condition group class after partition job.setReducerClass(Example_Join_01_Reduce.class);//reduce class job.setOutputKeyClass(Text.class);//reduce output key class job.setOutputValueClass(Text.class);//reduce ouput value class FileInputFormat.addInputPath(job, new Path(otherArgs[0]));//one of source files FileInputFormat.addInputPath(job, new Path(otherArgs[1]));//another file FileOutputFormat.setOutputPath(job, new Path(otherArgs[2]));//output path System.exit(job.waitForCompletion(true) ? 0 : 1);//run untill job ends } 不能直接使用原始数据，而是要搞一堆代码处理标记，并绕过MapReduce原本的架构，最后从底层设计并计算数据之间的关联关系。这还是最简单的关联计算，如果用MapReduce进行多表关联或逻辑更复杂的关联计算，复杂度会呈几何级数递增。转自：http://hi.baidu.com/rwvzjwhehncntye/item/da8cdcf335e40b2dfe3582db意外搜到另一篇相同主题的文章，不知道是否软文，开卷有益吧：http://blog.sina.com.cn/s/blog_e4de31d00101efat.html

jiongtoast

Hadoop如何实现关联计算