Hadoop入门：初探Map-Reduce

Hadoop用于架设分布式系统。一般来说，实现多个系统并行计算需要考虑许多细节，比如说各个系统节点间的任务分配、跟踪与意外发生后的恢复等。Hadoop可以使我们这些“！分布式系统Expert”轻松地架设自己的分布式系统，写出分布式应用程序。Hadoop的一个关键实现利用了Map-Reduce的编程模型。Map阶段是Reduce阶段的一个准备，将Hadoop提供的分片进行处理，然后提交给Reduce程序处理提供结果。基本的数据流向如图：
Hadoop入门：初探Map-Reduce-LMLPHP

数据流首先进行了分片（每片64M，与HDFS的分块大小一致），然后每个分片会分配给一个map进行处理，之后针对reduce的数量产生对应的输出分片，这里原先的分片顺序会打乱，类似于洗牌，之后分别交给reduce处理后输出结果。在Hadoop框架下编写程序时可以使用多种语言，通常可以使用python或C++。直接调用Hadoop提供的输入输出结构/Pipes就可以实现程序。下面是一个Tom White提供的关于天气数据处理的一个C++实现例子，由此可见一窥：

点击(此处)折叠或打开

//Hadoop MapReduce Example by C++
#include <algorithm>
#include <limits>
#include <stdlib>
#include <string>
#include "~\hadoop\Pipes.hh"
#include "~\hadoop\TemplateFactory.hh"
#include "~\hadoop\StringUtils.hh"
class MaxTemperatureMapper:public HadoopPipes::Mapper
{
public :
MaxTemperatureMapper(HadoopPipes::TaskContext& context)
{
}
void map(HadoopPipes::MapContext& context) //重写map函数用以实现自身操作
{
std::string line = context.getInputValue();
std::string year = line.substr(15,4);
std::string airTemperature = line.substr(87,5);
std::string q = line.substr(92,1);
if (airTemperature != "+9999" && (q == '0' || q == '1' || q == '4'||q == '5'|| q == '9'))
{
context.emit(year, airTemperature);
}
}
};
class MapTemperatureReducer:public HadoopPipes::Reducer
{
public : MapTemperatureReducer(HadoopPipes::TaskContext& context)
{
}
void reduce(hadoopPipes::ReduceContext& context) //重写了reduce函数
{
int maxValue = 0; //初始化一个最小值
while (context.nextValue())
{
maxValue = std::(maxValue, HadoopUtils::toInt(context.getInputValue));
}
context.emit(context.getInputKey(), HadoopUtils::toString(maxValue));
}
}
int main(int argc, char *argv[])
{
return HadoopPipes::runTask(HadoopPipes::TemplateFactory<MaxTemperatureMapper, MapTemperatureReducer()>);
}

第一次看原版书，第一次接触Hadoop，还是有些迷糊，有些概念还是没有搞懂，先搁置在一边，继续向下看，慢慢就会懂了吧！
PS:学习资料：>, Tom White, O'REILLY

dirk2014

Hadoop入门：初探Map-Reduce