如何在hadoop的shuffle/sort阶段进行数字排序?

本文介绍了如何在hadoop的shuffle/sort阶段进行数字排序?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

数据是这样的，第一个字段是一个数字，

The data looks like this, first field is a number,

3 ...
1 ...
2 ...
11 ...

我想根据第一个字段的数字而不是字母顺序对这些行进行排序，这意味着排序后它应该是这样的，

And I want to sort these lines according to the first field numerically instead of alphabetically, which means after sorting it should look like this,

1 ...
2 ...
3 ...
11 ...

但是 hadoop 一直给我这个，

But hadoop keeps giving me this,

1 ...
11 ...
2 ...
3 ...

如何改正?

推荐答案

假设您使用的是 Hadoop Streaming，您需要使用 KeyFieldBasedComparator 类.

Assuming you are using Hadoop Streaming, you need to use the KeyFieldBasedComparator class.

-D mapred.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator 应该添加到流命令中

-D mapred.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator should be added to streaming command

您需要使用 mapred.text.key.comparator.options 提供所需的排序类型.一些有用的是 -n :数字排序，-r :反向排序

You need to provide type of sorting required using mapred.text.key.comparator.options. Some useful ones are -n : numeric sort, -r : reverse sort

示例:

使用以下代码创建身份映射器和化简器

Create an identity mapper and reducer with the following code

这是 mapper.py &reducer.py

#!/usr/bin/env python
import sys
for line in sys.stdin:    
    print "%s" % (line.strip())

这是input.txt

这是流媒体命令

$HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar 
-D mapred.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator 
-D  mapred.text.key.comparator.options=-n 
-input /user/input.txt 
-output /user/output.txt 
-file ~/mapper.py 
-mapper ~/mapper.py 
-file ~/reducer.py 
-reducer ~/reducer.py

你会得到所需的输出

注意:

我使用了简单的一键输入.但是，如果您有多个键和/或分区，则必须根据需要编辑 mapred.text.key.comparator.options.由于我不知道您的用例，因此我的示例仅限于此

I have used a simple one key input. If however you have multiple keys and/or partitions, you will have to edit mapred.text.key.comparator.options as needed. Since I do not know your use case , my example is limited to this

需要身份映射器，因为您至少需要一个映射器才能运行 MR 作业.

Identity mapper is needed since you will need atleast one mapper for a MR job to run.

需要身份缩减器，因为如果它是纯地图作业，则混洗/排序阶段将不起作用.

Identity reducer is needed since shuffle/sort phase will not work if it is a pure map only job.

这篇关于如何在hadoop的shuffle/sort阶段进行数字排序?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！