本文介绍了如何在hadoop的shuffle/sort阶段进行数字排序?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

数据是这样的,第一个字段是一个数字,

The data looks like this, first field is a number,

3 ...
1 ...
2 ...
11 ...

我想根据第一个字段的数字而不是字母顺序对这些行进行排序,这意味着排序后它应该是这样的,

And I want to sort these lines according to the first field numerically instead of alphabetically, which means after sorting it should look like this,

1 ...
2 ...
3 ...
11 ...

但是 hadoop 一直给我这个,

But hadoop keeps giving me this,

1 ...
11 ...
2 ...
3 ...

如何改正?

推荐答案

假设您使用的是 Hadoop Streaming,您需要使用 KeyFieldBasedComparator 类.

Assuming you are using Hadoop Streaming, you need to use the KeyFieldBasedComparator class.

  1. -D mapred.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator 应该添加到流命令中

  1. -D mapred.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator should be added to streaming command

您需要使用 mapred.text.key.comparator.options 提供所需的排序类型.一些有用的是 -n :数字排序,-r :反向排序

You need to provide type of sorting required using mapred.text.key.comparator.options. Some useful ones are -n : numeric sort, -r : reverse sort

示例:

使用以下代码创建身份映射器和化简器

Create an identity mapper and reducer with the following code

这是 ma​​pper.py &reducer.py

#!/usr/bin/env python
import sys
for line in sys.stdin:    
    print "%s" % (line.strip())

这是input.txt

1
11
2
20
7
3
40

这是流媒体命令

$HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar 
-D mapred.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator 
-D  mapred.text.key.comparator.options=-n 
-input /user/input.txt 
-output /user/output.txt 
-file ~/mapper.py 
-mapper ~/mapper.py 
-file ~/reducer.py 
-reducer ~/reducer.py

你会得到所需的输出

1   
2   
3   
7   
11  
20  
40

注意:

  1. 我使用了简单的一键输入.但是,如果您有多个键和/或分区,则必须根据需要编辑 mapred.text.key.comparator.options.由于我不知道您的用例,因此我的示例仅限于此

  1. I have used a simple one key input. If however you have multiple keys and/or partitions, you will have to edit mapred.text.key.comparator.options as needed. Since I do not know your use case , my example is limited to this

需要身份映射器,因为您至少需要一个映射器才能运行 MR 作业.

Identity mapper is needed since you will need atleast one mapper for a MR job to run.

需要身份缩减器,因为如果它是纯地图作业,则混洗/排序阶段将不起作用.

Identity reducer is needed since shuffle/sort phase will not work if it is a pure map only job.

这篇关于如何在hadoop的shuffle/sort阶段进行数字排序?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-22 19:32