问题描述
数据是这样的,第一个字段是一个数字,
The data looks like this, first field is a number,
3 ...
1 ...
2 ...
11 ...
我想根据第一个字段的数字而不是字母顺序对这些行进行排序,这意味着排序后它应该是这样的,
And I want to sort these lines according to the first field numerically instead of alphabetically, which means after sorting it should look like this,
1 ...
2 ...
3 ...
11 ...
但是 hadoop 一直给我这个,
But hadoop keeps giving me this,
1 ...
11 ...
2 ...
3 ...
如何改正?
推荐答案
假设您使用的是 Hadoop Streaming,您需要使用 KeyFieldBasedComparator 类.
Assuming you are using Hadoop Streaming, you need to use the KeyFieldBasedComparator class.
-D mapred.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator 应该添加到流命令中
-D mapred.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator should be added to streaming command
您需要使用 mapred.text.key.comparator.options 提供所需的排序类型.一些有用的是 -n :数字排序,-r :反向排序
You need to provide type of sorting required using mapred.text.key.comparator.options. Some useful ones are -n : numeric sort, -r : reverse sort
示例:
使用以下代码创建身份映射器和化简器
Create an identity mapper and reducer with the following code
这是 mapper.py &reducer.py
#!/usr/bin/env python
import sys
for line in sys.stdin:
print "%s" % (line.strip())
这是input.txt
1
11
2
20
7
3
40
这是流媒体命令
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar
-D mapred.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator
-D mapred.text.key.comparator.options=-n
-input /user/input.txt
-output /user/output.txt
-file ~/mapper.py
-mapper ~/mapper.py
-file ~/reducer.py
-reducer ~/reducer.py
你会得到所需的输出
1
2
3
7
11
20
40
注意:
我使用了简单的一键输入.但是,如果您有多个键和/或分区,则必须根据需要编辑 mapred.text.key.comparator.options.由于我不知道您的用例,因此我的示例仅限于此
I have used a simple one key input. If however you have multiple keys and/or partitions, you will have to edit mapred.text.key.comparator.options as needed. Since I do not know your use case , my example is limited to this
需要身份映射器,因为您至少需要一个映射器才能运行 MR 作业.
Identity mapper is needed since you will need atleast one mapper for a MR job to run.
需要身份缩减器,因为如果它是纯地图作业,则混洗/排序阶段将不起作用.
Identity reducer is needed since shuffle/sort phase will not work if it is a pure map only job.
这篇关于如何在hadoop的shuffle/sort阶段进行数字排序?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!