问题描述
对于python Hadoop流式作业,我如何将一个参数传递给reducer脚本,以便根据传入的参数使其行为不同?
我了解到流式作业的格式为:
hadoop jar hadoop-streaming.jar - input -output -mapper mapper.py -reducer reducer.py ...
我想影响reducer.py。
命令行选项 -reducer 的参数可以是任何命令,因此您可以尝试:
$ HADOOP_HOME / bin / hadoop jar $ HADOOP_HOME / hadoop-streaming.jar \
-input inputDirs \
-output outputDir \
-mapper myMapper.py \
-reducer'myReducer.py 1 2 3'\
-file myMapper.py \
-file myReducer .py
假设 myReducer.py 可执行文件。免责声明:我没有尝试过,但之前我已经将类似的复杂字符串传递给 -mapper 和 -reducer 。
也就是说,您是否试过了
-cmdenv name = value
选项,只需让您的Python Reducer从环境中获得它的价值?这只是另一种做事的方式。
For a python Hadoop streaming job, how do I pass a parameter to, for example, the reducer script so that it behaves different based on the parameter being passed in?
I understand that streaming jobs are called in the format of:
hadoop jar hadoop-streaming.jar -input -output -mapper mapper.py -reducer reducer.py ...
I want to affect reducer.py.
The argument to the command line option -reducer can be any command, so you can try:
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \ -input inputDirs \ -output outputDir \ -mapper myMapper.py \ -reducer 'myReducer.py 1 2 3' \ -file myMapper.py \ -file myReducer.py
assuming myReducer.py is made executable. Disclaimer: I have not tried it, but I have passed similar complex strings to -mapper and -reducer before.
That said, have you tried the
-cmdenv name=value
option, and just have your Python reducer get its value from the environment? It's just another way to do things.
这篇关于如何将参数传递给python Hadoop串流作业?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!