依照我原来的
http://www.blogjava.net/Skynet/archive/2009/09/25/296420.html
以及官方文档
http://hadoop.apache.org/common/docs/r0.15.2/streaming.html
我现在有个 CPU 密集型的 运算 ,比如时间戳 转换
# 原始文件
> [email protected] 1019 20110622230010
# 先把 文件切分成 我们需要的 map 运算数量 (根据行数)
$ split -l 500000 userid_appid_time.pplog.day.data
$vim run.pm
- #!/usr/bin/perl -an
- chomp;
- @F=split "\t";
- $F[2]=~s/(\d{4})(\d{2})(\d{2})(\d{2})(\d{2})(\d{2})/\1-\2-\3 \4:\5:\6/g;
- $h{$F[2]}=`date -d \"$F[2]\" +%s` if not exists $h{$F[2]};
- print "$F[0]\t$F[1]\t$h{$F[2]}
hadoop jar ./contrib/streaming/hadoop-streaming-0.20.2-cdh3u1.jar \
-input hdfs:///tmp/pplog/userid_appid_time.pplog.day/x* \
-mapper run.pm \
-file /opt/sohudba/20111230/uniqname_pool/run.pm \
-output hdfs:///tmp/lky/streamingx3
$ 输出结果:
$ 帮助
- hadoop jar ./contrib/streaming/hadoop-streaming-0.20.2-cdh3u1.jar -info
- 12/01/09 16:38:00 ERROR streaming.StreamJob: Missing required options: input, output
- Usage: $HADOOP_HOME/bin/hadoop jar \
- $HADOOP_HOME/hadoop-streaming.jar [options]
- Options:
- -input DFS input file(s) for the Map step
- -output DFS output directory for the Reduce step
- -mapper | The streaming command to run
- -combiner | The streaming command to run
- -reducer | The streaming command to run
- -file File/dir to be shipped in the Job jar file
- -inputformat TextInputFormat(default)|SequenceFileAsTextInputFormat|
- -outputformat TextOutputFormat(default)|
- -partitioner
- -numReduceTasks Optional.
- -inputreader Optional.
- -cmdenv = Optional. Pass env.var to streaming commands
- -mapdebug Optional. To run this script when a map task fails
- -reducedebug Optional. To run this script when a reduce task fails
- -io Optional.
- -verbose
- Generic options supported are
- -conf specify an application configuration file
- -D use value for given property
- -fs | specify a namenode
- -jt | specify a job tracker
- -files specify comma separated files to be copied to the map reduce cluster
- -libjars specify comma separated jar files to include in the classpath.
- -archives specify comma separated archives to be unarchived on the compute machines.
- The general command line syntax is
- bin/hadoop command [genericOptions] [commandOptions]
- In -input: globbing on is supported and can have multiple -input
- Default Map input format: a line is a record in UTF-8
- the key part ends at first TAB, the rest of the line is the value
- Custom input format: -inputformat package.MyInputFormat
- Map output format, reduce input/output format:
- Format defined by what the mapper command outputs. Line-oriented
- The files named in the -file argument[s] end up in the
- working directory when the mapper and reducer are run.
- The location of this working directory is unspecified.
- To set the number of reduce tasks (num. of output files):
- -D mapred.reduce.tasks=10
- To skip the sort/combine/shuffle/sort/reduce step:
- Use -numReduceTasks 0
- A Task's Map output then becomes a 'side-effect output' rather than a reduce input
- This speeds up processing, This also feels more like "in-place" processing
- because the input filename and the map input order are preserved
- This equivalent -reducer NONE
- To speed up the last maps:
- -D mapred.map.tasks.speculative.execution=true
- To speed up the last reduces:
- -D mapred.reduce.tasks.speculative.execution=true
- To name the job (appears in the JobTracker Web UI):
- -D mapred.job.name='My Job'
- To change the local temp directory:
- -D dfs.data.dir=/tmp/dfs
- -D stream.tmpdir=/tmp/streaming
- Additional local temp directories with -cluster local:
- -D mapred.local.dir=/tmp/local
- -D mapred.system.dir=/tmp/system
- -D mapred.temp.dir=/tmp/temp
- To treat tasks with non-zero exit status as SUCCEDED:
- -D stream.non.zero.exit.is.failure=false
- Use a custom hadoopStreaming build along a standard hadoop install:
- $HADOOP_HOME/bin/hadoop jar /path/my-hadoop-streaming.jar [...]\
- [...] -D stream.shipped.hadoopstreaming=/path/my-hadoop-streaming.jar
- For more details about jobconf parameters see:
- http://wiki.apache.org/hadoop/JobConfFile
- To set an environement variable in a streaming command:
- -cmdenv EXAMPLE_DIR=/home/example/dictionaries/
- Shortcut:
- setenv HSTREAMING "$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar"
- Example: $HSTREAMING -mapper "/usr/local/bin/perl5 filter.pl"
- -file /local/filter.pl -input "/logs/0604*/*" [...]
- Ships a script, invokes the non-shipped perl interpreter
- Shipped files go to the working directory so filter.pl is found by perl
- Input files are all the daily logs for days in month 2006-04
- Streaming Command Failed!