依照我原来的
http://www.blogjava.net/Skynet/archive/2009/09/25/296420.html
以及官方文档
http://hadoop.apache.org/common/docs/r0.15.2/streaming.html


我现在有个 CPU 密集型的 运算 ,比如时间戳 转换
# 原始文件
> [email protected]    1019    20110622230010

# 先把 文件切分成 我们需要的 map 运算数量 (根据行数)
$ split -l 500000 userid_appid_time.pplog.day.data

$vim  run.pm
  1. #!/usr/bin/perl -an
  2. chomp;
  3. @F=split "\t";
  4. $F[2]=~s/(\d{4})(\d{2})(\d{2})(\d{2})(\d{2})(\d{2})/\1-\2-\3 \4:\5:\6/g;
  5. $h{$F[2]}=`date -d \"$F[2]\" +%s` if not exists $h{$F[2]};
  6. print "$F[0]\t$F[1]\t$h{$F[2]}
$ 运行
hadoop jar ./contrib/streaming/hadoop-streaming-0.20.2-cdh3u1.jar  \
      -input hdfs:///tmp/pplog/userid_appid_time.pplog.day/x*  \
      -mapper   run.pm  \
      -file /opt/sohudba/20111230/uniqname_pool/run.pm  \
      -output hdfs:///tmp/lky/streamingx3

$ 输出结果:


$ 帮助
  1. hadoop jar ./contrib/streaming/hadoop-streaming-0.20.2-cdh3u1.jar -info
  2. 12/01/09 16:38:00 ERROR streaming.StreamJob: Missing required options: input, output
  3. Usage: $HADOOP_HOME/bin/hadoop jar \
  4. $HADOOP_HOME/hadoop-streaming.jar [options]
  5. Options:
  6. -input DFS input file(s) for the Map step
  7. -output DFS output directory for the Reduce step
  8. -mapper | The streaming command to run
  9. -combiner | The streaming command to run
  10. -reducer | The streaming command to run
  11. -file File/dir to be shipped in the Job jar file
  12. -inputformat TextInputFormat(default)|SequenceFileAsTextInputFormat|
  13. -outputformat TextOutputFormat(default)|
  14. -partitioner
  15. -numReduceTasks Optional.
  16. -inputreader Optional.
  17. -cmdenv = Optional. Pass env.var to streaming commands
  18. -mapdebug Optional. To run this script when a map task fails
  19. -reducedebug Optional. To run this script when a reduce task fails
  20. -io Optional.
  21. -verbose

  22. Generic options supported are
  23. -conf specify an application configuration file
  24. -D use value for given property
  25. -fs | specify a namenode
  26. -jt | specify a job tracker
  27. -files specify comma separated files to be copied to the map reduce cluster
  28. -libjars specify comma separated jar files to include in the classpath.
  29. -archives specify comma separated archives to be unarchived on the compute machines.

  30. The general command line syntax is
  31. bin/hadoop command [genericOptions] [commandOptions]


  32. In -input: globbing on is supported and can have multiple -input
  33. Default Map input format: a line is a record in UTF-8
  34. the key part ends at first TAB, the rest of the line is the value
  35. Custom input format: -inputformat package.MyInputFormat
  36. Map output format, reduce input/output format:
  37. Format defined by what the mapper command outputs. Line-oriented

  38. The files named in the -file argument[s] end up in the
  39. working directory when the mapper and reducer are run.
  40. The location of this working directory is unspecified.

  41. To set the number of reduce tasks (num. of output files):
  42. -D mapred.reduce.tasks=10
  43. To skip the sort/combine/shuffle/sort/reduce step:
  44. Use -numReduceTasks 0
  45. A Task's Map output then becomes a 'side-effect output' rather than a reduce input
  46. This speeds up processing, This also feels more like "in-place" processing
  47. because the input filename and the map input order are preserved
  48. This equivalent -reducer NONE

  49. To speed up the last maps:
  50. -D mapred.map.tasks.speculative.execution=true
  51. To speed up the last reduces:
  52. -D mapred.reduce.tasks.speculative.execution=true
  53. To name the job (appears in the JobTracker Web UI):
  54. -D mapred.job.name='My Job'
  55. To change the local temp directory:
  56. -D dfs.data.dir=/tmp/dfs
  57. -D stream.tmpdir=/tmp/streaming
  58. Additional local temp directories with -cluster local:
  59. -D mapred.local.dir=/tmp/local
  60. -D mapred.system.dir=/tmp/system
  61. -D mapred.temp.dir=/tmp/temp
  62. To treat tasks with non-zero exit status as SUCCEDED:
  63. -D stream.non.zero.exit.is.failure=false
  64. Use a custom hadoopStreaming build along a standard hadoop install:
  65. $HADOOP_HOME/bin/hadoop jar /path/my-hadoop-streaming.jar [...]\
  66. [...] -D stream.shipped.hadoopstreaming=/path/my-hadoop-streaming.jar
  67. For more details about jobconf parameters see:
  68. http://wiki.apache.org/hadoop/JobConfFile
  69. To set an environement variable in a streaming command:
  70. -cmdenv EXAMPLE_DIR=/home/example/dictionaries/

  71. Shortcut:
  72. setenv HSTREAMING "$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar"

  73. Example: $HSTREAMING -mapper "/usr/local/bin/perl5 filter.pl"
  74. -file /local/filter.pl -input "/logs/0604*/*" [...]
  75. Ships a script, invokes the non-shipped perl interpreter
  76. Shipped files go to the working directory so filter.pl is found by perl
  77. Input files are all the daily logs for days in month 2006-04

  78. Streaming Command Failed!










09-25 21:54