



由于我们需要在非Hadoop中读取一堆文件以映射到环境中,我使用os.walk(dir)file=open(path, mode)进行读入每个文件.

As we need to read in bunch of files to mapper, in non-Hadoopenvironment, I use os.walk(dir) and file=open(path, mode) to read ineach file.


However, in Hadoop environment, as I read that HadoopStreaming convertfile input to stdin of mapper and conver stdout of reducer to fileoutput, I have a few questions about how to input file:

  1. 我们是否必须在mapper.py中设置来自STDIN的输入,并让HadoopStreaming将hdfs输入目录中的文件转换为STDIN吗?

  1. Do we have to set input from STDIN in and letHadoopStreaming convert files in hdfs input directory to STDIN?


If I want to read in each file separately and parse each line, howcan I set input from file in

我先前针对非Hadoop环境集的Python代码:用于os.walk中的根目录,dirs文件(非hdfs的路径") .....

My previous Python code for non-Hadoop environment sets:for root, dirs, files in os.walk('path of non-hdfs') .....

但是,在Hadoop环境中,我需要将非hdfs的路径"更改为我将fromFromLocal复制到的HDFS路径,但是我尝试了很多成功,例如os.walk('/user/hadoop/in') -这就是我检查的内容通过运行bin/hadoop dfs -ls和os.walk('home/hadoop/files')-this是我在非Hadoop环境中的本地路径,甚至是os.walk('hdfs://host:fs_port/user/hadoop/in') ....

However, in Hadoop environment, I need to change 'path of non-hdfs' toa path of HDFS where I copyFromLocal to, but I tried many with nosuccess, such as os.walk('/user/hadoop/in') -- this is what I checkedby running bin/hadoop dfs -ls, and os.walk('home/hadoop/files')--thisis my local path in non-Hadoop environment, and even os.walk('hdfs://host:fs_port/user/hadoop/in')....


Can anyone tell me whether I can input from file by using fileoperation in or I have to input from STDIN?



Hadoop流式传输接受了STDIN的输入.我认为您遇到的困惑是您试图编写代码来执行Hadoop Streaming为您做的一些事情.当我第一次开始Hadooping时就这样做了.

Hadoop streaming has to take input from STDIN. I think the confusion you're having is you're trying to write code to do some of the things that Hadoop Streaming is doing for you. I did that when I first started Hadooping.


Hadoop streaming can read in multiple files and even multiple zipped files which it then parses, one line at a time, into the STDIN of your mapper. This is a helpful abstraction because you then write your mapper to be file name/location independent. You can then use your mappers and reducers for any input which is handy later. Plus you don't want your mapper trying to grab files because you have no way of knowing how many mappers you will have later. If files were coded into the mapper, then if a single mapper failed you would never get output from the files hard coded in that mapper. So let Hadoop do the file management and have your code be as generic as possible.


08-20 13:45