STDIN或文件作为Hadoop环境中的映射器输入? | STDIN或文件作为Hadoop环境中的映射器输入

本文介绍了STDIN或文件作为Hadoop环境中的映射器输入?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

由于我们需要在非Hadoop中读取一堆文件以映射到环境中，我使用os.walk(dir)和file=open(path, mode)进行读入每个文件.

As we need to read in bunch of files to mapper, in non-Hadoopenvironment, I use os.walk(dir) and file=open(path, mode) to read ineach file.

但是，在Hadoop环境中，我读到了HadoopStreaming转换将文件输入到mapper的stdin并将reducer的stdout转换为file输出，我对如何输入文件有一些疑问:

However, in Hadoop environment, as I read that HadoopStreaming convertfile input to stdin of mapper and conver stdout of reducer to fileoutput, I have a few questions about how to input file:

我们是否必须在mapper.py中设置来自STDIN的输入，并让HadoopStreaming将hdfs输入目录中的文件转换为STDIN吗?

Do we have to set input from STDIN in mapper.py and letHadoopStreaming convert files in hdfs input directory to STDIN?

如果我想分别读取每个文件并解析每一行，如何我可以从mapper.py中的文件设置输入吗?

If I want to read in each file separately and parse each line, howcan I set input from file in mapper.py?

我先前针对非Hadoop环境集的Python代码:用于os.walk中的根目录，dirs文件(非hdfs的路径") .....

My previous Python code for non-Hadoop environment sets:for root, dirs, files in os.walk('path of non-hdfs') .....

但是，在Hadoop环境中，我需要将非hdfs的路径"更改为我将fromFromLocal复制到的HDFS路径，但是我尝试了很多成功，例如os.walk('/user/hadoop/in') －这就是我检查的内容通过运行bin/hadoop dfs -ls和os.walk('home/hadoop/files')-this是我在非Hadoop环境中的本地路径，甚至是os.walk('hdfs://host:fs_port/user/hadoop/in') ....

However, in Hadoop environment, I need to change 'path of non-hdfs' toa path of HDFS where I copyFromLocal to, but I tried many with nosuccess, such as os.walk('/user/hadoop/in') -- this is what I checkedby running bin/hadoop dfs -ls, and os.walk('home/hadoop/files')--thisis my local path in non-Hadoop environment, and even os.walk('hdfs://host:fs_port/user/hadoop/in')....

谁能告诉我是否可以使用文件从文件输入在mapper.py中操作还是我必须从STDIN输入?

Can anyone tell me whether I can input from file by using fileoperation in mapper.py or I have to input from STDIN?

谢谢.

推荐答案

Hadoop流式传输已接受了STDIN的输入.我认为您遇到的困惑是您试图编写代码来执行Hadoop Streaming为您做的一些事情.当我第一次开始Hadooping时就这样做了.

Hadoop streaming has to take input from STDIN. I think the confusion you're having is you're trying to write code to do some of the things that Hadoop Streaming is doing for you. I did that when I first started Hadooping.

Hadoop流可以读取多个文件，甚至可以读取多个压缩文件，然后将其一次解析一行到映射器的STDIN中.这是一个有用的抽象，因为您随后将映射器编写为与文件名/位置无关.然后，您可以将映射器和缩减器用于以后方便使用的任何输入.另外，您不希望您的映射器尝试获取文件，因为您无法知道以后将拥有多少个映射器.如果文件已编码到映射器中，则如果单个映射器失败，则您永远不会从该映射器中经过硬编码的文件中获得输出.因此，让Hadoop执行文件管理，并使您的代码尽可能通用.

Hadoop streaming can read in multiple files and even multiple zipped files which it then parses, one line at a time, into the STDIN of your mapper. This is a helpful abstraction because you then write your mapper to be file name/location independent. You can then use your mappers and reducers for any input which is handy later. Plus you don't want your mapper trying to grab files because you have no way of knowing how many mappers you will have later. If files were coded into the mapper, then if a single mapper failed you would never get output from the files hard coded in that mapper. So let Hadoop do the file management and have your code be as generic as possible.

这篇关于STDIN或文件作为Hadoop环境中的映射器输入?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！