

我使用Hortonworks HDP for Windows,并成功配置了主设备和2个从设备。

bin \hadoop jar contrib\streaming\hadoop-streaming-1.1.0-SNAPSHOT.jar -files file:/// d:/ dev /python/mapper.py,file//d:/dev/python/reducer.py -mapperpython mapper.py-reducerpython reduce.py-input /flume/0424/userlog.MDAC-HD1 .MDAC.local..20130424.1366789040945 -output / flume / o%1 -cmdenv PYTHONPATH = c:\python27

映射器贯穿没错,但是日志报告没有找到reduce.py文件。在例外情况下,它看起来像hadoop taskrunner正在为reducer创建符号链接到mapper.py文件。

当我检查作业配置文件时,我发现 mapred.cache.files 设为

hdfs:// MDAC-HD1:8020 / mapred / staging / administrator / .staging / job_201304251054_0021 /files/mapper.py#mapper.py




请注意,这是在Windows上。 / p>

编辑 - 我刚刚在本地运行它,它工作正常,看起来像我的问题可能是在集群周围复制文件。



我发现问题通过重命名hadoop conf文件强制默认设置,这意味着本地作业跟踪器。

这项工作正常运行,它给我提供了解决问题的空间,看起来集群周围的沟通并不像需要的那样完整。 / p>

I'm using Hortonworks HDP for Windows and have it successfully configured with a master and 2 slaves.

I'm using the following command;

bin\hadoop jar contrib\streaming\hadoop-streaming-1.1.0-SNAPSHOT.jar -files file:///d:/dev/python/mapper.py,file:///d:/dev/python/reducer.py -mapper "python mapper.py" -reducer "python reduce.py" -input /flume/0424/userlog.MDAC-HD1.MDAC.local..20130424.1366789040945 -output /flume/o%1 -cmdenv PYTHONPATH=c:\python27

The mapper runs through fine, but the log reports that the reduce.py file wasn't found. In the exception it looks like the hadoop taskrunner is creating the symlink for the reducer to the mapper.py file.

When I check the job configuration file, I noticed that mapred.cache.files is set to;


It looks like although the reduce.py file is being added to the jar file, it's not being included in the configuration correctly and can't be found when the reducer tries to run.

I think my command is correct, I've tried using -file parameters instead but then neither file is found.

Can anyone see or know of an obvious reason?

Please note, this is on Windows.

EDIT- I've just run it locally and it worked, looks like my problem may be with the copying of the files round the cluster.

Still welcome input!


Well, thats embarrassing... my first question and I answer it myself.

I found the problem by renaming the hadoop conf file to force default settings which meant the local job tracker.

The job ran properly and it gave me the room to work out what the problem is, looks like communication around the cluster isn't as complete as it need be.


07-29 15:45