本文介绍了在流式传输hadoop程序中获取输入文件名称的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



有没有相应的方法可以做到这一点当我用Python编写程序时(使用流?)

我在apache的hadoop streaming文档中发现了以下内容:

但是我仍然无法理解如何使用这个在我的映射器中。



任何帮助都非常感谢。



谢谢

解决方案

根据



Hadoop将作业配置参数设置为Streaming程序的环境变量。但是,它会用下划线替换非字母数字字符,以确保它们是有效的名称。以下Python表达式说明了如何从Python Streaming脚本中检索mapred.job.id属性的值:
$ b os.environ [mapred_job_id]



您还可以通过将-cmdenv选项应用于Streaming启动器程序(您希望设置的每个变量一次),为MapReduce启动的Streaming进程设置环境变量。例如,以下设置MAGIC_PARAMETER环境变量:

-cmdenv MAGIC_PARAMETER = abracadabra


I am able to find the name if the input file in a mapper class using FileSplit when writing the program in Java.

Is there a corresponding way to do this when I write a program in Python (using streaming?)

I found the following in the hadoop streaming document on apache:

But I still cant understand how to make use of this inside my mapper.

Any help is highly appreciated.

Thanks

解决方案

According to the "Hadoop : The Definitive Guide"

Hadoop sets job configuration parameters as environment variables for Streaming programs. However, it replaces non-alphanumeric character with underscores to make sure they are valid names. The following Python expression illustrates how you can retrieve the value of the mapred.job.id property from within a Python Streaming script:

os.environ["mapred_job_id"]

You can also set environment variables for the Streaming process launched by MapReduce by applying the -cmdenv option to the Streaming launcher program (once for each variable you wish to set). For example, the following sets the MAGIC_PARAMETER environment variable:

-cmdenv MAGIC_PARAMETER=abracadabra

这篇关于在流式传输hadoop程序中获取输入文件名称的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-20 13:25
查看更多