在流式传输hadoop程序中获取输入文件名称

本文介绍了在流式传输hadoop程序中获取输入文件名称的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

有没有相应的方法可以做到这一点当我用Python编写程序时（使用流？）

我在apache的hadoop streaming文档中发现了以下内容：

但是我仍然无法理解如何使用这个在我的映射器中。

任何帮助都非常感谢。

谢谢

解决方案

根据

Hadoop将作业配置参数设置为Streaming程序的环境变量。但是，它会用下划线替换非字母数字字符，以确保它们是有效的名称。以下Python表达式说明了如何从Python Streaming脚本中检索mapred.job.id属性的值：
$ b os.environ [mapred_job_id]

您还可以通过将-cmdenv选项应用于Streaming启动器程序（您希望设置的每个变量一次），为MapReduce启动的Streaming进程设置环境变量。例如，以下设置MAGIC_PARAMETER环境变量：

-cmdenv MAGIC_PARAMETER = abracadabra

I am able to find the name if the input file in a mapper class using FileSplit when writing the program in Java.
Is there a corresponding way to do this when I write a program in Python (using streaming?)
I found the following in the hadoop streaming document on apache:
But I still cant understand how to make use of this inside my mapper.
Any help is highly appreciated.
Thanks
解决方案
According to the "Hadoop : The Definitive Guide"
Hadoop sets job configuration parameters as environment variables for Streaming programs. However, it replaces non-alphanumeric character with underscores to make sure they are valid names. The following Python expression illustrates how you can retrieve the value of the mapred.job.id property from within a Python Streaming script:
os.environ["mapred_job_id"]
You can also set environment variables for the Streaming process launched by MapReduce by applying the -cmdenv option to the Streaming launcher program (once for each variable you wish to set). For example, the following sets the MAGIC_PARAMETER environment variable:
-cmdenv MAGIC_PARAMETER=abracadabra

这篇关于在流式传输hadoop程序中获取输入文件名称的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！