问题描述
我运行在亚马逊电子病历Hadoop实现最高蟒蛇马preduce脚本。由于从主脚本的结果,我得到逐项similiarities。在善后一步,我想拆分此输出成一个单独的S3存储的每个项目,因此每个项目斗含有类同它的项目清单。要做到这一点,我想用亚马逊博托Python库的善后步骤的减少作用。
I am running a python MapReduce script on top of Amazons EMR Hadoop implementation. As a result from the main scripts, I get item item similiarities. In an aftercare step, I want to split this output into a seperate S3 bucket for each item, so each item-bucket contains a list of items similiar to it. To achieve this, I want to use Amazons boto python library in the reduce function of the aftercare step.
- 如何导入外部(蟒蛇)库到Hadoop的,让他们可以在用Python编写的一个降低步骤中使用?
- 是否有可能访问S3那样的Hadoop的环境里面?
在此先感谢,托马斯
推荐答案
在启动,您可以指定要提供外部文件的Hadoop的过程。这是通过使用 -files
参数做了。
When launching a hadoop process you can specify external files that should be made available. This is done by using the -files
argument.
$ HADOOP_HOME /斌/ Hadoop的罐子/usr/lib/COMPANY/analytics/libjars/MyJar.jar -files HDFS://PDHadoop1.corp.COMPANY.com:54310 /数据/ geoip的/ GeoIPCity.dat
我不知道,如果这些文件必须在HDFS,但如果它是将要经常运行的作业,它不会是一个坏主意,把它们放在那里。
从code,你可以做类似
I don't know if the files HAVE to be on the HDFS, but if it's a job that will be running often, it wouldn't be a bad idea to put them there.
From the code you can do something similar to
if (DistributedCache.getLocalCacheFiles(context.getConfiguration()) != null) {
List<Path> localFiles = Utility.arrayToList(DistributedCache.getLocalCacheFiles(context.getConfiguration()));
for (Path localFile : localFiles) {
if ((localFile.getName() != null) && (localFile.getName().equalsIgnoreCase("GeoIPCity.dat"))) {
Path path = new File(localFile.toUri().getPath());
}
}
}
这是所有,但复制并直接从工作code里面多了映射器中粘贴。
This is all but copy and pasted directly from working code inside multiple of our Mappers.
我不知道你问题的第二部分。希望答案的第一部分将让你开始。 :)
I don't know about the second part of your question. Hopefully the answer to the first part will get you started. :)
除了 -files
有 -libjars
为包括额外的罐子;我对这里一点信息 - If我有一个构造函数,需要一个文件路径,我怎么能假,如果它被打包成一个jar?
In addition to -files
there is -libjars
for including additional jars; I have a little information about here - If I have a constructor that requires a path to a file, how can I "fake" that if it is packaged into a jar?
这篇关于在Hadoop的马preduce脚本导入外部库的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!