在Hadoop的马preduce脚本导入外部库

本文介绍了在Hadoop的马preduce脚本导入外部库的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我运行在亚马逊电子病历Hadoop实现最高蟒蛇马preduce脚本。由于从主脚本的结果，我得到逐项similiarities。在善后一步，我想拆分此输出成一个单独的S3存储的每个项目，因此每个项目斗含有类同它的项目清单。要做到这一点，我想用亚马逊博托Python库的善后步骤的减少作用。

I am running a python MapReduce script on top of Amazons EMR Hadoop implementation. As a result from the main scripts, I get item item similiarities. In an aftercare step, I want to split this output into a seperate S3 bucket for each item, so each item-bucket contains a list of items similiar to it. To achieve this, I want to use Amazons boto python library in the reduce function of the aftercare step.

如何导入外部（蟒蛇）库到Hadoop的，让他们可以在用Python编写的一个降低步骤中使用？
是否有可能访问S3那样的Hadoop的环境里面？

在此先感谢，托马斯

推荐答案

在启动，您可以指定要提供外部文件的Hadoop的过程。这是通过使用 -files 参数做了。

When launching a hadoop process you can specify external files that should be made available. This is done by using the -files argument.

$ HADOOP_HOME /斌/ Hadoop的罐子/usr/lib/COMPANY/analytics/libjars/MyJar.jar -files HDFS：//PDHadoop1.corp.COMPANY.com：54310 /数据/ geoip的/ GeoIPCity.dat

我不知道，如果这些文件必须在HDFS，但如果它是将要经常运行的作业，它不会是一个坏主意，把它们放在那里。
从code，你可以做类似

I don't know if the files HAVE to be on the HDFS, but if it's a job that will be running often, it wouldn't be a bad idea to put them there.
From the code you can do something similar to

if (DistributedCache.getLocalCacheFiles(context.getConfiguration()) != null) {
    List<Path> localFiles = Utility.arrayToList(DistributedCache.getLocalCacheFiles(context.getConfiguration()));
    for (Path localFile : localFiles) {
        if ((localFile.getName() != null) && (localFile.getName().equalsIgnoreCase("GeoIPCity.dat"))) {
            Path path = new File(localFile.toUri().getPath());
        }
    }
}

这是所有，但复制并直接从工作code里面多了映射器中粘贴。

This is all but copy and pasted directly from working code inside multiple of our Mappers.

我不知道你问题的第二部分。希望答案的第一部分将让你开始。：）

I don't know about the second part of your question. Hopefully the answer to the first part will get you started. :)

除了 -files 有 -libjars 为包括额外的罐子;我对这里一点信息 - If我有一个构造函数，需要一个文件路径，我怎么能假，如果它被打包成一个jar？

In addition to -files there is -libjars for including additional jars; I have a little information about here - If I have a constructor that requires a path to a file, how can I "fake" that if it is packaged into a jar?

这篇关于在Hadoop的马preduce脚本导入外部库的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！