将流式传输步骤添加到在AWS EMR 5.0上运行的boto3中的MR作业中

本文介绍了将流式传输步骤添加到在AWS EMR 5.0上运行的boto3中的MR作业中的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试将我用python编写的几个MR作业从AWS EMR 2.4迁移到AWS EMR 5.0.到现在为止，我一直在使用boto 2.4，但它不支持EMR 5.0，因此我正尝试转向boto3.之前，在使用boto 2.4时，我使用了StreamingStep模块来指定输入位置和输出位置，以及映射器和化简器源文件的位置.使用此模块，我实际上不必创建或上传任何jar即可运行我的工作.但是，在boto3文档中的任何地方都找不到此模块的等效项.如何在boto3中向我的MR工作添加流式处理步骤，这样就不必上载jar文件来运行它了?

I'm trying to migrate a couple of MR jobs that I have written in python from AWS EMR 2.4 to AWS EMR 5.0. Till now I was using boto 2.4, but it doesn't support EMR 5.0, so I'm trying to shift to boto3. Earlier, while using boto 2.4, I used the StreamingStep module to specify input location and output location, as well as the location of my mapper and reducer source files. Using this module, I effectively didn't have to create or upload any jar to run my jobs. However, I cannot find the equivalent for this module anywhere in the boto3 documentation. How can I add a streaming step in boto3 to my MR job, so that I don't have to upload a jar file to run it?

推荐答案

不幸的是，boto3和EMR API的文档很少.最少的单词计数示例如下所示:

It's unfortunate that boto3 and EMR API are rather poorly documented. Minimally, the word counting example would look as follows:

import boto3

emr = boto3.client('emr')

resp = emr.run_job_flow(
    Name='myjob',
    ReleaseLabel='emr-5.0.0',
    Instances={
        'InstanceGroups': [
            {'Name': 'master',
             'InstanceRole': 'MASTER',
             'InstanceType': 'c1.medium',
             'InstanceCount': 1,
             'Configurations': [
                 {'Classification': 'yarn-site',
                  'Properties': {'yarn.nodemanager.vmem-check-enabled': 'false'}}]},
            {'Name': 'core',
             'InstanceRole': 'CORE',
             'InstanceType': 'c1.medium',
             'InstanceCount': 1,
             'Configurations': [
                 {'Classification': 'yarn-site',
                  'Properties': {'yarn.nodemanager.vmem-check-enabled': 'false'}}]},
        ]},
    Steps=[
        {'Name': 'My word count example',
         'HadoopJarStep': {
             'Jar': 'command-runner.jar',
             'Args': [
                 'hadoop-streaming',
                 '-files', 's3://mybucket/wordSplitter.py#wordSplitter.py',
                 '-mapper', 'python2.7 wordSplitter.py',
                 '-input', 's3://mybucket/input/',
                 '-output', 's3://mybucket/output/',
                 '-reducer', 'aggregate']}
         }
    ],
    JobFlowRole='EMR_EC2_DefaultRole',
    ServiceRole='EMR_DefaultRole',
)

我不记得需要通过boto来执行此操作，但是在不禁用vmem-check-enabled的情况下正确运行简单的流式作业存在一些问题.

I don't remember needing to do this with boto, but I have had issues running the simple streaming job properly without disabling vmem-check-enabled.

此外，如果您的脚本位于S3上的某个位置，请使用-files进行下载(将#filename附加到参数上，使下载的文件在群集中以filename的形式提供).

Also, if your script is located somewhere on S3, download it using -files (appending #filename to the argument make the downloaded file available as filename in the cluster).

这篇关于将流式传输步骤添加到在AWS EMR 5.0上运行的boto3中的MR作业中的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！

Boto3