本文介绍了AWS EMR并行映射器?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试确定我的EMR集群需要多少个节点.作为最佳做法的一部分,建议是:

I am trying to determine how many nodes I need for my EMR cluster. As part of best practices the recommendations are:

(作业所需的总映射器+处理时间)/(每个实例的容量+所需时间),如下所示: http://www.slideshare.net/AmazonWebServices/amazon-elastic-mapreduce-deep-dive-and -best-practices-bdt404-aws-reinvent-2013 ,第89页.

(Total Mappers needed for your job + Time taken to process) / (per instance capacity + desired time) as outlined here: http://www.slideshare.net/AmazonWebServices/amazon-elastic-mapreduce-deep-dive-and-best-practices-bdt404-aws-reinvent-2013, page 89.

问题是如何确定自AWS不发布以来,实例将支持多少个并行映射器? https://aws.amazon.com/emr/pricing/

The question is how to determine how many parallel mappers the instance will support since AWS don't publish? https://aws.amazon.com/emr/pricing/

对不起,如果我错过了明显的事情.

Sorry if i missed something obvious.

韦恩

推荐答案

要确定并行映射器的数量,您需要从称为任务配置的EMR中查看此文档,其中EMR具有针对每种实例类型的预定义配置集这将确定映射器/缩小器的数量.
http://docs.aws.amazon. com/emr/latest/ReleaseGuide/emr-hadoop-task-config.html

To determine the number of parallel mappers , you will need to check this documentation from EMR called Task Configuration where EMR had a predefined mapping set of configurations for every instance type which would determine the number of mappers/reducers.
http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hadoop-task-config.html

例如:假设您有5个m1.xlarge核心节点.根据EMR文档中该实例类型的默认mapred-site.xml配置值,我们有

For example : Lets say you have 5 m1.xlarge core nodes. According to the default mapred-site.xml configuration values for that instance type from EMR docs, we have

mapreduce.map.memory.mb = 768
yarn.nodemanager.resource.memory-mb = 12288
yarn.scheduler.maximum-allocation-mb = 12288 (same as above)

您可以简单地将后一个除以前一个设置,以获取一个m1.xlarge节点= (12288/768) = 16

You can simply divide the later with former setting to get the maximum number of mappers supported by one m1.xlarge node = (12288/768) = 16

因此,对于5个节点的群集,最多可以并行运行(考虑到仅映射作业)最多16*5 = 80个映射器.最大并行Reducers(30)的情况也是如此.您可以对映射器和化简器的组合执行类似的数学运算.

So, for the 5 node cluster , it would a max of 16*5 = 80 mappers that can run in parallel (considering a map only job). The same is the case with max parallel Reducers(30). You can do similar math for a combination of mappers and reducers.

因此,如果要并行运行更多的映射器,则可以re-size集群,也可以在每个节点上减少mapreduce.map.memory.mb(及其堆mapreduce.map.java.opts),然后将NM重新启动到

So, If you want to run more mappers in parallel , you can either re-size the cluster or reduce the mapreduce.map.memory.mb(and its heap mapreduce.map.java.opts) on every node and restart NM to

要了解上述mapred-site.xml属性的含义以及为什么需要进行这些计算,请在此处进行引用: https://hadoop. apache.org/docs/r2.7.2/hadoop-yarn/hadoop-yarn-common/yarn-default.xml

To understand what the above mapred-site.xml properties mean and why you do need to do those calculations , you can refer it here : https://hadoop.apache.org/docs/r2.7.2/hadoop-yarn/hadoop-yarn-common/yarn-default.xml

注意:如果EMR使用YARN capacity schedulerDefaultResourceCalculator保持其默认配置,则上述计算和陈述正确.例如,如果将容量调度程序配置为使用DominantResourceCalculator,它将考虑每个节点上的VCPU +内存(而不仅仅是内存)来决定并行数目的映射器.

Note : The above calculations and statements are true if EMR stays in its default configuration using YARN capacity scheduler with DefaultResourceCalculator. If for example , you configure your capacity scheduler to use DominantResourceCalculator, it will consider VCPU's + Memory on every nodes (not just memory's) to decide on parallel number of mappers.

这篇关于AWS EMR并行映射器?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-11 01:18