本文介绍了为什么向 mapreduce 提交作业通常需要这么多时间?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

因此通常对于 20 个节点的集群提交作业处理 3GB(200 次拆分)的数据需要大约 30 秒,实际执行时间大约为 1m.我想了解工作提交过程中的瓶颈是什么,并了解下一个报价

So usually for 20 node cluster submitting job to process 3GB(200 splits) of data takes about 30sec and actual execution about 1m.I want to understand what is the bottleneck in job submitting process and understand next quote

Per-MapReduce 开销很大:启动/结束 MapReduce 作业需要时间

我知道的一些过程:1. 数据拆分2.jar文件共享

Some process I'm aware:1. data splitting2. jar file sharing

推荐答案

关于 HDFS 和 M/R 的一些了解有助于理解这种延迟:

A few things to understand about HDFS and M/R that helps understand this latency:

  1. HDFS 将您的文件存储为分布在称为数据节点的多台机器上的数据块
  2. M/R 在每个数据块或块上运行多个称为映射器的程序.这些映射器的 (key,value) 输出由 reducer 编译在一起.(想想总结来自多个映射器的各种结果)
  3. 每个映射器和化简器都是在这些分布式系统上产生的成熟程序.产生一个完整的程序确实需要时间,即使让我们说它们什么也没做(No-OP map reduce 程序).
  4. 当要处理的数据量变得非常大时,这些生成时间变得微不足道,而这正是 Hadoop 大放异彩的时候.

如果您要处理一个包含 1000 行内容的文件,那么您最好使用普通的文件读取和处理程序.在分布式系统上生成进程的 Hadoop 基础设施不会产生任何好处,而只会增加定位包含相关数据块的数据节点、启动它们上的处理程序、跟踪和收集结果的额外开销.

If you were to process a file with a 1000 lines content then you are better of using a normal file read and process program. Hadoop infrastructure to spawn a process on a distributed system will not yield any benefit but will only contribute to the additional overhead of locating datanodes containing relevant data chunks, starting the processing programs on them, tracking and collecting results.

现在将其扩展为 100 Peta Bytes 的数据,与处理它们所需的时间相比,这些开销看起来完全微不足道.处理器(映射器和化简器)的并行化将在这里显示出它的优势.

Now expand that to 100 of Peta Bytes of data and these overheads looks completely insignificant compared to time it would take to process them. Parallelization of the processors (mappers and reducers) will show it's advantage here.

因此,在分析您的 M/R 性能之前,您应该首先对集群进行基准测试,以便更好地了解开销.

So before analyzing the performance of your M/R, you should first look to benchmark your cluster so that you understand the overheads better.

在集群上做一个无操作的 map-reduce 程序需要多少时间?

How much time does it take to do a no-operation map-reduce program on a cluster?

为此目的使用 MRBench:

  1. MRbench 多次循环一个小工作
  2. 检查小型作业运行是否响应迅速并在您的集群上高效运行.
  3. 它对 HDFS 层的影响非常有限

要运行此程序,请尝试以下操作(检查最新版本的正确方法:

To run this program, try the following (Check the correct approach for latest versions:

hadoop jar /usr/lib/hadoop-0.20/hadoop-test.jar mrbench -numRuns 50

令人惊讶的是,在我们的一个开发集群上,它是 22 秒.

Surprisingly on one of our dev clusters it was 22 seconds.

另一个问题是文件大小.

如果文件大小小于 HDFS 块大小,则 Map/Reduce 程序的开销很大.Hadoop 通常会尝试为每个块生成一个映射器.这意味着如果您有 30 个 5KB 文件,那么即使文件的大小很小,Hadoop 最终每个块也可能会产生 30 个映射器.这是一个真正的浪费,因为与处理小文件所花费的时间相比,每个程序的开销都很大.

If the file sizes are less than the HDFS block size then Map/Reduce programs have significant overhead. Hadoop will typically try to spawn a mapper per block. That means if you have 30 5KB files, then Hadoop may end up spawning 30 mappers eventually per block even if the size of file is small. This is a real wastage as each program overhead is significant compared to the time it would spend processing the small sized file.

这篇关于为什么向 mapreduce 提交作业通常需要这么多时间?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-24 03:23