Hadoop - 分布式缓存中的大文件 | 分布式缓存中的大文件

本文介绍了Hadoop - 分布式缓存中的大文件的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个4 GB的文件，我试图通过分布式缓存跨所有映射器共享。但我正在观察地图任务尝试启动的重大延迟。具体来说，我提交工作的时间（通过job.waitForCompletion（））和第一张地图开始的时间之间存在显着的延迟。

我想知道在DistributedCache中有大文件的副作用。复制分布式缓存上的文件多少次？集群中的节点数量是否对此产生影响？

（我的集群有大约13个节点在非常强大的机器上运行，其中每台机器可以托管接近10个地图插槽。）

谢谢

解决方案

在这种情况下，Cache有点令人误解。您的4 GB文件将随罐子和配置一起分发给每个任务。

对于大于200mb的文件，我通常将它们直接放入文件系统，并将复制设置为比通常复制更高的值（在您的情况下，我将其设置为5 -7）。您可以直接从分布式文件系统读取每个任务中的常用FS命令，如：

  FileSystem fs = FileSystem.get（config ）; 
 fs.open（new Path（/ path / to / the / larger / file））;

这节省了群集中的空间，但也不应该延迟任务启动。但是，如果是非本地HDFS读取，则需要将数据流式传输到可能使用大量带宽的任务。

I have a 4 GB file that I am trying to share across all mappers through a distributed cache. But I am observing a significant delay in map task attempt starts. Specifically, there is a significant delay between the time I submit my job (through job.waitForCompletion()) and the time the first map starts.

I would like to know what the side effect of having large files in a DistributedCache. How many times is the file on a distributed cache replicated ? Does the number of nodes in a cluster have any effect on this ?

(My cluster has about 13 nodes running on very powerful machines where each machine is able to host close to 10 map slots.)

Thanks

解决方案

"Cache" in this case is a bit misleading. Your 4 GB file will be distributed to every task along with the jars and configuration.

For files larger than 200mb I usually put them directly into the filesystem and set the replication to a higher value than the usual replication (in your case I would set this to 5-7). You can directly read from the distributed filesystem in every task by the usual FS commands like:

FileSystem fs = FileSystem.get(config);
fs.open(new Path("/path/to/the/larger/file"));

This saves space in the cluster, but also should not delay the task start. However, in case of non-local HDFS reads, it needs to stream the data to the task which might use a considerable amount of bandwidth.

这篇关于Hadoop - 分布式缓存中的大文件的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！