问题描述
在文件传输到使用Hadoop的数据流作业,分布式缓存机制,并在系统中删除这些文件的节点后,作业完成?如果它们被删除,这是我presume他们,有没有一种方法,使缓存依然为多个作业?这是否以同样的方式在亚马逊弹性麻preduce?
When files are transferred to nodes using the distributed cache mechanism in a Hadoop streaming job, does the system delete these files after a job is completed? If they are deleted, which i presume they are, is there a way to make the cache remain for multiple jobs? Does this work the same way on Amazon's Elastic Mapreduce?
推荐答案
我周围挖源$ C $ C,它看起来像文件由 TrackerDistributedCacheManager
约一次,当他们的引用计数下降到零一分钟。该 TaskRunner
明确地释放它的所有文件在任务结束。也许你应该修改 TaskRunner
不这样做,并控制缓存通过更明确的表示自己呢?
I was digging around in the source code, and it looks like files are deleted by TrackerDistributedCacheManager
about once a minute when their reference count drops to zero. The TaskRunner
explicitly releases all its files at the end of a task. Maybe you should edit TaskRunner
to not do this, and control the cache through more explicit means yourself?
这篇关于在Hadoop的分布式缓存的生命的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!