问题描述
在Dask中分发了文档,它们具有以下信息:
In Dask distributed documentation, they have the following information:
但是,似乎get_block_locations()
已从HDFS fs后端中删除,所以我的问题是:Dask关于HDFS的当前状态是什么?是否将计算发送到本地数据节点?是否在优化调度程序时考虑了HDFS上的数据局部性?
However, it seems that the get_block_locations()
was removed from the HDFS fs backend, so my question is: what is the current state of Dask regarding to HDFS ? Is it sending computation to nodes where data is local ? Is it optimizing the scheduler to take into account data locality on HDFS ?
推荐答案
很正确,箭头的HDFS界面的外观(现在优于hdfs3成为首选),考虑到块位置不再是访问HDFS的工作负载的一部分,因为arrow的实现不包括get_block_locations()方法.
Quite right, with the appearance of arrow's HDFS interface, which is now preferred over hdfs3, the consideration of block locations is no longer part of workloads accessing HDFS, since arrow's implementation doesn't include the get_block_locations() method.
但是,我们已经想删除使这项工作变得有些复杂的代码,因为我们发现测试HDFS部署上的节点间带宽完全可以满足大多数工作负载的实际需求.块大小与您希望在内存中划分的分区大小之间的额外限制增加了一层复杂性.
However, we already wanted to remove the somewhat convoluted code which made this work, because we found that the inter-node bandwidth on test HDFS deployments was perfectly adequate that it made little practical difference in most workloads. The extra constrains on the size of the blocks versus the size of the partitions you would like in-memory created an additional layer of complexity.
通过删除专门的代码,我们可以避免HDFS出现的特殊情况,而不是外部云存储(s3,gcs,azure),在这里无关紧要的是哪个工作人员访问了数据的哪一部分.
By removing the specialised code, we could avoid the very special case that was being made for HDFS as opposed to external cloud storage (s3, gcs, azure) where it didn't matter which worker accessed which part of the data.
简而言之,是的,文档应该进行更新.
In short, yes the docs should be updated.
这篇关于Dask是否与HDFS通信以优化数据局部性?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!