本文介绍了Dask中KilledWorker异常是什么意思?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用Dask和dask.distributed调度程序时,我的任务返回了 KilledWorker 异常。这些错误是什么意思?

My tasks are returning with KilledWorker exceptions when using Dask with the dask.distributed scheduler. What do these errors mean?

推荐答案

当Dask调度程序不再信任您的任务时(因为它也存在),将生成此错误。通常是工人意外死亡时。它旨在保护群集,使其免受杀害工作人员的任务(例如,由于段错误或内存错误)的破坏。

This error is generated when the Dask scheduler no longer trusts your task, because it was present too often when workers died unexpectedly. It is designed to protect the cluster against tasks that kill workers, for example by segfaults or memory errors.

每当工作人员意外死亡时,调度程序都会记录正在该任务上运行的任务工人死时。它会在其他工人上重试这些任务,但也会将其标记为可疑。如果多个工人死亡时在同一任务上存在相同的任务,则调度程序最终将放弃尝试重试此任务,而是将其标记为失败,但 KilledWorker 例外。

Whenever a worker dies unexpectedly the scheduler notes which tasks were running on that worker when it died. It retries those tasks on other workers but also marks them as suspicious. If the same task is present on several workers when they die then eventually the scheduler will give up on trying to retry this task, and instead marks it as failed with the exception KilledWorker.

通常这意味着您的任务还有其他问题。可能导致分段错误或分配了过多的内存。也许它使用的不是线程安全的库。也许这只是非常不幸。无论如何,您都应检查工作日志,以确定工作失败的原因。这可能是一个比任务失败更大的问题。

Often this means that your task has some other issue. Perhaps it causes a segmentation fault or allocates too much memory. Perhaps it uses a library that is not threadsafe. Or perhaps it is just very unlucky. Regardless, you should inspect your worker logs to determine why your workers are failing. This is likely a bigger issue than your task failing.

您可以通过修改〜/ .config / dask中的以下条目来控制此行为。 /distributed.yaml 文件。

allowed-failures: 3     # number of retries before a task is considered bad

这篇关于Dask中KilledWorker异常是什么意思?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-23 09:02