在 Heroku 上重启后，长时间运行的 delay_job 作业保持锁定状态

本文介绍了在 Heroku 上重启后，长时间运行的 delay_job 作业保持锁定状态的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

当 Heroku 工作进程重新启动时(根据命令或部署的结果)，Heroku 将 SIGTERM 发送到工作进程.对于 delayed_job，SIGTERM 信号被捕获，然后在当前作业(如果有)停止后工作人员停止执行.

如果 worker 需要很长时间才能完成，那么 Heroku 将发送 SIGKILL.在 delayed_job 的情况下，这会在数据库中留下一个锁定的作业，不会被其他工作人员接走.

我想确保作业最终完成(除非出现错误).鉴于此，解决这个问题的最佳方法是什么?

我看到两个选项.但我想获得其他输入:

修改 delayed_job 以在收到 SIGTERM 时停止处理当前作业(并释放锁定).
找出一种(程序化的)方法来检测孤立的锁定作业，然后将其解锁.

有什么想法吗?

解决方案

TLDR:

把它放在你工作方法的顶部:

开始term_now = 假old_term_handler = 陷阱 'TERM' 做term_now = 真old_term_handler.call结尾

AND

确保至少每十秒调用一次:

 如果 term_nowputs '被告知终止'返回真结尾

AND

在你的方法结束时，输入:

确保陷阱 'TERM', old_term_handler结尾

说明:

我遇到了同样的问题，并发现了这篇 Heroku 文章.>

作业包含一个外循环，所以我按照文章添加了一个trap('TERM')和exit.然而，delayed_job 将其视为 failed with SystemExit 并将任务标记为失败.

随着 SIGTERM 现在被我们的 trap 没有调用 worker 的处理程序，而是立即重新启动作业，然后在几秒钟后获取 SIGKILL.回到第一个.

我尝试了一些 exit 的替代方法:

return true 将作业标记为成功(并将其从队列中删除)，但如果队列中有另一个作业在等待，则会遇到同样的问题.
调用 exit! 将成功退出工作和工作人员，但是它不允许工作人员从队列中删除工作，所以你仍然存在孤立锁定作业"问题.

我的最终解决方案是在我的答案顶部给出的解决方案，它由三部分组成:

在我们开始可能很长的工作之前，我们通过执行 trap(如 Heroku 文章中所述)为 'TERM' 添加一个新的中断处理程序，我们用它来设置 term_now = true.
但是我们还必须获取 set(由trap返回)和记得调用它.
我们仍然必须确保我们将控制权返回给 Delayed:Job:Worker 有足够的时间来清理和关闭，因此我们应该检查 term_now至少(刚好)每十秒一次，如果 true 为 true，则 return.
您可以return true 或 return false 取决于您是否希望作业被视为成功.
最后，至关重要记得移除您的处理程序并在完成后重新安装 Delayed:Job:Worker .如果你不这样做，你将保留对我们添加的一个悬空引用，如果你在它之上添加另一个(例如，当工作人员再次开始这项工作时)，这可能会导致内存泄漏.

When a Heroku worker is restarted (either on command or as the result of a deploy), Heroku sends SIGTERM to the worker process. In the case of delayed_job, the SIGTERM signal is caught and then the worker stops executing after the current job (if any) has stopped.

If the worker takes to long to finish, then Heroku will send SIGKILL. In the case of delayed_job, this leaves a locked job in the database that won't get picked up by another worker.

I'd like to ensure that jobs eventually finish (unless there's an error). Given that, what's the best way to approach this?

I see two options. But I'd like to get other input:

Modify delayed_job to stop working on the current job (and release the lock) when it receives a SIGTERM.
Figure out a (programmatic) way to detect orphaned locked jobs and then unlock them.

Any thoughts?

解决方案

TLDR:

Put this at the top of your job method:

begin
  term_now = false
  old_term_handler = trap 'TERM' do
    term_now = true
    old_term_handler.call
  end

AND

Make sure this is called at least once every ten seconds:

  if term_now
    puts 'told to terminate'
    return true
  end

AND

At the end of your method, put this:

ensure
  trap 'TERM', old_term_handler
end

Explanation:

I was having the same problem and came upon this Heroku article.

The job contained an outer loop, so I followed the article and added a trap('TERM') and exit. However delayed_job picks that up as failed with SystemExit and marks the task as failed.

With the SIGTERM now trapped by our trap the worker's handler isn't called and instead it immediately restarts the job and then gets SIGKILL a few seconds later. Back to square one.

I tried a few alternatives to exit:

A return true marks the job as successful (and removes it from the queue), but suffers from the same problem if there's another job waiting in the queue.
Calling exit! will successfully exit the job and the worker, but it doesn't allow the worker to remove the job from the queue, so you still have the 'orphaned locked jobs' problem.

My final solution was the one given at at the top of my answer, it comprises of three parts:

Before we start the potentially long job we add a new interrupt handler for 'TERM' by doing a trap (as described in the Heroku article), and we use it to set term_now = true.
But we must also grab the old_term_handler which the delayed job worker code set (which is returned by trap) and remember to call it.
We still must ensure that we return control to Delayed:Job:Worker with sufficient time for it to clean up and shutdown, so we should check term_now at least (just under) every ten seconds and return if it is true.
You can either return true or return false depending on whether you want the job to be considered successful or not.
Finally it is vital to remember to remove your handler and install back the Delayed:Job:Worker one when you have finished. If you fail to do this you will keep a dangling reference to the one we added, which can result in a memory leak if you add another one on top of that (for example, when the worker starts this job again).

这篇关于在 Heroku 上重启后，长时间运行的 delay_job 作业保持锁定状态的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！