问题描述
我们有一些软件使用Slurm将作业提交到队列中,它在我们的现场集群以及各种客户的Slurm设置上按预期工作。
我们看到的问题是,当我们在CycleCloud的Slurm上提交多节点作业时,正确的资源数量会增加,但是,这些作业似乎永远不会过渡进入"跑步"州。它们仍然停留在"待定(资源)"状态。
state。
我已经运行了一个测试脚本,它至少可以提交多节点作业。这些适当地调整适当数量的资源并运行该作业。所以,很明显,我们配置中的某些东西必须关闭。
有人可以分享一些关于在哪里跟踪工作陷入待决状态的原因的指针吗?
谢谢,
Eric
We have some software that uses Slurm to submit jobs to a queue and it works as expected on our on-site cluster as well as a variety of our clients' Slurm setups.
The issue we are seeing is that when we submit a multi-node job on CycleCloud's Slurm, the correct number of resources spin up, however, the jobs never seem to transition into a "Running" state. They remain stuck in "Pending(Resources)" state.
I have run a test script that does the bare minimum to submit multi-node jobs. These properly spin up the appropriate number of resources and run the job. So, clearly, something in our configuration must be off.
Can anyone share some pointers of where to track reasons for jobs getting stuck in a pending state?
Thanks,
Eric
这篇关于使用CycleCloud和Slurm - 作业处于挂起状态的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!