问题描述
我正在使用SLURM调度系统的大学集群(普通用户,没有管理员权限)上运行作业,并且我有兴趣绘制一段时间内(即作业运行时)的CPU和内存使用情况.我了解sacct
和sstat
,并且我正在考虑将这些命令包含在我的提交脚本中,例如
I'm running jobs on our university cluster (regular user, no admin rights), which uses the SLURM scheduling system and I'm interested in plotting the CPU and memory usage over time, i.e while the job is running. I know about sacct
and sstat
and I was thinking to include these commands in my submission script, e.g. something in the line of
#!/bin/bash
#SBATCH <options>
# Running the actual job in background
srun my_program input.in output.out &
# While loop that records resources
JobStatus="$(sacct -j $SLURM_JOB_ID | awk 'FNR == 3 {print $6}')"
FIRST=0
#sleep time in seconds
STIME=15
while [ "$JobStatus" != "COMPLETED" ]; do
#update job status
JobStatus="$(sacct -j $SLURM_JOB_ID | awk 'FNR == 3 {print $6}')"
if [ "$JobStatus" == "RUNNING" ]; then
if [ $FIRST -eq 0 ]; then
sstat --format=AveCPU,AveRSS,MaxRSS -P -j ${SLURM_JOB_ID} >> usage.txt
FIRST=1
else
sstat --format=AveCPU,AveRSS,MaxRSS -P --noheader -j ${SLURM_JOB_ID} >> usage.txt
fi
sleep $STIME
elif [ "$JobStatus" == "PENDING" ]; then
sleep $STIME
else
sacct -j ${SLURM_JOB_ID} --format=AllocCPUS,ReqMem,MaxRSS,AveRSS,AveDiskRead,AveDiskWrite,ReqCPUS,AllocCPUs,NTasks,Elapsed,State >> usage.txt
JobStatus="COMPLETED"
break
fi
done
但是,我对这种解决方案并不十分信服:
However, I'm not really convinced of this solution:
- 不幸的是,
-
sstat
没有显示在瞬间(仅平均)
sstat
unfortunately doesn't show how many cpus are used at themoment (only average)
如果我尝试记录一段时间内的内存使用情况,MaxRSS也无济于事
MaxRSS is also not helpful if I try to record memory usage over time
似乎仍然有一些错误(作业完成后脚本不会停止)
there still seems to be some error (script doesn't stop after job finishes)
有人知道如何正确执行此操作吗?甚至可以用top
或htop
代替sstat
?非常感谢您的帮助.
Does anyone have an idea how to do that properly? Maybe even with top
or htop
instead of sstat
? Any help is much appreciated.
推荐答案
Slurm提供了一个插件,用于将作业的配置文件(某些技术的PCU使用率,内存使用率,甚至磁盘/网络IO)记录到HDF5文件中.该文件包含每个跟踪度量的时间序列,您可以选择时间分辨率.
Slurm offers a plugin to record a profile of a job (PCU usage, memory usage, even disk/net IO for some technologies) into a HDF5 file. The file contains a time series for each measure tracked, and you can choose the time resolution.
您可以通过以下方式激活它
You can activate it with
#SBATCH --profile=<all|none|[energy[,|task[,|filesystem[,|network]]]]>
在此处中查看文档.
要检查是否已安装此插件,请运行
To check that this plugin is installed, run
scontrol show config | grep AcctGatherProfileType
它应该输出AcctGatherProfileType = acct_gather_profile/hdf5
.
对于您的脚本,您可以尝试将sstat
替换为与计算节点的SSH连接以运行ps
.假设已安装pdsh
或clush
,则可以运行以下命令:
As for your script, you could try replacing sstat
with an SSH connection to the compute nodes to run ps
. Assuming pdsh
or clush
is installed, you could run something like:
pdsh -j $SLURM_JOB_ID ps -u $USER -o pid,state,cputime,%cpu,rssize,command --columns 100 >> usage.txt
这将为您提供每个进程的CPU和内存使用情况.
This will give you CPU and memory usage per process.
最后一点要注意的是,您的作业永远不会终止,因为它会在while
循环终止时终止,而while
循环将在作业终止时终止...永远不会观察到条件"$JobStatus" == "COMPLETED"
在脚本中.作业完成后,脚本将被终止.
As a final note, your job never terminates simply because it will terminate when the while
loop terminates, and the while
loop will terminate when the job terminates... The condition "$JobStatus" == "COMPLETED"
will never be observed from within the script. When the job is completed, the script is killed.
这篇关于lur浆工作期间如何监控资源?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!