lur浆工作期间如何监控资源? | lur浆工作期间如何监控资源

本文介绍了lur浆工作期间如何监控资源?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在使用SLURM调度系统的大学集群(普通用户，没有管理员权限)上运行作业，并且我有兴趣绘制一段时间内(即作业运行时)的CPU和内存使用情况.我了解sacct和sstat，并且我正在考虑将这些命令包含在我的提交脚本中，例如

I'm running jobs on our university cluster (regular user, no admin rights), which uses the SLURM scheduling system and I'm interested in plotting the CPU and memory usage over time, i.e while the job is running. I know about sacct and sstat and I was thinking to include these commands in my submission script, e.g. something in the line of

#!/bin/bash
#SBATCH <options>

# Running the actual job in background
srun my_program input.in output.out &

# While loop that records resources
JobStatus="$(sacct -j $SLURM_JOB_ID | awk 'FNR == 3 {print $6}')"
FIRST=0
#sleep time in seconds
STIME=15
while [ "$JobStatus" != "COMPLETED" ]; do
    #update job status
    JobStatus="$(sacct -j $SLURM_JOB_ID | awk 'FNR == 3 {print $6}')"
    if [ "$JobStatus" == "RUNNING" ]; then
        if [ $FIRST -eq 0 ]; then
            sstat --format=AveCPU,AveRSS,MaxRSS -P -j ${SLURM_JOB_ID} >> usage.txt
            FIRST=1
        else
            sstat --format=AveCPU,AveRSS,MaxRSS -P --noheader -j ${SLURM_JOB_ID} >> usage.txt
        fi
        sleep $STIME
    elif [ "$JobStatus" == "PENDING" ]; then
        sleep $STIME
    else
        sacct -j ${SLURM_JOB_ID} --format=AllocCPUS,ReqMem,MaxRSS,AveRSS,AveDiskRead,AveDiskWrite,ReqCPUS,AllocCPUs,NTasks,Elapsed,State >> usage.txt
        JobStatus="COMPLETED"
        break
    fi
done

但是，我对这种解决方案并不十分信服:

However, I'm not really convinced of this solution:

sstat没有显示在瞬间(仅平均)

sstat unfortunately doesn't show how many cpus are used at themoment (only average)

如果我尝试记录一段时间内的内存使用情况，MaxRSS也无济于事

MaxRSS is also not helpful if I try to record memory usage over time

似乎仍然有一些错误(作业完成后脚本不会停止)

there still seems to be some error (script doesn't stop after job finishes)

有人知道如何正确执行此操作吗?甚至可以用top或htop代替sstat?非常感谢您的帮助.

Does anyone have an idea how to do that properly? Maybe even with top or htop instead of sstat? Any help is much appreciated.

推荐答案

Slurm提供了一个插件，用于将作业的配置文件(某些技术的PCU使用率，内存使用率，甚至磁盘/网络IO)记录到HDF5文件中.该文件包含每个跟踪度量的时间序列，您可以选择时间分辨率.

Slurm offers a plugin to record a profile of a job (PCU usage, memory usage, even disk/net IO for some technologies) into a HDF5 file. The file contains a time series for each measure tracked, and you can choose the time resolution.

您可以通过以下方式激活它

You can activate it with

#SBATCH --profile=<all|none|[energy[,|task[,|filesystem[,|network]]]]>

在此处中查看文档.

要检查是否已安装此插件，请运行

To check that this plugin is installed, run

scontrol show config | grep AcctGatherProfileType

它应该输出AcctGatherProfileType = acct_gather_profile/hdf5.

对于您的脚本，您可以尝试将sstat替换为与计算节点的SSH连接以运行ps.假设已安装pdsh或clush，则可以运行以下命令:

As for your script, you could try replacing sstat with an SSH connection to the compute nodes to run ps. Assuming pdsh or clush is installed, you could run something like:

pdsh -j $SLURM_JOB_ID ps -u $USER -o pid,state,cputime,%cpu,rssize,command --columns 100 >> usage.txt

这将为您提供每个进程的CPU和内存使用情况.

This will give you CPU and memory usage per process.

最后一点要注意的是，您的作业永远不会终止，因为它会在while循环终止时终止，而while循环将在作业终止时终止...永远不会观察到条件"$JobStatus" == "COMPLETED"在脚本中.作业完成后，脚本将被终止.

As a final note, your job never terminates simply because it will terminate when the while loop terminates, and the while loop will terminate when the job terminates... The condition "$JobStatus" == "COMPLETED" will never be observed from within the script. When the job is completed, the script is killed.

这篇关于lur浆工作期间如何监控资源?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！