Slurm Stats

Overview

When you log on to the FASRC clusters you will be greeted by Slurm Stats. On a nightly basis we pull data from the scheduler for the day and display a summary for you when you log in to the cluster in an easy to read table. This should help you to understand how your jobs are performing as well as help you track your usage on a daily basis. Below is description of the statistics we are providing along with recommendations of where to go to get more information or to improve your performance.

The Statistics

+---------------- Slurm Stats for Aug 20 -----------------------+
|                  End of Day Fairshare                         |
|                    test_lab: 0.003943                         |
+-------------------- Jobs By State ----------------------------+
|       Total | Completed | Canceled | Failed | Out of |  Timed |
|             |           |          |        | Memory |    Out |
| CPU:     25 |         4 |        1 |     20 |      0 |      0 |
| GPU:     98 |        96 |        1 |      1 |      0 |      0 |
+---------------------- Job Stats ------------------------------+
|        | Average | Average   | Average    | Total Usage /     |
|        | Used    | Allocated | Efficiency | Ave. Wait Time    |
| Cores  |     4.3 |       5.5 |      69.4% |    133.00 CPU Hrs |
| Memory |   22.2G |     27.2G |      68.3% |                   |
| GPUS   |     N/A |       1.0 |        N/A |    100.20 GPU Hrs |
| Time   |  14.57h |    45.38h |      45.9% |             0.00h |
+---------------------------------------------------------------+

Above is what you will see when you login to the cluster if you have run jobs in the last day.  This data is pulled from the scheduler and is for jobs that finished in the 24-hour day listed. If you would like similar summary information but for a longer time period of time, use the seff-account command. For instance if you wanted the data for the last week you would do:

seff-account -u USERNAME -S 2024-08-13 -E 2024-08-20

For more detailed information on specific jobs you can use the seff and sacct commands. If you want summary plots of various statistics please see our XDMod instance (requires RC VPN). For fairshare usage plots see our Cannon and FASSE Fairshare Dashboards (requires RC VPN).  Below we will describe the various fields and what they mean.

Fairshare

The first thing listed is the fairshare for the lab accounts that you belong to. This is as of the end of the day indicated. Lower fairshare means lower priority for your jobs on the cluster.  For more on fairshare and how to improve your score see our comprehensive fairshare document.

Job State

If you have jobs that finished in the day indicated, then a breakdown of their end states is presented. Jobs are sorted first by whether or not they asked for GPU.  Next the total number of jobs in that category is given, followed by a break down by state. Completed jobs are those that finished cleanly with no errors that slurm could detect (there may still be errors that your code has generated internally). Canceled jobs are those jobs which were terminated via the scancel command either by yourself or the administrator. Failed jobs are those jobs that the scheduler has detected as having a faulty exit. Out of Memory jobs are those that hit the requested memory limit set in the job script. Timed Out jobs are those that hit the requested time limit set in the job script.

Used, Allocated, and Efficiency

For all the jobs that were not Canceled, we calculate statistics averaged over all the jobs run. These are broken down by Cores, Memory, GPUs, and Time. Average Used is the average amount actually used by the job. Average Allocated is the average amount of resources allocated by the job script for the job. Average Efficiency is the ratio of the amount of resource Used by the job to the amount of resources Allocated per job, averaged over all the jobs. In an ideal world your jobs should use exactly, or as close as possible, as much resources as they request and hence have a Average Efficiency of 100%. In practice, some jobs use all the resources they request, others do not.  Have unused resources that you have allocated means that your code is not utilizing all the space you’ve set aside for it. This wasted space ends up driving down your fairshare as cores, memory, and GPUs you do not use are still charged against your fairshare.

To learn more about which jobs are the culprits, we recommend using tools like seff-account, seff, and sacct. These tools can give you an overview of your jobs and more detailed information about specific jobs.  We have also have an in depth guide to Job Efficiency and Optimization which goes into more depth regarding techniques for improving your efficiency.

Finally in the case of GPUs, slurm does not currently gather statistics on actual usage and thus we can’t construct an efficiency metric. That said if you want to learn more about how your job is performing check out the Job Efficiency and Optimization doc as well as our GPU monitoring documentation. Tools like nvidia-smi and nvtop can be useful for monitoring your usage interactively.

Total Usage

Total usage is the total number of hours allocated for CPUs and GPUs respectively. This is a measure of your total usage of the jobs that finished on the day indicated. Note that this is the total usage for a job, so a job that ran for multiple days will have all its usage show up at once in this number and not just its usage for that day only. This usage is also not weighted by the type of CPU or GPU requested which can impact how much fairshare the usage would cost. For more on how we handle usage and fairshare, see our general fairshare document.

Wait Time

The number in the lower right hand corner of the Job Stats table in the Time row, is our average wait time per job. This is a useful number as your total Time to Science (TtS) is your wait time (aka pending time) plus your run time. Wait time varies depending on partition used, size of job, and relative priority of your jobs versus other jobs in the queue. To lower wait time investigate using a different partition, submitting to multiple partitions, resizing your job, or improving your fairshare. A deeper discussion can be found in the Job Efficiency and Optimization page.