Tag: fairshare

Slurm Stats

Slurm Stats

Overview

When you log on to the FASRC clusters you will be greeted by Slurm Stats. On a nightly basis we pull data from the scheduler for the day and display a summary for you when you log in to the cluster in an easy to read table. This should help you to understand how your jobs are performing as well as help you track your usage on a daily basis. Below is description of the statistics we are providing along with recommendations of where to go to get more information or to improve your performance.

The Statistics

+---------------- Slurm Stats for Aug 20 -----------------------+
|                  End of Day Fairshare                         |
|                    test_lab: 0.003943                         |
+-------------------- Jobs By State ----------------------------+
|       Total | Completed | Canceled | Failed | Out of |  Timed |
|             |           |          |        | Memory |    Out |
| CPU:     25 |         4 |        1 |     20 |      0 |      0 |
| GPU:     98 |        96 |        1 |      1 |      0 |      0 |
+---------------------- Job Stats ------------------------------+
|        | Average | Average   | Average    | Total Usage /     |
|        | Used    | Allocated | Efficiency | Ave. Wait Time    |
| Cores  |     4.3 |       5.5 |      69.4% |    133.00 CPU Hrs |
| Memory |   22.2G |     27.2G |      68.3% |                   |
| GPUS   |     N/A |       1.0 |        N/A |    100.20 GPU Hrs |
| Time   |  14.57h |    45.38h |      45.9% |             0.00h |
+---------------------------------------------------------------+

Above is what you will see when you login to the cluster if you have run jobs in the last day.  This data is pulled from the scheduler and is for jobs that finished in the 24-hour day listed. If you would like similar summary information but for a longer time period of time, use the seff-account command. For instance if you wanted the data for the last week you would do:

seff-account -u USERNAME -S 2024-08-13 -E 2024-08-20

For more detailed information on specific jobs you can use the seff and sacct commands. If you want summary plots of various statistics please see our XDMod instance (requires RC VPN). For fairshare usage plots see our Cannon and FASSE Fairshare Dashboards (requires RC VPN).  Below we will describe the various fields and what they mean.

Fairshare

The first thing listed is the fairshare for the lab accounts that you belong to. This is as of the end of the day indicated. Lower fairshare means lower priority for your jobs on the cluster.  For more on fairshare and how to improve your score see our comprehensive fairshare document.

Job State

If you have jobs that finished in the day indicated, then a breakdown of their end states is presented. Jobs are sorted first by whether or not they asked for GPU.  Next the total number of jobs in that category is given, followed by a break down by state. Completed jobs are those that finished cleanly with no errors that slurm could detect (there may still be errors that your code has generated internally). Canceled jobs are those jobs which were terminated via the scancel command either by yourself or the administrator. Failed jobs are those jobs that the scheduler has detected as having a faulty exit. Out of Memory jobs are those that hit the requested memory limit set in the job script. Timed Out jobs are those that hit the requested time limit set in the job script.

Used, Allocated, and Efficiency

For all the jobs that were not Canceled, we calculate statistics averaged over all the jobs run. These are broken down by Cores, Memory, GPUs, and Time. Average Used is the average amount actually used by the job. Average Allocated is the average amount of resources allocated by the job script for the job. Average Efficiency is the ratio of the amount of resource Used by the job to the amount of resources Allocated per job, averaged over all the jobs. In an ideal world your jobs should use exactly, or as close as possible, as much resources as they request and hence have a Average Efficiency of 100%. In practice, some jobs use all the resources they request, others do not.  Have unused resources that you have allocated means that your code is not utilizing all the space you’ve set aside for it. This wasted space ends up driving down your fairshare as cores, memory, and GPUs you do not use are still charged against your fairshare.

To learn more about which jobs are the culprits, we recommend using tools like seff-account, seff, and sacct. These tools can give you an overview of your jobs and more detailed information about specific jobs.  We have also have an in depth guide to Job Efficiency and Optimization which goes into more depth regarding techniques for improving your efficiency.

Finally in the case of GPUs, slurm does not currently gather statistics on actual usage and thus we can’t construct an efficiency metric. That said if you want to learn more about how your job is performing check out the Job Efficiency and Optimization doc as well as our GPU monitoring documentation. Tools like nvidia-smi and nvtop can be useful for monitoring your usage interactively.

Total Usage

Total usage is the total number of hours allocated for CPUs and GPUs respectively. This is a measure of your total usage of the jobs that finished on the day indicated. Note that this is the total usage for a job, so a job that ran for multiple days will have all its usage show up at once in this number and not just its usage for that day only. This usage is also not weighted by the type of CPU or GPU requested which can impact how much fairshare the usage would cost. For more on how we handle usage and fairshare, see our general fairshare document.

Wait Time

The number in the lower right hand corner of the Job Stats table in the Time row, is our average wait time per job. This is a useful number as your total Time to Science (TtS) is your wait time (aka pending time) plus your run time. Wait time varies depending on partition used, size of job, and relative priority of your jobs versus other jobs in the queue. To lower wait time investigate using a different partition, submitting to multiple partitions, resizing your job, or improving your fairshare. A deeper discussion can be found in the Job Efficiency and Optimization page.

Running Jobs

Running Jobs

Tip: Along with this document, please also see our Data Management Best Practices guide.

Overview: The FASRC cluster uses Slurm to manage jobs

Slurm (aka SLURM) is a queue management system and stands for Simple Linux Utility for Resource Management. Slurm was originally developed at the Lawrence Livermore National Lab, but is now primarily developed by SchedMD. Slurm is the scheduler that currently runs some of the largest compute clusters in the world.
Slurm is similar in many ways to most other queuing systems. You write a batch script then submit it to the queue manager. The queue manager then schedules your job to run on the queue (or partition in Slurm parlance) that you designate. Below we will provide an outline of how to submit jobs to Slurm, how Slurm decides when to schedule your job, and how to monitor progress.
Slurm has a number of valuable features compared to other job management systems:

  • Stop and Requeue: SLURM’s ability to kill and requeue is superior to that of other systems. It waits for jobs to be cleared before scheduling the high priority job. It also does requeue on memory rather than just on core count.
  • Memory requests are sacrosanct in SLURM. Thus the amount of memory you request at runtime is guaranteed to be there. No one can infringe on that memory space and you cannot exceed the amount of memory that you request.
  • Slurm has a concept called GRES (Generic Resource) that allows for fair scheduling on GPU’s and other accelerators.  This is very handy in a dynamic research environment like RC’s where various different hardware technologies can be put into the scheduler.
  • SLURM has a back-end database which stores historical information about the cluster. This information can be queried by the users who are curious about how much resources they have used.  It is used for adjudicating job priority on the cluster.

Cluster jobs are generally run from the command line

Once you’ve gone through the account setup procedure, you can login to the cluster via ssh to a login node and begin using the cluster.

FASRC cluster nodes run the CentOS distribution of the Linux operating system and commands are run under the “bash” shell. As with most supercomputers work is done via command line, typing commands into a prompt, and not via a GUI (graphical user interface).  There are a number of Linux and bash references, cheat sheets and tutorials available on the web. RC’s own training is also available.

Cluster applications should not be run from login nodes

Once you have logged in to the cluster, you will be on one of a handful of login nodes. These nodes are shared entry points for all users and so cannot be used to run computationally intensive software. Think of them as front-ends for your work, not the place where you do your work.
Simple file copies, light text processing or editing, etc. are fine, but you should not run large graphical applications like Matlab, Mathematica, RStudio, or computationally intensive command line tools. A culling program runs on these nodes that will kill any application that exceeds memory and computational limits.
For interactive work, please start an interactive session or, if you require a GUI use our VDI system.


Software – Using modules to access software

Because of the diversity of projects currently supported by FAS, and because the cluster is not a single computer on which you install software directly, thousands of applications and libraries are supported on the FASRC cluster. Technically, it is impossible to include all of these tools in every user’s environment.

Search available modules here
(https://portal.rc.fas.harvard.edu/apps/modules)

The Research Computing and Informatics departments have developed an enhanced Linux module system, Helmod, based on the hierarchical Lmod module system from TACC. Helmod enables applications much the same way as Linux modules, but also prevents multiple versions of the same tool from being loaded at the same time and separates tools that use particular compilers or MPI libraries entirely.
A module load command enables a particular application in the environment, mainly by adding the application to your PATH variable and pulling in dependencies. For example, to enable the 3.4.2 version of the R package:
module load R/3.4.2-fasrc01
Once a module is loaded inside a session/shell, it is available just as though you’d just installed it.

[jharvard@boslogin01 ~]? which R
 R: Command not found.
[jharvard@boslogin01 ~]? module load R/3.4.2-fasrc01
[jharvard@boslogin01 ~]? which R
[jharvard@boslogin01 ~]? /n/helmod/apps/centos7/Core/R_core/3.4.2-fasrc01/bin/R


Loading more complex modules can affect a number of environment variables including
PYTHONPATH, LD_LIBRARY_PATH, PERL5LIB, etc. Modules may also load dependencies. Bear in mind, you will need to include module load statements in your SBATCH scripts. If you load a module on, say, a login node and then launch a job, that job will run on another node and in a new shell where the module has not been loaded.
To determine what has been loaded in your environment, the module list command will print all loaded modules.
The module purge command will remove all currently loaded modules. This is particularly useful if you have to run incompatible software (e.g. python 2.x or python 3.x). The module unload command will remove a specific module.
Finding the modules that are appropriate for your needs can be done in a couple of different ways. The module search page allows you to browse and search the list of modules that have been deployed to the cluster.
There are a number of command line options for module searching, including the module avail command for browsing the entire list of applications and the module-query command for keyword searching. But please note: the online module search is much more thorough and has additional information on each module. module-avail may not show you all available options.
Though there are many modules available by default, the hierarchical Helmod system enables additional modules after loading certain key libraries such as compilers and MPI packages. The module avail command output reflects this.

[jharvard@boslogin01 ~]? module load gcc/7.1.0-fasrc01
[jharvard@boslogin01 ~]? module avail
---------------------------- /n/helmod/modulefiles/centos7/Core ----------------------------
ADOL-C/2.5.2-fasrc01       bzip2/1.0.6-fasrc01
julia/0.6.2-fasrc01        phyml/2014Oct16-fasrc01
ATAC-seq/0.1-fasrc02       cd-hit/4.6.4-fasrc02
julia/0.6.2-fasrc02        plink/1.90-fasrc01
Anaconda/5.0.1-fasrc01     cellranger/2.1.0-fasrc01
julia/0.6.3-fasrc01        progressiveCactus/20180313-fasrc01
Anaconda3/5.0.1-fasrc01    centos6/0.0.1-fasrc01
julia/0.6.3-fasrc02 (D)    proj/4.9.3-fasrc01
BEAST/2.4.8-fasrc01        centrifuge/1.0.3.5c51ac-fasrc02
kalign/2.0-fasrc01         proj/5.0.1-fasrc01 (D)
BaitFisher-package/e92dbf28b-fasrc01 centrifuge/1.0.3.8a9a820-fasrc01 (D)
kallisto/0.43.1-fasrc02    prokka/1.12-fasrc02
CLAPACK/3.2.1-fasrc01      clustalo/1.2.0-fasrc01
kraken/1.1-fasrc01         psmc/0.6.5-fasrc01
--More--


The
module-query command supports more sophisticated queries and returns additional information for modules. If you query by the name of an application or library (e.g. hdf5), you’ll retrieve a consolidated report showing all of the modules grouped together for a particular application. The online module search is much more thorough as it will show you all available versions.

[jharvard@boslogin01 ~]? module-query hdf5
module-query hdf5
------------------------------------------------------------------------------------------------------------
hdf5
------------------------------------------------------------------------------------------------------------
Built for: centos7
Description:
HDF5 is a data model, library, and file format for storing and managing data. It supports
an unlimited variety of datatypes, and is designed for flexible and efficient I/O and for
high volume and complex data. HDF5 is portable and is extensible, allowing applications to
evolve in their use of HDF5. The HDF5 Technology suite includes tools and applications for
managing, manipulating, viewing, and analyzing data in the HDF5 format. HDF5 is used as a
basis for many other file formats, including NetCDF.
Versions:
hdf5/1.10.1-fasrc03..................... Core Core module for CentOS 7
hdf5/1.10.1-fasrc02..................... Comp
hdf5/1.10.1-fasrc01..................... MPI
hdf5/1.8.12-fasrc12..................... MPI
hdf5/1.8.12-fasrc09..................... Comp Compiler-specific build
hdf5/1.8.12-fasrc08..................... Core Added c++ bindings
To find detailed information about a module, enter the full name.
For example,
module-query hdf5/1.8.12-fasrc08


A query for a single module, however, will return details about that build including module load statements and build comments (if any exist).
 

[jharvard@boslogin01 ~]? module-query hdf5/1.10.1-fasrc01 
------------------------------------------------------------------------------------------------------------
hdf5 : hdf5/1.10.1-fasrc01
------------------------------------------------------------------------------------------------------------
Built for: centos7
Description:
HDF5 is a data model, library, and file format for storing and managing data. It supports
an unlimited variety of datatypes, and is designed for flexible and efficient I/O and for
high volume and complex data. HDF5 is portable and is extensible, allowing applications to
evolve in their use of HDF5. The HDF5 Technology suite includes tools and applications for
managing, manipulating, viewing, and analyzing data in the HDF5 format. HDF5 is used as a
basis for many other file formats, including NetCDF.
This module can be loaded as follows:
module load gcc/7.1.0-fasrc01 openmpi/2.1.0-fasrc02 hdf5/1.10.1-fasrc01
module load gcc/7.1.0-fasrc01 mvapich2/2.3b-fasrc02 hdf5/1.10.1-fasrc01
module load intel/17.0.4-fasrc01 openmpi/2.1.0-fasrc02 hdf5/1.10.1-fasrc01
module load intel/17.0.4-fasrc01 mvapich2/2.3b-fasrc02 hdf5/1.10.1-fasrc01
This module also loads:
zlib/1.2.8-fasrc07 szip/2.1-fasrc02

For more details about the Helmod module system, check out the Software on The Cluster page (this has been updated to reflect our upgrade to CentOS7).
For more details about errors in loading modules after the O3 upgrade, check out the CentOS7 FAQ.


SLURM resources

The primary source for documentation on Slurm usage and commands can be found at the Slurm site. Use the docs at the SchedMD site, though these are always for the latest version of Slurm. A great way to get details on the Slurm commands for the version of Slurm we run is the man pages available from the cluster. For example, if you type the following command:
man sbatch
you’ll get the manual page for the sbatch command.
Though Slurm is not as common as SGE or LSF, documentation is readily available.

Summary of Slurm commands

The table below shows a summary of Slurm commands. These commands are described in more detail below along with links to the Slurm doc site.

SLURM SLURM Example
Submit a batch serial job sbatch sbatch runscript.sh
Run a script or application interactively
(do not use salloc on FASSE)
salloc salloc -p test -t 10 --mem 1000 [script or app]
Start interactive session
(do not use salloc on FASSE)
salloc salloc -p test -t 10 --mem 1000
Kill a job scancel scancel 999999
View status of your jobs squeue squeue -u akitzmiller
Check current job by id number sacct sacct -j 999999
Schedule recurring batch job scrontab  see scrontab document for example

NOTE: No single user can submit more than 10,000 jobs at a time.

Slurm partitions

Partition is the term that Slurm uses for queues. Partitions can be thought of as a set of resources and parameters around their use (See also: Convenient Slurm Commands).  You can find out what partitions you have access to using the spart command.  FASSE has different partitions than Cannon.  In the case where no resources have been requested explicitly, default resources that get allocated to a job on Cannon or FASSE are, serial_requeue for the partition, 10 minutes for the time, 1 core, and 100 MB for the memory. Here is a list of Cannon partitions:

PartitionNodesCores per NodeCPU Core TypesMem per Node (GB)Time LimitMax JobsMax CoresMPI Suitable?GPU Capable?/scratch size (GB)
sapphire192112Intel "Sapphire Rapids"9903 daysnonenoneYesNo396
shared31048Intel "Cascade Lake"1843 daysnonenoneYesNo68
bigmem4112Intel "Sapphire Rapids"20003 daysnonenoneNoNo396
bigmem_intermediate364Intel "Ice Lake"200014 daysnonenoneNoNo396
gpu3664Intel "Ice Lake"9903 daysnonenoneYesYes (4 A100/node)396
intermediate12112Intel "Sapphire Rapids"99014 daysnonenoneYesNo396
unrestricted848Intel "Cascade Lake"184nonenonenoneYesNo68
test12112Intel "Sapphire Rapids"99012 Hours5112YesNo396
gpu_test1464Intel "Ice Lake"49912 Hours564YesYes (8 A100 MIG 3g.20GB/node) - Limit 8 per job172
remotevizdown32Intel "Cascade Lake"3733 daysnonenoneNoShared V100 GPUs for rendering396
serial_requeuevariesvariesAMD/Intelvaries3 daysnonenoneNoYesvaries
gpu_requeuevariesvariesIntel (mixed)varies3 daysnonenoneNoYesvaries
PI/Lab nodesvariesvariesvariesvariesnonenonenonevariesvariesvaries
  • sapphire – The sapphire partition has a maximum run time of 3 days. Serial, parallel, and interactive jobs are permitted on this queue, and this is the most appropriate location for MPI jobs. This queue is governed by backfill and FairShare (explained below). The sapphire partition is populated with hardware that RC runs at the MGHPCC data center in Holyoke, MA. This partition has 192 nodes connected by a InfiniBand (IB) fabric, where each node configured with 2 Intel Xeon Sapphire Rapids CPUs, 1004 GB of RAM, and 400 GB of local scratch space. Each Intel CPU has 56 Cores, and 100 MB of cache. When submitting MPI jobs on the sapphire partition, it maybe advisable to use the --contiguous option for best communication performance if your code is topology sensitive. Though all of the nodes are connected by Infiniband fabric, there are multiple switches routing the MPI traffic and Slurm will by default schedule you where ever it can find space.  Thus your job may end up scattered across the cluster. The --contiguous option will ensure that the jobs are run on nodes that are adjacent to each other on the IB fabric.  Be advised that using --contiguous will make your job pend longer, so only use it if you absolutely need it.
  • shared – The shared partition has a maximum run time of 3 days. Serial, parallel, and interactive jobs are permitted on this queue, and this is the most appropriate location for MPI jobs. This queue is governed by backfill and FairShare (explained below). The shared partition is populated with hardware that RC runs at the MGHPCC data center in Holyoke, MA. This partition has 310 nodes connected by a InfiniBand (IB) fabric, where each node configured with 2 Intel Xeon Cascade Lake CPUs, 184 GB of RAM, and 70 GB of local scratch space. Each Intel CPU has 48 Cores, and 48 MB of cache. When submitting MPI jobs on the shared partition, it maybe advisable to use the --contiguous option for best communication performance if your code is topology sensitive. Though all of the nodes are connected by Infiniband fabric, there are multiple switches routing the MPI traffic and Slurm will by default schedule you where ever it can find space.  Thus your job may end up scattered across the cluster. The --contiguous option will ensure that the jobs are run on nodes that are adjacent to each other on the IB fabric.  Be advised that using --contiguous will make your job pend longer, so only use it if you absolutely need it.
  • bigmem This partition should be used for large memory work requiring greater than 1000 GB RAM per job, like genome / transcript assemblies. Jobs requesting less than 1000 GB RAM are automatically rejected by the scheduler. There is 3 day limit for work here. MPI or low memory work is not appropriate for the this partition, and inappropriate jobs may be terminated without warning. This partition has an allocation of 4 nodes with 2000 GB of RAM
  • bigmem_intermediate This partition should be used for large memory work requiring greater than 1000 GB RAM per job, like genome / transcript assemblies. Jobs requesting less than 1000 GB RAM are automatically rejected by the scheduler. There is 14 for work here. MPI or low memory work is not appropriate for the this partition, and inappropriate jobs may be terminated without warning. This partition has an allocation of 3 nodes with 2000 GB of RAM
  • ultramem Deprecated as of 1/22/2024. Please use bigmem instead.
  • gpu  – This 36 node partition is for individuals wishing to use GPGPU resources. One will need to include #SBATCH --gres=gpu:n where n=1-4 in your SLURM submission scripts. Each node has 64 cores and is equipped with 4 x NVidia A100s per node. There are also private partitions that may have more GPU resources, but to which access may be controlled by the owners. See our GPU Computing section for more info on using and specifying GPU resources.
  • gpu_mig – Deprecated as of 1/22/2024. Please use gpu_test instead.
  • intermediate – Serial and parallel (including MPI) jobs are permitted on this partition and this partition is intended for runs needing 3 to 14 days of runtime. This partition has an allocation of 12 nodes of the same configuration as above for the shared partition.
  • unrestricted – Serial and parallel (including MPI) jobs are permitted on this partition and no restriction on run time. Given this, there is no guarantee of 100% uptime. Running on this partition is done at the users own risk. Users should understand that if the queue is full it could take weeks or up to months for your job to be scheduled to run. unrestricted is made up of 8 nodes of the same configuration as above for the shared partition.
  • test – This partition is dedicated for interactive (foreground / live) work and for testing (interactively) code before submitting in batch and scaling. Small numbers (1 to 5) of serial and parallel jobs with small resource requirements (RAM/cores) are permitted on this partition; large numbers of interactive jobs or those requiring large resource requirements should really be done on another partition. This partition is made up of 12 nodes of the same configuration as above for the sapphire partition. This smaller queue has a 12 hour maximum run time. This queue has a maximum of 112 cores and 1000 GB RAM. Jobs in this queue are not charged fairshare.
  • gpu_test – This 14 node partition is for individuals wishing to test GPGPU resources. One will need to include #SBATCH --gres=gpu:n where n=1-8 in your SLURM submission scripts. These nodes have 64 cores and are equipped with 4 x NVidia A100s in Multi-Instance GPU (MIG) mode. Each GPU has two 3g.20GB MIG instances. This queue has a maximum of 5 jobs, 64 cores, 1000 GB RAM, 8 GPU’s, 12 hour run time. See our GPU Computing section for more info on using and specifying GPU resources. Jobs in this queue are not charged fairshare.
  • remoteviz – This single node partition is for individuals who wish to use shared GPU’s for rendering graphics.  The V100 cards on this node are in shared mode and are not intended for computational use but instead of rendering.  You do not need to request a gpu to use this partition. For computation please use the gpu and gpu_test partitions.
  • serial_requeue – This partition is appropriate for single core (serial) jobs, jobs that require up to 8 cores for small periods of time (less than 1 day), or job arrays where each job instance uses less than 8 cores. The maximum runtime for this queue is 3 days. MPI jobs or jobs that are tightly coupled across multiple nodes are not appropriate for this partition. If you do not specify a partition you will be sent to this partition by default. As this partition is made up of an assortment of nodes owned by other groups in addition to the general nodes, jobs in this partition may be killed but automatically requeued if a higher priority job (e.g. the job of a node owner) comes in. Because serial_requeue takes advantage of slack time in owned partitions, times in the PENDING state can potentially be much shorter than the shared partition. Since jobs may be killed, requeued, and run a 2nd time, ensure that the jobs are a good match for this partition. For example, jobs that append output would not be good for serial_requeue unless the data files were zeroed out at the start to ensure output from a previous (killed) run was removed. Also, to ensure your job need not redo all its compute again, it would be advisable to have breakpoints or branching instructions to bypass parts of work that have already been completed. We do advise that you use the --open-mode=append to see the requeue status/error messages in your log files. Without this option, your log files will be reset at the start of each (requeued) run, with no obvious indication of requeue events.
  • gpu_requeue  – This partition is appropriate for gpu jobs that require small periods of time (less than 1 day). The maximum runtime for this queue is 3 days. One will need to include #SBATCH --gres=gpu:1 in your SLURM submission scripts to get access to this partition. MPI jobs are not appropriate for this partition. As this partition is made up of an assortment of gpu nodes owned by other groups in addition to the public nodes, jobs in this partition may be killed but automatically requeued if a higher priority job (e.g. the job of a node owner) comes in. Because gpu_requeue takes advantage of slack time in owned partitions, times in the PENDING state can potentially be much shorter than the shared partition. Since jobs may be killed, requeued, and run a 2nd time, ensure that the jobs are a good match for this partition. For example, jobs that append output would not be good for gpu_requeue unless the data files were zeroed out at the start to ensure output from a previous (killed) run was removed. Also, to ensure your job need not redo all its compute again, it would be advisable to have breakpoints or branching instructions to bypass parts of work that have already been completed. We do advise that you use the --open-mode=append to see the requeue status/error messages in your log files. Without this option, your log files will be reset at the start of each (requeued) run, with no obvious indication of requeue events. See our GPU Computing section for more info on using and specifying GPU resources.
  • Please refer to the HUCE and SEAS partition pages for details about those partitions.

Slurm Limits

Slurm has several internal limits that users submitting large jobs or large numbers of jobs should be aware of and should plan around.  These limits exist to prevent any one person from taking over the cluster and also serve to prevent the cluster being overwhelmed due to poorly formed jobs.  Users must work within these limits and should plan their work accordingly.  This is typically done by breaking up their workflow into smaller chunks or by deliberately serializing their jobs to increase the job time and decrease the number of cores needed.  The limits are as follows:

  • Maximum Number of Jobs per User: 10,100.  This is meant to prevent any one user from monopolizing the cluster.
  • Maximum Array Size: 10,000.  This is both array index and size.  This is meant to prevent any one user from monopolizing the cluster.  Note that each array index counts as a single job for purposes of the Maximum Number of Jobs per User, so this is intentionally redundant.
  • Maximum Number of Steps: 40,000.  A job step is recorded by slurm for each invocation of srun by a job.  This is meant to prevent run-away jobs.

Submitting batch jobs using the sbatch command

The main way to run jobs on the cluster is by submitting a script with the sbatch command. The command to submit a job is as simple as:

sbatch runscript.sh

The commands specified in the runscript.sh file will then be run on the first available compute node that fits the resources requested in the script. sbatch returns immediately after submission; commands are not run as foreground processes and won’t stop if you disconnect from the cluster.

Tip: You can see your jobs on portal.rc.fas.harvard.edu/jobs

A typical submission script, in this case loading a Python module and having Python print a message, will look like this:

NOTE: It is important to keep all #SBATCH lines together and at the top of the script; no comments, bash code, or variables settings should be done until after the #SBATCH lines. Otherwise, Slurm may assume it’s done interpreting and skip any that follow.

#!/bin/bash
#SBATCH -c 1                # Number of cores (-c)
#SBATCH -t 0-00:10          # Runtime in D-HH:MM, minimum of 10 minutes
#SBATCH -p serial_requeue   # Partition to submit to
#SBATCH --mem=100           # Memory pool for all cores (see also --mem-per-cpu)
#SBATCH -o myoutput_%j.out  # File to which STDOUT will be written, %j inserts jobid
#SBATCH -e myerrors_%j.err  # File to which STDERR will be written, %j inserts jobid

# load modules
module load python/3.10.9-fasrc01

# run code
python -c 'print("Hi there.")'

In general, the script is composed of 4 parts.

  • the #!/bin/bash line allows the script to be run as a bash script
  • the #SBATCH lines are technically bash comments, but they set various parameters for the SLURM scheduler
  • loading any necessary modules and setting any variables, paths, etc.
  • the command line itself, in this case calling python and having it print a message

The #SBATCH lines shown above set key parameters. N.B. The Slurm system copies many environment variables from your current session to the compute host where the script is run including PATH and your current working directory. As a result, you can specify files relative to your current location (e.g. ./project/myfiles/myfile.txt).
#SBATCH -c 1
This line sets the number of cores (threads) that you’re requesting. Make sure that your tool can use multiple cores before requesting more than one. If this parameter is omitted, Slurm assumes -c 1.  For more on parallel work see: threads, MPI
#SBATCH -t 0-01:00
This line specifies the running time for the job in minutes. Other acceptable time formats include “minutes”, “minutes:seconds”, “hours:minutes:seconds”, “days-hours”, “days-hours:minutes” and “days-hours:minutes:seconds”. If your job runs longer than the value you specify here, it will be canceled. Jobs have a maximum run time which varies by partition (see table above), though extensions can be done. There is no fairshare penalty for over-requesting time, though it will be harder for the scheduler to backfill your job if you overestimate. NOTE! If this parameter is omitted on any partition, the your job will be given the default of 10 minutes.
#SBATCH -p serial_requeue
This line specifies the Slurm partition (AKA queue) under which the script will be run. The serial_requeue partition is good for routine jobs that can handle being occasionally stopped and restarted. PENDING times are typically short for this queue. See the partitions description above for more information.  If you do not specify this parameter you will be given serial_requeue by default.
#SBATCH --mem=100
The FASRC cluster requires that you specify the amount of memory (in MB) that you will be using for your job. Accurate specifications allow jobs to be run with maximum efficiency on the system. There are two main options, --mem-per-cpu and --mem. The --mem option specifies the total memory pool for one or more cores, and is the recommended option to use. If you must do work across multiple compute nodes (e.g. MPI code), then you must use the --mem-per-cpu option, as this will allocate the amount specified for each of the cores you’re requesting, whether it is on one node or multiple nodes. If this parameter is omitted, then you are granted 100 MB by default.  Chances are good that your job will be killed as it will likely go over this amount, so one should always specify how much memory you require.
#SBATCH -o myoutput_%j.out
This line specifies the file to which standard out will be appended. If a relative file name is used, it will be relative to your current working directory. The %j in the filename will be substituted by the JobID at runtime. If this parameter is omitted, any output will be directed to a file named slurm-JOBID.out in the current directory.
#SBATCH -e myerrors_%j.err
This line specifies the file to which standard error will be appended. Slurm submission and processing errors will also appear in the file. The %j in the filename will be substituted by the JobID at runtime. If this parameter is omitted, any output will be directed to a file named slurm-JOBID.err in the current directory.
#SBATCH --test-only
While not shown above, adding this option to your script will tell the scheduler to return information on what would happen if you submit this job. This is a good and easy way to determine if you script is viable as well as give a rough estimate of how long it would take to schedule in the current queue load.
#SBATCH --account=some_lab
If you are in more than one lab, please ensure that you are charging your Fairshare to the appropriate group by using this option in all of your job scripts and specifying the lab group.

Notifications by email:

The scheduler can send email to you for various job states (FAIL and END being the most useful). But please bear in mind that this must be used responsibly as one user can quickly overwhelm the mail system and affect the notifications of all users by clogging up the mail queue. Keep in mind that tens or even hundreds of thousands of jobs may be in flight at a given time. This is why below we will strongly caution against using the ALL mail type. If you are using a metascheduler, job arrys, or just many jobs, please try to avoid adding too much burden to the email queue; Sending hundreds or thousands of emails can cause email backups, not to mention fill up your inbox.

To add mail notification to your job script you can use the --mail-type

SBATCH command. Example:

#SBATCH --mail-type=END #This command would send an email when the job ends.

Valid type values are NONE, BEGIN, END, FAIL, REQUEUE, ALL (Please avoid: Equivalent to BEGIN, END, FAIL, INVALID_DEPEND, REQUEUE, and STAGE_OUT), INVALID_DEPEND (dependency never satisfied), STAGE_OUT (burst buffer stage out and teardown completed), TIME_LIMIT, TIME_LIMIT_90 (reached 90 percent of time limit), TIME_LIMIT_80 (reached 80 percent of time limit), TIME_LIMIT_50 (reached 50 percent of time limit) and ARRAY_TASKS (Please also avoid: Send emails for each array task).

Multiple type values may be specified in a comma separated list. The user to be notified is indicated with --mail-user. Unless the ARRAY_TASKS option is specified, mail notifications on job BEGIN, END and FAIL apply to a job array as a whole rather than generating individual email messages for each task in the job array.

#SBATCH --mail-user=ajk@123.com #Email to which notifications will be sent

It is important to accurately request resources, especially memory

The FASRC cluster is a large, shared system that must have an accurate idea of the resources your program(s) will use so that it can effectively schedule jobs. If insufficient memory is allocated, your program may crash (often in an unintelligible way); if too much memory is allocated, resources that could be used for other jobs will be wasted. Additionally, your “fairshare“, a number used in calculating the priority of your job for scheduling purposes, can be adversely affected by over-requesting. Therefore it is important to be as accurate as possible when requesting cores (-n) and memory (--mem or --mem-per-cpu).
Many scientific computing tools can take advantage of multiple processing cores, but many cannot. A typical R script, for example will not use multiple cores. On the other hand, RStudio, a graphical console for R is a Java program that is improved substantially by using multiple cores. Or, you can use the Rmpi package and spawn “slaves” that correspond to the number of cores you’ve selected.
The distinction between --mem and --mem-per-cpu is important when running multi-core jobs (for single core jobs, the two are equivalent). --mem sets total memory across all cores, while --mem-per-cpu sets the value for each requested core. If you request two cores (-n 2) and 4 Gb with --mem, each core will receive 2 Gb RAM. If you specify 4 Gb with --mem-per-cpu, each core will receive 4 Gb for a total of 8 Gb.  A good distinction between the two is that --mem-per-cpu is for MPI jobs and --mem is for all other types.
The #SBATCH --test-onlyoption is a good way to sanity check your scripts before submitting them. Just remember to remove it after running your test.

Monitoring job progress with squeue and sacct

squeue and sacct are two different commands that allow you to monitor job activity in SLURM. sacct talks directly to the slurm accounting database and provides both live and historic data (up to 6 months). sacct with out any options will print out all the jobs you have run in the past day. sacct -j 999999 will show you a specific job. Note that sacct is almost live data, in addition the various accounting fields (such as memory usage) are incomplete until the job finishes. If you want current data on memory usage or other counters use the sstat command.

sacct can provide much more detail as it has access to many of the resource accounting fields that SLURM uses. For example, to get a detailed report on the memory and CPU usage for an array job (see below for details about job arrays):

[jharvard@boslogin01 ~]? sacct -j 44375501 --format JobID,Elapsed,ReqMem,MaxRSS,AllocCPUs,TotalCPU,State   
JobID      Elapsed    ReqMem   MaxRSS AllocCPUS TotalCPU State
------------ ---------- --------- ------- ---------- ---------- ----------
44375501_[1+ 00:00:00   40000Mc           8    00:00:00   PENDING
44375501_1   2-03:50:53 40000Mc           8    2-03:50:23 COMPLETED
44375501_1.+ 2-03:50:53 40000Mc 34372176K 6    2-03:50:23 COMPLETED
44375501_1.+ 2-03:50:53 40000Mc 1236K     8    00:00.004  COMPLETED
44375501_2   1-23:47:35 40000Mc           8    1-23:47:18 COMPLETED
44375501_2.+ 1-23:47:35 40000Mc 34467196K 6    1-23:47:17 COMPLETED
44375501_2.+ 1-23:47:36 40000Mc 1116K     8    00:00.003  COMPLETED
44375501_3   1-23:32:36 40000Mc           8    1-23:32:15 COMPLETED
44375501_3.+ 1-23:32:36 40000Mc 34389040K 6    1-23:32:15 COMPLETED
44375501_3.+ 1-23:32:37 40000Mc 1224K     8    00:00.004  COMPLETED
44375501_4   1-21:59:30 40000Mc           8    1-21:59:07 COMPLETED
44375501_4.+ 1-21:59:30 40000Mc 34389044K 6    1-21:59:07 COMPLETED

The seff and seff-account commands are summary commands based off the data in sacct.

Running squeue without arguments will list all your currently running, pending, and completing jobs. If you include the -l option (for “long” output) you can get useful data, including the running state of the job.

[jharvard@boslogin01 ~]?squeue -u jharvard -l
Thu May 31 10:59:05 2018
    JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES NODELIST(REASON)
44768543_24 shared longseq2 mmcfee RUNNING 20:34:20 1-12:00:00 1 holy7c09106
44768543_23 shared longseq2 mmcfee RUNNING 20:34:55 1-12:00:00 1 holy7c15302
44768543_22 shared longseq2 mmcfee RUNNING 20:52:34 1-12:00:00 1 holy7c15310
44768543_10 shared longseq2 mmcfee RUNNING 23:30:38 1-12:00:00 1 holy7c05312
44768543_11 shared longseq2 mmcfee RUNNING 23:30:38 1-12:00:00 1 holy7c09211
44768518_24 shared shortseq mmcfee RUNNING 23:32:21 1-12:00:00 1 holy7c13111

The default squeue tool in your PATH (/usr/local/bin/squeue) is a modified version developed by FAS Informatics. To reduce the load on the SLURM scheduler (RC processes 2.5 million jobs each month), this tool actually queries a centrally collected result from the ‘real’ squeue tool, which can be found at /usr/bin/squeue. This data is collected approximately every 30 seconds. Many, but not all, of the options from the original tool are supported. Check this using the squeue –help command.squeue long output using username (-u) filter. If you absolutely need to use all of the options from the real squeue tool, simply call it directly: /usr/bin/squeue. But please do not use it for any scripted checks or processes as it increases overhead for the scheduler.

Both tools provide information about the job State. This value will typically be one of PENDING, RUNNING, COMPLETED, CANCELLED, or FAILED.

PENDING Job is awaiting a slot suitable for the requested resources. Jobs with high resource demands may spend significant time PENDING.
RUNNING Job is running.
COMPLETED Job has finished and the command(s) have returned successfully (i.e. exit code 0).
CANCELLED Job has been terminated by the user or administrator using scancel.
FAILED Job finished with an exit code other than 0.

See broader queue with showq

The showq command can be used to show what the rest of the partition looks like.  Often your job is pending due to other people in the partition.  The showq command then shows you an overview of all the jobs for a specific partition.  showq is invoked by doing:
showq -o -p shared
Where -o orders the pending queue by priority, with the next job to be scheduled at the top.  -p specifies the partition that you want to look at.

Canceling jobs with scancel

If for any reason, you need to cancel a job that you’ve submitted, just use the scancel command with the job ID.
scancel 9999999
If you don’t keep track of the job ID returned from sbatch, you should be able to find it with the squeue or sacct command described above.

Interactive jobs and salloc

PLEASE NOTE: If you are attempting to use salloc on FASSE, please use the FASSE VDI instead.

Though batch submission is the best way to take full advantage of the compute power the cluster, foreground/interactive jobs can also be run. These can be useful for things like:

  • Iterative data exploration at the command line
  • RAM intensive graphical applications like MATLAB or SAS
  • Interactive “console tools” like R and iPython
  • Significant software development and compiling efforts

An interactive job differs from a batch job in two important aspects: 1) the partition to be used is the test partition (though any partition in Slurm can be used for interactive work) and, 2) jobs should be initiated with the salloc command instead of sbatch. The command salloc will start a command line shell on a compute node with default settings. In this case, since no resources have been requested explicitly, default resources will get allocated to a job on Cannon or FASSE, which are, serial_requeue for the partition, 10 minutes for the time, 1 core, and 100 MB for the memory.

Note that you should not include /bin/bash as part of your salloc line, as it will simply execute that command and exit. Instead simply run salloc with only your resource paramterts and it will put you in an interactive session.

This command: salloc -p test --mem 500 -t 0-06:00 will start a command line shell on the test queue with 500 MB of RAM for 6 hours; 1 core on 1 node is assumed as this parameter (-c 1) were left out. When the interactive session starts, you will notice that you are no longer on a login node, but rather one of the compute nodes dedicated to this queue.

salloc -p test --x11 --mem 4G -t 0-06:00

In this case, we’ve asked for more memory because we plan to run MATLAB which requires a larger memory footprint. The --x11 option allows XWindows to operate between the login and compute nodes. See also: Virtual Desktop (VDI)

Interactive sesssions require you to be active in the session. If you go more than an hour without any kind of input, it will assume that you have left the session and will terminate it. If you have interactive tasks that must stretch over days, we recommend you print to screen occasionally to keep the connection open.


Remote desktop access

For a GUI native X11 interface, you can connect to the cluster using our Open OnDemand VDI system. This is more reliable and stable than X11 forwarding back to your computer. Remote desktop access is particularly useful for heavy client applications like MATLAB, Jupyter, and R Studio where the performance of X11 forwarding is decidedly poor.
Note: The NX/NoMachine servers have been retired. Please use VDI/Open OnDemand


Storage and Scratch on the Cluster

Cluster partitions have many owned and general purpose file systems attached for use by labs and individuals to store data long-term. These are shared filesystems and are typically located in a different datacenter from the compute nodes. As such high I/O (Input/Output) from production jobs is not the best use case for your lab storage, as lab storage is not designed for jobs that need to write large amounts of data or need quick access to storage.
For best performance while running jobs please use the temporary scratch storage found at /n/holyscratch01. This is a Lustre file system with 1.2 PB of storage and connected via Infiniband fabric. This temporary scratch space is available from all compute nodes.  In addition Lustre has the ability to stripe data across its servers to enhance performance.  See both of these handy guides for more details about how to do Lustre striping.
There are lab-based 50TB quota and a 90 day retention policy on holyscratch01 scratch. Please review the scratch policy page here. If you have not moved your data after 90 days it will be deleted to make space for other users. Please use holyscratch01 only for reading and writing data from the cluster. Please create a subdirectory in your lab group’s folder here under /n/holyscratch01/[lab name] Please contact us if your lab does not have a holyscratch01 directory or you are unable to create a sub-directory for yourself.


Troubleshooting Jobs and Resource Usage

A number of factors, including fair-share are used for job scheduling

We use a multifactor method of job scheduling on the cluster. Job priority is assigned by a combination of fair-share and length of time a job has been sitting in the queue.  You can find out the priority calculation for your jobs by using the sprio command, such as sprio -j JOBID.
You can find a description of how SLURM calculates Fair-share here.  Fairshare is shared on a lab basis, so usage by any member of the lab will impact the score of the whole lab as the lab is pulling from a common pool.  Fairshare has a 3 day halflife and naturally recovers if your lab does not run any jobs.  Thus it is wise to store up fairshare if you need to do significant runs, and plan your runs accordingly in order to maintain a good fairshare score.  You can learn more about your fairshare score and slurm usage by using the sshare command, such as sshare -U which shows your current score.  Contact RC if you want to get graphs of your usage and fairshare over time.
The other factor in priority is how long you have been sitting in the queue. The longer your job sits in the queue the higher its priority grows, out to a maximum of 3 days. If everyone’s priority is equal then FIFO (first in first out) is the scheduling method.  We weight the age of a job that has pended for 3 days to be equal to a fairshare score of 0.5.
We also have backfill turned on. This allows for jobs which are smaller to sneak in while a larger higher priority job is waiting for nodes to free up. If your job can run in the amount of time it takes for the other job to get all the nodes it needs, SLURM will schedule you to run during that period. This means knowing how long your code will run for is very important and must be declared if you wish to leverage this feature. Otherwise the scheduler will just assume you will use the maximum allowed time for the partition when you run.  The better your constrain your job in terms of CPU, Memory, and Time the easier it will be for the backfill scheduler to find you space and let your job jump ahead in the queue.

Troubleshooting common problems

A variety of problems can arise when running jobs on the cluster. Many are related to resource misallocation, but there are other common problems as well.

Error Likely cause
JOB <jobid> CANCELLED AT <time> DUE TO TIME LIMIT You did not specify enough time in your batch submission script. The -t option sets time in minutes or can also take D-HH:MM form (0-12:30 for 12.5 hours)
Job <jobid> exceeded <mem> memory limit, being killed Your job is attempting to use more memory than you’ve requested for it. Either increase the amount of memory requested by --mem or --mem-per-cpu or, if possible, reduce the amount your application is trying to use. For example, many Java programs set heap space using the -Xmx JVM option. This could potentially be reduced. For jobs that require truly large amounts of memory (>256 Gb), you may need to use the bigmem SLURM partition. Genome and transcript assembly tools are commonly in this camp.
SLURM_receive_msg: Socket timed out on send/recv operation This message indicates a failure of the SLURM controller. Though there are many possible explanations, it is generally due to an overwhelming number of jobs being submitted, or, occasionally, finishing simultaneously. If you want to figure out if SLURM is working use the sdiag command. sdiag should respond quickly in these situations and give you an idea as to what the scheduler is up to.
JOB <jobid> CANCELLED AT <time> DUE TO NODE FAILURE This message may arise for a variety of reasons, but it typically indicates that the host on which your job was running can no longer be contacted by SLURM.  Jobs that die from NODE_FAILURE are automatically requeued by the scheduler.

Using GPUs

To request a single GPU on slurm just add #SBATCH --gres=gpu to your submission script and it will give you access to a GPU. To request multiple GPUs add #SBATCH --gres=gpu:n where ‘n’ is the number of GPUs. Note that --gres specifies the resources on a per node basis, so for multinode work you only need to specify how many gpus you need per node. For more on GPU computing see our more indepth GPGPU Document.

Specifying GPU Type

For users who wish to specify which type of GPU they wish to use, especially for those using heterogeneous partitions like gpu_requeue, there are two methods that can be used. The first is using --constraint="<tag>", this will constrain the job to only run on gpus of a certain class. A full listing of constraints can be found below. The second method is defining the specific model you want using --gres=gpu:<model>:1. For example if you want a A100 with 80GB of onboard memory then you would specify --gres=gpu:nvidia_a100-sxm4-80gb:1. Below is a list of classes of gpus (listed by the constraint tag) that are on the cluster along with the models that fall under that class.

v100

  • tesla_v100-pcie-16gb: Nvidia V100 PCIe 16GB
  • tesla_v100-pcie-32gb: Nvidia V100 PCIe 32GB
  • tesla_v100-sxm2-16gb: Nvidia V100 SXM2 16GB
  • tesla_v100-sxm2-32gb: Nvidia V100 SXM2 32GB
  • tesla_v100s-pcie-32gb: Nvidia V100S PCIe 32GB

a40

  • nvidia_a40: Nvidia A40 40GB

a100-mig

  • nvidia_a100_1g.5gb: Nvidia A100 1g MIG 5GB
  • nvidia_a100_1g.10gb: Nvidia A100 1g MIG 10GB
  • nvidia_a100_2g.10gb: Nvidia A100 2g MIG 10GB
  • nvidia_a100_3g.20gb: Nvidia A100 3g MIG 20GB
  • nvidia_a100_3g.39gb: Nvidia A100 3g MIG 40GB
  • nvidia_a100_4g.20gb: Nvidia A100 4g MIG 20GB
  • nvidia_a100_4g.39gb: Nvidia A100 4g MIG 40GB

a100

  • nvidia_a100-pcie-40gb: Nvidia A100 PCIe 40GB
  • nvidia_a100-sxm4-40gb: Nvidia A100 SXM4 40GB
  • nvidia_a100-sxm4-80gb: Nvidia A100 SXM4 80GB

h100

  • nvidia_h100_80gb_hbm3: NVIDIA H100 80GB HBM3

To find out what specific types of gpu’s are available on a partition run scontrol show partition <PartitionName> and look under the TRES category.

Using Threads such as OpenMP

One of the basic methods for parallelization is to use a threading library, such as pthreads, OpenMP, or applications that use OpenMP under the hood (e.g. numpy, OpenBLAS).  Slurm by default does not know what cores to assign to what process it runs, in addition for threaded applications you need to make sure that all the cores you request are on the same node.  Below is an example OpenMP script that both ensures all the cores are on the same node, and lets Slurm know which process gets the cores that you requested for threading.

#!/bin/bash
#SBATCH -c 8 # Number of threads
#SBATCH -t 0-00:30:00 # Amount of time needed DD-HH:MM:SS
#SBATCH -p sapphire # Partition to submit to
#SBATCH --mem-per-cpu=100 #Memory per cpu
module load intel/21.2.0-fasrc01
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
srun -c $SLURM_CPUS_PER_TASK MYPROGRAM > output.txt 2> errors.txt

The most important aspect of the threaded script above is the -c option which tells Slurm how many threads you intend to run with.

Using MPI

MPI (Message Passing Interface) is a standard that supports communication between separate processes, allowing parallel programs to simulate a large common memory space. OpenMPI and MVAPICH2 are available as modules on the cluster as well as an Intel specific library.
As described in the Helmod documentation, MPI libraries are a special class of module, called “Comp”, that is compiler dependent. To load an MPI library, load the compiler first.

module load intel/21.2.0-fasrc01 openmpi/4.1.1-fasrc01

Once an MPI module is loaded, applications built against that library are made available. This dynamic loading mechanism prevents conflicts that can arise between compiler versions and MPI library flavors.
An example MPI script with comments is shown below:

#!/bin/bash
#SBATCH -n 128 # Number of cores
#SBATCH -t 10 # Runtime in minutes
#SBATCH -p sapphire # Partition to submit to
#SBATCH --mem-per-cpu=100 # Memory per cpu in MB (see also --mem)
module load intel/21.2.0-fasrc01 openmpi/4.1.1-fasrc01
module load MYPROGRAM
srun -n $SLURM_NTASKS --mpi=pmix MYPROGRAM > output.txt 2> errors.txt

There are a number of important aspects to an MPI SLURM job.

  • MPI jobs must be run on a partition that supports MPI interconnects.  sapphire, shared, test, general, unrestricted are MPI-enabled, but serial_requeue includes non-MPI resources and should be avoided.
  • Memory should be allocated with the --mem-per-cpu option instead of --mem so that memory matches core utilization.
  • The -np option for mpirun or mpiexec (when these runners are used) should use the bash variable $SLURM_NTASKS so that the correct number of cores is passed to the MPI engine at runtime.
  • If network topology and communications overhead is a concern for your code, try using the  --contiguous option which will ensure that all the cores you get will be adjacent to each other.  Use this with caution though as it will make your job pend longer, as finding contiguous blocks of compute is difficult.  Verify that the boost in performance is worth the extra wait time in the queue.  If you do not include this option you will be given cores and what ever nodes that Slurm can find, which may be scattered across the cluster.   Depending on your code this may or may not be a concern.  Test your code in both modes to see if it is an option that is worth including if you don’t know off hand.  It may not be worth including --continguous as the aggregate time of waiting plus runtime may be longer with --contiguous.  The sbatch and srun documentation have more information on various fine tuning options.
  • The application must be MPI-enabled. Applications cannot take advantage of MPI parallelization unless the source code is specifically built for it. All such applications in the Helmod module system can only be loaded if an MPI library is loaded first.

Job arrays

SLURM allows you to submit a number of “near identical” jobs simultaneously in the form of a job array. To take advantage of this, you will need a set of jobs that differ only by an “index” of some kind.
For example, say that you would like to run tophat, a splice-aware transcript-to-genome mapping tool, on 30 separate transcript files named trans1.fq, trans2.fq, trans3.fq, etc. First, construct a SLURM batch script, called tophat.sh, using special SLURM job array variables:

#!/bin/bash
#SBATCH -J tophat # A single job name for the array
#SBATCH -c 1 # Number of cores
#SBATCH -p serial_requeue # Partition
#SBATCH --mem 4000 # Memory request (4Gb)
#SBATCH -t 0-2:00 # Maximum execution time (D-HH:MM)
#SBATCH -o tophat_%A_%a.out # Standard output
#SBATCH -e tophat_%A_%a.err # Standard error
module load tophat/2.0.13-fasrc02
tophat /n/holyscratch01/informatics_public/ref/ucsc/Mus_musculus/mm10/chromFatrans"${SLURM_ARRAY_TASK_ID}".fq

Then launch the batch process using the --array option to specify the indexes.
sbatch --array=1-30 tophat.sh
In the script, two types of substitution variables are available when running job arrays. The first, %A and %a, represent the job ID and the job array index, respectively. These can be used in the sbatch parameters to generate unique names. The second, SLURM_ARRAY_TASK_ID, is a bash environment variable that contains the current array index and can be used in the script itself. In this example, 30 jobs will be submitted each with a different input file and different standard error and standard out files.
More detail can be found on the SLURM job array documentation page.

Checkpointing

Slurm does not automatically checkpoint, i.e. create files that your job can restart from.  To protect against job failure (due to code error or node failure) and to allow your job to be broken up into smaller chunks it is always advisable to checkpoint your code so it can restart from where it left off.  This is especially valuable for jobs on partitions subject to requeue, but is also just generally useful for any type of job.  Checkpointing varies from code type to code type and needs to be implemented by the user as part of their code base.

Job dependencies

Many scientific computing tasks consist of serial processing steps. A genome assembly pipeline, for example, may require sequence quality trimming, assembly, and annotation steps that must occur in series. Launching each of these jobs without manual intervention can be done by repeatedly polling the controller with squeue / sacct until the State is COMPLETED. However, it’s much more efficient to let the SLURM controller handle this using the --dependency option.

[akitzmiller@boslogin01 examples]? sbatch assemble_genome.sh
Submitted batch job 53013437
[akitzmiller@boslogin01 examples]? sbatch --dependency=afterok:53013437 annotate_genome.sh
[akitzmiller@boslogin01 examples]?

When submitting a job, specify a combination of “dependency type” and job ID in the --dependency option. afterok is an example of a dependency type that will run the dependent job if the parent job completes successfully (state goes to COMPLETED). The full list of dependency types can be found on the SLURM doc site in the man page for sbatch.  It is best not to create a chain of dependencies that is greater than 2-3 levels.  Any more than that and the scheduler will become significantly slower.  Dependencies should only be used if the resource requirements between each step are significantly different, or if you need to wait for an array to complete before you run a single job that processes all the array results.  Be sure to think about whether you truly need dependencies or not.

Job Constraints

Sometimes, especially on the requeue partitions, jobs need to be constrained to run on specific hardware.  Many times this is due to either the code being compiled for a specific architecture or because the code runs more efficiently on a specific type of host.  Slurm provides for this functionality via the --constraint option (see the sbatch documentation for usage details).  The features for constraint are defined by FASRC and fall into three broad categories: Processor, GPU, and Network.  You can match against multiple of these but keep in mind the more constraints you use the longer your job will pend for as the scheduler will find it more difficult to find nodes that fit your needs.  A list of the features available on the cluster follows, you can also see the features for a specific node by doing scontrol show node NODENAME.

Processor

  • amd: All AMD processors
  • intel: All Intel processors
  • avx: All processors that are AVX capable
  • avx2: All processors that are AVX2 capable
  • avx512: All processors that are AVX512 capable
  • milan: AMD Milan chips
  • genoa: AMD Genoa chips
  • skylake: Intel Skylake chips
  • sapphirerapids: Intel Sapphire Rapids
  • cascadelake: Intel Cascade Lake chips
  • icelake: Intel Ice Lake chips

GPU

To specify a GPU model, for example, A100 with 80GB refer to Specifying GPU Type

  • cc6.1, cc7.0, cc7.5, cc8.0, cc8.6, cc9.0: Level of Nvidia Compute Capability
  • a40: Nvidia A40 GPU
  • v100: Nvidia V100 GPU
  • a100: Nvidia A100 GPU
  • a100-mig: Nvidia A100 GPU MIG
  • h100: Nvidia H100 GPU

Network

  • holyhdr: Holyoke HDR Infiniband Fabric
  • holyndr: Holyoke NDR Infiniband Fabric

[lastupdated]

© The President and Fellows of Harvard College
Except where otherwise noted, this content is licensed under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International license.