Search Docs by Keyword
> Running Jobs
Introduction
Faculty of Arts and Sciences Research Computing (FASRC) hosts several collections of computers in what are called clusters. Each cluster is large number of individual compute servers networked together with a high speed interconnect and integrated with storage (see our Data Management guide for more). To manage work on these clusters FASRC uses Slurm.
Slurm is a open source scheduler from SchedMD. The job of Slurm is:
1. To govern what user gets what resources on the cluster and when.
2. To create allocations for individual units of work which are called jobs.
3. To ensure maximum utilization of the cluster.
4. To keep a historical record of usage.
Users interact with Slurm by submitting a job to the scheduler. The scheduler then puts that job in the pending queue for the selected subsection of the cluster (called a partition) for consideration. The scheduler will weight the job’s priority based on the users prior usage to ensure a fair distribution of resources. It will then try to schedule the highest priority work by playing a large scale game of Tetris. In addition Slurm will take lower priority jobs and try to fit them into various gaps it finds in order to maximize usage without impacting the time when the higher priority work would run.
Below we will walk you through how to submit jobs to the scheduler for work. We will also discuss how the cluster is organized and some best practices for use. For more details on the architecture of the cluster, please see our Job Efficiency and Optimization Best Practices page.
Getting Started
To submit jobs you will first need to set up your account. Once you’ve gone through the account setup procedure, you can login to the cluster via ssh to a login node and/or use Open OnDemand. The guide below assumes that you will be using the command line (CLI) for interaction with Slurm.
FASRC cluster nodes run the Rocky distribution of the Linux operating system and commands are run under the bash shell. There are a number of Linux and bash references, cheat sheets and tutorials available on the web. FASRC’s own training is also available.
Storage and Scratch on the Cluster
Cluster nodes have file systems mounted for use by labs and individuals to store both on a temporary (called scratch) and long term basis. The Data Storage page covers the various storage options. Please use the appropriate storage for your jobs as each storage type has different purposes and performance characteristics.
Slurm Documentation
Comprehensive documentation for Slurm can be found at the official Slurm website. Note that these docs are always for the latest version of Slurm, while FASRC tries to keep up with the latest version you will want to cross check the version we run against the version the docs are for. To find the version of Slurm the cluster is running do sinfo --version.
You can also get documentation on individual commands by using the unix man command. This command will show you the manual for the command for the version of Slurm the cluster is using. For instance if you want the manual for the sinfo command you would run: man sinfo
Some other useful documentation sites are:
Summary of Slurm Commands
The table below shows a brief list of common Slurm commands. These commands are described in more detail below along with links to the Slurm doc site.
| What you want to do | SLURM | SLURM Example |
|---|---|---|
| Submit a batch serial job | sbatch | sbatch runscript.sh |
| Run a script or application interactively (do not use salloc on FASSE) |
salloc | salloc -p test -t 10 --mem 1G [script or app] |
| Start interactive session (do not use salloc on FASSE) |
salloc | salloc -p test -t 10 --mem 1G |
| Kill a job | scancel | scancel JOBID |
| View status of your jobs | sacct | sacct -u USERNAME |
| Check job by id number | sacct | sacct -j JOBID |
| Check efficiency of job | jobstats | jobstats JOBID |
| List of available partitions | spart | spart |
| Check current partition queue state | showq | showq -o -p PARTITIONNAME |
| Details on current job, node, partition
|
scontrol
|
scontrol show job JOBID
|
| Schedule recurring batch job | scrontab | see scrontab document for example |
| Check fairshare | sshare | sshare -U |
Slurm Global Limits and Defaults
Before submitting any jobs users should familiarize themselves with:
FASRC has set several global limits that users should be aware of and should plan around. These limits exist to prevent any one person from taking over the cluster and also serve to prevent the cluster being overwhelmed due to poorly formed jobs. Users must work within these limits and should plan their work accordingly. This is typically done by breaking up their workflow into smaller chunks or by deliberately serializing their jobs to increase the job time and decrease the number of cores needed. The limits are as follows:
- Maximum Number of Jobs per User: 10,100. This is meant to prevent any one user from monopolizing the cluster.
- Maximum Array Size: 10,000. This is both array index and size. This is meant to prevent any one user from monopolizing the cluster. Note that each array index counts as a single job for purposes of the Maximum Number of Jobs per User, so this is intentionally redundant.
- Maximum Number of Steps: 40,000. A job step is recorded by slurm for each invocation of
srunby a job. This is meant to prevent run-away jobs.
All other limits are partition or node dependent. More on that below.
FASRC also sets the following defaults if nothing is requested:
- Core Count: 1
- Memory: 100 MB
- GPU Count: 0
- Partition: serial_requeue
- Time: There is no default time set. Users must always declare time.
Users can set their own defaults by setting a definition file in $HOME/.slurm/defaults, for more see the CLI Filter doc.
Slurm Partitions
Partitions are a block of nodes on the cluster with their own scheduling policy. Partitions have various limits governing what types of jobs are appropriate to run in them. When a job is submitted it schedules to the specified partition(s) and then joins the pending queue. When the job is scheduled in a partition it will join the running queue for that partition. You can find out what partitions you have access to using the spart command. To learn more about a given partition run: scontrol show partition PARTITIONAME. To learn more about an individual node run: scontrol show node NODENAME. Below is a list of the public partitions on Cannon (FASSE can be found here).
| Partition | Nodes | Cores per Node | CPU Core Types | Mem per Node (GB) | Time Limit | Max Jobs | Max Cores | GPU Capable? | /scratch size (GB) |
|---|---|---|---|---|---|---|---|---|---|
| sapphire | 186 | 112 | Intel “Sapphire Rapids” |
990 | 3 days | none | none | No | 396 |
| shared | 310 | 48 | Intel “Cascade Lake” |
184 | 3 days | none | none | No | 68 |
| bigmem | 4 | 112 | Intel “Sapphire Rapids” |
1988 | 3 days | none | none | No | 396 |
| bigmem_intermediate | 3 | 64 | Intel “Ice Lake” |
2000 | 14 days | none | none | No | 396 |
| gpu | 36 | 64 | Intel “Ice Lake” |
990 | 3 days | none | none | Yes (4 A100/node) | 396 |
| gpu_h200 | 22 | 112 | Intel “Sapphire Rapids” | 990 | 3 days | none | none | Yes (4 H200/node) | 843 |
| intermediate | 12 | 112 | Intel “Sapphire Rapids” |
990 | 14 days | none | none | No | 396 |
| unrestricted | 8 | 48 | Intel “Cascade Lake” |
184 | none | none | none | No | 68 |
| test | 18 | 112 | Intel “Sapphire Rapids” |
990 | 12 Hours | 5 | 112 | No | 396 |
| gpu_test | 12 | 64 | Intel “Ice Lake” |
487 | 12 Hours | 2 | 64 | Yes (8 A100 MIG 3g.20GB/node) – Limit 8 per job | 172 |
| remoteviz | down | 32 | Intel “Cascade Lake” |
373 | 3 days | none | none | Shared V100 GPUs for rendering | 396 |
| serial_requeue | varies | varies | AMD/Intel | varies | 3 days | none | none | No | varies |
| gpu_requeue | varies | varies | AMD/Intel | varies | 3 days | none | none | Yes | varies |
| PI/Lab nodes | varies | varies | varies | varies | none | none | none | varies | varies |
Partition Details
sapphire
The sapphire partition has a maximum run time of 3 days. Serial, parallel, and interactive jobs are permitted on this queue, and this is the most appropriate location for MPI jobs. This partition has 186 nodes connected by a NDR InfiniBand (IB) fabric, where each node configured with 2 Intel Xeon Sapphire Rapids CPUs, 990 GB of RAM, and 400 GB of local scratch space. Each Intel CPU has 56 Cores, and 100 MB of cache.
When submitting MPI jobs on the sapphire partition, it may be advisable to use the --contiguous option for best communication performance if your code is topology sensitive. Though all of the nodes are connected by Infiniband fabric, there are multiple switches routing the MPI traffic and Slurm will by default schedule you where ever it can find space. Thus your job may end up scattered across the cluster. The --contiguous option will ensure that the jobs are run on nodes that are adjacent to each other on the IB fabric. Be advised that using --contiguous will make your job pend longer, so only use it if you absolutely need it.
The shared partition has a maximum run time of 3 days. Serial, parallel, and interactive jobs are permitted on this queue, and this is the most appropriate location for MPI jobs. This partition has 310 nodes connected by a HDR InfiniBand (IB) fabric, where each node configured with 2 Intel Xeon Cascade Lake CPUs, 184 GB of RAM, and 70 GB of local scratch space. Each Intel CPU has 24 Cores, and 48 MB of cache.
When submitting MPI jobs on the shared partition, it maybe advisable to use the --contiguous option for best communication performance if your code is topology sensitive. Though all of the nodes are connected by Infiniband fabric, there are multiple switches routing the MPI traffic and Slurm will by default schedule you where ever it can find space. Thus your job may end up scattered across the cluster. The --contiguous option will ensure that the jobs are run on nodes that are adjacent to each other on the IB fabric. Be advised that using --contiguous will make your job pend longer, so only use it if you absolutely need it.
bigmem
This partition should be used for large memory work requiring greater than 1000 GB RAM per job. Jobs requesting less than 1000 GB RAM are automatically rejected by the scheduler.
There is 3 day limit for work here. MPI or low memory work is not appropriate for the this partition, and inappropriate jobs may be terminated without warning. This partition has an allocation of 4 nodes with 1988 GB of RAM
bigmem_intermediate
This partition should be used for large memory work requiring greater than 1000 GB RAM per job. Jobs requesting less than 1000 GB RAM are automatically rejected by the scheduler. There is a minimum run time of 3 days and maximum run time of 14 days.
MPI or low memory work is not appropriate for the this partition, and inappropriate jobs may be terminated without warning. This partition has an allocation of 3 nodes with 2000 GB of RAM
gpu
This 36 node partition is for individuals wishing to use GPGPU resources. One will need to include #SBATCH --gres=gpu:n where n=1-4 in your SLURM submission scripts. Each node has 64 cores and is equipped with 4 x Nvidia A100s per node. See our GPU Computing section for more info on using and specifying GPU resources.
gpu_h200
This 22 node partition is for individuals wishing to use GPGPU resources. One will need to include #SBATCH --gres=gpu:n where n=1-4 in your SLURM submission scripts. Each node has 112 cores and is equipped with 4 x NVidia H200s per node. See our GPU Computing section for more info on using and specifying GPU resources.
intermediate
Serial and parallel (including MPI) jobs are permitted on this partition and this partition is intended for runs needing 3 to 14 days of runtime. This partition has an allocation of 12 nodes of the same configuration as above for the sapphire partition.
unrestricted
Serial and parallel (including MPI) jobs are permitted on this partition and 365 day limit on run time. Given this, there is no guarantee of 100% uptime. Running on this partition is done at the users own risk. Users should understand that if the queue is full it could take weeks or up to months for your job to be scheduled to run. unrestricted is made up of 8 nodes of the same configuration as above for the shared partition.
test
This partition is dedicated for interactive (foreground / live) work and for testing (interactively) code before submitting in batch and scaling. Small numbers (1 to 5) of serial and parallel jobs with small resource requirements (RAM/cores) are permitted on this partition; large numbers of interactive jobs or those requiring large resource requirements should really be done on another partition. Multiple partition submissions to this partition are forbidden (i.e. one is not permitted to do #SBATCH -p test,sapphire).
This partition is made up of 18 nodes of the same configuration as above for the sapphire partition. This smaller queue has a 12 hour maximum run time. This queue has a maximum of 112 cores and 1000 GB RAM. Jobs in this queue are not charged fairshare.
gpu_test
This 14 node partition is for individuals wishing to test GPGPU resources. One will need to include #SBATCH --gres=gpu:n where n=1-8 in your SLURM submission scripts. These nodes have 64 cores and are equipped with 4 x Nvidia A100s in Multi-Instance GPU (MIG) mode. Each GPU has two 3g.20GB MIG instances. This queue has a maximum of 2 jobs, 64 cores, 1000 GB RAM, 8 GPU’s, 12 hour run time. This partition is intended for interactive, testing, and experimentation only. Multiple partition submissions to this partition are forbidden. See our GPU Computing section for more info on using and specifying GPU resources. Jobs in this queue are not charged fairshare.
remoteviz
This single node partition is for individuals who wish to use shared GPU’s for rendering graphics. The V100 cards on this node are in shared mode and are not intended for computational use but instead of rendering. You do not need to request a gpu to use this partition. Multiple partition submissions to this partition are forbidden. For computation please use the gpu and gpu_test partitions.
serial_requeue
This partition is appropriate for single core (serial) jobs, jobs that require up to 8 cores for small periods of time (less than 1 day), or job arrays where each job instance uses less than 8 cores. Multinode jobs may be run in the partition but be advised that this is a heterogeneous partition and users are highly recommended to leverage the --constraint option to get a homogeneous block of compute and networking. The maximum runtime for this queue is 3 days. GPU jobs are rejected from this partition and should be run in gpu_requeue. As this partition is made up of an assortment of nodes owned by other groups in addition to the general nodes, jobs in this partition may be killed and requeued if a higher priority job (e.g. the job of a node owner) comes in.
Because serial_requeue takes advantage of slack time in owned partitions, times in the PENDING state can potentially be much shorter than the shared and sapphire partitions. Since jobs may be killed, requeued, and run a 2nd time, ensure that the jobs are a good match for this partition. For example, jobs that append output would not be good for serial_requeue unless the data files were zeroed out at the start to ensure output from a previous (killed) run was removed. Also, to ensure your job need not redo all its compute again, it is advisable to have checkpointing enabled for your code. We do advise that you use the --open-mode=append to see the requeue status/error messages in your log files. Without this option, your log files will be reset at the start of each (requeued) run, with no obvious indication of requeue events.
gpu_requeue
This partition is appropriate for gpu jobs that require small periods of time (less than 1 day). Multinode jobs may be run in the partition but be advised that this is a heterogeneous partition and users are highly recommended to leverage the --constraint option to get a homogeneous block of compute and networking. The maximum runtime for this queue is 3 days. One will need to include #SBATCH --gres=gpu:1 in your SLURM submission scripts to get access to this partition. As this partition is made up of an assortment of gpu nodes owned by other groups in addition to the public nodes, jobs in this partition may be killed but automatically requeued if a higher priority job (e.g. the job of a node owner) comes in.
Because gpu_requeue takes advantage of slack time in owned partitions, times in the PENDING state can potentially be much shorter than the gpu and gpu_h200 partitions. Since jobs may be killed, requeued, and run a 2nd time, ensure that the jobs are a good match for this partition. For example, jobs that append output would not be good for gpu_requeue unless the data files were zeroed out at the start to ensure output from a previous (killed) run was removed. Also, to ensure your job need not redo all its compute again, it is advisable to have checkpointing enabled for your code. We do advise that you use the --open-mode=append to see the requeue status/error messages in your log files. Without this option, your log files will be reset at the start of each (requeued) run, with no obvious indication of requeue events. See our GPU Computing section for more info on using and specifying GPU resources.
ITC, Kempner, HSPH, HUCE, and SEAS
Submitting Batch Jobs Using the sbatch Command
The main way to run jobs on the cluster is by submitting a script with the sbatch command. The command to submit a job is as simple as:
sbatch runscript.sh
The commands specified in the runscript.sh file will then be run on the first available compute node that fits the resources requested in the script. sbatch returns immediately after submission; commands are not run as foreground processes and won’t stop if you disconnect from the cluster.
When sbatch is run Slurm copies the current user environment and submission script into the scheduler. Thus the user is free to update their environment and the submission script they used. Note that this behavior does not apply to any thing else, so files, folders, executables, etc. will be executed and used as they are on disk the moment the script starts to use and access them, so do not update those files if you do not want those changed propagated. When the scheduler launches the script, the script will start in the directory the user submitted the job from.
A typical submission script, in this case loading a Python module and having Python print a message, will look like this:
NOTE: It is important to keep all #SBATCH lines together and at the top of the script; no comments, bash code, or variables settings should be done until after the #SBATCH lines. Otherwise, Slurm may assume it’s done interpreting and skip any that follow.
#!/bin/bash
#SBATCH -c 1 # Number of cores (-c)
#SBATCH -t 0-00:10 # Runtime in D-HH:MM, minimum of 10 minutes
#SBATCH -p serial_requeue # Partition to submit to
#SBATCH --mem=100 # Memory pool for all cores (see also --mem-per-cpu)
#SBATCH -o myoutput_%j.out # File to which STDOUT will be written, %j inserts jobid
#SBATCH -e myerrors_%j.err # File to which STDERR will be written, %j inserts jobid
# load modules
module load python/3.10.9-fasrc01
# run code
python -c 'print("Hi there.")'
In general, a submission script is composed of 4 parts:
- The
#!/bin/bashline allows the script to be run as a bash script. - The
#SBATCHlines which are instructions for Slurm. - Commands loading any necessary modules and setting any variables, paths, etc.
- The execution line itself, in this case calling
pythonand having it print a message.
The #SBATCH lines shown above set the following key parameters:
#SBATCH -c 1: Sets the number of cores (threads) that you’re requesting. Make sure that your tool can use multiple cores before requesting more than one. If this parameter is omitted, Slurm assumes-c 1. For more on parallel work see: threads, MPI#SBATCH -t 0-01:00: Specifies the running time for the job in day-hour:minute (DD-HH:MM) format. Other acceptable time formats include “minutes”, “minutes:seconds”, “hours:minutes:seconds”, “days-hours”, and “days-hours:minutes:seconds”. If your job runs longer than the value you specify here, it will be canceled. Jobs have a maximum run time which varies by partition (see table above), though extensions can be done. There is no fairshare penalty for over-requesting time, though it will be harder for the scheduler to backfill your job if you overestimate.#SBATCH -p serial_requeue: Specifies the Slurm partition under which the script will be run. See the partitions description above for more information. If you do not specify this parameter you will be given serial_requeue by default.#SBATCH --mem=100: Specifies how much memory you require per node. Default units are MB, and users can use suffixes for other units [K|M|G|T]. Accurate specifications allow jobs to be run with maximum efficiency on the system. There are two main options,--mem-per-cpuand--mem. The--memoption specifies the total memory pool per node. If you must do work across multiple compute nodes (e.g. MPI code) and want to scale your memory allocation on a per core basis, then you should use the--mem-per-cpuoption, as this will allocate the amount specified for each of the cores you’re requesting, whether it is on one node or multiple nodes. If this parameter is omitted, then you are granted 100 MB by default. Chances are good that your job will be killed as it will likely go over this amount, so one should always specify how much memory you require.#SBATCH -o myoutput_%j.out: Specifies the file to which standard out will be appended. If a relative file name is used, it will be relative to your current working directory. The%jin the filename will be substituted by the JobID at runtime. If this parameter is omitted, any output will be directed to a file named slurm-JOBID.out in the current directory.#SBATCH -e myerrors_%j.err: Specifies the file to which standard error will be appended. Slurm submission and processing errors will also appear in the file. The%jin the filename will be substituted by the JobID at runtime. If this parameter is omitted, any output will be directed to a file named slurm-JOBID.err in the current directory.
#SBATCH --test-only
While not shown above, adding this option to your script will tell the scheduler to return information on what would happen if you submit this job. This is a good and easy way to determine if you script is viable as well as give a rough estimate of how long it would take to schedule in the current queue load.
#SBATCH --account=some_lab
If you are in more than one lab, please ensure that you are charging your Fairshare to the appropriate group by using this option in all of your job scripts and specifying the lab group.
Other useful options not shown above are:
#SBATCH --gpus=1: Specifies how many gpus are needed for the computation. For more see the GPU specific section.#SBATCH --test-only: Adding this option to your script will tell the scheduler to return information on what would happen if you submit this job. This is a good and easy way to determine if you script is viable as well as give a rough estimate of how long it would take to schedule in the current queue load.#SBATCH --account=jharvard_lab: If you are in more than one lab, this option will charging your usagee to the appropriate group.
It should be noted that all options that are prefixed by #SBATCH can also be set on the command line and visa versa. For example if you wanted to set the partition via commandline instead you would do: sbatch -p PARTITIONNAME runscript.sh
Notifications by Email
The scheduler can send email to you for various job states (FAIL and END being the most useful). But please bear in mind that this must be used responsibly as one user can quickly overwhelm the mail system and affect the notifications of all users by clogging up the mail queue. Keep in mind that tens or even hundreds of thousands of jobs may be in flight at a given time. This is why below we will strongly caution against using the ALL mail type. If you are using a metascheduler, job arrays, or just many jobs, please try to avoid adding too much burden to the email queue; Sending hundreds or thousands of emails can cause email backups, not to mention fill up your inbox.
To add mail notification to your job script you can use the --mail-type option. You can find all the options available in the sbatch documentation. In addition if you specify END you will receive a summary of your job performance from jobstats.
The user to be notified is indicated with --mail-user. If no mail user is specified, Slurm uses the email address that is listed with your account.
Monitoring Job Progress
To monitor jobs use sacct. sacct with out any options will print out all the jobs you have run in the past day. sacct -j JOBID will show you a specific job. Note that sacct is almost live data, in addition the various accounting fields (such as memory usage) are incomplete until the job finishes. For monitoring live performance stats use the jobstats command. Slurm keeps past job records, so users can look back at their historic usage for up to 6 months. If you need data from further back contact FASRC to get access to our job archive.
sacct can provide much more detail as it has access to many of the resource accounting fields that SLURM uses. For example, to get a detailed report on the memory and CPU usage for an array job (see below for details about job arrays):
[jharvard@boslogin01 ~]? sacct -j 44375501 --format JobID,Elapsed,ReqMem,MaxRSS,AllocCPUs,TotalCPU,State JobID Elapsed ReqMem MaxRSS AllocCPUS TotalCPU State ------------ ---------- --------- ------- ---------- ---------- ---------- 44375501_[1+ 00:00:00 40000Mc 8 00:00:00 PENDING 44375501_1 2-03:50:53 40000Mc 8 2-03:50:23 COMPLETED 44375501_1.+ 2-03:50:53 40000Mc 34372176K 6 2-03:50:23 COMPLETED 44375501_1.+ 2-03:50:53 40000Mc 1236K 8 00:00.004 COMPLETED 44375501_2 1-23:47:35 40000Mc 8 1-23:47:18 COMPLETED 44375501_2.+ 1-23:47:35 40000Mc 34467196K 6 1-23:47:17 COMPLETED 44375501_2.+ 1-23:47:36 40000Mc 1116K 8 00:00.003 COMPLETED 44375501_3 1-23:32:36 40000Mc 8 1-23:32:15 COMPLETED 44375501_3.+ 1-23:32:36 40000Mc 34389040K 6 1-23:32:15 COMPLETED 44375501_3.+ 1-23:32:37 40000Mc 1224K 8 00:00.004 COMPLETED 44375501_4 1-21:59:30 40000Mc 8 1-21:59:07 COMPLETED 44375501_4.+ 1-21:59:30 40000Mc 34389044K 6 1-21:59:07 COMPLETED
The jobstats and seff-account commands are summary commands based off the data in sacct.
Slurm provides information about the job State. This value will typically be one of PENDING, RUNNING, COMPLETED, CANCELLED, or FAILED.
| PENDING | Job is awaiting a slot suitable for the requested resources. Jobs with high resource demands may spend significant time PENDING. |
| RUNNING | Job is running. |
| COMPLETED | Job has finished and the command(s) have returned successfully (i.e. exit code 0). |
| CANCELLED | Job has been terminated by the user or administrator using scancel. |
| FAILED | Job finished with an exit code other than 0. |
To learn more detailed information about individual jobs that are in the PENDING or RUNNING you can run the scontrol command. For example:
[jharvard@boslogin06 general]# scontrol show job 7000364 JobId=7000364 JobName=run_pros UserId=jharvard(21442) GroupId=jharvard_lab(10483) MCS_label=N/A Priority=313513 Nice=0 Account=jharvard_lab QOS=normal JobState=PENDING Reason=Resources Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=04:00:00 TimeMin=N/A SubmitTime=2026-04-21T05:51:54 EligibleTime=2026-04-21T05:51:54 AccrueTime=2026-04-21T05:51:54 StartTime=2026-04-22T00:20:00 EndTime=2026-04-22T04:20:00 Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2026-04-21T09:28:15 Scheduler=Main Partition=sapphire,shared AllocNode:Sid=holylogin06:928788 ReqNodeList=(null) ExcNodeList=(null) NodeList= SchedNodeList=holy8a24607 StepMgrEnabled=Yes NumNodes=1-1 NumCPUs=32 NumTasks=32 CPUs/Task=1 ReqB:S:C:T=0:0:*:* ReqTRES=cpu=32,mem=250G,node=1,billing=36 AllocTRES=(null) Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryCPU=8000M MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) LicensesAlloc=(null) Network=(null) Command=/n/netscratch/jharvard_lab/Lab/jharvard/run_BlueJay.sh SubmitLine=sbatch run_BlueJay.sh WorkDir=/n/netscratch/jharvard_lab/Lab/jharvard/finaladd5perc_v28 StdErr= StdIn=/dev/null StdOut=/n/netscratch/jharvard_lab/Lab/jharvard/finaladd5perc_v28/slurm-7000364.out
Of particular interest will be the Reason and StartTime fields. The Reason field will state why the job is pending, while the StartTime will give the current best estimate based on current cluster state as to when the job will start. Note that for job arrays this command will print out all elements, so it is best to specify which element you are interested in.
See the Broader Queue
The showq command can be used to show what the rest of the partition looks like. Often your job is pending due to other people in the partition. The showq command then shows you an overview of all the jobs for a specific partition. showq is invoked by doing:
showq -o -p PARTITIONNAME
Where -o orders the pending queue by priority, with the next job to be scheduled at the top. -p specifies the partition that you want to look at.
The sinfo command is used to get the general state of nodes in a partition. Nodes can be in the following states:
| IDLE | Node is available for work. |
| MIXED | Node is partially used. |
| ALLOCATED | Node is fully used. |
| COMPLETING | Node has jobs which are finishing up. |
| PLANNED | Node will be used by a future job. |
| RESERVED | Node is part of a Reservation. |
| DRAINING | Node is closed to new jobs and existing jobs will run to completion. |
| DOWN | Node is offline. |
You can then use scontrol show node NODENAME to get information on a given node including why it may be DOWN or DRAINING.
Canceling Jobs
If for any reason, you need to cancel a job that you’ve submitted, just use the scancel command with the job ID.
scancel JOBID
If you don’t keep track of the job ID returned from sbatch, you should be able to find it with the sacct command described above. scancel can also do bulk cancellations based on various parameters such as Job Name and Partition.
Interactive Jobs and salloc
Though batch submission is the best way to take full advantage of the compute power the cluster, foreground/interactive jobs can also be run. These can be useful for things like:
- Iterative data exploration at the command line
- RAM intensive graphical applications like MATLAB or SAS
- Interactive “console tools” like R and Jupyter
- Significant software development and compiling efforts
There are two main types of interactive sessions: Graphical User Interface (GUI) and Command Line Interface (CLI). For graphical sessions FASRC provides Open OnDemand (OOD). With Open OnDemand a user can launch a job which will start a Remote Desktop on the cluster or some other application in OOD.
Command line interactive jobs are instead launched directly from the login nodes using salloc. Please note that salloc is disabled on FASSE due to security considerations, you will want to use FASSE OOD instead. salloc has all the same options as sbatch. To start an interactive session run:
salloc -p test -c 1 --mem=4G -t 0-6:00:00
This will ask for 1 core and 4GB of memory on the test partition for 6 hours. With salloc if you append a command it will run it and then exit (this includes /bin/bash which will just exit), but if you append no command it will simply start a remote shell on the node the scheduler selects. Jobs submitted via salloc behave like normal jobs for the sake of scheduling, as such salloc may hang for a while if the partition you select is busy. As such it is wise to select a partition like test or gpu_test where you are guaranteed immediate access. If you intend to use a busy partition, we recommend switching to using Open OnDemand Remote Desktop.
Command line interactive sessions require you to be active in the session. If you go more than an hour without any kind of input, it will assume that you have left the session and will terminate it. If you have interactive tasks that must stretch over days, we recommend switching to Open OnDemand Remote Desktop.
Software
Users are permitted to install whatever software relevant to their research on the cluster, provided it complies with our Acceptable Use Policy. FASRC clusters run a unified Operating System (Rocky Linux 8) and system architecture (x86-64), so software built on one system should generally work on the entire cluster (unless built against a specific hardware type). Users are responsible for managing and maintaining their own software stack. Under no circumstances will a user be given sudo access to install software. See the software guide for more on how to use FASRC provided software modules, how to use Podman or Singularity containers, and how to install software of various types.
Using GPUs
To request a single GPU on slurm just add #SBATCH --gpus=1 to your submission script and it will give you access to a GPU. For more on GPU computing see our more in depth GPGPU Document.
Specifying GPU Type
For users who wish to specify which type of GPU they wish to use, especially for those using heterogeneous partitions like gpu_requeue, there are two methods that can be used. The first is using --constraint="<tag>", this will constrain the job to only run on gpus of a certain class. A full listing of constraints can be found below. The second method is defining the specific model you want using --gpus=<model>:1. For example if you want a A100 with 80GB of onboard memory then you would specify --gpus=nvidia_a100-sxm4-80gb:1.
a100
- nvidia_a100-sxm4-40gb: Nvidia A100 SXM4 40GB
- nvidia_a100-sxm4-80gb: Nvidia A100 SXM4 80GB
h100 & h200
- nvidia_h100_80gb_hbm3: Nvidia H100 80GB HBM3
- nvidia_h200: Nvidia H200 140GB
mig
- nvidia_a100_1g.5gb: Nvidia A100 1g MIG 5GB
- nvidia_a100_1g.10gb: Nvidia A100 1g MIG 10GB
- nvidia_a100_3g.20gb: Nvidia A100 3g MIG 20GB
v100
- tesla_v100-pcie-16gb: Nvidia V100 PCIe 16GB
- tesla_v100-pcie-32gb: Nvidia V100 PCIe 32GB
- tesla_v100s-pcie-32gb: Nvidia V100S PCIe 32GB
a40
- nvidia_a40: Nvidia A40 40GB
rtx
- nvidia_rtx_a6000: Nvidia RTX A6000 PCIe 48GB
Some of the GPUs listed here were purchased by specific groups and only available via gpu_requeue. To find out what specific types of gpu’s are available on a partition run scontrol show partition <PartitionName> and look under the TRES category.
Parallelization
Using Threads such as OpenMP
One of the basic methods for parallelization is to use a threading library, such as pthreads, OpenMP, or applications that use OpenMP under the hood (e.g. numpy, OpenBLAS). Slurm by default does not know what cores to assign to what process it runs, in addition for threaded applications you need to make sure that all the cores you request are on the same node. Below is an example script that both ensures all the cores are on the same node, and lets Slurm know which process gets the cores that you requested for threading.
#!/bin/bash #SBATCH -c 8 # Number of threads #SBATCH -t 0-00:30:00 # Amount of time needed DD-HH:MM:SS #SBATCH -p sapphire # Partition to submit to #SBATCH --mem-per-cpu=100 #Memory per cpu module load intel/25.3.1-fasrc01 srun -c $SLURM_CPUS_PER_TASK MYPROGRAM > output.txt 2> errors.txt
The most important aspect of the threaded script above is the -c option which tells Slurm how many threads you intend to run with. If you are using OpenMP you will want notify it of how many threads it can use by setting OMP_NUM_THREADS before the executable:
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
Using MPI
MPI (Message Passing Interface) is a standard that supports communication between separate processes, allowing parallel programs to simulate a large common memory space. OpenMPI, MPICH, and Intel MPI are available as modules on the cluster. As described in the module documentation, MPI libraries are a special class of module, called “Comp”, that is compiler dependent. To load an MPI library, load the compiler first.
module load intel/25.3.1-fasrc01 openmpi/5.0.10-fasrc01Once an MPI module is loaded, applications built against that library are made available. This dynamic loading mechanism prevents conflicts that can arise between compiler versions and MPI library flavors.
An example MPI script with comments is shown below:
#!/bin/bash #SBATCH -n 128 # Number of cores #SBATCH -t 10 # Runtime in minutes #SBATCH -p sapphire # Partition to submit to #SBATCH --mem-per-cpu=100 # Memory per cpu in MB (see also --mem) module load intel/25.3.1-fasrc01 openmpi/5.0.10-fasrc01 module load MYPROGRAM srun -n $SLURM_NTASKS --mpi=pmix MYPROGRAM > output.txt 2> errors.txt
There are a number of important aspects to an MPI SLURM job.
- Most partitions have a unified Infiniband fabric except for the requeue partitions. If you use the requeue partitions you will want to specify a IB fabric via the constraint option.
- Memory should be allocated with the
--mem-per-cpuoption instead of--memso that memory matches core utilization. - The
-npoption for mpirun or mpiexec (when these runners are used) should use the bash variable$SLURM_NTASKSso that the correct number of cores is passed to the MPI engine at runtime. - If network topology and communications overhead is a concern for your code, try using the
--contiguousoption which will ensure that all the cores you get will be adjacent to each other. Use this with caution though as it will make your job pend longer, as finding contiguous blocks of compute is difficult. Verify that the boost in performance is worth the extra wait time in the queue. If you do not include this option you will be given cores and what ever nodes that Slurm can find, which may be scattered across the cluster. Depending on your code this may or may not be a concern. Test your code in both modes to see if it is an option that is worth including if you don’t know off hand. It may not be worth including--continguousas the aggregate time of waiting plus runtime may be longer with--contiguous. Thesbatchandsrundocumentation have more information on various fine tuning options. - The application must be MPI-enabled. Applications cannot take advantage of MPI parallelization unless the source code is specifically built for it.
Job Arrays
SLURM allows you to submit a number of “near identical” jobs simultaneously in the form of a job array. To take advantage of this, you will need a set of jobs that differ only by an “index” of some kind.
For example, say that you would like to run tophat, a splice-aware transcript-to-genome mapping tool, on 30 separate transcript files named trans1.fq, trans2.fq, trans3.fq, etc. First, construct a SLURM batch script, called tophat.sh, using special SLURM job array variables:
#!/bin/bash
#SBATCH -J tophat # A single job name for the array
#SBATCH -c 1 # Number of cores
#SBATCH --array=1-30 # Array range
#SBATCH -p serial_requeue # Partition
#SBATCH --mem 4000 # Memory request (4Gb)
#SBATCH -t 0-2:00 # Maximum execution time (D-HH:MM)
#SBATCH -o tophat_%A_%a.out # Standard output
#SBATCH -e tophat_%A_%a.err # Standard error
source activate tophat
tophat /n/netscratch/informatics_public/ref/ucsc/Mus_musculus/mm10/chromFatrans"${SLURM_ARRAY_TASK_ID}".fq
The --array flag sets the number of elements to be run. Each array element is treated by the scheduler as an independent job for the sake fo scheduling.
In the script, two types of substitution variables are available when running job arrays. The first, %A and %a, represent the job ID and the job array index, respectively. These can be used in the sbatch parameters to generate unique names. The second, SLURM_ARRAY_TASK_ID, is a bash environment variable that contains the current array index and can be used in the script itself. In this example, 30 jobs will be submitted each with a different input file and different standard error and standard out files. More detail can be found on the SLURM job array documentation page and our Submitting Large Numbers of Jobs page.
Checkpointing
Slurm does not automatically checkpoint, i.e. create files that your job can restart from. To protect against job failure (due to code error or node failure) and to allow your job to be broken up into smaller chunks it is always advisable to checkpoint your code so it can restart from where it left off. This is especially valuable for jobs on partitions subject to requeue, but is also just generally useful for any type of job. Checkpointing varies from code type to code type and needs to be implemented by the user as part of their code base. Some resources for checkpointing codes that do not have them built-in include Distributed MultiThreaded CheckPointing (DMTCP) and Checkpoint/Restore in Userspace (CRIU).
Job dependencies
Many scientific computing tasks consist of serial processing steps. A genome assembly pipeline, for example, may require sequence quality trimming, assembly, and annotation steps that must occur in series. Launching each of these jobs without manual intervention can be done by repeatedly polling the controller with sacct until the State is COMPLETED. However, it’s much more efficient to let the SLURM controller handle this using the --dependency option.
[jharvard@boslogin01 examples]? sbatch assemble_genome.sh Submitted batch job 53013437 [jharvard@boslogin01 examples]? sbatch --dependency=afterok:53013437 annotate_genome.sh [jharvard@boslogin01 examples]?
When submitting a job, specify a combination of “dependency type” and job ID in the --dependency option. afterok is an example of a dependency type that will run the dependent job if the parent job completes successfully (state goes to COMPLETED). The full list of dependency types can be found on the SLURM doc site in the man page for sbatch. It is best not to create a chain of dependencies that is greater than 2-3 levels. Any more than that and the scheduler will become significantly slower. Dependencies should only be used if the resource requirements between each step are significantly different, or if you need to wait for an array to complete before you run a single job that processes all the array results. Be sure to think about whether you truly need dependencies or not.
Job Constraints
Sometimes, especially on the requeue partitions, jobs need to be constrained to run on specific hardware. Many times this is due to either the code being compiled for a specific architecture or because the code runs more efficiently on a specific type of host. Slurm provides for this functionality via the --constraint option (see the sbatch documentation for usage details). The features for constraint are defined by FASRC and fall into three broad categories: Processor, GPU, and Network. You can match against multiple of these but keep in mind the more constraints you use the longer your job will pend for as the scheduler will find it more difficult to find nodes that fit your needs. A list of the features available on the cluster follows, you can also see the features for a specific node by doing scontrol show node NODENAME.
Processor
- amd: All AMD processors
- intel: All Intel processors
- avx: All processors that are AVX capable
- avx2: All processors that are AVX2 capable
- avx512: All processors that are AVX512 capable
- milan: AMD Milan chips
- genoa: AMD Genoa chips
- skylake: Intel Skylake chips
- sapphirerapids: Intel Sapphire Rapids
- cascadelake: Intel Cascade Lake chips
- icelake: Intel Ice Lake chips
GPU
To specify a GPU model, for example, A100 with 80GB refer to Specifying GPU Type
- rtxa6000: Nvidia RTX A6000 GPU
- a40: Nvidia A40 GPU
- v100: Nvidia V100 GPU
- a100: Nvidia A100 GPU
- a100-mig: Nvidia A100 GPU MIG
- h100: Nvidia H100 GPU
- h200: Nvidia H200 GPU
Network
- holyhdr: Holyoke HDR Infiniband Fabric
- holyndr: Holyoke NDR Infiniband Fabric
We use a multifactor method of job scheduling on the cluster. Job priority is assigned by a combination of fair-share and length of time a job has been sitting in the queue. You can find out the priority calculation for your jobs by using the sprio command, such as sprio -j JOBID.
Fairshare is shared on a lab basis, so usage by any member of the lab will impact the score of the whole lab as the lab is pulling from a common pool. Fairshare has a 3 day halflife and naturally recovers if your lab does not run any jobs. Thus it is wise to store up fairshare if you need to do significant runs, and plan your runs accordingly in order to maintain a good fairshare score. You can learn more about your fairshare score and slurm usage by using the sshare command, such as sshare -U which shows your current score.
The other factor in priority is how long you have been sitting in the queue. The longer your job sits in the queue the higher its priority grows, out to a maximum of 3 days. If everyone’s priority is equal then FIFO (first in first out) is the scheduling method. We weight the age of a job that has pended for 3 days to be equal to a fairshare score of 0.1.
We also have backfill turned on. This allows for jobs which are smaller to sneak in while a larger higher priority job is waiting for nodes to free up. If your job can run in the amount of time it takes for the other job to get all the nodes it needs, SLURM will schedule you to run during that period. This means knowing how long your code will run for is very important and must be declared if you wish to leverage this feature. Otherwise the scheduler will just assume you will use the maximum allowed time for the partition when you run. The better your constrain your job in terms of CPU, Memory, and Time the easier it will be for the backfill scheduler to find you space and let your job jump ahead in the queue.
For more see:
- Fairshare and Job Accounting
- Job Efficiency and Optimization Best Practices
- jobstats
- Job Defense Shield
- Slurm Stats
Troubleshooting
A variety of problems can arise when running jobs on the cluster. Many are related to resource misallocation, but there are other common problems as well.
| Error | Likely cause |
|---|---|
JOB <jobid> CANCELLED AT <time> DUE TO TIME LIMIT |
You did not specify enough time in your batch submission script. The -t option sets time in minutes or can also take D-HH:MM form (0-12:30 for 12.5 hours) |
Job <jobid> exceeded <mem> memory limit, being killed |
Your job is attempting to use more memory than you’ve requested for it. Either increase the amount of memory requested by --mem or --mem-per-cpu or, if possible, reduce the amount your application is trying to use. For example, many Java programs set heap space using the -Xmx JVM option. This could potentially be reduced. For jobs that require truly large amounts of memory (>1 Tb), you may need to use the bigmem SLURM partition. Genome and transcript assembly tools are commonly in this camp. |
SLURM_receive_msg: Socket timed out on send/recv operation |
This message indicates a failure of the SLURM controller. Though there are many possible explanations, it is generally due to an overwhelming number of jobs being submitted, or, occasionally, finishing simultaneously. If you want to figure out if SLURM is working use the sdiag command. sdiag should respond quickly in these situations and give you an idea as to what the scheduler is up to. |
JOB <jobid> CANCELLED AT <time> DUE TO NODE FAILURE |
This message may arise for a variety of reasons, but it typically indicates that the host on which your job was running can no longer be contacted by SLURM. Jobs that die from NODE_FAILURE are automatically requeued by the scheduler. |
Bookmarkable Links
- 1 Introduction
- 2 Getting Started
- 3 Slurm Documentation
- 4 Slurm Global Limits and Defaults
- 5 Slurm Partitions
- 6 Submitting Batch Jobs Using the sbatch Command
- 7 Monitoring Job Progress
- 8 Canceling Jobs
- 9 Interactive Jobs and salloc
- 10 Software
- 11 Using GPUs
- 12 Parallelization
- 13 Checkpointing
- 14 Job dependencies
- 15 Job Constraints
- 16 Fairshare and Job Prioritization
- 17 Troubleshooting
