Cluster Quick Start Guide
- 1 PREREQUISITES
- 2 Accessing the Cluster and Cluster Resources
- 2.1 Use a terminal to SSH to login.rc.fas.harvard.edu
- 2.2 Transfer any files you may need
- 2.3 Familiarize yourself with proper decorum on the cluster
- 2.4 Determine what software you’d like to load and run
- 2.5 Determine where your files will be stored
- 2.6 Run a batch job…
- 2.7 … or an interactive job.
- 2.8 Getting further help
- 2.9 A note on requesting memory (--mem or --mem-per-cpu)
This guide will provide you with the basic information needed to get up and running on the FASRC cluster for simple command line access. If you’d like more detailed information, each section has a link to more full documentation
1. Get a FASRC account using the account request tool.
Before you can access the cluster you need to request a Research Computing account.
2. Set a password and setup OpenAuth for two factor authentication
Once you have your new FASRC account, you will need to to do two things to allow you to authenticate:
2a – Set your password – You will be unable to login until you set your password via the RC password reset link: https://portal.rc.fas.harvard.edu/pwreset/
You will need to enter the same email address you used to sign up for you account and then will receive an email with a reset link (this email and link expires and is for one-time use – it is never needed again). Once you’ve set your password you can use your username and password to request your OpenAuth token.
2b – Set up OpenAuth Two-Factor – You will need to set up our OpenAuth two-factor authentication (desktop or app). See the OpenAuth Guide for instructions if you have not yet set up OpenAuth.
3. Use the FASRC VPN when connecting to storage, VDI, or other resources
Note that VPN is not required to SSH into a login node, but is required for many other tasks including VDI and mounting shares on your desktop/laptop.
4. Review our introductory training
Accessing the Cluster and Cluster Resources
Use a terminal to SSH to login.rc.fas.harvard.edu
NOTE: If you did not request cluster access when signing up, you will not be able to log into the cluster or login node as you have no home directory. You will simply be asked for your password over and over. See this doc for how to add cluster access as well as additional groups.
For command line access to the cluster, connect to login.rc.fas.harvard.edu using ssh. If you are running Linux or Mac OSX, open a terminal and type
ssh USERNAME@login.rc.fas.harvard.edu, where USERNAME is the name you were assigned when you received your account (example: jharvard – but not jharvard@fasrc, that is only necessary for VPN).
Enter the password you set after receiving your account confirmation email. When prompted for the Verification code, use the OpenAuth supplied number.
The OpenAuth application (upper right corner) displays the value to be used for the Verification code prompt.
-CY if you have an X11 server installed and desire graphics support (
ssh -CY email@example.com). For help with X11 forwarding, start with our Access and Login page.
For Windows users, we recommend PuTTy for SSH. HUIT (Harvard IT) also provides newer versions of SecureCRT (SSH) and SecureFX (SFTP). If you are in FAS and would like to try them, go to the HUIT download page (uses HarvardKey). Older versions of these programs will not work with modern SSH.
See our Access and Login page for more details on ways to connect to FASRC resources, including terminal applications.
You can also log into our OpenOnDemand Virtual Desktop Interface for GUI apps and interactive applications: Virtual Desktop through Open OnDemand
Transfer any files you may need
There are also graphical scp tools available. The Filezilla SFTP client is available cross-platform for Mac OSX, Linux, and Windows. See our SFTP file transfer using Filezilla document for more information. Windows users who prefer SCP can download it from WinSCP.net.
NOTE: If you are off campus or behind a firewall and wish to connect to FASRC servers other than the login servers, you should first connect to the Research Computing VPN.
Familiarize yourself with proper decorum on the cluster
The FASRC cluster is a massive system of shared resources. While much effort is made to ensure that you can do your work in relative isolation, some rules must be followed to avoid interfering with other user’s work.
The most important rule on the cluster is to avoid performing computations on the login nodes. Once you’ve logged in, you must either submit a batch processing script or start an interactive session (see below). Any significant processing (high memory requirements, long running time, etc.) that is attempted on the login nodes will be killed.
See the full list of Cluster Customs and Responsibilities.
Determine what software you’d like to load and run
An enhanced module system called Helmod is used on the cluster to control the run-time environment for individual applications. To find out what modules are available you can either look at the module list on the RC / Informatics portal, or use the
module avail command. By itself, module avail will print out the entire list of packages. To find a specific tool, use the module spider or module-query command.
Once you’ve determined what software you would like to use, load the module:
where MODULENAME is the specific software you want to use. You can use
module unload MODULENAME to unload a module. To see what modules you have loaded type
module list. This is very helpful information to provide when you submit help tickets.
For errors in loading modules after the O3 upgrade, see Modules on CentOS7 upgrade page.
For details on finding and using modules effectively, see Software on the cluster page.
For details on running software on the cluster, including graphical applications, see module section of the Running Jobs page.
Determine where your files will be stored
Users of the cluster are granted 100Gb of storage in their home directory. This volume has decent performance and is regularly backed up. For many, this is enough to get going. However, there are a number of other storage locations that are important to consider when running software on the FASRC cluster.
- /n/holyscratch01 – Our global scratch (environment variable $SCRATCH) is large, high performance temporary Lustre filesystem. We recommend that people use this filesystem as their primary job working area, as this area is highly optimized for cluster use. Use this for processing large files, but realize that files will be removed after 90 days and the volume is not backed up. Create your own folder inside the folder of your lab group. If that doesn’t exist, contact RCHelp.
- /scratch – Local on-node scratch. When running batch jobs (see below), /scratch is a large, very fast temporary store for files created while a tool is running. This space is on the node’s local hard drive. It is a good place for temporary files created while a tools is executing because the disks are local to the node that is performing the computation making access is very fast. However, data is only accessible from that node so you cannot directly retrieve it after calculations are finished. If you use /scratch, make moving any results off and onto another storage system part of your job.
- Lab storage – Each lab that is doing regular work on the cluster can request an initial 4Tb of group accessible storage at no charge. Like home directories, this is a good place for general storage, but it is not high performance and should not be used during I/O intensive processing. See global scratch above.
Do NOT use your home directory or lab storage for significant computation.
This degrades performance for everyone on the cluster.
For details on different types of storage and how obtain more, see the Cluster Storage page
Run a batch job…
The cluster is managed by a batch job control system called SLURM. Tools that you want to run are embedded in a command script and the script is submitted to the job control system using an appropriate SLURM command.
For a simple example that just prints the hostname of a compute host to both standard out and standard err, create a file called
hostname.slurm with the following content:
#SBATCH -n 1 # Number of cores requested
#SBATCH -N 1 # Ensure that all cores are on one machine
#SBATCH -t 15 # Runtime in minutes
#SBATCH -p serial_requeue # Partition to submit to
#SBATCH –mem=100 # Memory per cpu in MB (see also –mem-per-cpu)
#SBATCH -o hostname_%j.out # Standard out goes to this file
#SBATCH -e hostname_%j.err # Standard err goes to this filehostname
Then submit this job script to SLURM
When command scripts are submitted, SLURM looks at the resources you’ve requested and waits until an acceptable compute node is available on which to run it. Once the resources are available, it runs the script as a background process (i.e. you don’t need to keep your terminal open while it is running), returning the output and error streams to the locations designated by the script.
You can monitor the progress of your job using the
squeue -j JOBID command, where JOBID is the ID returned by SLURM when you submit the script. The output of this command will indicate if your job is PENDING, RUNNING, COMPLETED, FAILED, etc. If the job is completed, you can get the output from the file specified by the
-o option. If there are errors, the should appear in the file specified by the
If you need to terminate a job, the
scancel command can be used (JOBID is the number returned when the job is submitted).
SLURM-managed resources are divided into partitions (known as queues in other batch processing systems). Normally, you will be using the
serial_requeue partitions, but there are others for interactive jobs (see below), large memory jobs, etc.
For more information on the partitions on the cluster, please see the SLURM partitions page.
For more information and running batch jobs, including MPI code, please see the Running Jobs page.
For a list of useful SLURM commands, please see the Convenient SLURM Commands page.
… or an interactive job.
Batch jobs are great for long-lasting computationally intensive data processing. However, many activities like one-off scripts, graphics and visualization, and exploratory analysis do not work well in a batch system, but are too resource intensive to be done on a login node. There is a special partition on the cluster called “test” that is designed for responsive, interactive shell and graphical tool usage.
You can start an interactive session using a specific flavor of the
srun -p test --pty --mem 500 -t 0-08:00 /bin/bash
srun is like
sbatch, but it runs synchronously (i.e. it does not return until the job is finished). The example starts a job on the “test” partition, with pseudo-terminal mode on (
--pty), an allocation of 500 MB RAM (
--mem 500), and for 6 hours (
D-HH:MM format). It also assumes one core on one node. The final argument is the command that you want to run. In this case you’ll just get a shell prompt on a compute host. Now you can run any normal Linux commands without taking up resources on a login node. Make sure you choose a reasonable amount of memory (
--mem) for your session.
For graphical tools we have the Virtual Desktop through Open OnDemand. Simply get on the RC VPN and go to https://vdi.rc.fas.harvard.edu to get started using OnDemand.
Getting further help
If you have any trouble with running jobs on the cluster, first check the comprehensive Running Jobs page and our FAQ. Then, if your questions aren’t answered there, feel free to contact us at RCHelp. Tell us the job ID of the job in question. Also provide us with what script you ran, the error and output files, and where they’re located as well. The output of
module list is helpful, too.
A note on requesting memory
(--mem or --mem-per-cpu)
In SLURM you must declare how much memory you are using for your job using the
--mem-per-cpu command switches. By default SLURM assumes you need 100 MB. If you don’t request enough the job can be terminated, often times without very useful information (error files can show segfault, file write errors, etc. that are downstream symptoms). If you request too much, it can increase your wait time (it’s harder to allocate a lot of memory than a little), crowd out jobs for other users, and lower your fairshare.
You can view the runtime and memory usage for a past job with
where JOBID is the numeric job ID of a past job:
JobID JobName ReqMeM MaxRSS Elapsed
531306 sbatch 00:02:03
531306.batch batch 750000K 513564K 00:02:03
531306.0 true 916K 00:00:00
.batch portion of the job is usually what you’re looking for, but the output may vary. This job had a maximum memory footprint of about 500MB, and took a little over two minutes to run.