Search Docs by Keyword

Table of Contents

GPU Computing on the FASRC cluster

The FASRC cluster has a number of nodes that have NVIDIA general purpose graphics processing units (GPGPU) attached to them. It is possible to use CUDA tools to run computational work on them and in some use cases see very significant speedups.  Details on public partitions can be found here. For SEAS users, please check here for available partitions.

GPGPU’s on SLURM

To request a single GPU on slurm just add #SBATCH --gres=gpu to your submission script and it will give you access to a GPU. To request multiple GPUs add #SBATCH --gres=gpu:n where ‘n’ is the number of GPUs. You can use this method to request both CPUs and GPGPUs independently. So if you want 1 CPU and 2 GPUs from our general use GPU nodes in the ‘gpu’ partition, you would specify:

#SBATCH -p gpu
#SBATCH -n 1
#SBATCH --gres=gpu:2

When you submit a GPU job SLURM automatically selects some GPUs and restricts your jobs to those GPUs. In your code you reference those GPUs using zero-based indexing from [0,n) where n is the number of GPUs requested. For example, if you’re using a GPU-enabled Tensorflow build and requested 2 GPUs you would simply reference gpu:0 or gpu:1 from your code.

To request a specific type of GPU, you would need to add #SBATCH --gres=gpu:name:n where name is substituted for the GPU model being requested. The GPU models currently available on our cluster can be found here. See the official Nvidia website for more details.

For an interactive session to work with the GPUs you can use following:

salloc -p gpu_test -t 0-01:00 --mem 8000 --gres=gpu:1

While on GPU node, you can run nvidia-smi to get information about the assigned GPU’s.

The partition gpu_requeue is a backfill partition similar to serial_requeue and will allow you to submit jobs to idle GPU enabled nodes. Please note that the hardware in that partition is heterogeneous.  SLURM is aware of the model name, and compute capability of the GPU devices each compute node has.

Name or compute capability can be requested as a constraint in your job submission.  When running in gpu_requeue, nodes with a specific model can be selected using --constraint=modelname,  or, more in general, nodes offering a card with a specific compute capability can be selected using  --constraint=ccx.x (e.g. --constraint=cc7.0 for compute capability 7.0).

For example if your code needs to run on devices with at least compute capability 3.7, you would specify:

#SBATCH -p gpu_requeue 
#SBATCH -n 1 
#SBATCH --gres=gpu:1
#SBATCH --constraint=cc3.7

CUDA Runtime

The current version of the Nvidia driver installed on all GPU-enabled nodes may vary over time so its best to request an interactive job and then run nvidia-smi

Tue Jun 8 06:12:32 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.19.01 Driver Version: 465.19.01 CUDA Version: 11.3 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA Tesla V1... On | 00000000:06:00.0 Off | 0 |
| N/A 44C P0 31W / 250W | 0MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+

To load the toolkit and additional runtime libraries (cublas, cufftw, …) remember to always load the module for cuda in your Slurm  job script or interactive session.

$ module-query cuda

$ module load cuda/<version>

NOTE: In the past our Cuda installations were heterogeneous and different nodes on the cluster would provide different versions of the Cuda driver. For this reason might have used in your job submissions  the Slurm flags --constraint=cuda-$version  (for example –constraint=cuda-7.5)  to specifically request nodes that were supporting that version.
This is no longer needed as our cuda modules are the same throughout the cluster, and you should remove those flags from your scripts.

Using CUDA-dependent modules

CUDA-dependent applications are accessed on the cluster in a manner that is similar to compilers and MPI libraries. For these applications, a CUDA module must first be loaded before an application is available. For example, to use cuDNN, a CUDA-based neural network library from NVIDIA, the following command will work:

$ module load cuda/11.1.0-fasrc01 cudnn/8.0.4.30_cuda11.1-fasrc01

If you don’t load the CUDA module first, the cuDNN module is not available.

$ module purge
$ module load cudnn/8.0.4.30_cuda11.1-fasrc01
Lmod has detected the following error:
The following module(s) are unknown: “cudnn/8.0.4.30_cuda11.1-fasrc01”
Please use the command module-query or our user Portal to find available versions and how to load them.
More information on software modules can be found here, and how to run jobs here.

Example Codes

We experiment with different libraries based on user requests and try to document simple examples for our users.

Please visit https://github.com/fasrc/User_Codes

Performance Monitoring

Nvidia

Besides nvidia-smi and nvtop, Nvidia also provides Nsight and Data Center GPU Manager (DCGM) for monitoring job performance.  You can find a walkthrough on how to use DCGM here.  It is recommended to name the GPU group something other than allgpus as they do in their example.

Weights & Biases

Weights & Biases is also an excellent method for monitoring job performance. It can display on a per job basis plots for your job performance. Follow their guides for how to add monitoring to your jobs.

Training

You can download training slides from here.

© The President and Fellows of Harvard College
Except where otherwise noted, this content is licensed under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International license.