Skip to main content

Search Docs by Keyword

Table of Contents

Job Defense Shield

Overview

Job Defense Shield (JDS) is an application built by Princeton University Research Computing for job efficiency monitoring. Leveraging the statistics collected by jobstats, JDS allows administrators to trigger automated emails and other actions (including job cancellation) based on various thresholds. JDS is being used on FASRC clusters to send out weekly emails to users regarding job inefficiencies with the goal of aiding users in improving their usage and job throughput.

JDS Emails

On Tuesday mornings JDS evaluates the previous week’s worth of cluster usage (Tuesday of the previous week to Monday of the current week). It then sends emails to users who meet certain thresholds of job inefficiency. The emails contain a description of the problem, list of jobs, and recommendations for how to improve efficiency. To stop receiving the emails users simply need to improve their efficiency beyond the recommended thresholds. The thresholds and recommendations for each alert email are as follows:

Jobs with Zero CPU Utilization

This alert indicates that your job allocated cores but used none of the cores allocated on one or mode nodes. This can indicate that:

  • Your job did not start properly: Check your runscript and test to make sure that your script is working as intended.
  • You were idle: This means you started an interactive session but did not do anything in it. Only start interactive sessions if you intend to do work. Close sessions that you are done with.
  • Your job is not properly parallelizing: Check your code’s documentation regarding if and how your code parallelizes. If the documentation does not specify talk to your colleagues who run the same code or reach out to the primary developer. Note that Slurm does not automatically parallelize code, even if you ask for more than one core.

Serial Jobs Allocating Multiple Cores

This alert indicates that your job is only using a single core but is asking for multiple. This indicates that your job did not parallelize properly or is not able to be parallelized. Check your code’s documentation regarding if and how your code parallelizes. If the documentation does not specify talk to your colleagues who run the same code or reach out to the primary developer. Note that Slurm does not automatically parallelize code, even if you ask for more than one core.

Jobs with Low CPU Efficiency

This alert indicates that your job was below 80% CPU utilization for the run. This can indicate that:

  • Your job is not well optimized: Check your code’s documentation regarding if there are methods for improving optimization. If the documentation does not specify talk to your colleagues who run the same code or reach out to the primary developer. We also have a general guide regarding code optimization that you can leverage to diagnose problems.
  • Your job is not properly parallelizing: Check your code’s documentation regarding if and how your code parallelizes. If the documentation does not specify talk to your colleagues who run the same code or reach out to the primary developer. Note that Slurm does not automatically parallelize code, even if you ask for more than one core.
  • Your job is not scaling: Run a scaling test to find out how many cores your code can optimally run. See if there are newer versions of the code or compilers that have better scaling or work to better optimize your code for higher core counts.

Jobs Requesting Too Much CPU Memory

This alert indicates that your job was below 80% peak memory utilization for the run. To fix you will want to better constrain your memory allocation request.

Requesting Too Much Time for CPU Jobs

This alert indicates that you job used less than 50% of the time that it requested. To fix you will want to better constrain your time allocation request.

Jobs with Zero GPU Utilization

This alert indicates that your job allocated GPUs but used none of them. This can indicate that:

  • Your job did not start properly: Check your runscript and test to make sure that your script is working as intended.
  • Your code was built for a specific GPU: Make sure your code is GPU type agnostic or leverage Slurm flags to specify the GPU type you need.
  • You were idle: This means you started an interactive session but did not do anything in it. Only start interactive sessions if you intend to do work. Close sessions that you are done with.

Jobs with Low GPU Efficiency

This alert indicates that your job was below 25% GPU utilization for the run. This can indicate that:

  • Your job is not well optimized: Check your code’s documentation regarding if there are methods for improving optimization. If the documentation does not specify talk to your colleagues who run the same code or reach out to the primary developer. We also have a general guide regarding code optimization that you can leverage to diagnose problems.
  • GPU is too powerful: There may in fact be nothing you can do to further optimize your code and the GPU you are using is overkill for your workflow. In these cases it will be beneficial to switch to a less powerful GPU that is more closely aligned with your code performance needs.

Repeat Offense

Users should endeavour to rectify their workflows. Users who do not will be contacted by FASRC staff. Further failure to improve may lead to fairshare reduction, job cancellation, and banning of your account. The cluster is a shared resource and so it behooves all users to use it in the most efficient manner as it help accelerate research for all cluster users.

© The President and Fellows of Harvard College.
Except where otherwise noted, this content is licensed under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International license.