Job Efficiency and Optimization Best Practices

Overview

The art of High Performance Computing is really the art of getting the most out of the computational resources you have access to. This applies to working on a laptop, to working in the cloud, or working on a supercomputer. While this diversity of different systems and environments may seem intimidating, in reality there are some good general rules and best practices that you can use to get the most out of your code and the computer you are on.

As defined in our Glossary, the term job has a broad and a narrow sense. In the broad sense, a job is an individual run of an application, code, or script; and may be used interchangeable with those terms. This includes whether you run it from the command line, cronjob, or use a scheduler. In the narrow sense, a job is an individual allocation for a user by the scheduler. It is usually obvious from the context which is meant.

By Job Efficiency, we mean that the parameters of the job in terms of cpu, gpus, memory, network, time, etc. (refer to Glossary for definitions) are accurately defined and match what the job actually uses. As an example, a job that asks for 100 cores but only uses 1 is not efficient. A job that asks for 100GB and uses 99GB is efficient.  Efficiency is a measure of how well the user has scoped their job so that it can run in the space defined.

Finally Job Optimization means to make the job run at the maximum speed possible with the least amount of resources used. For example, a poorly optimized code may only use 50% of the gpu it was allocated, whereas a well optimized one could use 100% and see acceleration commensurate with that improved usage. Similarly, a poorly optimized code may use 1TB of memory, but a well optimized code may only use 100GB. Optimization is a measure of how well structured a code is numerically, both in terms of algorithm and implementation, so that it can get to the solution in the fastest, most accurate, and most economical way.

Efficiency and Optimization are thus two sides of the same coin. Efficiency is about accurately defining the resources that you will use and optimization is about reducing that usage. Both have the goal of getting the most out of the resources the job is using.

Architecture

Schematic showing how cores, memory, and nodes are arranged on the cluster.
Schematic showing how cores, memory, and nodes are arranged on the cluster.

Before we get into talking about Job Efficiency and Optimization, we should first discuss general cluster architecture. Supercomputers (aka clusters) are essentially a bunch of computers of similar type networked together by a high speed interconnect so that they can act in unison with each other and have a common computational environment. The fundamental building block of a cluster is a node. Each node is composed of a bunch of cores that all talk to the same block of memory. GPU nodes have in addition to cores and memory, GPU’s which can be used for specialized workflows such as machine learning. The nodes are then strung together with a network, typically Infiniband, and then a scheduler is put in front to handle what jobs get what resources.

CPU/GPU Type

Typically a cluster will be made up of a uniform set of hardware. However that is not necessarily the case. At FASRC we run a variety of hardware, spanning multiple generations and vendors. These different types of CPU and GPU have various performance characteristics and features that may have impacts on your job. We will talk about this later but being cognizant of what hardware your job works best on will be important for efficient and optimal use of the cluster.  At FASRC we split up our partitions such that each partition has a uniform set of hardware, unless otherwise noted (e.g. gpu_requeue and serial_requeue). A comprehensive list of available hardware can be found in the Job Constraints section of the Running Jobs page. You can learn more about a specific node’s hardware by running scontrol show node NODENAME.

Network Topology

How the nodes are interconnected on the Infiniband is known as the topology.  Topology becomes very important as you run larger and larger jobs, more on that later. At FASRC we generally follow a Fat Tree Infiniband topology, with a hierarchy of switching and adjacent nodes being close to each other network-wise.

We name our nodes after their location in the datacenter, which makes it easy to figure out which nodes are next to each other both in terms of space and in terms of network. Our naming convention for nodes is Datacenter/Node has a GPU or Not/Row/Pod/Rack/Chassis/Blade. So for example the node holy7c04301 is at our Holyoke Data Center in Row 7 Pod C Rack 04 Chassis 3 Blade 01, the adjacent nodes would be holy7c04212 (which is the last blade on the chassis below Chassis 3 as there are 12 blades per chassis for this hardware) and holy7c04302. Another is example is holygpu8a11404, this node is a GPU node in our Holyoke Data Center in Row 8 Pod A Rack 11 Chassis 4 Blade 4, the adjacent nodes would be holygpu8a11403 and holygpu8a11501 (which is the next blade on the chassis above Chassis 4 as there are only 4 blades per chassis for this hardware). You can see the full topology of the cluster by doing scontrol show topology.

Job Efficiency

The first step in improving job efficiency is understanding your job as it exists today. Understanding means you have a good handle on the resource needs and characteristics of your job, and thus you are able to accurately allocate resources to it thereby improving efficiency. As a general rule, one should always understand the jobs you run regardless of size. This knowledge is both beneficial for right sizing your requested resources, but also for noticing any pitfalls that may occur when scaling the job up.

There are two ways of learning about your job. The first is to have a fundamental understanding of the job you are running. Based on your knowledge of the algorithm, code, job script, and cluster architecture, you know what you should request for core count, gpu count, memory, time, and storage. Knowing your code at this level will allow you to make the most accurate estimates for what you will use.

While a full understanding of your job is ideal, often it is not possible. You may not control the code base, you may just be getting started, or you may not have time to obtain a deep understanding of the job. Even in cases where you have a good theoretical understanding of your job, you need to confirm that knowledge with hard data. In which case the second method is to test your job empirically and find out what the best job parameters are. Simply take an example that you know will be akin to what you will run in production and run it as a test job. Then once the test job is done, check to see how it performed. You then repeat, changing the job parameters until you have a good understanding of how your job performs in different situations.

That’s the rough sketch, but the details are a bit different depending on what you want to understand. Below are some methods for finding out how much memory, cores, gpus, time, and storage your job will need. These may not cover every job but should work for most situations.

Memory

Memory on the cluster is doled out in two different ways, either by node (--mem) or by core (--mem-per-cpu). If your job exceeds its memory request, you will see an error either containing Out of Memory or oom. This indicates that the scheduler terminated your job for exceeding your memory request. You will need to increase your memory allocation to give your job more space.

Here is a test plan for figuring out how much memory you should request:

  1. Come up with an initial guess as to how much memory your job will require. A good first guess is usually 2-3 times the size of any data files you are reading in, or 2-3 times the size of the data you will be generating. If you do not know either of those then a safe initial guess is 4GB. Most of the cluster has 4GB per core, so its a good initial guess that will allow you to get through the scheduler in a quick manner.
  2. Run a test job on the test partition with your guess.
  3. Check the result of your run using seff or sacct. (Note: GPU stats (usage and onboard GPU memory) are not given by either of these commands, only CPU usage and CPU memory)
  4. If your job ran out of memory then double the amount and return to step 2.  If it ran properly (i.e, no out of memory error), then look at how much your job actually used and update your request to match with an additional 10% buffer as the scheduler’s sampling of the memory usage runs every 30 seconds and it may have missed any short term memory spikes.

Every time you change a parameter in your code, you should check to see how the memory changes. Some parameters will not change the memory usage at all. Others will change it dramatically.  If you do not know if a parameter will change the memory usage run a test to see how it behaves.

If you are working to scale up a job its good to understand how your memory usage will scale as the job increases. For example, say you are running a three dimensional code and you increase the resolution of the box you are simulating by 2. That means that your memory usage will grow by a factor of 8 because each dimension grew by a factor of 2. Likewise if you are running a simulation that ingests data, it will likely scale linearly with the amount of data you ingest. Testing by increments is the best way to validate how your memory usage will grow depending on situation.

One important warning is to make sure to use the correct memory for each type of job. When your job runs, the scheduler blocks off a segment of memory for you to use, regardless of if you actually use it. If your job asks for 100GB but only uses 1GB, the scheduler will give you 100GB and your fairshare will also be charged for that 100GB. In addition, if you had asked for 1GB your job may have been better able to fit into the gaps in the scheduler, as 1GB of memory is easier to find than 100GB. Efficient use of the cluster means selecting the right amount of memory for whatever job you are running at that time. A quick way to spot if you have any jobs with incorrect memory settings is to use the seff-account command which will plot a histogram of your job memory efficiency over a specified period.

Cores

Slurm does not automatically parallelize jobs. Even if you ask for 1000’s of cores, if your job is not set up to run in parallel, your job will just run on a single core and the other cores will remain idle. Thus when in doubt about your code, err on the side of asking for a single core and then check the code’s documentation or contact the primary author to find out whether it is parallel and what method it uses.

Broadly parallel applications fall into two categories: thread based and rank based. Thread based parallelism relies on a shared memory space and thus is constrained to a single node. This includes things like OpenMP, pthreads, and python multi-processessing. Rank based parallelism relies on individual processes that have their own dedicated memory space which communicate with each other to share information. The main example of this is MPI (Message Passing Interface). It is important to understand which method your job uses as that will make a difference how you ask for resources in Slurm and how many cores you can reasonably ask for.

Once you figure out if your code is thread based or rank based, you can then do a scaling test to figure out how your code behaves as you add more cores.  There are two types of scaling tests you can do, both test slightly different parts of your code. The first type is called strong scaling. In this test, you keep the size of the problem the same while increasing the number of cores you use. In an ideal world your job should go twice as a fast every time you double the amount of cores you use. Most codes though do not have ideal scaling. Instead various inefficiencies in the algorithm or the size of the job itself mean that there is a point of diminishing returns where adding more cores does not improve speed.  Typically when you plot a chart of strong scaling you will see:

Chart showing strong scaling with ideal line in red and experimental points in black. The black line bends away from the red showing the point where scaling becomes inefficient.
Strong Scaling Plot. Plot is Log-Log with the ideal scaling line in red and the experimental data in black.

In this example, the user would not want to run their code with more than 256 cores because after that point adding more cores has diminishing returns with respect to improving performance.

The second type is called weak scaling. In this test you increase the size of the job proportional to the number of cores asked for. So if you double the cores, you would double the job size. Job size in this case is the amount of total computational work your job does. For instance you might double the amount of data ingested (assuming your code computational needs increase linearly as you increase the data ingested) or double one of the dimensions of multidimensional grid.  In an ideal world, your job should take the same amount of time to run if the job size grows linearly with the core count.  Most codes though do not have this ideal scaling. Instead various communications inefficiencies or nonlinear growth in processing time can impact the performance of the job and thus adding more cores would be inefficient beyond a certain point. A typical plot for weak scaling looks like:

 

Chart showing weak scaling with ideal line in red and experimental points in black. The black line bends away from the red showing the point where scaling becomes inefficient.
Weak Scaling Plot. Plot is Log-Linear with the ideal scaling line in red and the experimental data in black.

In this example, the user would not want to run this job with more than about 1000 cores as, after that point, the run time grows substantially from the ideal.

Besides these more robust scaling tests, you can get a quick view of your job core usage efficiency by using the seff or seff-account command. Those commands will take the ratio of two numbers. The first number is how much time you actually used of the cores for the job (t_cpu), this is known as the system CPU time. Note that for historical reasons CPU and Core are used interchangably with respect to CPU time. Regardless of the name what is meant is the amount of time that the system detects as being spent computing on a specific set of cores.  The second number is your elapsed run time multiplied by the number of cores (t_elapsed*n). If your job scales perfectly, your CPU efficiency (t_cpu/(t_elapsed*n)) will be 100%. If it is less than 100%, then the ratio of that will be roughly the number of cores you should reduce your job by. So say your job uses 8 cores, but you have an efficiency of 50% in seff, then you should reduce your ask to 4 cores instead. This is also a good way to check quickly if your job is scaling at all as if you see your job only using one core you know that either your job is not parallel or alternatively something is wrong and you need to investigate why your job is not scaling.

Time to Science (TtS)

With these two tests you can figure out the maximum number of cores you should ask for. That said, even if your core scales perfectly you will probably not want to ask for the maximum number of cores you can.  The reason for this is that the more cores you ask for the longer your job will pend in the queue, waiting for resources to become available. Time to Science (TtS) is the sum of the amount of time your job pends for plus the amount of time your job runs for. You want to minimize both. Counterintuitively, it may be the case that asking for less cores will mean your job will pend for substantially shorter, enough to make up for the loss in the run’s speed.

As an illustration, say your code scales perfectly and your job of 256 cores will take 1 day to run.  However it turns out that you will be spending 2 days pending in the queue waiting for your job to launch, thus your total TtS is 3 days. After more investigation you find out that if you ask for 128 cores, your job will take 2 days to run but the scheduler will be able to launch it in 4 hours leading to TtS of 2.25 days. You can see that the 128 core job was “faster” than the 256 core job, simply due to the fact that the 128 core job fit better in the scheduler at that moment.

It should be noted that the scheduler state is fluid and thus one should inspect the queue before submitting. You can test when your job is scheduled to run by adding the --test-only flag to your sbatch line, that will cause the scheduler to print back when it thinks the job will execute. This is a good way of right sizing your job.

Topology

For certain codes, layout on the node (i.e. which cores on which CPU) and cluster (i.e. where the nodes are located relative to each other) matters. In these cases the topology of the run is critical to getting the peak speed out of the job. Without deep knowledge of the code base, it’s hard to know if your code is one of these codes, and in most cases your code is not.

In cases where the topology of the run matters, Slurm provides a number of options to require the scheduler to give you a certain layout for the job. Both the sbatch and srun commands have options to this effect. Note that the more constraints you add to a job the longer it will take the scheduler to find resources for your job to use. One should set the minimum necessary restrictions on a job to give the scheduler maximum flexibility. As before, it may be the case that you may see a significant speed up if given the right topology but if it comes at the cost of having to wait significantly longer to run, your TtS may actually not improve or even get worse.

GPUs

Many of the same rules that applied to cores also apply to GPU’s. For most codes, your job will use a single GPU. If your code uses multiple GPU’s then you can follow the same process as above for cores to see how your code scales. Note that currently GPU efficiency is not recorded in Slurm. As such you will want to use other tools like DCGM and nvtop to get statistics on how your job is doing.

Time

It should be stated upfront that Slurm does not charge you fairshare for time you do not use. If you ask for 3 days and only use 2 hours, the scheduler will only charge you for the 2 hours you actually used. This is different than Memory, Cores, and GPU’s where you will be charged for allocating those resources whether you use them or not as the scheduler had to block them off for you to use and could not give them to anyone else.

To accurately estimate time is important not for the sake of fairshare, but rather for the sake of scheduling. The scheduler only knows what you tell it; if you tell it that a job takes 3 days, it will assume it takes 3 days even if it really takes 2 hours.  Thus when the scheduler goes to consider the job for scheduling, it will look for an allocation block the size of the length of job you request. A more accurate time estimate means that the scheduler can fit your job into tighter spots in the giant game of Tetris it is playing.  Taking our previous example, it may be that there are no spots right now for a 3 day job, but a 2 hour job may run immediately because there is a gap that the scheduler can fit it into while waiting to schedule a large high priority job. This behavior is called backfill, and is one of the two loops the scheduler engages in when scheduling. Leveraging the backfill loop is important as it is the main method through which low priority jobs, even those with zero fairshare, get scheduled. You can leap frog ahead of higher priority jobs because your job happens to fill a gap.

Assuming you are running on the same hardware (for considerations regarding different types of hardware see the next section) then you can reliably predict the runtime for certain classes of jobs. Simply run a test job and then look at how long it took using sacct or seff. If you run a bunch of jobs you can use seff-account to see the distribution of run times. Once you have the runtime, round it up the nearest hour and that should cover most situations. Run times can vary for various reasons but typically not more than 10%, so if your job takes 10 hours, you should ask for around 12 hours.

Finally a word about minimum run times. As described above your goal is to minimize Time to Science (TtS). You may naively think that asking for very short amounts of time would decrease TtS even more, but this is incorrect. The scheduler takes time to actually schedule jobs no matter how small your job is. To put it bluntly, you do not want the scheduler doing more work to schedule your job than your actual job is doing. For super short jobs the scheduler can get into a thrashing state where it schedules a job, the job exits immediately, and then the scheduler has to fill that slot again, similar to trying to fill a tub with the drain open. To prevent this, we require jobs to run for at least 10 minutes. Ideally jobs would last for an hour or longer. Thus when you are doing work on the cluster try to make sure you batch in increments of longer than 10 minutes and ideally longer than an hour. This will help the scheduler, and make sure your TtS is as short as possible.

Hardware

For similar job types, the run time is usually the same, with an important caveat being that you need to run on the same hardware. Different types of cores and GPU’s have different capabilities and speeds.  It is important to know how your job behaves as you switch between them. We have a table of relative speeds on the Fairshare page. It should be noted that that table only applies if your code is fully utilizing the hardware in question (more on that in the optimization section), you should always test your code to see how it actually performs as certain CPU and GPU types may work better for your code than others despite what the officially advertised benchmarks say.  While we generally validate vendor advertised performance numbers, they only apply to heavily optimized codes designed for those specific chips, as such your code speed may vary substantially.

If your are submitting to gpu_requeue or serial_requeue you will notice that your run times will vary quite a bit. This is because gpu_requeue and serial_requeue are mosaic partitions with a wide variety of hardware and thus a wide variety of performance. In cases like that you can either be very specific about which type of hardware you want using the --constraint option, or you can simply increase your time estimate to be the maximum you expect it to take on the slowest hardware available. A good rule of thumb is a factor of three variance in speed. So if your job takes 3 hours on most hardware, give it 9 hours on serial_requeue as you may end up on a substantially slower host.

Storage

The final thing that can impact Job Efficiency is the storage you use. Nothing can drag down a fast code faster than slow IO speed (Input/Output). To select the right storage, please read our Data Management Best Practices page. In general, for jobs you will want to use either Global Scratch or Local Scratch. If your job is IO heavy (i.e. it is constantly talking to the storage), Local Scratch is strongly preferred. Please also see the Data Management page for how to best lay out your file structure, as file structure layout can impact job performance as well.

Job Optimization

Now that we have dealt with Job Efficiency the next thing to look at is Job Optimization, after all the only way to improve your Time to Science (TtS) and increase your code capability after properly structuring your job is to improve the code itself. Job optimization can be very beneficial but can also take significant time. There are in general three methods to optimize your code, each taking different amounts of time.

  1. Compiler Version, Library Version, Containers and Optimization: Compilers, Libraries, and Containers are code as well and thus subject to improvement. Simply changing or updating your compiler, libraries, and container can sometimes lead to dramatic increases in performance. In addition compilers have different optimization flags you can use that will automatically optimize your code. This option is the fastest way to get optimized code as all the work is already done for you, you just need to select the right compiler, libraries, container, and options.
  2. Partial Code Rewrite: Looking through the code as it exists now and reworking portions can create speed ups. The process of reworking your code consists of finding places where the code is spending significant time and then refactoring the code; either by updating the logic, replacing the numerical method, or by substituting an optimized library. This process can take a few weeks to months but can give substantial increases in speed. However, this method cannot fix the basic structural problems with the code.
  3. Full Code Rewrite: This can could take a significant amount of time depending on the complexity of the code (for large codes this can take up to six months to a year to complete) but is the best way to optimize your code. It will allow you to fundamentally understand how your code operates and fix any major structural problems resulting in transformative increases in performance. If you go this route you should try the other two options first; as when you do the first two steps you will have a good understanding of the quirks of your code. You should also do a cost benefit analysis to figure out if the time spent is worth the potential gains. Make sure the project has a firm end goal in mind. If your code needs continual improvement, it may be time to hire a Research Software Engineer to do that very important and necessary code development work.

Regardless of the method, you will need to grow more acquainted with your code, its numerical methods, and how it interacts with the underlying hardware. While there are some generalized rules and things to look for when optimizing code, in the end it will depend on you turning your code from a blackbox into something you understand at a fundamental level. This is also where learning how to use various debuggers and code inspectors can be very beneficial as they can help identify which portions of the code to focus on.

Important final note: Always reconfirm the results of your code whenever you change your optimization. This goes for any changes to your code, but especially when you recompile with different compilers, optimization levels, libraries, etc. You should have a standard battery of tests you know the results of that you can run to reconfirm that the results did not change, or if they did they are acceptable changes. Optimization can change the numerical methods and order of operations leading to numerical drift. Sometimes that drift is fine as its at the edge of the mantissa. Sometimes though those changes at the edge of the mantissa can build and lead to substantial changes in results. Even if your code is confirmed as working, always have a healthy suspicion of your code results and engage in independent verification as code bugs and faults can produce results that look legitimate but were arrived at by faulty logic or code.

Below we are going to give some general rules regarding optimization as well as suggestions as to different ways to go about it.

Compiler Optimization

Compiler optimization means letting the compiler look through the code for things it can improve automatically. Maybe it will change the memory layout to make it more optimal, maybe it will notice that you are doing a certain numerical technique and then substitute in a better one, maybe it will change the order of operations to improve numerical speed. Regardless of what it tries, compiler optimization relies on the authors of the compiler and their deep knowledge of numerical methods and the underlying hardware to get improved speed. Compiler optimization really applies to those compiling from C, C++, or Fortran but higher level codes like Python and R, which lean on libraries that are written in C, C++, and Fortran can also benefit. Thus if you want to really optimize your Python or R code, getting those underlying libraries built in an optimized way can lead to speed ups.

There is a generally agreed upon standard for most compilers with regard to level of optimization. After all not all optimizations are numerically safe, or will produce gains in speed. Some may in fact slow things down. As such when using compiler optimization test your code speed and accuracy at different levels of optimization, and with different compilers. Each compiler has a different implementation of the standard, some are better for certain things than others. It is also worth reading the documentation for the compiler optimization levels to see what is included. A good exercise is to take each individual flag that makes up an optimization level and test to see if it speeds up your code and if it introduces numerical issues.

The standard code optimization (-O) levels are:

-O0: Not optimized at all.  The compiler just runs your code as is with zero work done. If you turn on debugging typically this is what your code will default to.

-O1: Numerically safe optimization. This level of optimization is guaranteed to be numerical stable and safe. No corners are cut, no compromises in numerical precision are made, nothing is reordered.

-O2: Mostly numerically safe optimization. This level of optimization is the default level for most compilers. At this level, in most cases, the optimizations made are numerically okay. Generally there is no sacrifice of numerical precision, though loops may be unrolled and reordered to make things more efficient.

-O3: Heavily optimized. This level of optimization takes the approach of trying to include every possible optimization whether numerically safe or not.

As you can see the various levels of optimization make certain assumptions about how numerically safe it is trying to be. Given this, you should always test your code to make sure that it runs as it should after compilation and does not produce errant results.

One other common optimization is to leverage special features found with different chipsets. Each generation of CPU has different features built into it that you can leverage. Some example features are SSE (Streaming SIMD Extensions), FMA (Fused Multiply Add), and AVX (Advance Vector Extensions). If your code is architected to use them, you can gain substantial speed by enabling these optimizations. There are three ways to do this:

  1. Turn on each feature individually: This allows you to pick and choose which you want and makes your compiled code portable across different chipsets.
  2. Specify chipset you are building for: Compilers include flags that allow you to target a specific type of chip and include all the relevant optimizations for it.  This approach works well if you have a uniform set of hardware you are running on, or if you are not sure what features your code will leverage. Note your code will not work on other chipsets.
  3. Have the compiler autodetect what chipset you are using: Compilers usually have a flag (i.e. -xHost) that will detect the chipset you are currently on and then build specifically for that. To do this properly you will need to make sure you are on the node that is of the same type that you will run your code on. In addition your code will not be portable.

It is worth noting again that not all optimizations are safe or beneficial. Heavily optimized code can lead to substantial bloat in memory usage with little material gain. Numerical issues may occur if the compiler makes bad assumptions about your code. You should only use up to the level of optimization that is stable and beneficial and no higher. If an optimization has no impact on your code performance, it is best to leave it off.

Important final note: Always remove debugging flags and options when running in production. Debugging flags will disable optimization even you tell the compiler to optimize, as the debugging flag overrides the optimization flag. Before going to production, remove debugging flags, recompile, and test your code for accuracy and performance.

Languages, Numerical Method, and Libraries

Selecting the correct language, numerical method, and libraries are important parts of code optimizations. You always want to select the right tool for the job. For some situations Python is good enough, for others you really need Fortran. An improved numerical method may give enormous speed ups but at the cost of increased memory, or vice versa. Swapping out code you wrote for a library maintained by a domain expert may be faster, or having a more integrated code may end up being quicker.

With languages, you are usually locked into a specific one unless you do a complete code rewrite. As such, you should learn the quirks of the language you are using and make sure your code conforms to the best standards and practices for that language. If you are looking to rewrite your code, then consider changing which language you are using. It may be that a different language may lead to more speed ups in the future. As a general rule, languages that are closer to the hardware (things like C or Fortran) can be made to go faster, but they also are trickier to use.

For numerical methods you will want to stay abreast of the current literature in your field and the field that generated the relevant numerical technique (e.g. matrix multiplication, sorting). Even small changes to a numerical technique can add up to large gains in speed. They can also dramatically impact memory utilization. Simplicity is also important, as in many cases a simpler method is faster just by dint of having to do less math and logic. This is not true in all cases though, so be sure to test and verify.

Libraries are another important tool in the toolbox. By using a library you leverage someone else’s time and experience to write optimized code. This saves you from having to debug the code and optimize it, you simply plug in the library and go. Libraries can still have flaws though, so you want to make sure you keep up to date and test. If you do find flaws you should contribute back to the community (i.e., report to the library’s developers) so that everyone benefits from the improvement you suggest. One other caution with libraries is that sometimes it is better to inline the code rather than go to a library, as the gains from using the library may not outweigh the cost of accessing the library. Libraries will not automatically make your code faster, but rather are a tool you can use to potentially get more speed and efficiency.

Containers

Probably the ultimate form of library is a well maintained container. Well optimized containers have the advantage of providing a highly customized stack of optimized libraries that will allow the code to get to near its peak performance. Containers are powerful tools for especially complex software stacks, as the container can provide optimization for each individual element of the stack and ensure that all the various versions interoperate properly. Containers are not free though and do have some performance overhead, due to having an abstraction layer between the software in the container and the system hardware. For peak performance you will want to build your software stack outside of the container in the native environment. In most cases though, the performance penalties are minimal and substantially outweighed by the performance gains of using a well maintained and well optimized software stack that you do not have to build yourself.  For best performance look for containers that are provided by the hardware vendor. For instance Nvidia provides a well curated list of containers built for its GPU’s. The vendor typically has the best knowledge of the internals of their hardware and thus will know how to get the most out of it. Containers provided by primary code authors are also good sources as the code author will have the best knowledge of the internals of their code base and how to best run it.

Containers can also be handy for users dealing with operations or code bases involving many files. By including these files in the container itself it effectively hides them from the underlying storage, the storage treats it as one large single file rather than lots of smaller files. Filesystems in general behave best when interacting with single large files as traversing between files is expense, especially when there are many files to deal with. Thus if your workflow either has a software that has many files, like a Python/Anaconda/Mamba environment, or your code is engaging in IO with many files, consider putting them inside a container.

Other General Rules

Here are some rules that did not fit into other sections but are things you can look for when optimizing your code.

  1. Remove Debugging Flags and Options: Cannot emphasize this enough. Production code should not be run in debug mode as it will slow things down substantially.
  2. Use the latest compilers and libraries: Implied above, but one of the first things to try is updating your compiler and library versions to see if the various improvements to those codes improve your code performance.
  3. Leave informative comments in your code: Comments are free and having good comments can help you understand your code and improve it. A very good practice is to cite the paper and specific equation or analysis you are using so you can find the original context.
  4. Make sure your loops are appropriately ordered for your arrays: Different languages have different array ordering as to which index is fastest to traverse in memory (for instance Fortran orders its arrays with the first index being fastest, in C it is the opposite). Be aware of this and arrange your arrays and loops appropriately.
  5. Avoid if statements buried in loops: if statements are not free and cost time to execute, thus it is best if you can execute it once rather than all the time.
  6. Use temporary variables to hold constants: Multiplications are faster than divisions or exponents. Thus, instead of pi/2, use 0.5*pi; instead of 5^2, use 5*5. In addition, if you have a complicated coefficient you are multiplying or dividing by repeatedly, consider calculating that coefficient once and storing that as a temporary variable. If your coefficients are related to each other by some constant value, also consider making that a constant. For instance if you are always using 4*pi/3, store that as a variable, and then use that in place of it where ever it appears.
  7. Use the right type, size, and level of precision for variables: Integer math is faster than floating point math. Single precision math is faster than double precision. 4 byte integers use up less space than 8 byte integers. Select the size and type necessary for the numerical precision and accuracy you require and no larger.
  8. If you have a heavy arithmetic section consider using small temporary arrays for the data you are manipulating: Long strings of math in a single line are hard for the compiler to optimize, and also trend towards mistakes. Consider breaking it up in to smaller chunks that eventually sum up to the total value you need. Be careful of round off error and order of operations issues with this.
  9. Lower your cache miss rate: CPU’s and GPU’s are built with onboard memory (typically called cache), you should try to keep your processing in this onboard memory and only go out to main memory when necessary. Cache is faster to access and generally small, so doing things in smaller chunks that reuse data will be more likely to drop in the cache layer.
  10. Be aware of first touch rule for memory allocation: Memory is typically allocated on an at-need basis, and the further the code needs to search in memory, the worse the performance. Allocate frequently used arrays and variables first.
  11. Reduce memory footprint: As a general rule you want to keep your memory usage to the bare minimum you need. The more temporary arrays and variables you keep the more memory bloat your code will have.
  12. Avoid over abstraction: Pointers are useful, but pointers to pointers to pointers are not. It makes it hard for the compiler to optimize and for you (and anyone that uses your code) to follow the code.
  13. Be specific and well defined: A well structured code is easier to optimize. Declare all your variables up front, allocate your arrays as soon as you can, do not leave the variable types ambiguous.
  14. Work in Memory and Not on Disk: Accessing storage, no matter how fast, takes far more time than accessing memory. Try to only read and write to storage when necessary. If possible spin off a separate process to handle reading and writing to disk so that your main process can continue work.
  15. Avoid large numbers of files: It is better from a IO performance standpoint to have lower numbers of large files on disk rather than many small files. Bundle your data together into larger files that you read from or write to all at once.
  16. Include Restarts/Checkpoints: Include the ability for your code to pick up from where it left off by writing restart/checkpoint data to disk. The restart data should be only what is sufficient to pick up from where your calculation left off. This will allow you to recover from crashes and leverage the requeue partitions. Restarts will also allow you to use partitions with shorter time limits to bridge yourself to a longer run (e.g. using ten 3 day runs to accumulate a 30 day run).

Parallelization

There are limits to how fast you can make any single code run in serial. Once this limit is hit, parallelization needs to be considered. Sometimes this parallelization is trivial, such as launching thousands of jobs at once each with different parameters to do a parameter sweep (this is known as an embarrasingly parallel workflow). However if your code needs to be tightly coupled then other methods of parallelism will need to be considered. The three main methods of parallelization are:

  1. SIMD: Singe Instruction Multiple Data
  2. Thread: Shared Memory
  3. Rank: Distributed Memory

Regardless of what method you use, the general rule is that you want to make sure as much of your code is parallelized as possible and that communications and computation are overlapped with each other.  It is also possible to use SIMD in conjunction with Threads in conjunction with Ranks, this is known as the hybrid approach. These can lead to very powerful codes that can scale up to the largest supercomputers in the world.

Some libraries and codes (for example MATLAB, PETSc, OpenFOAM, Python Multiprocessing, HDF5) will already have parallelization included. Check with the documentation and/or inquire of the developer as to if it is able to parallelize and what method it uses. Once you know that you will be able to get the most out of the built-in parallelization.

SIMD (Single Instruction Multiple Data)

Most processors have multiple channels that can execute a specific command simultaneously on a stream of data. This is built in to the chipset itself and compilers will automatically optimize code to leverage this behavior. You can intentionally design your code to better leverage it depending on which specific compiler and instruction set you are using (such as AVX).

Threads

Threading achieves parallelism by having a shared memory space but then running multiple computational streams (threads) across it to accomplish specific instructions. Thread based parallelism is typically fairly easy to accomplish as it requires no complex interprocess communications, all the changes to memory are readily visible to each thread. Typically all the coder needs to do is to indicate which loops and sections can be threaded, and the compiler takes care of the rest. Examples are OpenMP, OpenACC, Pthreads, and Cuda.

Rank

Rank based parallelism is the most powerful but also most technically demanding type of parallelism. Each process has its own memory space and the user has to manage inter process communications themselves. Key here is making sure that communications bottlenecks are minimal, and if they exist to overlap them with computation so they do not slow down code execution. The industry standard for doing this is called MPI (Message Passing Interface).

Profiling

Knowing where to focus your time for optimizing your code is important. You will gain the most speed by optimizing the part of your code that is currently occupying the most execution time, or using the most memory.  To figure this out you need to profile your code.

The easiest and most immediate way is to use print statements combined with printing how much time each section takes. Most languages have methods of printing out time stamps or calculating elapsed time, simply use those methods with judicious use of print statements and you can quickly find out where your code is spending most of its time. Generally you should instrument your code to give you overall timing estimates, especially if your code works on some sort of large loop (i.e. such as taking time steps for doing fluid dynamics). Print statements are the quickest and easiest way to get information on your code.

Besides print statements, various profilers exist that you can use to inspect your code. Profilers will give you far more information about your code, as well as suggestions as to where your code could be improved. They can give you super precise timing for your code as well as inform you what cache/memory level your code is touching. All of this rich information can be valuable for dialing in on particularly small sections of code or subtle issues that may be causing dramatic slow downs.

Below is a list of profilers you can use:

  • VTune: Intel’s profiler
  • NSight: Nvidia’s profiler
  • DCGM: Data Center GPU Manager from Nvidia
  • top: Not really a profiler but a useful system utility for monitoring live job performance.
  • nvtop: Similar to top but for gpus.