Job Efficiency and Optimization Best Practices

Overview

The art of High Performance Computing is really the art of getting the most out of the computational resources you have access to. This applies to working on a laptop, to working in the cloud, or working on a supercomputer. While this diversity of different systems and environments may seem intimidating, in reality there are some good general rules and best practices that you can use to get the most out of your code and the computer you are on.

Before we go further though we should define our terms.  The term job has a broad and a narrow sense. In the broad sense, a job is an individual run of an application, code, or script; and may be used interchangeable with those terms. This includes whether you run it from the command line, cronjob, or use a scheduler. In the narrow sense, a job is an individual allocation for a user by the scheduler. It is usually obvious from context which is meant.

By Job Efficiency, we mean that the parameters of the job in terms of cpu, gpus, memory, network, time, etc. (refer to Glossary for definitions) are accurately defined and match what the job actually uses. As an example, a job that asks for 100 cores but only uses 1 is not efficient. A job that asks for 100GB and uses 99GB is efficient.  Efficiency is a measure of how well the user has scoped their job so that it can run in the space defined.

Finally Job Optimization means to make the job run to the maximum speed possible with the least amount of resources used. For example, a poorly optimized code may only use 50% of the gpu it was allocated, whereas a well optimized could would use 100% and see speed ups commensurate with that improved usage. Similarly, a poorly optimized code may use 1TB of memory, but a well optimized code may only use 100GB. Optimization is a measure of how well structured a code is numerically, both in terms of algorithm and implementation, so that it can get to the solution in the fastest, most accurate, and most economical way.

Efficiency and Optimization are thus two sides of the same coin. Efficiency is about accurately defining the resources that you will use and optimization is about reducing that usage. Both have the goal of getting the most out of the resources the job is using.

Job Efficiency

The first step in improving job efficiency is understanding your job as it exists today. Once you have a good handle on the characteristics of your job, you should be able to accurately allocate resources to it thereby improving efficiency. As a general rule, one should always understand the jobs you run regardless of size. This knowledge is both beneficial for right sizing your requested resources, but also for noticing any pitfalls that may occur when scaling the job up.

There are two ways of learning about your job. The first is to have a fundamental understanding of the job you are running. Based on your knowledge of the algorithm, code, job script, and cluster architecture, you know what you should request for core count, gpu count, memory, time, and storage. Knowing your code at this level will allow you to make the most accurate estimates for what you will use.

While a full understanding of your job is ideal, often it is not possible. You may not control the code base, you may just be getting started, or you may not have time to obtain a deep understanding of the job. In which case the second method is to test your job empirically and find out what the best job parameters are. Simply take an example that you know will be akin to what you will run in production and run it as a test job. Then once the test job is done, check to see how it performed. You then repeat changing the job parameters until you have a good understanding of how your job performs in different situations.

That’s the rough sketch, but the details are a bit different depending on what you want to understand. Below are some methods for finding out how much memory, cores, gpus, time, and storage your job will need. These may not cover every job but should work for most situations.

Memory

Memory on the cluster is doled out in two different ways, either by node (--mem) or by core (--mem-per-cpu). If your job exceeds its memory request, you will see an error either containing Out of Memory or oom. This indicates that the scheduler terminated your job for exceeding your memory request. You will need to increase your memory allocation to give your job more space.

Here is a test plan for figuring out how much memory you should request:

  1. Come up with an initial guess as to how much memory your job will require. A good first guess is usually 2-3 times the size of any data files you are reading in, or 2-3 times the size of the data you will be generating. If you do not know either of those then a safe initial guess is 4GB. Most of the cluster has 4GB per core, so its a good initial guess that will allow you to get through the scheduler in a quick manner.
  2. Run a test job on the test partition with your guess.
  3. Check the result of your run using seff or sacct.
  4. If your job ran out of memory then double to amount and return to step 2.  If it ran properly (i.e, no out of memory error), then look at how much your job actually used and update your request to match with an additional 10% buffer as the scheduler’s sampling of the memory usage runs every 30 seconds and it may have missed any short term memory spikes.

Every time you change a parameter in your code, you should check to see how the memory changes. Some parameters will not change the memory usage at all. Others will change it dramatically.  If you do not know if a parameter will change the memory usage run a test to see how it behaves.

If you are working to scale up a job its good to understand how your memory usage will scale as the job increases. For example, say you are running a three dimensional code and you increase the resolution of the box you are simulating by 2. That means that your memory usage will grow by a factor of 8 because each dimension grew by a factor of 2. Likewise if you are running a simulation that ingests data, it will likely scale linearly with the amount of data you ingest. Testing by increments is the best way to validate how your memory usage will grow depending on situation.

One important warning is to make sure to use the correct memory for each type of job. When your job runs, the scheduler blocks off a segment of memory for you to use, regardless of if you actually use it. If your job asks for 100GB but only uses 1GB, the scheduler will give you 100GB and your fairshare will also be charged for that 100GB. In addition if you had asked for 1GB your job may have been better able to fit into the gaps in the scheduler as 1GB of memory is easier to find than 100GB. Efficient use of the cluster means selecting the right amount of memory for whatever job you are running at that time. A quick way to spot if you have any jobs that are currently incorrect is to use the seff-account command which will plot a histogram of your job performance over a specified period.

Cores

Slurm does not automatically parallelize jobs. Even if you ask for 1000’s of cores, if your job is not set up to run in parallel, your job will just run on a single core and the other cores will remain idle. Thus when in doubt about your code, err on the side of asking for a single core and then check the code’s documentation or contact the primary author to find out whether it is parallel and what method it uses.

Broadly parallel applications fall into two categories: thread based and rank based. Thread based parallelism relies on a shared memory space and thus is constrained to a single node. This includes things like OpenMP, pthreads, and python multi-processessing. Rank based parallelism relies on individual processes that have their own dedicated memory space which communicate with each other. The main example of this is MPI (Message Passing Interface). It is important to understand which method your job uses as that will make a difference how you ask for resources in Slurm and also how many cores you can reasonably ask the scheduler for.

Once you figure out if your code is thread based or rank based, you can then do a scaling test to figure out how your code behaves as you add more cores.  There are two types of scaling tests you can do, both test slightly different parts of your code. The first type is called strong scaling. In this test, you keep the size of the problem the same while increasing the number of cores you use. In an ideal world your job should go twice as a fast every time you double the amount of cores you use. Most codes though do not have ideal scaling. Instead various inefficiencies in the algorithm or the size of the job itself mean that there is a point of diminishing returns where adding more cores does not gain any speed.  Typically when you plot a chart of strong scaling you will see:

Chart showing strong scaling with ideal line in red and experimental points in black. The black line bends away from the red showing the point where scaling becomes inefficient.
Strong Scaling Plot. Plot is Log-Log with the ideal scaling line in red and the experimental data in black.

In this example, the user would not want to run their code with more than 256 cores because after that point adding more cores does not increase performance substantially.

The second type is called weak scaling. In this test you increase the size of the job proportional to the number of cores asked for. So if you double the cores, you would double the job size.  In an ideal world, your job should take the same amount of time to run if the job size grows linearly with the core count.  Most codes though do not have this ideal scaling. Instead various communications inefficiencies or nonlinear growth in processing time can impact the performance of the job and thus adding more cores would be inefficient beyond a certain point. A typical plot for weak scaling looks like:

 

Chart showing weak scaling with ideal line in red and experimental points in black. The black line bends away from the red showing the point where scaling becomes inefficient.
Weak Scaling Plot. Plot is Log-Linear with the ideal scaling line in red and the experimental data in black.

In this example, the user would not want to run this job with more than about 1000 cores as, after that point, the run time grows substantially from the ideal.

With these two tests you can figure out the maximum number of cores you should ask for. That said, even if your core scales perfectly you will probably not want to ask for the maximum number of cores you can.  The reason for this is that the more cores you ask for, the longer your job will pend in the queue waiting for resources to come free on the cluster. Time to Science (TtS) is the sum of the amount of time your job pends for plus the amount of time your job runs for. You want to minimize both. Counterintuitively, it may be the case that asking for less cores will mean your job will pend for substantially shorter, enough to make up for the loss in the run’s speed.

As an illustration, say your code scales perfectly and your job of 256 cores will take 1 day to run.  However it turns out that you will be spending 2 days pending in the queue waiting for your job to launch, thus your total TtS is 3 days. After more investigation you find out that if you ask for 128 cores, your job will take 2 days to run but the scheduler will be able to launch it in 4 hours leading to TtS of 2.25 days. You can see that the 128 core job was “faster” than the 256 core job, simply due to the fact that the 128 core job fit better in the scheduler at that moment.

It should be noted that the scheduler state is fluid and thus one should inspect the queue before submitting. You can test when your job run by adding the --test-only flag to your sbatch line, that will cause the scheduler to print back when it thinks the job will execute. This is a good way of right sizing your job.

Besides these more robust scaling tests, you can get a quick view of your job core usage efficiency by using the seff or seff-account command. Those commands will take the ratio of your run time multiplied by the number of cores to the CPU time for the job. If your job scales perfectly, your CPU efficient will be 100%. If it is less than 100%, then the ratio of that will be roughly the number of cores you should reduce your job by. So say your job uses 8 cores, but you have an efficiency of 50% in seff, then you should reduce your ask to 4 cores instead. This is also a good way to check quickly if your job is scaling at all as if you see your job only using one core you know that either your job is not parallel or alternatively something is wrong and you need to investigate why you job is not scaling.

Topology

For certain codes, layout on the node (i.e. which cores on which CPU) and cluster (i.e. where the nodes are located relative to each other) matters. In these cases the topology of the run is critical to getting the peak speed out of the job. Without deep knowledge of the code base, its hard to know if your code is one of these codes, and in most cases your code is not.

In cases where the topology of the run matters, Slurm provides a number of options to require the scheduler to give you a certain layout for the job. Both the sbatch and srun commands have options to this effect. Note that the more constraints you add to a job the longer it will take the scheduler to find resources for your job to use. One should set the minimum necessary restrictions on a job to give the scheduler maximum flexibility. As before it may be the case that your may see a significant speed up if given the right topology but if it comes at the cost of having to wait significantly longer to run, your TtS may actually not improve or even get worse.

GPUs

Many of the same rules that applied to cores also apply to GPU’s. For most codes, your job will use a single GPU. If your code uses multiple GPU’s then you can follow the same process as above for cores to see how your code scales. Note that currently GPU efficiency is not recorded in Slurm. As such you will want to use other tools like DCGM and nvtop to get statistics on how your job is doing.

Time

It should be stated upfront that Slurm does not charge you fairshare for time you do not use. If you ask for 3 days and only use 2 hours, the scheduler will only charge you for the 2 hours you actually used. This is different than Memory, Cores, and GPU’s where you will be charged for allocating those resources whether you use them or not as the scheduler had to block them off for you to use and could not give them to anyone else.

To accurately estimate time is important not for the sake of fairshare, but rather for the sake of scheduling. The scheduler only knows what you tell it; if you tell it that a job takes 3 days, it will assume it takes 3 days even if it really takes 2 hours.  Thus when the scheduler goes to consider the job for scheduling, it will look for an allocation block the size of the length of job you request. A more accurate time estimate means that the scheduler can fit your job into tighter spots in the giant game of Tetris it is playing.  Taking our previous example, it may be that there are no spots right now for a 3 day job, but a 2 hour may run immediately because there is a gap that the scheduler can fit you into while it waits to schedule a large high priority job. This behavior is called backfill, and is one of two loops the scheduler engages in when scheduling. Leveraging the backfill loop is important as it is the main method that low priority jobs, even those will zero fairshare, get scheduled. You can leap frog ahead of higher priority jobs because your job happens to fill a gap.

For similar job types, the run time is usually the same, with an important caveat being that you need to run on the same hardware. Different types of cores and GPU’s have different capabilities and speeds.  It is important to know how your job behaves as you switch between them. We have a table of relative speeds on the Fairshare page. It should be noted that that table only applies if your code is fully utilizing the hardware in question (more on that in the optimization section), you should always test your code to see how it actually performs as certain CPU and GPU types may work better for your code than others despite what the officially advertised benchmarks say.  While we generally validate vendor advertised performance numbers, they only apply to heavily optimized codes designed for those specific chips, as such your code speed may vary substantially.

If you are running on the same hardware then you can reliably predict the runtime for certain classes of jobs. Simply run a test job and then look at how long it took using sacct or seff. If you run a bunch of jobs you can use seff-account to see the distribution of run times. Once you have the runtime, round it up the nearest hour and that should cover most situations. Run times can vary for various reasons but typically not more than 10%, so if your job takes 10 hours, you should ask for around 12 hours.

If your are submitting to gpu_requeue or serial_requeue you will notice that your run times will vary quite a bit. This is because gpu_requeue and serial_requeue are mosaic partitions with a wide variety of hardware and thus a wide variety of performance. In cases like that you can either be very specific about which type of hardware you want using the --constraint option, or you can simply increase your time estimate to be the maximum you expect it to take on the slowest hardware available. A good rule of thumb is a factor of three variance in speed. So if your job takes 3 hours on most hardware, give it 9 hours on serial_requeue as you may end up on a substantially slower host.

Finally a word about minimum run times. As described above your goal is to minimize Time to Science (TtS). You may naively think that asking for very short amounts of time would decrease TtS even more, but this is incorrect. The scheduler takes time to actually schedule jobs no matter how small your job is. To put it bluntly, you do not want the scheduler doing more work to schedule your job than your actual job is doing. For super short jobs the scheduler can get into a thrashing state where it schedules a job, the job exits immediately, and then the scheduler has to fill that slot again, similar to trying to fill a tub with the drain open. To prevent this, we require jobs to run for at least 10 minutes. Ideally jobs would last for an hour or longer. Thus when you are doing work on the cluster try to make sure you batch in increments of longer than 10 minutes and ideally longer than an hour. This will help the scheduler, and make sure your TtS is as short as possible.

Storage

The final thing that can impact Job Efficiency is the storage you use. Nothing can drag down a fast code faster than slow IO speed (Input/Output). To select the right storage, please read our Data Management Best Practices page. In general, for jobs you will want to use either Global Scratch or Local Scratch. If your job is IO heavy (i.e. it is constantly talking to the storage), Local Scratch is strongly preferred. Please also see the Data Management page for how to best lay out your file structure, as file structure layout can impact job performance as well.

Job Optimization

Now that we have dealt with Job Efficiency the next thing to look at is Job Optimization, after all the only way to improve your Time to Science and increase your code capability after properly structuring your job is to improve the code itself. Job optimization can be very beneficial but can also take significant time. There are in general three methods to optimizing your code, each taking different amounts of time.

  1. Compiler Version, Library Version, and Optimization: Compilers and Libraries are code as well and thus subject to improvement. Simply changing or updating your compiler and libraries can sometimes lead to dramatic increases in performance. In addition compilers have different optimization flags you can use that will automatically optimize your code. This option is the fastest way to get optimized code as all the work is already done for you, you just need to select the right compiler, libraries, and options.
  2. Partial Code Rewrite: Looking through the code as it exists now and reworking portions or substituting in optimized libraries can create speed ups. This process can take a few weeks to months but can give substantial increases in speed. This cannot fix basic structural problems with the code though.
  3. Full Code Rewrite: This can take six months to a year to accomplish but is the best way to optimize your code. It will allow you to fundamentally understand how your code operates and fix any major structural problems with the code. Transformative increases in optimization can occur from this project, but it also takes significant time and effort and can turn into a permanent project of revamping your code. If you go this route you should try the other two options first, so you have a good understanding of the quirks of you code. You should also do a cost benefit analysis to figure out if the time spent is worth the potential gains. You will also want to make sure the project has a firm end goal in mind, the goal is to do research not do permanent code development. That said if your code needs continual improvement, it may be time to hire a Research Software Engineer to do that very important and necessary code development work.

Regardless of method, you will need to grow more acquainted with your code, its numerical methods, and how it interacts with the underlying hardware. While there are some generalized rules and things to look for when optimizing code, in the end it will depend on you turning your code from a blackbox into something you understand at a fundamental level. This is also where learning how to use various debuggers and code inspectors can be very beneficial as they can help identify which portions of code to focus on. Below we are going to give some general rules regarding optimizing as well as suggestions as to different ways to go about it.

Compiler Optimization

Compiler optimization means letting the compiler look through the code for things it can improve automatically. Maybe it will change the memory layout to make it more optimal, maybe it will notice that you are doing a certain numerical technique and then substitute in a better one, maybe it will change the order of operations to improve numerical speed. Regardless of what it tries, compiler optimization relies on the authors of the compiler and their deep knowledge of numerical methods and the underlying hardware to get improved speed. Compiler optimization really applies to those compiling from C, C++, or Fortran but higher level codes like Python and R, lean on libraries that are written in C, C++, and Fortran. Thus if you want to really optimize your Python or R code, getting those underlying libraries built in an optimized way can lead to speed ups.

There is a generally agreed upon standard for most compilers with regard to level of optimization. After all not all optimizations are numerically safe, or will produce gains in speed. Some may in fact slow things down.  As such when using compiler optimization test your code speed and accuracy at different levels of optimization, and with different compilers. Each compiler has a different implementation of the standard, some are better for certain things than others. It is also worth reading the documentation for the compiler optimization levels to see what is included. A good exercise is to take each individual flag that makes up an optimization level and test to see if it speeds up your code and if it introduces numerical issues.

The standard code optimization (-O) levels are:

-O0: Not optimized at all.  The compiler just runs your code as is with zero work done. If you turn on debugging typically this is what your code will default to. As a reminder always remove debugging flags and options when running in production. Debugging flags will disable optimization even you tell the compiler to optimize, as the debugging flag overrides the optimization flag.

-O1: Numerically safe optimization. This level of optimization is guaranteed to be numerical stable and safe. No corners are cut, no compromises in numerical precision are made, nothing is reordered.

-O2: Mostly numerically safe optimization. This level optimization is the default level for most compilers. At this level, in most cases, the optimizations made are numerically okay. Generally there is no sacrifice of numerical precision, though loops may be unrolled and reordered to make things more efficient.

-O3: Heavily optimized. This level of optimization takes the approach of trying to include every possible optimization whether numerically safe or not.

As you can see the various levels of optimization make certain assumptions about how numerically safe it is trying to be. Given this, you should always test your code to make sure that it runs as it should after compilation and does not produce errant results.

One other common optimization is to leverage special features found with different chipsets. Each generation of CPU has different features built into it that you can leverage. Some example features are SSE (Streaming SIMD Extensions), FMA (Fused Multiply Add), and AVX (Advance Vector Extensions). If your code is architected to use them, you can gain substantial speed by enabling these optimizations. There are three ways to do this:

  1. Turn on each feature individually: This allows you to pick and choose which you want and makes your compiled code portable across different chipsets.
  2. Specify chipset you are building for: Compilers include flags that allow you to target a specific type of chip and include all the relevant optimizations for it.  This approach works well if you have a uniform set of hardware you are running on, or if you are not sure what features your code will leverage. Note your code will not work on other chipsets.
  3. Have the compiler autodetect what chipset you are using: Compilers usually have a flag (i.e. -xHost) that will detect the chipset you are current on and then build specifically for that. To do this properly you will need to make sure you are on the node that is of the same type that you will run your code on. In addition your code will not be portable.

It is worth noting again that not all optimizations are safe or beneficial. Heavily optimized code can lead to substantial bloat in memory usage with little material gain. Numerical issues may occur if the compiler makes bad assumptions about your code. You should only use up to the level of optimization that is stable and beneficial and no higher. If an optimization has no impact on your code performance, it is best to leave it off.

Languages, Numerical Method, and Libraries

Selecting the correct language, numerical method, and libraries are important parts of code optimizations. You always want to select the right tool for the job. For some situations Python is good enough, for others you really need Fortran. An improved numerical method may give enormous speed ups but at the cost of increased memory, or visa versa. Swapping out code you wrote for a library maintained by a domain expert may be faster, or having a more integrated code may end up being quicker.

With languages, you are usually locked into a language unless you do a complete code rewrite. As such, you should learn the quirks of the language you are using and make sure your code conforms to the best standards and practices for that language. If you are looking to rewrite your code, then consider changing which language you are using. It may be that a different language may lead to more speed ups in the future. As a general rule, languages that are closer to the hardware (things like C or Fortran) can be made to go fastest, but they also are trickier to use.

For numerical methods you will want to stay abreast of the current literature in your field and the field that generated the relevant numerical technique (e.g. matrix multiplication, sorting). Even small changes to a numerical technique can add up to large gains in speed. They can also dramatically impact memory utilization. Simplicity is also important, as in many cases a simpler method is faster just by dint of having to do less math and logic to make it go. This is not true in all cases though, so be sure to test and verify.

Libraries are another important tool in the toolbox. By using a library you leverage someone else’s time and experience to write optimized code. This saves you from having to debug the code and optimize it, you simply plug in the library and go. Libraries can still have flaws though, so you want to make sure you keep up to date and test. If you do find flaws you should contribute back to the community (i.e., report to the library’s developers) so that everyone benefits from the improvement you suggest. One other caution with libraries is that sometimes it is better to inline the code rather than go to library, as the gains from using the library may not outweigh to cost of accessing the library. Libraries will not automatically make your code faster, but rather are a tool you can use to potentially get more speed and efficiency.

Other General Rules

Here are some rules that did not fit into other sections but are things you can look for when optimizing your code.

  1. Use the latest compilers and libraries: Implied above, but one of the first things to try is updating your compiler and library versions to see if the various improvements to those codes improve your code performance.
  2. Leave informative comments in your code: Comments are free and having good comments can help you understand your code and improve it. A very good practice is to cite the paper and specific equation or analysis you are using so you can find the original context.
  3. Make sure your loops are appropriately ordered for your arrays: Different languages have different array ordering as to which index is fastest to traverse in memory (for instance Fortran orders its arrays with the first index being fastest, in C it is the opposite). Be aware of this and arrange your arrays and loops appropriately.
  4. Avoid if statements buried in loops: if statements are not free and cost time to execute, thus it is best if you can execute it once rather than all the time.
  5. Use temporary variables to hold constants: Multiplications are faster than divisions or exponentials. Thus, instead of pi/2, use 0.5*pi; instead of 5^2, use 5*5. In addition, if you have a complicated coefficient you are multiplying or dividing by repeatedly, consider calculating that coefficient once and storing that as a temporary variable. If your coefficients are related to each other by some constant value, also consider making that a constant. For instance if you are always using 4*pi/3, store that as a variable, and then use that in place of it where ever it appears.
  6. Use the right type, size, and level of precision for variables: Integer math is faster than floating point math. Single precision math is faster than double precision. 4 byte integers use up less space than 8 byte integers. Select the size and type necessary for the numerical precision and accuracy you require and no larger.
  7. If you have a heavy arithmetic section consider using small temporary arrays for the data you are manipulating: Long strings of math in a single line are hard for the compiler to optimize, and also trend towards mistakes. Consider breaking it up in to smaller chunks that eventually sum up to the total value you need. Be careful of round off error and order of operations issues with this.
  8. Lower your cache miss rate: CPU’s and GPU’s are built with onboard memory (typically called cache), you should try to keep your processing in this onboard memory and only go out to main memory when necessary. Cache is faster to access and generally small, so doing things in smaller chunks that reuse data will be more likely to drop in the cache layer.
  9. Be aware of first touch rule for memory allocation: Memory is typically allocated on an at-need basis, and the further the code needs to search in memory, the worse the performance. Allocate frequently used arrays and variables first.
  10. Reduce memory footprint: As a general rule you want to keep your memory usage to the bare minimum you need. The more temporary arrays and variables you keep the more memory bloat your code will have.
  11. Avoid over abstraction: Pointers are useful, but pointers to pointers to pointers are not. It makes it hard for the compiler to optimize and for you (and anyone that uses your code) to follow the code.
  12. Be specific and well defined: A well structured code is easier to optimize. Declare all your variables up front, allocate your arrays as soon as you can, do not leave the variable types ambiguous.

Parallelization

There are limits to how fast you can make any single code run in serial. Once this limit is hit, parallelization needs to be considered. Sometimes this parallelization is trivial, such as launching thousands of jobs at once each with different parameters to do a parameter sweep (this is known as an embarrasingly parallel workflow). However if your code needs to be tightly coupled then other methods of parallelism will need to be considered. The three main methods of parallelization are:

  1. SIMD: Singe Instruction Multiple Data
  2. Thread: Shared Memory
  3. Rank: Distributed Memory

Regardless of what method you use, the general rule is that you want to make sure as much of your code is parallelized as possible and that communications and computation are overlapped with each other.  It is also possible to use SIMD in conjunction with Threads in conjunction with Ranks, this is known as the hybrid approach. These can lead to very powerful codes that can scale up to the largest supercomputers in the world.

SIMD (Single Instruction Multiple Data)

Most processors have multiple channels that can execute a specific command simultaneously on a stream of data. This is built in to the chipset itself and compilers will automatically optimize code to leverage this behavior. You can intentionally design your code to better leverage it depending on which specific compiler and instruction set you are using (such as AVX).

Threads

Threading achieves parallelism by having a shared memory space but then running multiple computational streams (threads) across it to accomplish specific instructions. Thread based parallelism is typically fairly easy to accomplish as it requires no complex interprocess communications, all the changes to memory are readily visible to each thread. Typically all the coder needs to do is to indicate which loops and sections can be threaded, and the compiler takes care of the rest. Examples are OpenMP, OpenACC, Pthreads, and Cuda.

Rank

Rank based parallelism is the most powerful but also most technically demanding type of parallelism. Each process its own memory space and the user has to manage inter process communications themselves. Key here is making sure that communications bottlenecks are minimal, and if they exist to overlap them with computation so they do not slow down code execution. The industry standard for doing this is called MPI (Message Passing Interface).

Profiling

Knowing where to focus your time for optimizing your code is important. You will gain the most speed by optimizing the part of your code that is currently occupying the most execution time, or using the most memory.  To figure this out you need to profile your code.

The easiest and most immediate way is to use print statements combined with printing how how much time each section takes. Most languages have methods of printing out time stamps or calculating elapsed time, simply use those methods with judicious use of print statements and you can quickly find out where your code is spending most of its time. Generally you should instrument your code to give you overall timing estimates, especially if your code works on some sort of large loop (i.e. such as taking time steps for doing fluid dynamics). Print statements are the quickest and easiest way to get information on your code.

Besides print statements, various profilers exist that you can use to inspect your code. Profilers will give you far more information about your code, as well as suggestions as to where your code could be improved. They can give you super precise timing for your code as well as inform you what cache/memory level your code is touching. All of this rich information can be valuable for dialing in on particularly small sections of code or subtle issues that may be causing dramatic slow downs.

Below is a list of profilers you can use:

  • VTune: Intel’s profiler
  • NSight: Nvidia’s profiler
  • DCGM: Data Center GPU Manager from Nvidia
  • top: Not really a profiler but a useful system utility for monitoring live job performance.
  • nvtop: Similar to top but for gpus.