Fairshare and Job Accounting
In order to ensure that all research labs get their fair share of the cluster and to account for differences in hardware being used, we utilize Slurm’s built-in job accounting and fairshare system. Every lab has a base Share of the community-wide system, which is governed by the Gratis Share purchased by the Faculty of Arts and Science and distributed equally to all labs. In addition, Shares purchased by individual labs by buying hardware are added to their base Share. The Fairshare score of a lab is then calculated based off of their Share versus the amount of the cluster they have actually used. This Fairshare score is then utilized to assign priority to their jobs relative to other users on the cluster. This keeps individual labs from monopolizing the resources, thus making it unfair to labs who have not used their fairshare for quite some time. Currently, we account for the fraction of the compute node used with CPU, GPU, and Memory usage using Slurm’s Trackable RESources (TRES).
Fairshare is a portmanteau that pretty much expresses what it is. Essentially fairshare is a way of ensuring that users get their appropriate portion of a system. Sadly this term is also used confusingly for different parts of fairshare. This includes what fraction of the system users get, the score that the system assigns for users based off of your usage, and the priority that users are assigned based off of their usage. For the sake of the discussion below, we will use the following terms. Share is the portion of the system users have been granted. Usage is the amount of the system users have actually used. Fairshare score is the value the system calculates based off of user’s usage. Priority score is the priority assigned based off of the user’s fairshare score.
While Fairshare may seem complex and confusing, it is actually quite logical once you think about it. The scheduler needs some way to adjudicate who gets what resources. Different groups on the cluster have been granted different resources for various reasons. In order to serve the great variety of groups and needs on the cluster a method of fairly adjudicating job priority is required. This is the goal of Fairshare. Fairshare allows those users who have not fully used their resource grant to get higher priority for their jobs on the cluster, while making sure that those groups that have used more than their resource grant do not overuse the cluster. The cluster is a limited resource and Fairshare allows us to ensure everyone gets a fair opportunity to use it regardless of how big or small the group is.
Trackable RESources (TRES)
Slurm Trackable RESources (TRES) allows the scheduler to charge back users for how much they have used different features on the cluster. This is important as the usage of the cluster factors into the Fairshare calculation. These TRES charge backs vary from partition to partition. You can see what the TRES charge back is by running
scontrol show partition <partitionname> .
On Cannon we set TRES for CPU, GPU, and Memory usage. For most partitions we charge back for CPU’s and GPU’s based off of the type being used. We normalize TRES to 1.0 for Intel Cascade Lake chips. For other chips we calculate the TRES by taking the theoretical peak Floating Point OPerations (FLOPs) for a single core of that CPU (or entire GPU) and dividing it by the theoretic peak for the Intel Cascade Lake chips. With this weighting we end up with the following TRES per core:
|AMD Abu Dhabi||0.1|
|Intel Sandy/Ivy Bridge||0.2|
|Intel Cascade Lake||1.0|
|Intel Ice Lake||1.15|
It may seem to be a penalty to charge more for the Cascade Lake than the Abu Dhabi, but it really is not in the end. The reason being is that jobs running on the Cascade Lake cores will run roughly 10 times faster than the Abu Dhabi chips. Thus the actual charge back to the user should be the same on a per job basis, it’s just a question of picking the right resource for the job you are running.
In the case of memory we set the TRES based off of the following formula
NumCore*CoreTRES/TotalMem where NumCore is the number of cores per node, CoreTRES is the TRES score for that type of core, and TotalMem is the total available memory for the node. The reason we weight memory like this is that if a user uses up all the memory on the node the scheduler cannot schedule another job on that node even if there are available cores. The opposite is also true, if all the cores are used up the scheduler cannot schedule another job there even if there is free memory. Thus memory and CPU are exhaustible resources that impact each other. The above weighting allows us to ensure that memory costs the same as the CPU’s on a given node. For instance, lets say you have a node that has 128 GB of RAM and 32 Intel Cascade Lake cores. In this case every 4 GB of RAM used should be equivalent to a single core being used. Thus we should charge a TRES of 1.0 for 4 GB used, or 0.25 for every GB used. In the case of a AMD Abu Dhabi node with 64 cores and 256 GB of RAM, you have the same scenario but now the Abu Dhabi chips are worth 10 times less, thus the memory also is worth 10 times less as so it is 0.025 for every GB used.
There is one exception to the above TRES rules and those are the requeue partitions, such as
gpu_requeue. Since jobs in these partitions can be interrupted by higher priority jobs at any time, this means that there could be a loss of computation time. This is especially true for jobs who are not able to snapshot their progress and restart from where they left off. Studies have shown that to make this type of model break even in terms of cost you need to charge back roughly half of what you normally would. So for the requeue partitions we charge a flat rate of 0.5 for CPU, 37.5 for GPU, and 0.125 per GB for Memory. Since the requeue partitions contain all our hardware, users can get access to normally very high cost CPU’s and GPU’s for cheaper. Thus if a user needs to run a lot of jobs the best way to optimize throughput and usage is to build their jobs to leverage the cheap resources in the requeue partitions. One should be aware though that the available cores in this partition vary wildly depending on how active any given primary partition is.
On Cannon each user is associated with their primary group. This lab group is what is called an Account in Slurm. Users belong to Accounts, and Accounts have Shares granted to them. These Shares determine how much of the cluster that group has been granted. Users when they run are charged back for their runs against the Account (i.e. lab) they belong to.
Shares granted an Account come in three types that are summed together. The first type is the Gratis Share. This Gratis Share is the Share given to all labs that are part of the cluster owing to the investment that Research Computing, via the Faculty of Arts and Sciences, has made in Cannon. This Gratis Share is calculated by summing the CPU and GPU TRES for all the nodes in the public partitions, excepting the requeue partitions, and then dividing by the total number of Accounts on Cannon. Thus the Gratis Share roughly corresponds to the number of cores each group has been granted. Currently the Gratis Share is set to 120.
The second type of Share is Lab Share. This Share is the Share given to those Labs who have purchased hardware for their own lab. The CPU and GPU TRES from that purchased hardware is summed and added to the Gratis Share for that Lab’s Account.
The third type of Share is Communal Partition Share. This Communal Partition Share is the Share given to labs who have gone in with other labs and have purchased hardware to be used in common by the group of labs (e.g. a partition for the entire department, or for a school, or a collaboration of labs). In these cases the CPU and GPU TRES is summed and then divided amongst the labs, per their discretion, and added to the Lab’s Account.
Thus the total Share an Account has is simply the addition of all of these types of Share. This Share is global to the whole cluster. So whether the Lab is running on their own dedicated partitions or on the public partitions, their Share is the same. The Share is simply the portion of the entire system they have been granted, and can be moved around as needed by the Lab to any of the resources available to them on the cluster.
Probably the easiest way to walk through how a Lab’s Fairshare Score is calculated is to explain what the Slurm tool
sshare displays. This tool shows you all the components of your Fairshare calculation. Here is an example:
[root@holyitc01 ~]# sshare --account=test_lab -a
Account User RawShares NormShares RawUsage EffectvUsage FairShare
-------------------- ---------- ---------- ----------- -----------
test_lab 244 0.001363 45566082 0.000572 0.747627
test_lab user1 parent 0.001363 8202875 0.000572 0.747627
test_lab user2 parent 0.001363 248820 0.000572 0.747627
test_lab user3 parent 0.001363 163318 0.000572 0.747627
test_lab user4 parent 0.001363 18901027 0.000572 0.747627
test_lab user5 parent 0.001363 18050039 0.000572 0.747627
The Account we are looking at is test_lab. The first line of the sshare output shows the summary for the whole lab, while the subsequent lines show the information for each user. test_lab has been granted 244 RawShares. Each user of that lab has a RawShare of parent, this means that all the users pull from the total Share of the Account and do not have their own individual subShares of the Account Share. Thus all users in this lab have full access to the full Share of the Account.
The next column after RawShares is NormShares. NormShares is simply the Account’s RawShares divided by the total number of RawShares given out to all Accounts on the cluster. Essentially NormShare is the fraction of the cluster the account has been granted, in this case about 0.136%. Given the way we set up giving out RawShares on Cannon, the total number of RawShares should be equivalent to the number of CPU TRES on Cannon, that is 244 Cascade Lake cores.
Following NormShares we have RawUsage. RawUsage is the amount of TRES-sec the Account/User has used. Thus if a user used a single Cascade Lake core for one second, the user’s account would be charged 1 TRES-sec in RawUsage. This RawUsage is also attenuated by the halflife that is set for the cluster, which is currently 4 weeks. Thus work done in the last 4 weeks counts at full cost, work done 8 weeks ago costs half, work done 12 weeks ago one fourth, and so on. So RawUsage is the aggregate of the Account’s past usage with this halflife weighting factor. The RawUsage for the Account is the sum of the RawUsage for each user, thus sshare is an effective way to figure out which users have contributed the most to the Account’s score.
A quick aside, it should be noted that RawUsage is the sum of all usage including: failed jobs, jobs that are requeued, jobs that ran on nodes that failed, etc. That usage is still counted as part of RawUsage. The reason for this is that it is up to the user to effectively use the time and resources allocated by the scheduler even if that time is cut short for various reasons. We highly recommend users test and verify their codes before running. Users should also ensure their code has checkpointing enabled so that jobs can restart from where they left off in case of node failure. These steps will minimize the effect of various failures on a user’s usage.
The next column is EffectvUsage. EffectvUsage is the Account’s RawUsage divided by the total RawUsage for the cluster. Thus EffectvUsage is the percentage of the cluster the Account has actually used. In this case, the test_lab has used 0.057% of the cluster.
Finally, we have the Fairshare score. The Fairshare score is calculated using the following formula.
f = 2^(-EffectvUsage/NormShares) From this one can see that there are five basic regimes for this score which are as follows:
1.0: Unused. The Account has not run any jobs recently.
1.0 > f > 0.5: Underutilization. The Account is underutilizing their granted Share. For example, when f=0.75 a lab has recently underutilized their Share of the resources 1:2
0.5: Average utilization. The Account on average is using exactly as much as their granted Share.
0.5 > f > 0: Over-utilization. The Account has overused their granted Share. For example, when f=0.25 a lab has recently overutilized their Share of the resources 2:1
0: No share left. The Account has vastly overused their granted Share. If there is no contention for resources, the jobs will still start.
Since the usage of the cluster varies, the schedule does not stop Accounts from using more than their granted Share. Instead, the scheduler wants to fill idle cycles, so it will take whatever jobs it has available. Thus an Account is essentially borrowing computing resource time in the future to use now. This will continue to drive down the Account’s Fairshare score, but allow jobs for the Account to still start. Eventually, another Account with a higher Fairshare score will start submitting jobs and that labs jobs will have a higher priority because they have not used their granted Share. Fairshare only recovers as a lab reduces the workload to allow other Accounts to run. The half-life helps to expedite this recovery.
Given this behavior of Fairshare, Accounts can also bank time for large computations that are beyond their average Share. For instance say the Lab knows it has a large parallel run to do, or alternatively a deadline to meet. The Lab can in preparation for this not run for several weeks. This will drive up their Fairshare as they will have not used their fraction of the cluster for that time period. This banked capacity can then be expended for a large run or series of runs. On the other hand, to continue the financial analogy, a group that has exhausted their Fairshare is in debt to the scheduler as they have used up far more than their granted Share. Thus they have to wait for that debt to be paid off by not running, which allows their Fairshare to recover. Again, when there is no contention for resources, even jobs with low Faishare scores will continue to start.
Now that we have discussed Fairshare we can now discuss how an individual job’s priority is calculated. Job Priority is an integer number that adjudicates the position of a job in the pending queue relative to other jobs. There are two components of Job Priority on Cannon. The first is the FairShare score multiplied by a weighting factor to turn it into an integer, in this case 20,000,000. A Fairshare of 1 would give a priority of 20,000,000, while a Fairshare of 0.5 would give a value of 10,000,000. We pick large numbers so we have resolution to break ties between Accounts that are close in Fairshare score. This Fairshare Priority evolves dynamically as the Fairshare of the Account changes over time.
The second component is Job Age. This priority accrues over time gaining a maximum value at 7 days. As the job sits in the queue waiting to be scheduled, its priority is gradually increasing due to the Job Age. The maximum possible value for Job Age is 10,000,000. Thus a job that has been sitting for 3.5 days would have a Job Age Priority of 5,000,000. We set the Job Age Priority to a maximum of 10,000,000 so that a job from an Account with a Fairshare of 0 but has been pending for 7 days would have the same priority as a job that was just submitted from an Account that has a Fairshare of 0.5. Thus even jobs from Accounts that have low Fairshare will schedule eventually due to the growth in their Job Age Priority.
These two components are summed together to make up an individual Job’s Priority. You can see this calculation for specific jobs by using the
sprio command. In addition you can see the Pending queue of a specific partition ordered by job priority by using
showq -o -p <partitionname>.
Slurm provides a way for users to adjust their own priority by defining a nice value. Similar to the unix nice command, this flag allows users to deprioritize certain jobs. Jobs that are deprioritized should have higher nice values than those that are more important. Values for nice can run between 0 and 2147483645, negative values are not allowed.
As discussed above users in the same account have the same Fairshare if their RawShares are set to parent. This can create a situations where the scheduling for a partition owned by a specific Account becomes FIFO (first in, first out) due to the inability to break ties between users at the same Fairshare. To solve this we have implemented a small adjustment to a user’s priority using the nice value. This small adjustment is weighted by the user’s individual RawUsage in order break ties internal to the group. The size of the adjustment is set to be negligible relative to the total priority score so it does not impact other aspects of scheduling or unduly penalize the user. Any user defined nice values are added to the value we define to preserve user defined job priority.
While most users are fine with having one Account they are associated with, some users do work for multiple Accounts. Slurm does have the ability to associate users with multiple Accounts, which allows users to charge back individual jobs to individual Accounts. Contact Research Computing if you are interested in this feature.
Research Computing keeps track of historic data for usage and Fairshare score. If you wish to see how your Account’s Fairshare score has evolved over time please contact Research Computing.
scalc is a calculator available on the cluster for figuring out various questions about fairshare. It includes a calculator for projecting a new Fairshare score based on a new RawShare, a calculator for figuring out how long it will take to restore fairshare, and a calculator for figuring out how much a set of jobs will cost in terms of cluster utilization and fairshare. When asked for to enter an account name, please enter your lab group name (e.g. – jharvard_lab). If you have additional calculations that you would like to see contact us.
There are several things that can be done when your fairshare is low:
- Do not run jobs: Fairshare recovers via two routes. The first is via your group not running any jobs and letting others use the resource. That allows your fractional usage to decrease which in turn increases your fairshare score. The second is via the half-life we apply to fairshare which ages out old usage over time. Both of these method require not action but inaction on the part of your group. Thus to recover your fairshare simply stop running jobs until your fairshare reaches the level you desire. Be warned this could take several weeks to accomplish depending on your current usage.
- Be patient: This is a corollary to the previous point but applies if you want to continue to run jobs. Even if your fairshare is low, your job gains priority by sitting the queue. The longer it sits the higher priority it gains. So even if you have very low fairshare your jobs will eventually run, it just may take several days to accomplish.
- Leverage Backfill: Slurm runs in two scheduling loops. The first loop is the main loop which simply looks at the top of the priority chain for the partition and tries to schedule that job. It will schedule jobs until it hits a job it cannot schedule and then it restarts the loop. The second loop is the backfill loop. This loop looks through jobs further down in the queue and asks can I schedule this job now and not interfere with the start time of the top priority job. Think of it as the scheduler playing giant game of three dimensional tetris, where the dimensions are number of cores, amount of memory, and amount of time. If your job will fit in the gaps that the scheduler has it will put your job in that spot even if it is low priority. This requires you to be very accurate in specifying the core, memory, and time usage of your job. The better constrained your job is the more likely the scheduler is to fit you in to these gaps. The
seffutility is a great way of figuring out your job performance.
- Leverage Requeue: The requeue partitions are cheaper to run in and have a lot of capacity. You are more likely to find your job pending for a shorter time, even with low fairshare, in those partitions than in the higher demand non-requeue partitions.
- Plan: Better planning and knowledge of your historic usage can help you better budget your time on the cluster. The cluster is not an infinite resource. You have been allocated a slice of the cluster, thus it is best to budget your usage so that you can run high priority jobs when you need to. We at FASRC are happy to consult with you as to how to best budget your usage. Tools like
seff-array, and the historic usage graphs are invaluable assets for this. Beyond that doing analysis of your code efficiency and memory usage will help dramatically. Most users vastly over estimate how much memory their job actually needs, dragging down their fairshare score over time. Trimming these excess requests makes for more efficient usage. Increasing code efficiency by taking time to optimize your code base can also be very beneficial as better, more efficient algorithms mean lower usage and therefore better fairshare.
- Purchase: If your group has persistent high demand that cannot be met with your current allocation, serious consideration should be given to purchasing hardware for the cluster. This is not an immediate solution to the problem as it takes time for hardware to be built and installed. That said once the hardware arrives your Share will be increased and your fairshare will improve commensurately. Please contact FASRC for more information if you wish to purchase hardware for the cluster.