https://docs.alliancecan.ca/mediawiki/api.php?action=feedcontributions&user=Jdesjard&feedformat=atomAlliance Doc - User contributions [en]2024-03-28T21:20:35ZUser contributionsMediaWiki 1.39.6https://docs.alliancecan.ca/mediawiki/index.php?title=Allocations_and_compute_scheduling&diff=144796Allocations and compute scheduling2023-10-02T17:47:22Z<p>Jdesjard: </p>
<hr />
<div><languages /><br />
<br />
<translate><br />
<br />
<!--T:31--><br />
<i>Parent page: [[Job scheduling policies]]</i><br />
<br />
=Allocations for high-performance computing= <!--T:2--><br />
<br />
<!--T:3--><br />
<b>An allocation is an amount of resources that a research group can target for use for a period of time, usually a year.</b> This amount is either a maximum amount, as is the case for storage, or an average amount of usage over the period, as is the case for shared resources like computation cores.<br />
<br />
<!--T:4--><br />
Allocations are usually made in terms of core years, GPU years, or storage space. Storage allocations are the most straightforward to understand: research groups will get a maximum amount of storage that they can use exclusively throughout the allocation period. Core year and GPU year allocations are more difficult to understand because these allocations are meant to capture average use throughout the allocation period---typically meant to be a year---and this use will occur across a set of resources shared with other research groups.<br />
<br />
<!--T:5--><br />
The time period of an allocation when it is granted is a reference value, used for the calculation of the average which is applied to the actual period during which the resources are available. This means that if the allocation period was a year and the clusters were down for a week of maintenance, a research group would not be entitled to an additional week of resource usage. Equally so, if the allocation period were to be extended by a month, research groups affected by such a change would not see their resource access diminish during this month.<br />
<br />
<!--T:6--><br />
It should be noted that in the case of core year and GPU year allocations, both of which target resource usage averages over time on shared resources, a research group is more likely to hit (or exceed) its target(s) if the resources are used evenly over the allocation period than if the resources are used in bursts or if use is put off until later in the allocation period.<br />
<br />
==From compute allocations to job scheduling== <!--T:7--><br />
<br />
<!--T:8--><br />
Compute-related resources granted by core-year and GPU-year allocations require research groups to submit what are referred to as <i>jobs</i> to a <i>scheduler</i>. A job is a combination of a computer program (an application) and a list of resources that the application is expected to use. The [[What is a scheduler?|scheduler]] is a program that calculates the priority of each job submitted and provides the needed resources based on the priority of each job and the available resources.<br />
<br />
<!--T:9--><br />
The scheduler uses prioritization algorithms to meet the allocation targets of all groups and it is based on a research group’s recent usage of the system as compared to their allocated usage on that system. The past of the allocation period is taken into account but the most weight is put on recent usage (or non-usage). The point of this is to allow a research group that matches their actual usage with their allocated amounts to operate roughly continuously at that level. This smooths resource usage over time across all groups and resources, allowing for it to be theoretically possible for all research groups to hit their allocation targets.<br />
<br />
==Consequences of overusing a CPU or GPU allocation== <!--T:32--><br />
<br />
<!--T:33--><br />
If you have jobs waiting to run, and competing demand is low enough, then the scheduler may allow more of your jobs to run than your target level. The only consequence of this is that succeeding jobs of yours <i>may</i> have lower priority for a time while the scheduler prioritizes other groups which were below their target. You are not prevented from submitting or running new jobs, and the average of your usage over time should still be close to your target, that is, your allocation.<br />
<br />
<!--T:34--><br />
It is even possible that you could end a month or even a year having run more work than your allocation would seem to allow, although this is unlikely given the demand on our resources.<br />
<br />
=Reference GPU Units= <!--T:45--><br />
{{Note|This is a new unit that will be used from RAC 2024.}}<br />
<br />
<!--T:46--><br />
As you may be aware, the performance of GPUs has dramatically increased in the recent years and is expected to do so again with the upcoming next generation of GPUs. Until RAC 2023, in order to reduce complexity, we have been treating all GPUs as equivalent to each other at allocation time and when considering how many resources groups have consumed. This has raised issues of fairness, both in the allocation process and while running jobs. We cannot continue to treat all GPU types as the same.<br />
<br />
<!--T:47--><br />
To overcome the fairness problem, we have defined a <i>reference GPU unit</i> (or <b>RGU</b>) in order to be able to rank all GPU models in production. Because roughly half of our users use primarily single-precision floating-point operations ([https://en.wikipedia.org/wiki/Single-precision_floating-point_format FP32]), the other half use half-precision floating-point operations ([https://en.wikipedia.org/wiki/Half-precision_floating-point_format FP16]), and a significant portion of all users care about the memory on the GPU itself, we set the following evaluation criteria with their corresponding weight:<br />
<br />
<!--T:48--><br />
{| class="wikitable" style="margin: auto;"<br />
|-<br />
! scope="col"| Evaluation Criteria<br />
! scope="col"| Weight <br> (RGU)<br />
|-<br />
! scope="row"| FP32 score<br />
| 40% * 4 = 1.6<br />
|-<br />
! scope="row"| FP16 score<br />
| 40% * 4 = 1.6<br />
|-<br />
! scope="row"| GPU memory score<br />
| 20% * 4 = 0.8<br />
|}<br />
<br />
<!--T:49--><br />
For convenience, weights are based on percentages up-scaled by a factor of 4 <i>reference GPU units</i> (RGUs). Then, by using the <b>A100-40gb</b> as the reference GPU model, we get the following scores for each model:<br />
<br />
<!--T:50--><br />
{| class="wikitable" style="margin: auto; text-align: center;"<br />
|-<br />
|<br />
! scope="col"| FP32 score<br />
! scope="col"| FP16 score<br />
! scope="col"| Memory score<br />
! scope="col"| Weighted Score<br />
|-<br />
! scope="col"| Weight:<br />
! scope="col"| 1.6<br />
! scope="col"| 1.6<br />
! scope="col"| 0.8<br />
| (RGU)<br />
|-<br />
! scope="row" style="text-decoration: underline;"| Model<br />
|-<br />
! scope="row"| P100-12gb<br />
| 0.48<br />
| 0.00<br />
| 0.3<br />
! 1.0<br />
|-<br />
! scope="row"| P100-16gb<br />
| 0.48<br />
| 0.00<br />
| 0.4<br />
! 1.1<br />
|-<br />
! scope="row"| T4-16gb<br />
| 0.42<br />
| 0.21<br />
| 0.4<br />
! 1.3<br />
|-<br />
! scope="row"| V100-16gb*<br />
| 0.81<br />
| 0.40<br />
| 0.4<br />
! 2.2<br />
|-<br />
! scope="row"| V100-32gb*<br />
| 0.81<br />
| 0.40<br />
| 0.8<br />
! 2.6<br />
|-<br />
! scope="row"| A100-40gb<br />
| <b>1.00</b><br />
| <b>1.00</b><br />
| <b>1.0</b><br />
! 4.0<br />
|-<br />
! scope="row"| A100-80gb*<br />
| 1.00<br />
| 1.00<br />
| 2.0<br />
! 4.8<br />
|}<br />
<br />
<!--T:59--><br />
(*) On Graham, these GPU models are available through a very few contributed GPU nodes. While all users can use them, they are not allocatable through the RAC process.<br />
<br />
<!--T:51--><br />
As an example, the oldest GPU model in production (P100-12gb) is worth 1.0 RGU. The next few generations of GPUs will be compared to the A100-40gb using the same formula.<br />
<br />
==Choosing GPU models for your project== <!--T:52--><br />
<br />
<!--T:53--><br />
The relative scores in the above table should give you a hint on the models to choose. Here is an example with the extremes:<br />
<br />
<!--T:54--><br />
* If your applications are doing primarily FP32 operations, an A100-40gb GPU is expected to be twice as fast as a P100-12gb GPU, but the recorded usage will be 4 times the resources. Consequently, for an equal amount of RGUs, P100-12gb GPUs should allow you to run double the computations.<br />
* If your applications (typically AI-related) are doing primarily FP16 operations (including mixed precision operations or using other [https://en.wikipedia.org/wiki/Bfloat16_floating-point_format floating-point formats]), using an A100-40gb will result in getting evaluated as using 4x the resources of a P100-12gb, but it is capable of computing ~30x the calculations for the same amount of time, which would allow you to complete ~7.5x the computations.<br />
<br />
==Starting from RAC 2024== <!--T:55--><br />
<br />
<!--T:56--><br />
* During the Resource Allocation Competition 2024 (RAC 2024), any proposal asking for GPUs will require to specify the preferred GPU model for the project. Then, in the CCDB form, the amount of reference GPU units (RGUs) will automatically be calculated from the requested amount of gpu-years per year of project.<br />
** For example, if you select the <i>narval-gpu</i> resource and request 13 gpu-years of the model A100-40gb, the corresponding amount of RGUs would be 13 * 4.0 = 52. The RAC committee would then allocate up to 52 RGUs, depending on the proposal score. In case your allocation must be moved to Cedar, the committee would instead allocate up to 20 gpu-years, because each V100-32gb GPU is worth 2.6 RGUs (and 52 / 2.6 = 20).<br />
<br />
<!--T:57--><br />
* For job scheduling and for usage accounting on CCDB, the use of <i>reference GPU units</i> will take effect on April 1st, 2024, with the implementation of RAC 2024.<br />
<br />
=Detailed effect of resource usage on priority= <!--T:10--><br />
<br />
<!--T:11--><br />
The overarching principle governing the calculation of priority on our national clusters is that compute-based jobs are considered in the calculation based on the resources that others are prevented from using and not on the resources actually used.<br />
<br />
<!--T:12--><br />
The most common example of unused cores contributing to a priority calculation occurs when a submitted job requests multiple cores but uses fewer cores than requested when run. The usage that will affect the priority of future jobs is the number of cores requested, not the number of cores the application actually used. This is because the unused cores were unavailable to others to use during the job.<br />
<br />
<!--T:13--><br />
Another common case is when a job requests memory beyond what is associated with the cores requested. If a cluster that has 4GB of memory associated with each core receives a job request for only a single core but 8GB of memory, then the job will be deemed to have used two cores. This is because other researchers were effectively prevented from using the second core because there was no memory available for it.<br />
<br />
==Cores equivalent used by the scheduler== <!--T:15--><br />
<br />
<!--T:16--><br />
A core equivalent is a bundle made up of a single core and some amount of associated memory. In other words, a core equivalent is a core plus the amount of memory considered to be associated with each core on a given system. <br />
<br />
<!--T:17--><br />
[[File:Core_equivalent_diagram_GP.png|frame|Figure 1 - Core equivalent diagram for Cedar and Graham.]]<br />
<br />
<!--T:18--><br />
Cedar and Graham are considered to provide 4GB per core, since this corresponds to the most common node type in those clusters, making a core equivalent on these systems a core-memory bundle of 4GB per core. Niagara is considered to provide 4.8GB of memory per core, making a core equivalent on it a core-memory bundle of 4.8GB per core. Jobs are charged in terms of core equivalent usage at the rate of 4 or 4.8 GB per core, as explained above. See Figure 1.<br />
<br />
<!--T:19--><br />
Allocation target tracking is straightforward when requests to use resources on the clusters are made entirely of core and memory amounts that can be portioned only into complete equivalent cores. Things become more complicated when jobs request portions of a core equivalent because it is possible to have many points counted against a research group’s allocation, even when they are using only portions of core equivalents. In practice, the method used by the Alliance to account for system usage solves problems about fairness and perceptions of fairness but unfortunately the method is not initially intuitive.<br />
<br />
<!--T:20--><br />
Research groups are charged for the maximum number of core equivalents they take from the resources. Assuming a core equivalent of 1 core and 4GB of memory:<br />
* [[File:Two_core_equivalents.png|frame|Figure 2 - Two core equivalents.]] Research groups using more cores than memory (above the 1 core/4GB memory ratio), will be charged by cores. For example, a research group requesting two cores and 2GB per core for a total of 4 GB of memory. The request requires 2 core equivalents worth of cores but only one bundle for memory. This job request will be counted as 2 core equivalents when priority is calculated. See Figure 2. <br clear=all><br />
<br />
<!--T:21--><br />
* [[File:Two_and_a_half_core_equivalents.png|frame|Figure 3 - 2.5 core equivalents.]] Research groups using more memory than the 1 core/4GB ratio will be charged by memory. For example, a research group requests two cores and 5GB per core for a total of 10 GB of memory. The request requires 2.5 core equivalents worth of memory, but only two bundles for cores. This job request will be counted as 2.5 core equivalents when priority is calculated. See Figure 3. <br clear=all><br />
<br />
==Reference GPU unit equivalent used by the scheduler== <!--T:22--><br />
<br />
<!--T:23--><br />
Use of GPUs and their associated resources follow the same principles as already described for core equivalents, except that a reference GPU unit (RGU) is added to the bundle alongside multiple cores and memory. This means that the accounting for GPU-based allocation targets must include the RGU. Similar to how the point system was used above when considering resource use as an expression of the concept of core equivalence, we use a similar point system here as an expression of RGU equivalence.<br />
<br />
<!--T:25--><br />
Research groups are charged for the maximum number of RGU-core-memory bundles they use. Assuming a fictive bundle of 1 RGU, 3 cores, and 4 GB of memory: <br />
[[File:GPU_equivalent_diagram.png|thumb|upright=1.1|center|Figure 4 - RGU equivalent diagram.]] <br clear=all><br />
<br />
<!--T:26--><br />
* Research groups using more RGUs than cores or memory per RGU-core-memory bundle will be charged by RGU. For example, a research group requests 2 P100-12gb GPUs (1 RGU each), 3 cores, and 4 GB of memory. The request is for 2 bundles worth of RGUs, but only one bundle for memory and cores. This job request will be counted as 2 RGU equivalents when the research group’s priority is calculated.<br />
[[File:Two_GPU_equivalents.png|thumb|center|Figure 5 - Two RGU equivalents.]] <br clear=all><br />
<br />
<!--T:27--><br />
* Research groups using more cores than RGUs or memory per RGU-core-memory bundle will be charged by core. For example, a researcher requests 1 RGU, 5 cores, and 5 GB of memory. The request is for 1.66 bundles worth of cores, but only one bundle for RGUs and 1.25 bundles for memory. This job request will be counted as 1.66 RGU equivalents when the research group’s priority is calculated.<br />
[[File:GPU_and_a_half_(cores).png|thumb|center|Figure 6 - 1.66 RGU equivalents, based on cores.]] <br clear=all><br />
<br />
<!--T:28--><br />
* Research groups using more memory than RGUs or cores per RGU-core-memory bundle will be charged by memory. For example, a researcher requests 1 RGU, 2 cores, and 6 GB of memory. The request is for 1.5 bundles worth of memory, but only one bundle for GPUs and 0.66 bundle for cores. This job request will be counted as 1.5 RGU equivalents when the research group’s priority is calculated.<br />
[[File:GPU_and_a_half_(memory).png|thumb|center|Figure 7 - 1.5 RGU equivalents, based on memory.]] <br clear=all><br />
<br />
<!--T:60--><br />
* On the same fictive cluster, a bundle with one V100-32gb GPU, 7.8 CPU cores and 10.4 GB of memory is worth 2.6 RGU equivalents:<br />
[[File:Two.Six_RGU_equivalents.png|thumb|upright=2.1|center|Figure 8 - 2.6 RGU equivalents, based on the V100-32gb GPU.]] <br clear=all><br />
<br />
<!--T:61--><br />
* On the same fictive cluster, a bundle with one A100-40gb GPU, 12 CPU cores and 16 GB of memory is worth 4.0 RGU equivalents:<br />
[[File:Four_RGU_equivalents.png|thumb|upright=2.66|center|Figure 9 - 4.0 RGU equivalents, based on the A100-40gb GPU.]] <br clear=all><br />
<br />
===Ratios in bundles=== <!--T:29--><br />
Alliance systems have the following RPU-core-memory and GPU-core-memory bundle characteristics:<br />
<br />
<!--T:58--><br />
{| class="wikitable" style="margin: auto; text-align: center;"<br />
|-<br />
! scope="col"| Cluster<br />
! scope="col"| GPU model<br />
! scope="col"| RGU per GPU<br />
! scope="col"| Bundle per RGU<br />
! scope="col"| Bundle per GPU<br />
! scope="col"| Physical ratios<br />
|-<br />
! scope="row"| [[Béluga/en#Node_Characteristics|Béluga]]<br />
| V100-16gb<br />
| 2.2<br />
| 4.5 cores / 21 GB<br />
| 10 cores / 46.5 GB<br />
| 10 cores / 46.5 GB<br />
|-<br />
! rowspan="3"| [[Cedar#Node_characteristics|Cedar]]<br />
| P100-12gb<br />
| 1.0<br />
| rowspan="3"|3.1 cores / 25 GB<br />
| 3.1 cores / 25 GB<br />
| 6 cores / 31.2 GB<br />
|-<br />
| P100-16gb<br />
| 1.1<br />
| 3.4 cores / 27 GB<br />
| 6 cores / 62.5 GB<br />
|-<br />
| V100-32gb<br />
| 2.6<br />
| 8.0 cores / 65 GB<br />
| 8 cores / 46.5 GB<br />
|-<br />
! rowspan="5"| [[Graham#Node_characteristics|Graham]]<br />
| P100-12gb<br />
| 1.0<br />
| rowspan="5"| 9.7 cores / 43 GB<br />
| 9.7 cores / 43 GB<br />
| 16 cores / 62 GB<br />
|-<br />
| T4-16gb<br />
| 1.3<br />
| 12.6 cores / 56 GB<br />
| {4, 11} cores / 46.8 GB<br />
|-<br />
| V100-16gb*<br />
| 2.2<br />
| 21.3 cores / 95 GB<br />
| 3.5 cores / 23.4 GB<br />
|-<br />
| V100-32gb*<br />
| 2.6<br />
| 25.2 cores / 112 GB<br />
| 5 cores / 47.1 GB<br />
|-<br />
| A100-80gb*<br />
| 4.8<br />
| 46.6 cores / 206 GB<br />
| {8, 16} c. / {62, 248} GB<br />
|-<br />
! scope="row"| [[Narval/en#Node_Characteristics|Narval]]<br />
| A100-40gb<br />
| 4.0<br />
| 3.0 cores / 31 GB<br />
| 12 cores / 124.5 GB<br />
| 12 cores / 124.5 GB<br />
|}<br />
<br />
<!--T:62--><br />
(*) These GPU models are available through a very few contributed GPU nodes. While all users can use them, they are not allocatable through the RAC process.<br />
<br />
<!--T:63--><br />
<b>Note:</b> While the scheduler will compute the priority based on the usage calculated with the above bundles, users requesting multiple GPUs per node also have to take into account the physical ratios.<br />
<br />
=Viewing group usage of compute resources= <!--T:35--><br />
<br />
<!--T:36--><br />
[[File:Select view group usage edit.png|thumb|Navigation to <i>View Group Usage</i>]]<br />
Information on the usage of compute resources by your groups can be found by logging into the CCDB and navigating to <i>My Account > View Group Usage</i>.<br />
<br clear=all><br />
<br />
<!--T:43--><br />
[[File:ccdb_view_use_by_compute_resource.png|thumb|CPU and GPU usage by compute resource]]<br />
CPU and GPU core year values are calculated based on the quantity of the resources allocated to jobs on the clusters. It is important to note that the values summarized in these pages do not represent core-equivalent measures such that, in the case of large memory jobs, the usage values will not match the cluster scheduler’s representation of the account usage.<br />
<br />
<!--T:37--><br />
The first tab bar offers these options:<br />
: <b>By Compute Resource</b>: cluster on which jobs are submitted; <br />
: <b>By Resource Allocation Project</b>: projects to which jobs are submitted;<br />
: <b>By Submitter</b>: user that submits the jobs;<br />
: <b>Storage usage</b> is discussed in [[Storage and file management]]. <br />
<br clear=all><br />
<br />
==Usage by compute resource== <!--T:38--><br />
<br />
<!--T:44--><br />
This view shows the usage of compute resources per cluster used by groups owned by you or of which you are a member for the current allocation year starting April 1st. The tables contain the total usage to date as well as the projected usage to the end of the current allocation period.<br />
<br clear=all><br />
<br />
<!--T:39--><br />
[[File:Ccdb_view_use_by_compute_resource_monthly.png|thumb|Usage by compute resource with monthly breakdown]]<br />
From the <i>Extra Info</i> column of the usage table <i>Show monthly usage</i> can be clicked to display a further breakdown of the usage by month for the specific cluster row in the table. By clicking <i>Show submitter usage</i>, a similar breakdown is displayed for the specific users submitting the jobs on the cluster.<br />
<br clear=all><br />
<br />
==Usage by resource allocation project== <!--T:40--><br />
[[File:Ccdb view use by compute resource monthly proj edit.png|thumb|Usage by Resource Allocation Project with monthly breakdown]]<br />
Under this tab, a third tag bar displays the RAPIs (Resource Allocation Project Identifiers) for the selected allocation year. The tables contain detailed information for each allocation project and the resources used by the projects on all of the clusters. The top of the page summarizes information such as the account name (e.g. def-, rrg- or rpp-*, etc.), the project title and ownership, as well as allocation and usage summaries.<br />
<br clear=all><br />
<br />
==GPU usage and Reference GPU Units (RGUs)== <!--T:41--><br />
[[File:Rgu en.png|thumb|alt=GPU usage|GPU usage summary with Reference GPU Unit (RGU) breakdown table.]]<br />
For resource allocation projects that have GPU usage the table is broken down into usage on various GPU models and measured in RGUs.<br />
<br clear=all><br />
<br />
==Usage by submitter== <!--T:42--><br />
[[File:Ccdb view use by submitter summary edit.png|thumb|CPU and GPU usage by submitter]]<br />
Usage can also be displayed grouped by the users that submitted jobs from within the resource allocation projects (group accounts). The view shows the usage for each user aggregated across systems.<br />
Selecting from the list of users will display that user’s usage broken down by cluster. Like the group summaries, these user summaries can then be broken down monthly by clicking the Show monthly usage link of the Extra Info column of the CPU/GPU Usage (in core/GPU] years) table for the specific Resource row. <br />
<br clear=all><br />
<br />
<!--T:30--><br />
[[Category:SLURM]]<br />
</translate></div>Jdesjardhttps://docs.alliancecan.ca/mediawiki/index.php?title=Allocations_and_compute_scheduling&diff=144795Allocations and compute scheduling2023-10-02T17:40:56Z<p>Jdesjard: </p>
<hr />
<div><languages /><br />
<br />
<translate><br />
<br />
<!--T:31--><br />
<i>Parent page: [[Job scheduling policies]]</i><br />
<br />
=Allocations for high-performance computing= <!--T:2--><br />
<br />
<!--T:3--><br />
<b>An allocation is an amount of resources that a research group can target for use for a period of time, usually a year.</b> This amount is either a maximum amount, as is the case for storage, or an average amount of usage over the period, as is the case for shared resources like computation cores.<br />
<br />
<!--T:4--><br />
Allocations are usually made in terms of core years, GPU years, or storage space. Storage allocations are the most straightforward to understand: research groups will get a maximum amount of storage that they can use exclusively throughout the allocation period. Core year and GPU year allocations are more difficult to understand because these allocations are meant to capture average use throughout the allocation period---typically meant to be a year---and this use will occur across a set of resources shared with other research groups.<br />
<br />
<!--T:5--><br />
The time period of an allocation when it is granted is a reference value, used for the calculation of the average which is applied to the actual period during which the resources are available. This means that if the allocation period was a year and the clusters were down for a week of maintenance, a research group would not be entitled to an additional week of resource usage. Equally so, if the allocation period were to be extended by a month, research groups affected by such a change would not see their resource access diminish during this month.<br />
<br />
<!--T:6--><br />
It should be noted that in the case of core year and GPU year allocations, both of which target resource usage averages over time on shared resources, a research group is more likely to hit (or exceed) its target(s) if the resources are used evenly over the allocation period than if the resources are used in bursts or if use is put off until later in the allocation period.<br />
<br />
==From compute allocations to job scheduling== <!--T:7--><br />
<br />
<!--T:8--><br />
Compute-related resources granted by core-year and GPU-year allocations require research groups to submit what are referred to as <i>jobs</i> to a <i>scheduler</i>. A job is a combination of a computer program (an application) and a list of resources that the application is expected to use. The [[What is a scheduler?|scheduler]] is a program that calculates the priority of each job submitted and provides the needed resources based on the priority of each job and the available resources.<br />
<br />
<!--T:9--><br />
The scheduler uses prioritization algorithms to meet the allocation targets of all groups and it is based on a research group’s recent usage of the system as compared to their allocated usage on that system. The past of the allocation period is taken into account but the most weight is put on recent usage (or non-usage). The point of this is to allow a research group that matches their actual usage with their allocated amounts to operate roughly continuously at that level. This smooths resource usage over time across all groups and resources, allowing for it to be theoretically possible for all research groups to hit their allocation targets.<br />
<br />
==Consequences of overusing a CPU or GPU allocation== <!--T:32--><br />
<br />
<!--T:33--><br />
If you have jobs waiting to run, and competing demand is low enough, then the scheduler may allow more of your jobs to run than your target level. The only consequence of this is that succeeding jobs of yours <i>may</i> have lower priority for a time while the scheduler prioritizes other groups which were below their target. You are not prevented from submitting or running new jobs, and the average of your usage over time should still be close to your target, that is, your allocation.<br />
<br />
<!--T:34--><br />
It is even possible that you could end a month or even a year having run more work than your allocation would seem to allow, although this is unlikely given the demand on our resources.<br />
<br />
=Reference GPU Units= <!--T:45--><br />
{{Note|This is a new unit that will be used from RAC 2024.}}<br />
<br />
<!--T:46--><br />
As you may be aware, the performance of GPUs has dramatically increased in the recent years and is expected to do so again with the upcoming next generation of GPUs. Until RAC 2023, in order to reduce complexity, we have been treating all GPUs as equivalent to each other at allocation time and when considering how many resources groups have consumed. This has raised issues of fairness, both in the allocation process and while running jobs. We cannot continue to treat all GPU types as the same.<br />
<br />
<!--T:47--><br />
To overcome the fairness problem, we have defined a <i>reference GPU unit</i> (or <b>RGU</b>) in order to be able to rank all GPU models in production. Because roughly half of our users use primarily single-precision floating-point operations ([https://en.wikipedia.org/wiki/Single-precision_floating-point_format FP32]), the other half use half-precision floating-point operations ([https://en.wikipedia.org/wiki/Half-precision_floating-point_format FP16]), and a significant portion of all users care about the memory on the GPU itself, we set the following evaluation criteria with their corresponding weight:<br />
<br />
<!--T:48--><br />
{| class="wikitable" style="margin: auto;"<br />
|-<br />
! scope="col"| Evaluation Criteria<br />
! scope="col"| Weight <br> (RGU)<br />
|-<br />
! scope="row"| FP32 score<br />
| 40% * 4 = 1.6<br />
|-<br />
! scope="row"| FP16 score<br />
| 40% * 4 = 1.6<br />
|-<br />
! scope="row"| GPU memory score<br />
| 20% * 4 = 0.8<br />
|}<br />
<br />
<!--T:49--><br />
For convenience, weights are based on percentages up-scaled by a factor of 4 <i>reference GPU units</i> (RGUs). Then, by using the <b>A100-40gb</b> as the reference GPU model, we get the following scores for each model:<br />
<br />
<!--T:50--><br />
{| class="wikitable" style="margin: auto; text-align: center;"<br />
|-<br />
|<br />
! scope="col"| FP32 score<br />
! scope="col"| FP16 score<br />
! scope="col"| Memory score<br />
! scope="col"| Weighted Score<br />
|-<br />
! scope="col"| Weight:<br />
! scope="col"| 1.6<br />
! scope="col"| 1.6<br />
! scope="col"| 0.8<br />
| (RGU)<br />
|-<br />
! scope="row" style="text-decoration: underline;"| Model<br />
|-<br />
! scope="row"| P100-12gb<br />
| 0.48<br />
| 0.00<br />
| 0.3<br />
! 1.0<br />
|-<br />
! scope="row"| P100-16gb<br />
| 0.48<br />
| 0.00<br />
| 0.4<br />
! 1.1<br />
|-<br />
! scope="row"| T4-16gb<br />
| 0.42<br />
| 0.21<br />
| 0.4<br />
! 1.3<br />
|-<br />
! scope="row"| V100-16gb*<br />
| 0.81<br />
| 0.40<br />
| 0.4<br />
! 2.2<br />
|-<br />
! scope="row"| V100-32gb*<br />
| 0.81<br />
| 0.40<br />
| 0.8<br />
! 2.6<br />
|-<br />
! scope="row"| A100-40gb<br />
| <b>1.00</b><br />
| <b>1.00</b><br />
| <b>1.0</b><br />
! 4.0<br />
|-<br />
! scope="row"| A100-80gb*<br />
| 1.00<br />
| 1.00<br />
| 2.0<br />
! 4.8<br />
|}<br />
<br />
<!--T:59--><br />
(*) On Graham, these GPU models are available through a very few contributed GPU nodes. While all users can use them, they are not allocatable through the RAC process.<br />
<br />
<!--T:51--><br />
As an example, the oldest GPU model in production (P100-12gb) is worth 1.0 RGU. The next few generations of GPUs will be compared to the A100-40gb using the same formula.<br />
<br />
==Choosing GPU models for your project== <!--T:52--><br />
<br />
<!--T:53--><br />
The relative scores in the above table should give you a hint on the models to choose. Here is an example with the extremes:<br />
<br />
<!--T:54--><br />
* If your applications are doing primarily FP32 operations, an A100-40gb GPU is expected to be twice as fast as a P100-12gb GPU, but the recorded usage will be 4 times the resources. Consequently, for an equal amount of RGUs, P100-12gb GPUs should allow you to run double the computations.<br />
* If your applications (typically AI-related) are doing primarily FP16 operations (including mixed precision operations or using other [https://en.wikipedia.org/wiki/Bfloat16_floating-point_format floating-point formats]), using an A100-40gb will result in getting evaluated as using 4x the resources of a P100-12gb, but it is capable of computing ~30x the calculations for the same amount of time, which would allow you to complete ~7.5x the computations.<br />
<br />
==Starting from RAC 2024== <!--T:55--><br />
<br />
<!--T:56--><br />
* During the Resource Allocation Competition 2024 (RAC 2024), any proposal asking for GPUs will require to specify the preferred GPU model for the project. Then, in the CCDB form, the amount of reference GPU units (RGUs) will automatically be calculated from the requested amount of gpu-years per year of project.<br />
** For example, if you select the <i>narval-gpu</i> resource and request 13 gpu-years of the model A100-40gb, the corresponding amount of RGUs would be 13 * 4.0 = 52. The RAC committee would then allocate up to 52 RGUs, depending on the proposal score. In case your allocation must be moved to Cedar, the committee would instead allocate up to 20 gpu-years, because each V100-32gb GPU is worth 2.6 RGUs (and 52 / 2.6 = 20).<br />
<br />
<!--T:57--><br />
* For job scheduling and for usage accounting on CCDB, the use of <i>reference GPU units</i> will take effect on April 1st, 2024, with the implementation of RAC 2024.<br />
<br />
=Detailed effect of resource usage on priority= <!--T:10--><br />
<br />
<!--T:11--><br />
The overarching principle governing the calculation of priority on our national clusters is that compute-based jobs are considered in the calculation based on the resources that others are prevented from using and not on the resources actually used.<br />
<br />
<!--T:12--><br />
The most common example of unused cores contributing to a priority calculation occurs when a submitted job requests multiple cores but uses fewer cores than requested when run. The usage that will affect the priority of future jobs is the number of cores requested, not the number of cores the application actually used. This is because the unused cores were unavailable to others to use during the job.<br />
<br />
<!--T:13--><br />
Another common case is when a job requests memory beyond what is associated with the cores requested. If a cluster that has 4GB of memory associated with each core receives a job request for only a single core but 8GB of memory, then the job will be deemed to have used two cores. This is because other researchers were effectively prevented from using the second core because there was no memory available for it.<br />
<br />
==Cores equivalent used by the scheduler== <!--T:15--><br />
<br />
<!--T:16--><br />
A core equivalent is a bundle made up of a single core and some amount of associated memory. In other words, a core equivalent is a core plus the amount of memory considered to be associated with each core on a given system. <br />
<br />
<!--T:17--><br />
[[File:Core_equivalent_diagram_GP.png|frame|Figure 1 - Core equivalent diagram for Cedar and Graham.]]<br />
<br />
<!--T:18--><br />
Cedar and Graham are considered to provide 4GB per core, since this corresponds to the most common node type in those clusters, making a core equivalent on these systems a core-memory bundle of 4GB per core. Niagara is considered to provide 4.8GB of memory per core, making a core equivalent on it a core-memory bundle of 4.8GB per core. Jobs are charged in terms of core equivalent usage at the rate of 4 or 4.8 GB per core, as explained above. See Figure 1.<br />
<br />
<!--T:19--><br />
Allocation target tracking is straightforward when requests to use resources on the clusters are made entirely of core and memory amounts that can be portioned only into complete equivalent cores. Things become more complicated when jobs request portions of a core equivalent because it is possible to have many points counted against a research group’s allocation, even when they are using only portions of core equivalents. In practice, the method used by the Alliance to account for system usage solves problems about fairness and perceptions of fairness but unfortunately the method is not initially intuitive.<br />
<br />
<!--T:20--><br />
Research groups are charged for the maximum number of core equivalents they take from the resources. Assuming a core equivalent of 1 core and 4GB of memory:<br />
* [[File:Two_core_equivalents.png|frame|Figure 2 - Two core equivalents.]] Research groups using more cores than memory (above the 1 core/4GB memory ratio), will be charged by cores. For example, a research group requesting two cores and 2GB per core for a total of 4 GB of memory. The request requires 2 core equivalents worth of cores but only one bundle for memory. This job request will be counted as 2 core equivalents when priority is calculated. See Figure 2. <br clear=all><br />
<br />
<!--T:21--><br />
* [[File:Two_and_a_half_core_equivalents.png|frame|Figure 3 - 2.5 core equivalents.]] Research groups using more memory than the 1 core/4GB ratio will be charged by memory. For example, a research group requests two cores and 5GB per core for a total of 10 GB of memory. The request requires 2.5 core equivalents worth of memory, but only two bundles for cores. This job request will be counted as 2.5 core equivalents when priority is calculated. See Figure 3. <br clear=all><br />
<br />
==Reference GPU unit equivalent used by the scheduler== <!--T:22--><br />
<br />
<!--T:23--><br />
Use of GPUs and their associated resources follow the same principles as already described for core equivalents, except that a reference GPU unit (RGU) is added to the bundle alongside multiple cores and memory. This means that the accounting for GPU-based allocation targets must include the RGU. Similar to how the point system was used above when considering resource use as an expression of the concept of core equivalence, we use a similar point system here as an expression of RGU equivalence.<br />
<br />
<!--T:25--><br />
Research groups are charged for the maximum number of RGU-core-memory bundles they use. Assuming a fictive bundle of 1 RGU, 3 cores, and 4 GB of memory: <br />
[[File:GPU_equivalent_diagram.png|thumb|upright=1.1|center|Figure 4 - RGU equivalent diagram.]] <br clear=all><br />
<br />
<!--T:26--><br />
* Research groups using more RGUs than cores or memory per RGU-core-memory bundle will be charged by RGU. For example, a research group requests 2 P100-12gb GPUs (1 RGU each), 3 cores, and 4 GB of memory. The request is for 2 bundles worth of RGUs, but only one bundle for memory and cores. This job request will be counted as 2 RGU equivalents when the research group’s priority is calculated.<br />
[[File:Two_GPU_equivalents.png|thumb|center|Figure 5 - Two RGU equivalents.]] <br clear=all><br />
<br />
<!--T:27--><br />
* Research groups using more cores than RGUs or memory per RGU-core-memory bundle will be charged by core. For example, a researcher requests 1 RGU, 5 cores, and 5 GB of memory. The request is for 1.66 bundles worth of cores, but only one bundle for RGUs and 1.25 bundles for memory. This job request will be counted as 1.66 RGU equivalents when the research group’s priority is calculated.<br />
[[File:GPU_and_a_half_(cores).png|thumb|center|Figure 6 - 1.66 RGU equivalents, based on cores.]] <br clear=all><br />
<br />
<!--T:28--><br />
* Research groups using more memory than RGUs or cores per RGU-core-memory bundle will be charged by memory. For example, a researcher requests 1 RGU, 2 cores, and 6 GB of memory. The request is for 1.5 bundles worth of memory, but only one bundle for GPUs and 0.66 bundle for cores. This job request will be counted as 1.5 RGU equivalents when the research group’s priority is calculated.<br />
[[File:GPU_and_a_half_(memory).png|thumb|center|Figure 7 - 1.5 RGU equivalents, based on memory.]] <br clear=all><br />
<br />
<!--T:60--><br />
* On the same fictive cluster, a bundle with one V100-32gb GPU, 7.8 CPU cores and 10.4 GB of memory is worth 2.6 RGU equivalents:<br />
[[File:Two.Six_RGU_equivalents.png|thumb|upright=2.1|center|Figure 8 - 2.6 RGU equivalents, based on the V100-32gb GPU.]] <br clear=all><br />
<br />
<!--T:61--><br />
* On the same fictive cluster, a bundle with one A100-40gb GPU, 12 CPU cores and 16 GB of memory is worth 4.0 RGU equivalents:<br />
[[File:Four_RGU_equivalents.png|thumb|upright=2.66|center|Figure 9 - 4.0 RGU equivalents, based on the A100-40gb GPU.]] <br clear=all><br />
<br />
===Ratios in bundles=== <!--T:29--><br />
Alliance systems have the following RPU-core-memory and GPU-core-memory bundle characteristics:<br />
<br />
<!--T:58--><br />
{| class="wikitable" style="margin: auto; text-align: center;"<br />
|-<br />
! scope="col"| Cluster<br />
! scope="col"| GPU model<br />
! scope="col"| RGU per GPU<br />
! scope="col"| Bundle per RGU<br />
! scope="col"| Bundle per GPU<br />
! scope="col"| Physical ratios<br />
|-<br />
! scope="row"| [[Béluga/en#Node_Characteristics|Béluga]]<br />
| V100-16gb<br />
| 2.2<br />
| 4.5 cores / 21 GB<br />
| 10 cores / 46.5 GB<br />
| 10 cores / 46.5 GB<br />
|-<br />
! rowspan="3"| [[Cedar#Node_characteristics|Cedar]]<br />
| P100-12gb<br />
| 1.0<br />
| rowspan="3"|3.1 cores / 25 GB<br />
| 3.1 cores / 25 GB<br />
| 6 cores / 31.2 GB<br />
|-<br />
| P100-16gb<br />
| 1.1<br />
| 3.4 cores / 27 GB<br />
| 6 cores / 62.5 GB<br />
|-<br />
| V100-32gb<br />
| 2.6<br />
| 8.0 cores / 65 GB<br />
| 8 cores / 46.5 GB<br />
|-<br />
! rowspan="5"| [[Graham#Node_characteristics|Graham]]<br />
| P100-12gb<br />
| 1.0<br />
| rowspan="5"| 9.7 cores / 43 GB<br />
| 9.7 cores / 43 GB<br />
| 16 cores / 62 GB<br />
|-<br />
| T4-16gb<br />
| 1.3<br />
| 12.6 cores / 56 GB<br />
| {4, 11} cores / 46.8 GB<br />
|-<br />
| V100-16gb*<br />
| 2.2<br />
| 21.3 cores / 95 GB<br />
| 3.5 cores / 23.4 GB<br />
|-<br />
| V100-32gb*<br />
| 2.6<br />
| 25.2 cores / 112 GB<br />
| 5 cores / 47.1 GB<br />
|-<br />
| A100-80gb*<br />
| 4.8<br />
| 46.6 cores / 206 GB<br />
| {8, 16} c. / {62, 248} GB<br />
|-<br />
! scope="row"| [[Narval/en#Node_Characteristics|Narval]]<br />
| A100-40gb<br />
| 4.0<br />
| 3.0 cores / 31 GB<br />
| 12 cores / 124.5 GB<br />
| 12 cores / 124.5 GB<br />
|}<br />
<br />
<!--T:62--><br />
(*) These GPU models are available through a very few contributed GPU nodes. While all users can use them, they are not allocatable through the RAC process.<br />
<br />
<!--T:63--><br />
<b>Note:</b> While the scheduler will compute the priority based on the usage calculated with the above bundles, users requesting multiple GPUs per node also have to take into account the physical ratios.<br />
<br />
=Viewing group usage of compute resources= <!--T:35--><br />
<br />
<!--T:36--><br />
[[File:Select view group usage edit.png|thumb|Navigation to <i>View Group Usage</i>]]<br />
Information on the usage of compute resources by your groups can be found by logging into the CCDB and navigating to <i>My Account > View Group Usage</i>.<br />
<br clear=all><br />
<br />
<!--T:43--><br />
[[File:ccdb_view_use_by_compute_resource.png|thumb|CPU and GPU usage by compute resource]]<br />
CPU and GPU core year values are calculated based on the quantity of the resources allocated to jobs on the clusters. It is important to note that the values summarized in these pages do not represent core-equivalent measures such that, in the case of large memory jobs, the usage values will not match the cluster scheduler’s representation of the account usage.<br />
<br />
<!--T:37--><br />
The first tab bar offers these options:<br />
: <b>By Compute Resource</b>: cluster on which jobs are submitted; <br />
: <b>By Resource Allocation Project</b>: projects to which jobs are submitted;<br />
: <b>By Submitter</b>: user that submits the jobs;<br />
: <b>Storage usage</b> is discussed in [[Storage and file management]]. <br />
<br clear=all><br />
<br />
==Usage by compute resource== <!--T:38--><br />
<br />
<!--T:44--><br />
This view shows the usage of compute resources per cluster used by groups owned by you or of which you are a member for the current allocation year starting April 1st. The tables contain the total usage to date as well as the projected usage to the end of the current allocation period.<br />
<br clear=all><br />
<br />
<!--T:39--><br />
[[File:Ccdb_view_use_by_compute_resource_monthly.png|thumb|Usage by compute resource with monthly breakdown]]<br />
From the <i>Extra Info</i> column of the usage table <i>Show monthly usage</i> can be clicked to display a further breakdown of the usage by month for the specific cluster row in the table. By clicking <i>Show submitter usage</i>, a similar breakdown is displayed for the specific users submitting the jobs on the cluster.<br />
<br clear=all><br />
<br />
==Usage by resource allocation project== <!--T:40--><br />
[[File:Ccdb view use by compute resource monthly proj edit.png|thumb|Usage by Resource Allocation Project with monthly breakdown]]<br />
Under this tab, a third tag bar displays the RAPIs (Resource Allocation Project Identifiers) for the selected allocation year. The tables contain detailed information for each allocation project and the resources used by the projects on all of the clusters. The top of the page summarizes information such as the account name (e.g. def-, rrg- or rpp-*, etc.), the project title and ownership, as well as allocation and usage summaries.<br />
<br clear=all><br />
<br />
==Usage by resource allocation project== <!--T:41--><br />
[[File:Rgu en.png|thumb|alt=GPU usage|GPU usage summary with Reference GPU Unit (RGU) breakdown table.]]<br />
For resource allocation projects that have GPU usage the table is broken down into usage on various GPU models and measured in RGUs.<br />
<br clear=all><br />
<br />
==Usage by submitter== <!--T:42--><br />
[[File:Ccdb view use by submitter summary edit.png|thumb|CPU and GPU usage by submitter]]<br />
Usage can also be displayed grouped by the users that submitted jobs from within the resource allocation projects (group accounts). The view shows the usage for each user aggregated across systems.<br />
Selecting from the list of users will display that user’s usage broken down by cluster. Like the group summaries, these user summaries can then be broken down monthly by clicking the Show monthly usage link of the Extra Info column of the CPU/GPU Usage (in core/GPU] years) table for the specific Resource row. <br />
<br clear=all><br />
<br />
<!--T:30--><br />
[[Category:SLURM]]<br />
</translate></div>Jdesjardhttps://docs.alliancecan.ca/mediawiki/index.php?title=Allocations_and_compute_scheduling&diff=144794Allocations and compute scheduling2023-10-02T17:34:51Z<p>Jdesjard: </p>
<hr />
<div><languages /><br />
<br />
<translate><br />
<br />
<!--T:31--><br />
<i>Parent page: [[Job scheduling policies]]</i><br />
<br />
=Allocations for high-performance computing= <!--T:2--><br />
<br />
<!--T:3--><br />
<b>An allocation is an amount of resources that a research group can target for use for a period of time, usually a year.</b> This amount is either a maximum amount, as is the case for storage, or an average amount of usage over the period, as is the case for shared resources like computation cores.<br />
<br />
<!--T:4--><br />
Allocations are usually made in terms of core years, GPU years, or storage space. Storage allocations are the most straightforward to understand: research groups will get a maximum amount of storage that they can use exclusively throughout the allocation period. Core year and GPU year allocations are more difficult to understand because these allocations are meant to capture average use throughout the allocation period---typically meant to be a year---and this use will occur across a set of resources shared with other research groups.<br />
<br />
<!--T:5--><br />
The time period of an allocation when it is granted is a reference value, used for the calculation of the average which is applied to the actual period during which the resources are available. This means that if the allocation period was a year and the clusters were down for a week of maintenance, a research group would not be entitled to an additional week of resource usage. Equally so, if the allocation period were to be extended by a month, research groups affected by such a change would not see their resource access diminish during this month.<br />
<br />
<!--T:6--><br />
It should be noted that in the case of core year and GPU year allocations, both of which target resource usage averages over time on shared resources, a research group is more likely to hit (or exceed) its target(s) if the resources are used evenly over the allocation period than if the resources are used in bursts or if use is put off until later in the allocation period.<br />
<br />
==From compute allocations to job scheduling== <!--T:7--><br />
<br />
<!--T:8--><br />
Compute-related resources granted by core-year and GPU-year allocations require research groups to submit what are referred to as <i>jobs</i> to a <i>scheduler</i>. A job is a combination of a computer program (an application) and a list of resources that the application is expected to use. The [[What is a scheduler?|scheduler]] is a program that calculates the priority of each job submitted and provides the needed resources based on the priority of each job and the available resources.<br />
<br />
<!--T:9--><br />
The scheduler uses prioritization algorithms to meet the allocation targets of all groups and it is based on a research group’s recent usage of the system as compared to their allocated usage on that system. The past of the allocation period is taken into account but the most weight is put on recent usage (or non-usage). The point of this is to allow a research group that matches their actual usage with their allocated amounts to operate roughly continuously at that level. This smooths resource usage over time across all groups and resources, allowing for it to be theoretically possible for all research groups to hit their allocation targets.<br />
<br />
==Consequences of overusing a CPU or GPU allocation== <!--T:32--><br />
<br />
<!--T:33--><br />
If you have jobs waiting to run, and competing demand is low enough, then the scheduler may allow more of your jobs to run than your target level. The only consequence of this is that succeeding jobs of yours <i>may</i> have lower priority for a time while the scheduler prioritizes other groups which were below their target. You are not prevented from submitting or running new jobs, and the average of your usage over time should still be close to your target, that is, your allocation.<br />
<br />
<!--T:34--><br />
It is even possible that you could end a month or even a year having run more work than your allocation would seem to allow, although this is unlikely given the demand on our resources.<br />
<br />
=Reference GPU Units= <!--T:45--><br />
{{Note|This is a new unit that will be used from RAC 2024.}}<br />
<br />
<!--T:46--><br />
As you may be aware, the performance of GPUs has dramatically increased in the recent years and is expected to do so again with the upcoming next generation of GPUs. Until RAC 2023, in order to reduce complexity, we have been treating all GPUs as equivalent to each other at allocation time and when considering how many resources groups have consumed. This has raised issues of fairness, both in the allocation process and while running jobs. We cannot continue to treat all GPU types as the same.<br />
<br />
<!--T:47--><br />
To overcome the fairness problem, we have defined a <i>reference GPU unit</i> (or <b>RGU</b>) in order to be able to rank all GPU models in production. Because roughly half of our users use primarily single-precision floating-point operations ([https://en.wikipedia.org/wiki/Single-precision_floating-point_format FP32]), the other half use half-precision floating-point operations ([https://en.wikipedia.org/wiki/Half-precision_floating-point_format FP16]), and a significant portion of all users care about the memory on the GPU itself, we set the following evaluation criteria with their corresponding weight:<br />
<br />
<!--T:48--><br />
{| class="wikitable" style="margin: auto;"<br />
|-<br />
! scope="col"| Evaluation Criteria<br />
! scope="col"| Weight <br> (RGU)<br />
|-<br />
! scope="row"| FP32 score<br />
| 40% * 4 = 1.6<br />
|-<br />
! scope="row"| FP16 score<br />
| 40% * 4 = 1.6<br />
|-<br />
! scope="row"| GPU memory score<br />
| 20% * 4 = 0.8<br />
|}<br />
<br />
<!--T:49--><br />
For convenience, weights are based on percentages up-scaled by a factor of 4 <i>reference GPU units</i> (RGUs). Then, by using the <b>A100-40gb</b> as the reference GPU model, we get the following scores for each model:<br />
<br />
<!--T:50--><br />
{| class="wikitable" style="margin: auto; text-align: center;"<br />
|-<br />
|<br />
! scope="col"| FP32 score<br />
! scope="col"| FP16 score<br />
! scope="col"| Memory score<br />
! scope="col"| Weighted Score<br />
|-<br />
! scope="col"| Weight:<br />
! scope="col"| 1.6<br />
! scope="col"| 1.6<br />
! scope="col"| 0.8<br />
| (RGU)<br />
|-<br />
! scope="row" style="text-decoration: underline;"| Model<br />
|-<br />
! scope="row"| P100-12gb<br />
| 0.48<br />
| 0.00<br />
| 0.3<br />
! 1.0<br />
|-<br />
! scope="row"| P100-16gb<br />
| 0.48<br />
| 0.00<br />
| 0.4<br />
! 1.1<br />
|-<br />
! scope="row"| T4-16gb<br />
| 0.42<br />
| 0.21<br />
| 0.4<br />
! 1.3<br />
|-<br />
! scope="row"| V100-16gb*<br />
| 0.81<br />
| 0.40<br />
| 0.4<br />
! 2.2<br />
|-<br />
! scope="row"| V100-32gb*<br />
| 0.81<br />
| 0.40<br />
| 0.8<br />
! 2.6<br />
|-<br />
! scope="row"| A100-40gb<br />
| <b>1.00</b><br />
| <b>1.00</b><br />
| <b>1.0</b><br />
! 4.0<br />
|-<br />
! scope="row"| A100-80gb*<br />
| 1.00<br />
| 1.00<br />
| 2.0<br />
! 4.8<br />
|}<br />
<br />
<!--T:59--><br />
(*) On Graham, these GPU models are available through a very few contributed GPU nodes. While all users can use them, they are not allocatable through the RAC process.<br />
<br />
<!--T:51--><br />
As an example, the oldest GPU model in production (P100-12gb) is worth 1.0 RGU. The next few generations of GPUs will be compared to the A100-40gb using the same formula.<br />
<br />
==Choosing GPU models for your project== <!--T:52--><br />
<br />
<!--T:53--><br />
The relative scores in the above table should give you a hint on the models to choose. Here is an example with the extremes:<br />
<br />
<!--T:54--><br />
* If your applications are doing primarily FP32 operations, an A100-40gb GPU is expected to be twice as fast as a P100-12gb GPU, but the recorded usage will be 4 times the resources. Consequently, for an equal amount of RGUs, P100-12gb GPUs should allow you to run double the computations.<br />
* If your applications (typically AI-related) are doing primarily FP16 operations (including mixed precision operations or using other [https://en.wikipedia.org/wiki/Bfloat16_floating-point_format floating-point formats]), using an A100-40gb will result in getting evaluated as using 4x the resources of a P100-12gb, but it is capable of computing ~30x the calculations for the same amount of time, which would allow you to complete ~7.5x the computations.<br />
<br />
==Starting from RAC 2024== <!--T:55--><br />
<br />
<!--T:56--><br />
* During the Resource Allocation Competition 2024 (RAC 2024), any proposal asking for GPUs will require to specify the preferred GPU model for the project. Then, in the CCDB form, the amount of reference GPU units (RGUs) will automatically be calculated from the requested amount of gpu-years per year of project.<br />
** For example, if you select the <i>narval-gpu</i> resource and request 13 gpu-years of the model A100-40gb, the corresponding amount of RGUs would be 13 * 4.0 = 52. The RAC committee would then allocate up to 52 RGUs, depending on the proposal score. In case your allocation must be moved to Cedar, the committee would instead allocate up to 20 gpu-years, because each V100-32gb GPU is worth 2.6 RGUs (and 52 / 2.6 = 20).<br />
<br />
<!--T:57--><br />
* For job scheduling and for usage accounting on CCDB, the use of <i>reference GPU units</i> will take effect on April 1st, 2024, with the implementation of RAC 2024.<br />
<br />
=Detailed effect of resource usage on priority= <!--T:10--><br />
<br />
<!--T:11--><br />
The overarching principle governing the calculation of priority on our national clusters is that compute-based jobs are considered in the calculation based on the resources that others are prevented from using and not on the resources actually used.<br />
<br />
<!--T:12--><br />
The most common example of unused cores contributing to a priority calculation occurs when a submitted job requests multiple cores but uses fewer cores than requested when run. The usage that will affect the priority of future jobs is the number of cores requested, not the number of cores the application actually used. This is because the unused cores were unavailable to others to use during the job.<br />
<br />
<!--T:13--><br />
Another common case is when a job requests memory beyond what is associated with the cores requested. If a cluster that has 4GB of memory associated with each core receives a job request for only a single core but 8GB of memory, then the job will be deemed to have used two cores. This is because other researchers were effectively prevented from using the second core because there was no memory available for it.<br />
<br />
==Cores equivalent used by the scheduler== <!--T:15--><br />
<br />
<!--T:16--><br />
A core equivalent is a bundle made up of a single core and some amount of associated memory. In other words, a core equivalent is a core plus the amount of memory considered to be associated with each core on a given system. <br />
<br />
<!--T:17--><br />
[[File:Core_equivalent_diagram_GP.png|frame|Figure 1 - Core equivalent diagram for Cedar and Graham.]]<br />
<br />
<!--T:18--><br />
Cedar and Graham are considered to provide 4GB per core, since this corresponds to the most common node type in those clusters, making a core equivalent on these systems a core-memory bundle of 4GB per core. Niagara is considered to provide 4.8GB of memory per core, making a core equivalent on it a core-memory bundle of 4.8GB per core. Jobs are charged in terms of core equivalent usage at the rate of 4 or 4.8 GB per core, as explained above. See Figure 1.<br />
<br />
<!--T:19--><br />
Allocation target tracking is straightforward when requests to use resources on the clusters are made entirely of core and memory amounts that can be portioned only into complete equivalent cores. Things become more complicated when jobs request portions of a core equivalent because it is possible to have many points counted against a research group’s allocation, even when they are using only portions of core equivalents. In practice, the method used by the Alliance to account for system usage solves problems about fairness and perceptions of fairness but unfortunately the method is not initially intuitive.<br />
<br />
<!--T:20--><br />
Research groups are charged for the maximum number of core equivalents they take from the resources. Assuming a core equivalent of 1 core and 4GB of memory:<br />
* [[File:Two_core_equivalents.png|frame|Figure 2 - Two core equivalents.]] Research groups using more cores than memory (above the 1 core/4GB memory ratio), will be charged by cores. For example, a research group requesting two cores and 2GB per core for a total of 4 GB of memory. The request requires 2 core equivalents worth of cores but only one bundle for memory. This job request will be counted as 2 core equivalents when priority is calculated. See Figure 2. <br clear=all><br />
<br />
<!--T:21--><br />
* [[File:Two_and_a_half_core_equivalents.png|frame|Figure 3 - 2.5 core equivalents.]] Research groups using more memory than the 1 core/4GB ratio will be charged by memory. For example, a research group requests two cores and 5GB per core for a total of 10 GB of memory. The request requires 2.5 core equivalents worth of memory, but only two bundles for cores. This job request will be counted as 2.5 core equivalents when priority is calculated. See Figure 3. <br clear=all><br />
<br />
==Reference GPU unit equivalent used by the scheduler== <!--T:22--><br />
<br />
<!--T:23--><br />
Use of GPUs and their associated resources follow the same principles as already described for core equivalents, except that a reference GPU unit (RGU) is added to the bundle alongside multiple cores and memory. This means that the accounting for GPU-based allocation targets must include the RGU. Similar to how the point system was used above when considering resource use as an expression of the concept of core equivalence, we use a similar point system here as an expression of RGU equivalence.<br />
<br />
<!--T:25--><br />
Research groups are charged for the maximum number of RGU-core-memory bundles they use. Assuming a fictive bundle of 1 RGU, 3 cores, and 4 GB of memory: <br />
[[File:GPU_equivalent_diagram.png|thumb|upright=1.1|center|Figure 4 - RGU equivalent diagram.]] <br clear=all><br />
<br />
<!--T:26--><br />
* Research groups using more RGUs than cores or memory per RGU-core-memory bundle will be charged by RGU. For example, a research group requests 2 P100-12gb GPUs (1 RGU each), 3 cores, and 4 GB of memory. The request is for 2 bundles worth of RGUs, but only one bundle for memory and cores. This job request will be counted as 2 RGU equivalents when the research group’s priority is calculated.<br />
[[File:Two_GPU_equivalents.png|thumb|center|Figure 5 - Two RGU equivalents.]] <br clear=all><br />
<br />
<!--T:27--><br />
* Research groups using more cores than RGUs or memory per RGU-core-memory bundle will be charged by core. For example, a researcher requests 1 RGU, 5 cores, and 5 GB of memory. The request is for 1.66 bundles worth of cores, but only one bundle for RGUs and 1.25 bundles for memory. This job request will be counted as 1.66 RGU equivalents when the research group’s priority is calculated.<br />
[[File:GPU_and_a_half_(cores).png|thumb|center|Figure 6 - 1.66 RGU equivalents, based on cores.]] <br clear=all><br />
<br />
<!--T:28--><br />
* Research groups using more memory than RGUs or cores per RGU-core-memory bundle will be charged by memory. For example, a researcher requests 1 RGU, 2 cores, and 6 GB of memory. The request is for 1.5 bundles worth of memory, but only one bundle for GPUs and 0.66 bundle for cores. This job request will be counted as 1.5 RGU equivalents when the research group’s priority is calculated.<br />
[[File:GPU_and_a_half_(memory).png|thumb|center|Figure 7 - 1.5 RGU equivalents, based on memory.]] <br clear=all><br />
<br />
<!--T:60--><br />
* On the same fictive cluster, a bundle with one V100-32gb GPU, 7.8 CPU cores and 10.4 GB of memory is worth 2.6 RGU equivalents:<br />
[[File:Two.Six_RGU_equivalents.png|thumb|upright=2.1|center|Figure 8 - 2.6 RGU equivalents, based on the V100-32gb GPU.]] <br clear=all><br />
<br />
<!--T:61--><br />
* On the same fictive cluster, a bundle with one A100-40gb GPU, 12 CPU cores and 16 GB of memory is worth 4.0 RGU equivalents:<br />
[[File:Four_RGU_equivalents.png|thumb|upright=2.66|center|Figure 9 - 4.0 RGU equivalents, based on the A100-40gb GPU.]] <br clear=all><br />
<br />
===Ratios in bundles=== <!--T:29--><br />
Alliance systems have the following RPU-core-memory and GPU-core-memory bundle characteristics:<br />
<br />
<!--T:58--><br />
{| class="wikitable" style="margin: auto; text-align: center;"<br />
|-<br />
! scope="col"| Cluster<br />
! scope="col"| GPU model<br />
! scope="col"| RGU per GPU<br />
! scope="col"| Bundle per RGU<br />
! scope="col"| Bundle per GPU<br />
! scope="col"| Physical ratios<br />
|-<br />
! scope="row"| [[Béluga/en#Node_Characteristics|Béluga]]<br />
| V100-16gb<br />
| 2.2<br />
| 4.5 cores / 21 GB<br />
| 10 cores / 46.5 GB<br />
| 10 cores / 46.5 GB<br />
|-<br />
! rowspan="3"| [[Cedar#Node_characteristics|Cedar]]<br />
| P100-12gb<br />
| 1.0<br />
| rowspan="3"|3.1 cores / 25 GB<br />
| 3.1 cores / 25 GB<br />
| 6 cores / 31.2 GB<br />
|-<br />
| P100-16gb<br />
| 1.1<br />
| 3.4 cores / 27 GB<br />
| 6 cores / 62.5 GB<br />
|-<br />
| V100-32gb<br />
| 2.6<br />
| 8.0 cores / 65 GB<br />
| 8 cores / 46.5 GB<br />
|-<br />
! rowspan="5"| [[Graham#Node_characteristics|Graham]]<br />
| P100-12gb<br />
| 1.0<br />
| rowspan="5"| 9.7 cores / 43 GB<br />
| 9.7 cores / 43 GB<br />
| 16 cores / 62 GB<br />
|-<br />
| T4-16gb<br />
| 1.3<br />
| 12.6 cores / 56 GB<br />
| {4, 11} cores / 46.8 GB<br />
|-<br />
| V100-16gb*<br />
| 2.2<br />
| 21.3 cores / 95 GB<br />
| 3.5 cores / 23.4 GB<br />
|-<br />
| V100-32gb*<br />
| 2.6<br />
| 25.2 cores / 112 GB<br />
| 5 cores / 47.1 GB<br />
|-<br />
| A100-80gb*<br />
| 4.8<br />
| 46.6 cores / 206 GB<br />
| {8, 16} c. / {62, 248} GB<br />
|-<br />
! scope="row"| [[Narval/en#Node_Characteristics|Narval]]<br />
| A100-40gb<br />
| 4.0<br />
| 3.0 cores / 31 GB<br />
| 12 cores / 124.5 GB<br />
| 12 cores / 124.5 GB<br />
|}<br />
<br />
<!--T:62--><br />
(*) These GPU models are available through a very few contributed GPU nodes. While all users can use them, they are not allocatable through the RAC process.<br />
<br />
<!--T:63--><br />
<b>Note:</b> While the scheduler will compute the priority based on the usage calculated with the above bundles, users requesting multiple GPUs per node also have to take into account the physical ratios.<br />
<br />
=Viewing group usage of compute resources= <!--T:35--><br />
<br />
<!--T:36--><br />
[[File:Select view group usage edit.png|thumb|Navigation to <i>View Group Usage</i>]]<br />
Information on the usage of compute resources by your groups can be found by logging into the CCDB and navigating to <i>My Account > View Group Usage</i>.<br />
<br clear=all><br />
<br />
<!--T:43--><br />
[[File:ccdb_view_use_by_compute_resource.png|thumb|CPU and GPU usage by compute resource]]<br />
CPU and GPU core year values are calculated based on the quantity of the resources allocated to jobs on the clusters. It is important to note that the values summarized in these pages do not represent core-equivalent measures such that, in the case of large memory jobs, the usage values will not match the cluster scheduler’s representation of the account usage.<br />
<br />
<!--T:37--><br />
The first tab bar offers these options:<br />
: <b>By Compute Resource</b>: cluster on which jobs are submitted; <br />
: <b>By Resource Allocation Project</b>: projects to which jobs are submitted;<br />
: <b>By Submitter</b>: user that submits the jobs;<br />
: <b>Storage usage</b> is discussed in [[Storage and file management]]. <br />
<br clear=all><br />
<br />
==Usage by compute resource== <!--T:38--><br />
<br />
<!--T:44--><br />
This view shows the usage of compute resources per cluster used by groups owned by you or of which you are a member for the current allocation year starting April 1st. The tables contain the total usage to date as well as the projected usage to the end of the current allocation period.<br />
<br clear=all><br />
<br />
<!--T:39--><br />
[[File:Ccdb_view_use_by_compute_resource_monthly.png|thumb|Usage by compute resource with monthly breakdown]]<br />
From the <i>Extra Info</i> column of the usage table <i>Show monthly usage</i> can be clicked to display a further breakdown of the usage by month for the specific cluster row in the table. By clicking <i>Show submitter usage</i>, a similar breakdown is displayed for the specific users submitting the jobs on the cluster.<br />
<br clear=all><br />
<br />
==Usage by resource allocation project== <!--T:40--><br />
[[File:Ccdb view use by compute resource monthly proj edit.png|thumb|Usage by Resource Allocation Project with monthly breakdown]]<br />
Under this tab, a third tag bar displays the RAPIs (Resource Allocation Project Identifiers) for the selected allocation year. The tables contain detailed information for each allocation project and the resources used by the projects on all of the clusters. The top of the page summarizes information such as the account name (e.g. def-, rrg- or rpp-*, etc.), the project title and ownership, as well as allocation and usage summaries.<br />
<br clear=all><br />
[[File:Rgu en.png|thumb|alt=GPU usage|GPU usage summay with Reference GPU Unit (RGU) breakdown table.]]<br />
<br />
==Usage by submitter== <!--T:41--><br />
[[File:Ccdb view use by submitter summary edit.png|thumb|CPU and GPU usage by submitter]]<br />
Usage can also be displayed grouped by the users that submitted jobs from within the resource allocation projects (group accounts). The view shows the usage for each user aggregated across systems.<br />
Selecting from the list of users will display that user’s usage broken down by cluster. Like the group summaries, these user summaries can then be broken down monthly by clicking the Show monthly usage link of the Extra Info column of the CPU/GPU Usage (in core/GPU] years) table for the specific Resource row. <br />
<br clear=all><br />
<br />
<!--T:30--><br />
[[Category:SLURM]]<br />
</translate></div>Jdesjardhttps://docs.alliancecan.ca/mediawiki/index.php?title=File:Rgu_en.png&diff=144793File:Rgu en.png2023-10-02T17:33:25Z<p>Jdesjard: </p>
<hr />
<div>GPU project usage summary and RGU table. English.</div>Jdesjardhttps://docs.alliancecan.ca/mediawiki/index.php?title=Graham&diff=104082Graham2021-09-28T18:18:47Z<p>Jdesjard: </p>
<hr />
<div><noinclude><br />
<languages /><br />
<br />
<translate><br />
<!--T:27--><br />
</noinclude><br />
{| class="wikitable"<br />
|-<br />
| Availability: In production since June 2017<br />
|-<br />
| Login node: '''graham.computecanada.ca'''<br />
|-<br />
| Globus endpoint: '''computecanada#graham-globus'''<br />
|-<br />
| Data transfer node (rsync, scp, sftp,...): '''gra-dtn1.computecanada.ca'''<br />
|}<br />
<br />
<!--T:2--><br />
Graham is a heterogeneous cluster, suitable for a variety of workloads, and located at the University of Waterloo. It is named after [https://en.wikipedia.org/wiki/Wes_Graham Wes Graham], the first director of the Computing Centre at Waterloo.<br />
<br />
<!--T:4--><br />
The parallel filesystem and external persistent storage (called "NDC-Waterloo" in some documents) are similar to [[Cedar|Cedar's]]. The interconnect is different and there is a slightly different mix of compute nodes.<br />
<br />
<!--T:28--><br />
It is entirely liquid cooled, using rear-door heat exchangers.<br />
<br />
<!--T:33--><br />
[[Getting started with the new national systems|Getting started with Graham]]<br />
<br />
<!--T:36--><br />
[[Running_jobs|How to run jobs]]<br />
<br />
<!--T:37--><br />
[[Transferring_data|Transferring data]]<br />
<br />
= Site-specific policies = <!--T:39--><br />
<br />
<!--T:40--><br />
* By policy, Graham's compute nodes cannot access the internet. If you need an exception to this rule, contact [[Technical Support|technical support]] with the following information:<br />
<br />
<!--T:42--><br />
<pre><br />
IP: <br />
Port/s: <br />
Protocol: TCP or UDP<br />
Contact: <br />
Removal Date: <br />
</pre><br />
<br />
<!--T:43--><br />
On or after the Removal Date we will follow up with the Contact to confirm if the exception is still required.<br />
<br />
<!--T:41--><br />
* Crontab is not offered on Graham. <br />
* Each job on Graham should have a duration of at least one hour (five minutes for test jobs).<br />
* A user cannot have more than 1000 jobs, running and queued, at any given moment. An array job is counted as the number of tasks in the array.<br />
<br />
=Storage= <!--T:23--><br />
<br />
<!--T:24--><br />
{| class="wikitable sortable"<br />
|-<br />
| '''Home space'''<br />64TB total volume ||<br />
* Location of home directories.<br />
* Each home directory has a small, fixed [[Storage and file management#Filesystem_quotas_and_policies|quota]]. <br />
* Not allocated via [https://www.computecanada.ca/research-portal/accessing-resources/rapid-access-service/ RAS] or [https://www.computecanada.ca/research-portal/accessing-resources/resource-allocation-competitions/ RAC]. Larger requests go to Project space.<br />
* Has daily backup.<br />
|-<br />
| '''Scratch space'''<br />3.6PB total volume<br />Parallel high-performance filesystem ||<br />
* For active or temporary (<code>/scratch</code>) storage.<br />
* Not allocated.<br />
* Large fixed [[Storage and file management#Filesystem_quotas_and_policies|quota]] per user.<br />
* Inactive data will be purged.<br />
|-<br />
|'''Project space'''<br />16PB total volume<br />External persistent storage<br />
||<br />
* Allocated via [https://www.computecanada.ca/research-portal/accessing-resources/rapid-access-service/ RAS] or [https://www.computecanada.ca/research-portal/accessing-resources/resource-allocation-competitions/ RAC].<br />
* Not designed for parallel I/O workloads. Use Scratch space instead.<br />
* Large adjustable [[Storage and file management#Filesystem_quotas_and_policies|quota]] per project.<br />
* Has daily backup.<br />
|}<br />
<br />
=High-performance interconnect= <!--T:19--><br />
<br />
<!--T:21--><br />
Mellanox FDR (56Gb/s) and EDR (100Gb/s) InfiniBand interconnect. FDR is used for GPU and cloud nodes, EDR for other node types. A central 324-port director switch aggregates connections from islands of 1024 cores each for CPU and GPU nodes. The 56 cloud nodes are a variation on CPU nodes, and are on a single larger island sharing 8 FDR uplinks to the director switch.<br />
<br />
<!--T:29--><br />
A low-latency high-bandwidth Infiniband fabric connects all nodes and scratch storage.<br />
<br />
<!--T:30--><br />
Nodes configurable for cloud provisioning also have a 10Gb/s Ethernet network, with 40Gb/s uplinks to scratch storage.<br />
<br />
<!--T:22--><br />
The design of Graham is to support multiple simultaneous parallel jobs of up to 1024 cores in a fully non-blocking manner. <br />
<br />
<!--T:31--><br />
For larger jobs the interconnect has a 8:1 blocking factor, i.e., even for jobs running on multiple islands the Graham system provides a high-performance interconnect.<br />
<br />
<!--T:32--><br />
[https://docs.computecanada.ca/mediawiki/images/b/b3/Gp3-network-topo.png Graham high performance interconnect diagram]<br />
<br />
=Visualization on Graham= <!--T:44--><br />
<br />
<!--T:45--><br />
Graham has dedicated visualization nodes available at '''gra-vdi.computecanada.ca''' that allow only VNC connections. For instructions on how to use them, see the [[VNC]] page.<br />
<br />
=Node characteristics= <!--T:5--><br />
A total of 41,548 cores and 520 GPU devices, spread across 1,185 nodes of different types; note that Turbo Boost is activated for the ensemble of Graham nodes.<br />
<br />
<!--T:55--><br />
{| class="wikitable sortable"<br />
! nodes !! cores !! available memory !! CPU !! storage !! GPU<br />
|-<br />
| 903 || 32 || 125G or 128000M || 2 x Intel E5-2683 v4 Broadwell @ 2.1GHz || 960GB SATA SSD || -<br />
|-<br />
| 24 || 32 || 502G or 514500M || 2 x Intel E5-2683 v4 Broadwell @ 2.1GHz || 960GB SATA SSD || -<br />
|-<br />
| 56 || 32 || 250G or 256500M || 2 x Intel E5-2683 v4 Broadwell @ 2.1GHz || 960GB SATA SSD || -<br />
|-<br />
| 3 || 64 || 3022G or 3095000M || 4 x Intel E7-4850 v4 Broadwell @ 2.1GHz || 960GB SATA SSD || -<br />
|-<br />
| 160 || 32 || 124G or 127518M || 2 x Intel E5-2683 v4 Broadwell @ 2.1GHz || 1.6TB NVMe SSD || 2 x NVIDIA P100 Pascal (12GB HBM2 memory)<br />
|-<br />
| 7 || 28 || 178G or 183105M || 2 x Intel Xeon Gold 5120 Skylake @ 2.2GHz || 4.0TB NVMe SSD || 8 x NVIDIA V100 Volta (16GB HBM2 memory). <br />
Note that one node is only populated with 6 GPUs.<br />
|-<br />
| 2 || 40 || 377G or 386048M || 2 x Intel Xeon Gold 6248 Cascade Lake @ 2.5GHz || 5.0TB NVMe SSD || 8 x NVIDIA V100 Volta (32GB HBM2 memory),NVLINK<br />
|-<br />
| 6 || 16 || 192G or 196608M || 2 x Intel Xeon Silver 4110 Skylake @ 2.10GHz || 11.0TB SATA SSD || 4 x NVIDIA T4 Turing (16GB GDDR6 memory)<br />
|-<br />
| 30 || 44 || 192G or 196608M || 2 x Intel Xeon Gold 6238 Cascade Lake @ 2.10GHz || 5.8TB NVMe SSD || 4 x NVIDIA T4 Turing (16GB GDDR6 memory)<br />
|-<br />
| 136 || 44 || 192G or 196608M || 2 x Intel Xeon Gold 6238 Cascade Lake @ 2.10GHz || 879GB SATA SSD || -<br />
|}<br />
<br />
<!--T:64--><br />
Most applications will run on either Broadwell, Skylake, or Cascade Lake nodes, and performance differences are expected to be small compared to job waiting times. Therefore we recommend that you do not select a specific node type for your jobs. If it is necessary, for CPU jobs there are only two constraints available, use either <code>--constraint=broadwell</code> or <code>--constraint=cascade</code>. See [[Running_jobs#Cluster_particularities|how to specify the CPU architecture]].<br />
<br />
<!--T:7--><br />
Best practice for local on-node storage is to use the temporary directory generated by [[Running jobs|Slurm]], <tt>$SLURM_TMPDIR</tt>. Note that this directory and its contents will disappear upon job completion.<br />
<br />
<!--T:38--><br />
Note that the amount of available memory is less than the "round number" suggested by hardware configuration. For instance, "base" nodes do have 128 GiB of RAM, but some of it is permanently occupied by the kernel and OS. To avoid wasting time by swapping/paging, the scheduler will never allocate jobs whose memory requirements exceed the specified amount of "available" memory. Please also note that the memory allocated to the job must be sufficient for IO buffering performed by the kernel and filesystem - this means that an IO-intensive job will often benefit from requesting somewhat more memory than the aggregate size of processes.<br />
<br />
= GPUs on Graham = <!--T:56--><br />
Graham contains Tesla GPUs from three different generations, listed here in order of age, from oldest to newest.<br />
* P100 Pascal GPUs<br />
* V100 Volta GPUs (including 2 nodes with NVLINK interconnect)<br />
* T4 Turing GPUs<br />
<br />
<!--T:57--><br />
P100 is NVIDIA's all-purpose high performance card. V100 is its successor, with about double the performance for standard computation, and about 8X performance for deep learning computations which can utilize its tensor core computation units. T4 Turing is the latest card targeted specifically at deep learning workloads - it does not support efficient double precision computations, but it has good performance for single precision, and it also has tensor cores, plus support for reduced precision integer calculations.<br />
<br />
== Pascal GPU nodes on Graham == <!--T:58--><br />
<br />
<!--T:59--><br />
These are Graham's default GPU cards. Job submission for these cards is described on page: [[Using GPUs with Slurm]]. When a job simply request a GPU with --gres=gpu:1 or --gres=gpu:2, it will be assigned to any type of available GPU. If you require a specific type of GPU, please request it. As all Pascal nodes have only 2 P100 GPUs, configuring jobs using these cards is relatively simple.<br />
<br />
==Volta GPU nodes on Graham== <!--T:46--><br />
Graham has a total of 9 Volta nodes.<br />
In 7 of these, four GPUs are connected to each CPU socket (except for one node, which is only populated with 6 GPUs, three per socket). The other 2 have high bandwidth NVLINK interconnect.<br />
<br />
<!--T:50--><br />
'''The nodes are available to all users with a maximum 7 days job runtime limit.''' <br />
<br />
<!--T:51--><br />
Following is an example job script to submit a job to one of the nodes (with 8 GPUs). The module load command will ensure that modules compiled for Skylake architecture will be used. Replace nvidia-smi with the command you want to run.<br />
<br />
<!--T:52--><br />
'''Important''': You should scale the number of CPUs requested, keeping the ratio of CPUs to GPUs at 3.5 or less on 28 core nodes. For example, if you want to run a job using 4 GPUs, you should request '''at most 14 CPU cores'''. For a job with 1 GPU, you should request '''at most 3 CPU cores'''. Users are allowed to run a few short test jobs (shorter than 1 hour) that break this rule to see how your code performs.<br />
<br />
<!--T:65--><br />
The two newest Volta nodes have 40 cores so the number of cores requested per GPU should be adjusted upwards accordingly, i.e. you can use 5 CPU cores per GPU. They also have NVLINK, which can provide huge benefits for situations where memory bandwidth between GPUs is the bottleneck. To use one of these NVLINK nodes, it should be requested directly, by adding the option '''--nodelist=gra1337''' or '''--nodelist=gra1338''' to the job submission script.<br />
<br />
<!--T:53--><br />
Single-GPU example:<br />
{{File<br />
|name=gpu_single_GPU_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --account=def-someuser<br />
#SBATCH --gres=gpu:v100:1<br />
#SBATCH --cpus-per-task=3<br />
#SBATCH --mem=12G<br />
#SBATCH --time=1-00:00<br />
module load arch/avx512 StdEnv/2018.3<br />
nvidia-smi<br />
}}<br />
Full-node example:<br />
{{File<br />
|name=gpu_single_node_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --account=def-someuser<br />
#SBATCH --nodes=1<br />
#SBATCH --gres=gpu:v100:1<br />
#SBATCH --cpus-per-task=3<br />
#SBATCH --mem=12G<br />
#SBATCH --time=1-00:00<br />
module load arch/avx512 StdEnv/2018.3<br />
nvidia-smi<br />
}}<br />
<br />
<!--T:54--><br />
The Volta nodes have a fast local disk, which should be used for jobs if the amount of I/O performed by your job is significant. Inside the job, the location of the temporary directory on fast local disk is specified by the environment variable $SLURM_TMPDIR. You can copy your input files there at the start of your job script before you run your program and your output files out at the end of your job script. All the files in $SLURM_TMPDIR will be removed once the job ends, so you do not have to clean up that directory yourself. You can even create Python virtual environments in this temporary space for greater efficiency. Please see the [[Python#Creating_virtual_environments_inside_of_your_jobs|information on how to do this]].<br />
<br />
==Turing GPU nodes on Graham== <!--T:60--><br />
<br />
<!--T:61--><br />
The usage of these nodes is similar to using the Volta nodes, except when requesting them, you should specify: <br />
<br />
<!--T:62--><br />
--gres=gpu:t4:2<br />
<br />
<!--T:63--><br />
In this example 2 T4 cards per node are requested.<br />
<br />
<br />
<br />
<!--T:14--><br />
<noinclude><br />
</translate><br />
</noinclude></div>Jdesjardhttps://docs.alliancecan.ca/mediawiki/index.php?title=Allocations_and_compute_scheduling&diff=103913Allocations and compute scheduling2021-09-21T20:00:35Z<p>Jdesjard: </p>
<hr />
<div><languages /><br />
<br />
<translate><br />
<br />
<!--T:31--><br />
''Parent page: [[Job scheduling policies]]''<br />
<br />
= What is an allocation? = <!--T:2--><br />
<br />
<!--T:3--><br />
'''An allocation is an amount of resources that a research group can target for use for a period of time, usually a year.''' This amount is either a maximum amount, as is the case for storage, or an average amount of usage over the period, as is the case for shared resources like computation cores.<br />
<br />
<!--T:4--><br />
Allocations are usually made in terms of core years, GPU years, or storage space. Storage allocations are the most straightforward to understand: research groups will get a maximum amount of storage that they can use exclusively throughout the allocation period. Core year and GPU year allocations are more difficult to understand because these allocations are meant to capture average use throughout the allocation period---typically meant to be a year---and this use will occur across a set of resources shared with other research groups.<br />
<br />
<!--T:5--><br />
The time period of an allocation when it is granted is a reference value, used for the calculation of the average which is applied to the actual period during which the resources are available. This means that if the allocation period was a year and the clusters were down for a week of maintenance, a research group would not be entitled to an additional week of resource usage. Equally so, if the allocation period were to be extended by a month, research groups affected by such a change would not see their resource access diminish during this month.<br />
<br />
<!--T:6--><br />
It should be noted that in the case of core year and GPU year allocations, both of which target resource usage averages over time on shared resources, a research group is more likely to hit (or exceed) its target(s) if the resources are used evenly over the allocation period than if the resources are used in bursts or if use is put off until later in the allocation period.<br />
<br />
== Viewing group usage of compute resources == <!--T:35--><br />
<br />
<!--T:36--><br />
[[File:Select view group usage edit.png|thumb|Navigation to ''View Group Usage'']]<br />
Information on the usage of compute resources by your groups can be found by logging into the CCDB and navigating to ''My Account > View Group Usage''.<br />
<br clear=all><br />
<br />
<!--T:43--><br />
[[File:ccdb_view_use_by_compute_resource.png|thumb|CPU and GPU usage by compute resource]]<br />
CPU and GPU core year values are calculated based on the quantity of the resources allocated to jobs on the clusters. It is important to note that the values summarized in these pages do not represent core-equivalent measures such that, in the case of large memory jobs, the usage values will not match the cluster scheduler’s representation of the account usage.<br />
<br />
<!--T:37--><br />
The first tag bar offers these options:<br />
: '''By Compute Resource''': cluster on which jobs are submitted; <br />
: '''By Resource Allocation Project''': projects to which jobs are submitted;<br />
: '''By Submitter''': user that submits the jobs;<br />
: '''Storage usage''' is discussed in [[Storage and file management]]. <br />
<br />
=== Usage by compute resource=== <!--T:38--><br />
<br />
<!--T:44--><br />
This view shows the usage of compute resources per cluster used by groups owned by you or of which you are a member for the current allocation year starting April 1st. The tables contain the total usage to date as well as the projected usage to the end of the current allocation period.<br />
<br clear=all><br />
<br />
<br />
<!--T:39--><br />
[[File:ccdb_view_use_by_compute_resource_monthly.png|thumb|Usage by compute resource with monthly breakdown]]<br />
From the ''Extra Info'' column of the usage table ''Show monthly usage'' can be clicked to display a further breakdown of the usage by month for the specific cluster row in the table. By clicking ''Show submitter usage'', a similar breakdown is displayed for the specific users submitting the jobs on the cluster.<br />
<br clear=all><br />
<br />
<br />
===Usage by resource allocation project=== <!--T:40--><br />
[[File:Ccdb view use by compute resource monthly proj edit.png|thumb|Usage by Resource Allocation Project with monthly breakdown]]<br />
Under this tab, a third tag bar displays the RAPIs (Resource Allocation Project Identifiers) for the selected allocation year. The tables contain detailed information for each allocation project and the resources used by the projects on all of the clusters. The top of the page summarizes information such as the account name (e.g. def-, rrg- or rpp-*, etc), the project title and ownership, as well as allocation and usage summaries.<br />
<br clear=all><br />
<br />
===Usage by submitter=== <!--T:41--><br />
[[File:Ccdb view use by submitter summary edit.png|thumb|CPU and GPU usage by submitter]]<br />
Usage can also be displayed grouped by the users that submitted jobs from within the resource allocation projects (group accounts). The view shows the usage for each user aggregated across systems.<br />
Selecting from the list of users will display that user’s usage broken down by cluster. Like the group summaries, these user summaries can then be broken down monthly by clicking the Show monthly usage link of the Extra Info column of the CPU/GPU Usage (in core/GPU] years) table for the specific Resource row. <br />
<br clear=all><br />
<br />
== What happens if my group overuses my CPU or GPU allocation? == <!--T:32--><br />
<br />
<!--T:33--><br />
Nothing bad. Your CPU or GPU allocation is a target level, i.e., a target number of CPUs or GPUs. If you have jobs waiting to run, and competing demand is low enough, then the scheduler may allow more of your jobs to run than your target level. The only consequence of this is that succeeding jobs of yours ''may'' have lower priority for a time while the scheduler prioritizes other groups which were below their target. You are not prevented from submitting or running new jobs, and the time-average of your usage should still be close to your target, that is, your allocation.<br />
<br />
<!--T:34--><br />
It is even possible that you could end a month or even a year having run more work than your allocation would seem to allow, although this is unlikely given the demand on our resources.<br />
<br />
=How does scheduling work?= <!--T:7--><br />
<br />
<!--T:8--><br />
Compute-related resources granted by core-year and GPU-year allocations require research groups to submit what are referred to as “jobs” to a “scheduler”. A job is a combination of a computer program (an application) and a list of resources that the application is expected to use. The [[What is a scheduler?|scheduler]] is a program that calculates the priority of each job submitted and provides the needed resources based on the priority of each job and the available resources.<br />
<br />
<!--T:9--><br />
The scheduler uses prioritization algorithms to meet the allocation targets of all groups and it is based on a research group’s recent usage of the system as compared to their allocated usage on that system. The past of the allocation period is taken into account but the most weight is put on recent usage (or non-usage). The point of this is to allow a research group that matches their actual usage with their allocated amounts to operate roughly continuously at that level. This smooths resource usage over time across all groups and resources, allowing for it to be theoretically possible for all research groups to hit their allocation targets.<br />
<br />
=How does resource use affect priority?= <!--T:10--><br />
<br />
<!--T:11--><br />
The overarching principle governing the calculation of priority on Compute Canada's new national clusters is that compute-based jobs are considered in the calculation based on the resources that others are prevented from using and not on the resources actually used.<br />
<br />
<!--T:12--><br />
The most common example of unused cores contributing to a priority calculation occurs when a submitted job requests multiple cores but uses fewer cores than requested when run. The usage that will affect the priority of future jobs is the number of cores requested, not the number of cores the application actually used. This is because the unused cores were unavailable to others to use during the job.<br />
<br />
<!--T:13--><br />
Another common case is when a job requests memory beyond what is associated with the cores requested. If a cluster that has 4GB of memory associated with each core receives a job request for only a single core but 8GB of memory, then the job will be deemed to have used two cores. This is because other researchers were effectively prevented from using the second core because there was no memory available for it.<br />
<br />
<!--T:14--><br />
The details of how resources are accounted for require a sound understanding of the core equivalent concept, which is discussed below.<ref>Further details about how priority is calculated are beyond the scope of this document. Additional documentation is in preparation. We also suggest that a [https://www.westgrid.ca/events/scheduling_job_management_how_get_most_cluster training course] might be valuable for anyone wishing to know more.</ref><br />
<br />
=What is a core equivalent and how is it used by the scheduler?= <!--T:15--><br />
<br />
<!--T:16--><br />
A core equivalent is a bundle made up of a single core and some amount of associated memory. In other words, a core equivalent is a core plus the amount of memory considered to be associated with each core on a given system. <br />
<br />
<!--T:17--><br />
[[File:Core_equivalent_diagram_GP.png|frame|Figure 1 - Core equivalent diagram for Cedar and Graham.]]<br />
<br />
<!--T:18--><br />
Cedar and Graham are considered to provide 4GB per core, since this corresponds to the most common node type in those clusters, making a core equivalent on these systems a core-memory bundle of 4GB per core. Niagara is considered to provide 4.8GB of memory per core, making a core equivalent on it a core-memory bundle of 4.8GB per core. Jobs are charged in terms of core equivalent usage at the rate of 4 or 4.8 GB per core, as explained above. See Figure 1.<br />
<br />
<!--T:19--><br />
Allocation target tracking is straightforward when requests to use resources on the clusters are made entirely of core and memory amounts that can be portioned only into complete equivalent cores. Things become more complicated when jobs request portions of a core equivalent because it is possible to have many points counted against a research group’s allocation, even when they are using only portions of core equivalents. In practice, the method used by Compute Canada to account for system usage solves problems about fairness and perceptions of fairness but unfortunately the method is not initially intuitive.<br />
<br />
<!--T:20--><br />
Research groups are charged for the maximum number of core equivalents they take from the resources. Assuming a core equivalent of 1 core and 4GB of memory:<br />
* [[File:Two_core_equivalents.png|frame|Figure 2 - Two core equivalents.]] Research groups using more cores than memory (above the 1 core/4GB memory ratio), will be charged by cores. For example, a research group requesting two cores and 2GB per core for a total of 4 GB of memory. The request requires 2 core equivalents worth of cores but only one bundle for memory. This job request will be counted as 2 core equivalents when priority is calculated. See Figure 2. <br clear=all><br />
<br />
<!--T:21--><br />
* [[File:Two_and_a_half_core_equivalents.png|frame|Figure 3 - 2.5 core equivalents.]] Research groups using more memory than the 1 core/4GB ratio will be charged by memory. For example, a research group requests two cores and 5GB per core for a total of 10 GB of memory. The request requires 2.5 core equivalents worth of memory, but only two bundles for cores. This job request will be counted as 2.5 core equivalents when priority is calculated. See Figure 3. <br clear=all><br />
<br />
=What is a GPU equivalent and how is it used by the scheduler?= <!--T:22--><br />
<br />
<!--T:23--><br />
Use of GPUs and their associated resources follow the same principles as already described for core equivalents. The complication is that it is important to separate allocation targets for GPU-based research from allocation targets for non-GPU-based research to ensure that we can meet the allocation targets in each case. If these cases were not separated, then it would be possible for a non-GPU-based researcher to use their allocation targets in the GPU-based research pool, adding load that would effectively block GPU-based researchers from meeting their allocation targets and vice versa.<br />
<br />
<!--T:24--><br />
Given this separation, a distinction must be made between core equivalents and GPU equivalents. Core equivalents are as described above. The GPU-core-memory bundles that make up a GPU equivalent are similar to core-memory bundles except that a GPU is added to the bundle alongside multiple cores and memory. This means that accounting for GPU-based allocation targets must include the GPU. Similar to how the points system was used above when considering resource use as an expression of the concept of core equivalence, we will use a similar point system here as an expression of GPU equivalence.<br />
<br />
<!--T:25--><br />
Research groups are charged for the maximum number of GPU-core-memory bundles they use. Assuming a core-memory bundle of 1 GPU, 6 cores, and 32GB of memory: <br />
[[File:GPU_equivalent_diagram.png|frame|center|Figure 4 - GPU equivalent diagram.]]<br clear=all><br />
<br />
<!--T:26--><br />
* Research groups using more GPUs than cores or memory per GPU-core-memory bundle will be charged by GPU. For example, a research group requests 2 GPUs, 6 cores, and 32GB of memory. The request is for 2 GPU-core-memory bundles worth of GPUs but only one bundle for memory and cores. This job request will be counted as 2 GPU equivalents when the research group’s priority is calculated.<br />
[[File:Two_GPU_equivalents.png|frame|center|Figure 5 - Two GPU equivalents.]] <br clear=all><br />
<br />
<!--T:27--><br />
* Research groups using more cores than GPUs or memory per GPU-core-memory bundle will be charged by core. For example, a researcher requests 1 GPU, 9 cores, and 32GB of memory. The request is for 1.5 GPU-core-memory bundles worth of cores, but only one bundle for GPUs and memory. This job request will be counted as 1.5 GPU equivalents when the research group’s priority is calculated.<br />
[[File:GPU_and_a_half_(cores).png|frame|center|Figure 6 - 1.5 GPU equivalents, based on cores.]] <br clear=all><br />
<br />
<!--T:28--><br />
* Research groups using more memory than GPUs or cores per GPU-core-memory bundle will be charged by memory. For example, a researcher requests 1 GPU, 6 cores, and 48GB of memory. The request is for 1.5 GPU-core-memory bundles worth of memory but only one bundle for GPUs and cores. This job request will be counted as 1.5 GPU equivalents when the research group’s priority is calculated.<br />
[[File:GPU_and_a_half_(memory).png|frame|center|Figure 7 - 1.5 GPU equivalents, based on memory.]] <br clear=all><br />
<br />
== Ratios: GPU / CPU Cores / System-memory == <!--T:29--><br />
Compute Canada systems have the following GPU-core-memory bundle characteristics:<br />
* [[Béluga/en#Node_Characteristics|Béluga]]:<br />
** V100/16GB nodes: 1 GPU / 10 cores / 47000 MB<br />
* [[Cedar#Node_characteristics|Cedar]]:<br />
** P100/12GB nodes: 1 GPU / 6 cores / 32000 MB<br />
** P100/16GB nodes: 1 GPU / 6 cores / 64000 MB<br />
** V100/32GB nodes: 1 GPU / 8 cores / 48000 MB<br />
* [[Graham#Node_characteristics|Graham]]:<br />
** P100/12GB nodes: 1 GPU / 16 cores / 64000 MB<br />
** V100/16GB nodes: 1 GPU / 3.5 cores / 22500 MB<br />
** V100/32GB nodes: 1 GPU / 5 cores / 48000 MB<br />
** T4/16GB nodes: 1 GPU / {4,11} cores / 49000 MB<br />
<br />
<!--T:30--><br />
[[Category:SLURM]]<br />
</translate></div>Jdesjardhttps://docs.alliancecan.ca/mediawiki/index.php?title=Allocations_and_compute_scheduling&diff=103912Allocations and compute scheduling2021-09-21T20:00:01Z<p>Jdesjard: </p>
<hr />
<div><languages /><br />
<br />
<translate><br />
<br />
<!--T:31--><br />
''Parent page: [[Job scheduling policies]]''<br />
<br />
= What is an allocation? = <!--T:2--><br />
<br />
<!--T:3--><br />
'''An allocation is an amount of resources that a research group can target for use for a period of time, usually a year.''' This amount is either a maximum amount, as is the case for storage, or an average amount of usage over the period, as is the case for shared resources like computation cores.<br />
<br />
<!--T:4--><br />
Allocations are usually made in terms of core years, GPU years, or storage space. Storage allocations are the most straightforward to understand: research groups will get a maximum amount of storage that they can use exclusively throughout the allocation period. Core year and GPU year allocations are more difficult to understand because these allocations are meant to capture average use throughout the allocation period---typically meant to be a year---and this use will occur across a set of resources shared with other research groups.<br />
<br />
<!--T:5--><br />
The time period of an allocation when it is granted is a reference value, used for the calculation of the average which is applied to the actual period during which the resources are available. This means that if the allocation period was a year and the clusters were down for a week of maintenance, a research group would not be entitled to an additional week of resource usage. Equally so, if the allocation period were to be extended by a month, research groups affected by such a change would not see their resource access diminish during this month.<br />
<br />
<!--T:6--><br />
It should be noted that in the case of core year and GPU year allocations, both of which target resource usage averages over time on shared resources, a research group is more likely to hit (or exceed) its target(s) if the resources are used evenly over the allocation period than if the resources are used in bursts or if use is put off until later in the allocation period.<br />
<br />
== Viewing group usage of compute resources == <!--T:35--><br />
<br />
<!--T:36--><br />
[[File:Select view group usage edit.png|thumb|Navigation to ''View Group Usage'']]<br />
Information on the usage of compute resources by your groups can be found by logging into the CCDB and navigating to ''My Account > View Group Usage''.<br />
<br clear=all><br />
<br />
<!--T:43--><br />
[[File:ccdb_view_use_by_compute_resource.png|thumb|CPU and GPU usage by compute resource]]<br />
CPU and GPU core year values are calculated based on the quantity of the resources allocated to jobs on the clusters. It is important to note that the values summarized in these pages do not represent core-equivalent measures such that, in the case of large memory jobs, the usage values will not match the cluster scheduler’s representation of the account usage.<br />
<br />
<!--T:37--><br />
The first tag bar offers these options:<br />
: '''By Compute Resource''': cluster on which jobs are submitted; <br />
: '''By Resource Allocation Project''': projects to which jobs are submitted;<br />
: '''By Submitter''': user that submits the jobs;<br />
: '''Storage usage''' is discussed in [[Storage and file management]]. <br />
<br />
=== Usage by compute resource=== <!--T:38--><br />
<br />
<!--T:44--><br />
This view shows the usage of compute resources per cluster used by groups owned by you or of which you are a member for the current allocation year starting April 1st. The tables contain the total usage to date as well as the projected usage to the end of the current allocation period.<br />
<br clear=all><br />
<br />
<br />
<!--T:39--><br />
[[File:ccdb_view_use_by_compute_resource_monthly.png|thumb|Usage by compute resource with monthly breakdown]]<br />
From the ''Extra Info'' column of the usage table ''Show monthly usage'' can be clicked to display a further breakdown of the usage by month for the specific cluster row in the table. By clicking ''Show submitter usage'', a similar breakdown is displayed for the specific users submitting the jobs on the cluster.<br />
<br clear=all><br />
<br />
<br />
===Usage by resource allocation project=== <!--T:40--><br />
[[File:Ccdb view use by compute resource monthly proj edit.png|thumb|Usage by Resource Allocation Project with monthly breakdown]]<br />
Under this tab, a third tag bar displays the RAPIs (Resource Allocation Project Identifiers) for the selected allocation year. The tables contain detailed information for each allocation project and the resources used by the projects on all of the clusters. The top of the page summarizes information such as the account name (e.g. def-, rrg- or rpp-*, etc), the project title and ownership, as well as allocation and usage summaries.<br />
<br clear=all><br />
<br />
===Usage by submitter=== <!--T:41--><br />
Usage can also be displayed grouped by the users that submitted jobs from within the resource allocation projects (group accounts). The view shows the usage for each user aggregated across systems.<br />
[[File:Ccdb view use by submitter summary edit.png|thumb|CPU and GPU usage by submitter]]<br />
Selecting from the list of users will display that user’s usage broken down by cluster. Like the group summaries, these user summaries can then be broken down monthly by clicking the Show monthly usage link of the Extra Info column of the CPU/GPU Usage (in core/GPU] years) table for the specific Resource row. <br />
<br clear=all><br />
<br />
== What happens if my group overuses my CPU or GPU allocation? == <!--T:32--><br />
<br />
<!--T:33--><br />
Nothing bad. Your CPU or GPU allocation is a target level, i.e., a target number of CPUs or GPUs. If you have jobs waiting to run, and competing demand is low enough, then the scheduler may allow more of your jobs to run than your target level. The only consequence of this is that succeeding jobs of yours ''may'' have lower priority for a time while the scheduler prioritizes other groups which were below their target. You are not prevented from submitting or running new jobs, and the time-average of your usage should still be close to your target, that is, your allocation.<br />
<br />
<!--T:34--><br />
It is even possible that you could end a month or even a year having run more work than your allocation would seem to allow, although this is unlikely given the demand on our resources.<br />
<br />
=How does scheduling work?= <!--T:7--><br />
<br />
<!--T:8--><br />
Compute-related resources granted by core-year and GPU-year allocations require research groups to submit what are referred to as “jobs” to a “scheduler”. A job is a combination of a computer program (an application) and a list of resources that the application is expected to use. The [[What is a scheduler?|scheduler]] is a program that calculates the priority of each job submitted and provides the needed resources based on the priority of each job and the available resources.<br />
<br />
<!--T:9--><br />
The scheduler uses prioritization algorithms to meet the allocation targets of all groups and it is based on a research group’s recent usage of the system as compared to their allocated usage on that system. The past of the allocation period is taken into account but the most weight is put on recent usage (or non-usage). The point of this is to allow a research group that matches their actual usage with their allocated amounts to operate roughly continuously at that level. This smooths resource usage over time across all groups and resources, allowing for it to be theoretically possible for all research groups to hit their allocation targets.<br />
<br />
=How does resource use affect priority?= <!--T:10--><br />
<br />
<!--T:11--><br />
The overarching principle governing the calculation of priority on Compute Canada's new national clusters is that compute-based jobs are considered in the calculation based on the resources that others are prevented from using and not on the resources actually used.<br />
<br />
<!--T:12--><br />
The most common example of unused cores contributing to a priority calculation occurs when a submitted job requests multiple cores but uses fewer cores than requested when run. The usage that will affect the priority of future jobs is the number of cores requested, not the number of cores the application actually used. This is because the unused cores were unavailable to others to use during the job.<br />
<br />
<!--T:13--><br />
Another common case is when a job requests memory beyond what is associated with the cores requested. If a cluster that has 4GB of memory associated with each core receives a job request for only a single core but 8GB of memory, then the job will be deemed to have used two cores. This is because other researchers were effectively prevented from using the second core because there was no memory available for it.<br />
<br />
<!--T:14--><br />
The details of how resources are accounted for require a sound understanding of the core equivalent concept, which is discussed below.<ref>Further details about how priority is calculated are beyond the scope of this document. Additional documentation is in preparation. We also suggest that a [https://www.westgrid.ca/events/scheduling_job_management_how_get_most_cluster training course] might be valuable for anyone wishing to know more.</ref><br />
<br />
=What is a core equivalent and how is it used by the scheduler?= <!--T:15--><br />
<br />
<!--T:16--><br />
A core equivalent is a bundle made up of a single core and some amount of associated memory. In other words, a core equivalent is a core plus the amount of memory considered to be associated with each core on a given system. <br />
<br />
<!--T:17--><br />
[[File:Core_equivalent_diagram_GP.png|frame|Figure 1 - Core equivalent diagram for Cedar and Graham.]]<br />
<br />
<!--T:18--><br />
Cedar and Graham are considered to provide 4GB per core, since this corresponds to the most common node type in those clusters, making a core equivalent on these systems a core-memory bundle of 4GB per core. Niagara is considered to provide 4.8GB of memory per core, making a core equivalent on it a core-memory bundle of 4.8GB per core. Jobs are charged in terms of core equivalent usage at the rate of 4 or 4.8 GB per core, as explained above. See Figure 1.<br />
<br />
<!--T:19--><br />
Allocation target tracking is straightforward when requests to use resources on the clusters are made entirely of core and memory amounts that can be portioned only into complete equivalent cores. Things become more complicated when jobs request portions of a core equivalent because it is possible to have many points counted against a research group’s allocation, even when they are using only portions of core equivalents. In practice, the method used by Compute Canada to account for system usage solves problems about fairness and perceptions of fairness but unfortunately the method is not initially intuitive.<br />
<br />
<!--T:20--><br />
Research groups are charged for the maximum number of core equivalents they take from the resources. Assuming a core equivalent of 1 core and 4GB of memory:<br />
* [[File:Two_core_equivalents.png|frame|Figure 2 - Two core equivalents.]] Research groups using more cores than memory (above the 1 core/4GB memory ratio), will be charged by cores. For example, a research group requesting two cores and 2GB per core for a total of 4 GB of memory. The request requires 2 core equivalents worth of cores but only one bundle for memory. This job request will be counted as 2 core equivalents when priority is calculated. See Figure 2. <br clear=all><br />
<br />
<!--T:21--><br />
* [[File:Two_and_a_half_core_equivalents.png|frame|Figure 3 - 2.5 core equivalents.]] Research groups using more memory than the 1 core/4GB ratio will be charged by memory. For example, a research group requests two cores and 5GB per core for a total of 10 GB of memory. The request requires 2.5 core equivalents worth of memory, but only two bundles for cores. This job request will be counted as 2.5 core equivalents when priority is calculated. See Figure 3. <br clear=all><br />
<br />
=What is a GPU equivalent and how is it used by the scheduler?= <!--T:22--><br />
<br />
<!--T:23--><br />
Use of GPUs and their associated resources follow the same principles as already described for core equivalents. The complication is that it is important to separate allocation targets for GPU-based research from allocation targets for non-GPU-based research to ensure that we can meet the allocation targets in each case. If these cases were not separated, then it would be possible for a non-GPU-based researcher to use their allocation targets in the GPU-based research pool, adding load that would effectively block GPU-based researchers from meeting their allocation targets and vice versa.<br />
<br />
<!--T:24--><br />
Given this separation, a distinction must be made between core equivalents and GPU equivalents. Core equivalents are as described above. The GPU-core-memory bundles that make up a GPU equivalent are similar to core-memory bundles except that a GPU is added to the bundle alongside multiple cores and memory. This means that accounting for GPU-based allocation targets must include the GPU. Similar to how the points system was used above when considering resource use as an expression of the concept of core equivalence, we will use a similar point system here as an expression of GPU equivalence.<br />
<br />
<!--T:25--><br />
Research groups are charged for the maximum number of GPU-core-memory bundles they use. Assuming a core-memory bundle of 1 GPU, 6 cores, and 32GB of memory: <br />
[[File:GPU_equivalent_diagram.png|frame|center|Figure 4 - GPU equivalent diagram.]]<br clear=all><br />
<br />
<!--T:26--><br />
* Research groups using more GPUs than cores or memory per GPU-core-memory bundle will be charged by GPU. For example, a research group requests 2 GPUs, 6 cores, and 32GB of memory. The request is for 2 GPU-core-memory bundles worth of GPUs but only one bundle for memory and cores. This job request will be counted as 2 GPU equivalents when the research group’s priority is calculated.<br />
[[File:Two_GPU_equivalents.png|frame|center|Figure 5 - Two GPU equivalents.]] <br clear=all><br />
<br />
<!--T:27--><br />
* Research groups using more cores than GPUs or memory per GPU-core-memory bundle will be charged by core. For example, a researcher requests 1 GPU, 9 cores, and 32GB of memory. The request is for 1.5 GPU-core-memory bundles worth of cores, but only one bundle for GPUs and memory. This job request will be counted as 1.5 GPU equivalents when the research group’s priority is calculated.<br />
[[File:GPU_and_a_half_(cores).png|frame|center|Figure 6 - 1.5 GPU equivalents, based on cores.]] <br clear=all><br />
<br />
<!--T:28--><br />
* Research groups using more memory than GPUs or cores per GPU-core-memory bundle will be charged by memory. For example, a researcher requests 1 GPU, 6 cores, and 48GB of memory. The request is for 1.5 GPU-core-memory bundles worth of memory but only one bundle for GPUs and cores. This job request will be counted as 1.5 GPU equivalents when the research group’s priority is calculated.<br />
[[File:GPU_and_a_half_(memory).png|frame|center|Figure 7 - 1.5 GPU equivalents, based on memory.]] <br clear=all><br />
<br />
== Ratios: GPU / CPU Cores / System-memory == <!--T:29--><br />
Compute Canada systems have the following GPU-core-memory bundle characteristics:<br />
* [[Béluga/en#Node_Characteristics|Béluga]]:<br />
** V100/16GB nodes: 1 GPU / 10 cores / 47000 MB<br />
* [[Cedar#Node_characteristics|Cedar]]:<br />
** P100/12GB nodes: 1 GPU / 6 cores / 32000 MB<br />
** P100/16GB nodes: 1 GPU / 6 cores / 64000 MB<br />
** V100/32GB nodes: 1 GPU / 8 cores / 48000 MB<br />
* [[Graham#Node_characteristics|Graham]]:<br />
** P100/12GB nodes: 1 GPU / 16 cores / 64000 MB<br />
** V100/16GB nodes: 1 GPU / 3.5 cores / 22500 MB<br />
** V100/32GB nodes: 1 GPU / 5 cores / 48000 MB<br />
** T4/16GB nodes: 1 GPU / {4,11} cores / 49000 MB<br />
<br />
<!--T:30--><br />
[[Category:SLURM]]<br />
</translate></div>Jdesjardhttps://docs.alliancecan.ca/mediawiki/index.php?title=File:Ccdb_view_use_by_submitter_user_fr_edit.png&diff=103691File:Ccdb view use by submitter user fr edit.png2021-09-20T20:35:49Z<p>Jdesjard: </p>
<hr />
<div>by submitter user</div>Jdesjardhttps://docs.alliancecan.ca/mediawiki/index.php?title=File:Ccdb_view_use_by_submitter_user_edit.png&diff=103690File:Ccdb view use by submitter user edit.png2021-09-20T20:34:51Z<p>Jdesjard: </p>
<hr />
<div>by submitter user</div>Jdesjardhttps://docs.alliancecan.ca/mediawiki/index.php?title=File:Ccdb_view_use_by_submitter_summary_fr_edit.png&diff=103689File:Ccdb view use by submitter summary fr edit.png2021-09-20T20:34:08Z<p>Jdesjard: </p>
<hr />
<div>by submitter summary</div>Jdesjardhttps://docs.alliancecan.ca/mediawiki/index.php?title=File:Ccdb_view_use_by_submitter_summary_edit.png&diff=103688File:Ccdb view use by submitter summary edit.png2021-09-20T20:33:26Z<p>Jdesjard: </p>
<hr />
<div>by submitter summary</div>Jdesjardhttps://docs.alliancecan.ca/mediawiki/index.php?title=File:Ccdb_view_use_by_compute_resource_monthly_proj_fr_edit.png&diff=103687File:Ccdb view use by compute resource monthly proj fr edit.png2021-09-20T20:31:37Z<p>Jdesjard: </p>
<hr />
<div>Usage by resource allocation project with monthly breakdown</div>Jdesjardhttps://docs.alliancecan.ca/mediawiki/index.php?title=File:Ccdb_view_use_by_compute_resource_monthly_fr.png&diff=103686File:Ccdb view use by compute resource monthly fr.png2021-09-20T20:30:06Z<p>Jdesjard: </p>
<hr />
<div>Usage by compute resource with monthly breakdown</div>Jdesjardhttps://docs.alliancecan.ca/mediawiki/index.php?title=File:Ccdb_view_use_by_compute_resource.png&diff=103685File:Ccdb view use by compute resource.png2021-09-20T20:28:37Z<p>Jdesjard: Jdesjard uploaded a new version of File:Ccdb view use by compute resource.png</p>
<hr />
<div>Usage view By Compute Resource</div>Jdesjardhttps://docs.alliancecan.ca/mediawiki/index.php?title=File:Ccdb_view_use_by_compute_resource_fr.png&diff=103684File:Ccdb view use by compute resource fr.png2021-09-20T20:26:23Z<p>Jdesjard: </p>
<hr />
<div>View of group usage by compute resource</div>Jdesjardhttps://docs.alliancecan.ca/mediawiki/index.php?title=File:Select_view_group_usage_fr_edit.png&diff=103683File:Select view group usage fr edit.png2021-09-20T20:24:22Z<p>Jdesjard: </p>
<hr />
<div>Navigation to 'View Group Usage" at the Compute Canada portal.</div>Jdesjardhttps://docs.alliancecan.ca/mediawiki/index.php?title=Allocations_and_compute_scheduling&diff=103322Allocations and compute scheduling2021-09-10T14:01:09Z<p>Jdesjard: Marked this version for translation</p>
<hr />
<div><languages /><br />
<br />
<translate><br />
<br />
<!--T:31--><br />
''Parent page: [[Job scheduling policies]]''<br />
<br />
= What is an allocation? = <!--T:2--><br />
<br />
<!--T:3--><br />
'''An allocation is an amount of resources that a research group can target for use for a period of time, usually a year.''' This amount is either a maximum amount, as is the case for storage, or an average amount of usage over the period, as is the case for shared resources like computation cores.<br />
<br />
<!--T:4--><br />
Allocations are usually made in terms of core years, GPU years, or storage space. Storage allocations are the most straightforward to understand: research groups will get a maximum amount of storage that they can use exclusively throughout the allocation period. Core year and GPU year allocations are more difficult to understand because these allocations are meant to capture average use throughout the allocation period---typically meant to be a year---and this use will occur across a set of resources shared with other research groups.<br />
<br />
<!--T:5--><br />
The time period of an allocation when it is granted is a reference value, used for the calculation of the average which is applied to the actual period during which the resources are available. This means that if the allocation period was a year and the clusters were down for a week of maintenance, a research group would not be entitled to an additional week of resource usage. Equally so, if the allocation period were to be extended by a month, research groups affected by such a change would not see their resource access diminish during this month.<br />
<br />
<!--T:6--><br />
It should be noted that in the case of core year and GPU year allocations, both of which target resource usage averages over time on shared resources, a research group is more likely to hit (or exceed) its target(s) if the resources are used evenly over the allocation period than if the resources are used in bursts or if use is put off until later in the allocation period.<br />
<br />
== Exploring group usage summaries at the Compute Canada portal == <!--T:35--><br />
<br />
<!--T:36--><br />
[[File:Select view group usage edit.png|thumb|Navigation to "View Group Usage" at the Compute Canada portal. ]]<br />
Summaries of account usage can be explored by logging into ccdb.computecanada.ca and navigating to ‘My Account’ > ‘View Group Usage’. CPU and GPU core year values are calculated based on the quantity of the resources allocated to jobs on the clusters. It is important to note that the values summarized at the View Group Usage pages do not represent core-equivalent measures such that, in the case of large memory jobs, the usage values will not match the cluster scheduler’s representation of the account usage. <br clear=all><br />
<br />
<br />
<!--T:37--><br />
[[File:Ccdb view use by compute resource.png|thumb|View of group usage by compute resource.]]<br />
The initial view of the usage at the View Group Usage page is divided up ‘By Compute Resources” (specific compute clusters, e.g. Beluga, Cedar, etc) which summarizes together the usage of all groups that the logged in user either owns or belongs to. The initial view begins with the current allocation year (starting on April 1st) up to the current date. <br clear=all><br />
<br />
<br />
<!--T:38--><br />
[[File:Ccdb view use by compute resource monthly.png|thumb|Usage by compute resource with monthly breakdown.]]<br />
Using the top selection bar of the View Group Usage page compute usage summaries can be grouped “By Compute Resources” (specific compute clusters, e.g. Beluga, Cedar, etc), “By Resource Allocation Project” (corresponding to the group account that jobs are submitted to, e.g. def-*, rrg-*, etc.), or “By Submitter” (corresponding to the specific users that submitted the jobs). The top selection bar also provides an option to display “Storage Usage”.<br />
<br />
<!--T:39--><br />
The table summaries for the current year provide a row for each compute resource (cluster), the “Total CPU Usage (in core years)” to date as well as the “Projected CPU Usage (in core years)” which assumes that the throughput achieved to date will be maintained for the remainder of the allocation period.<br />
<br />
<!--T:40--><br />
The “Year:” selection bar in the top panel of the View Group Usage page provides archival summaries of “Total CPU Usage (in core years)” for previous allocation years for which the “Projected CPU Usage” column is eliminated.<br />
<br />
<!--T:41--><br />
The “Extra Info” columns of the summary tables provide further breakdowns by “monthly” or “submitter” usage respectively. <br clear=all><br />
<br />
<!--T:42--><br />
[[File:Ccdb view use by compute resource monthly proj edit.png|thumb|Usage by Resource Allocation Project with monthly breakdown.]]<br />
The scheduling software implements the fair share usage policy across all Resource Allocation Projects (group accounts) on the specific cluster. Selecting the “By Resource Allocation Project” displays the summaries of specific project allocations and their usage on the clusters. This will display a new selection bar at the top of the View Group Usage page listing the available RAPIs (Resource Allocation Project Identifiers) for the selected allocation year. Selecting a specific RAPI displays various summary fields including the account name (e.g. def-*, rrg-* or rpp-*, etc), the project name and ownership as well as allocation and usage summaries. A single Resource Allocation Project may have allocations and usage on several clusters. This view divides the project’s allocations and usage uniquely across the clusters. The summary for the current allocation year provides columns for the “Resource” containing the allocation, “Allocations” details, “Total Core-Years Allocated”, “Total CPU Usage (in core years)”, “Utilization” expressed as a percentage of the allocation used, “Projected CPU Usage (in core years)” and “Projected Utilization” expressed as a percentage of the allocation projected use. <br clear=all><br />
<br />
== What happens if my group overuses my CPU or GPU allocation? == <!--T:32--><br />
<br />
<!--T:33--><br />
Nothing bad. Your CPU or GPU allocation is a target level, i.e., a target number of CPUs or GPUs. If you have jobs waiting to run, and competing demand is low enough, then the scheduler may allow more of your jobs to run than your target level. The only consequence of this is that succeeding jobs of yours ''may'' have lower priority for a time while the scheduler prioritizes other groups which were below their target. You are not prevented from submitting or running new jobs, and the time-average of your usage should still be close to your target, that is, your allocation.<br />
<br />
<!--T:34--><br />
It is even possible that you could end a month or even a year having run more work than your allocation would seem to allow, although this is unlikely given the demand on our resources.<br />
<br />
=How does scheduling work?= <!--T:7--><br />
<br />
<!--T:8--><br />
Compute-related resources granted by core-year and GPU-year allocations require research groups to submit what are referred to as “jobs” to a “scheduler”. A job is a combination of a computer program (an application) and a list of resources that the application is expected to use. The [[What is a scheduler?|scheduler]] is a program that calculates the priority of each job submitted and provides the needed resources based on the priority of each job and the available resources.<br />
<br />
<!--T:9--><br />
The scheduler uses prioritization algorithms to meet the allocation targets of all groups and it is based on a research group’s recent usage of the system as compared to their allocated usage on that system. The past of the allocation period is taken into account but the most weight is put on recent usage (or non-usage). The point of this is to allow a research group that matches their actual usage with their allocated amounts to operate roughly continuously at that level. This smooths resource usage over time across all groups and resources, allowing for it to be theoretically possible for all research groups to hit their allocation targets.<br />
<br />
=How does resource use affect priority?= <!--T:10--><br />
<br />
<!--T:11--><br />
The overarching principle governing the calculation of priority on Compute Canada's new national clusters is that compute-based jobs are considered in the calculation based on the resources that others are prevented from using and not on the resources actually used.<br />
<br />
<!--T:12--><br />
The most common example of unused cores contributing to a priority calculation occurs when a submitted job requests multiple cores but uses fewer cores than requested when run. The usage that will affect the priority of future jobs is the number of cores requested, not the number of cores the application actually used. This is because the unused cores were unavailable to others to use during the job.<br />
<br />
<!--T:13--><br />
Another common case is when a job requests memory beyond what is associated with the cores requested. If a cluster that has 4GB of memory associated with each core receives a job request for only a single core but 8GB of memory, then the job will be deemed to have used two cores. This is because other researchers were effectively prevented from using the second core because there was no memory available for it.<br />
<br />
<!--T:14--><br />
The details of how resources are accounted for require a sound understanding of the core equivalent concept, which is discussed below.<ref>Further details about how priority is calculated are beyond the scope of this document. Additional documentation is in preparation. We also suggest that a [https://www.westgrid.ca/events/scheduling_job_management_how_get_most_cluster training course] might be valuable for anyone wishing to know more.</ref><br />
<br />
=What is a core equivalent and how is it used by the scheduler?= <!--T:15--><br />
<br />
<!--T:16--><br />
A core equivalent is a bundle made up of a single core and some amount of associated memory. In other words, a core equivalent is a core plus the amount of memory considered to be associated with each core on a given system. <br />
<br />
<!--T:17--><br />
[[File:Core_equivalent_diagram_GP.png|frame|Figure 1 - Core equivalent diagram for Cedar and Graham.]]<br />
<br />
<!--T:18--><br />
Cedar and Graham are considered to provide 4GB per core, since this corresponds to the most common node type in those clusters, making a core equivalent on these systems a core-memory bundle of 4GB per core. Niagara is considered to provide 4.8GB of memory per core, making a core equivalent on it a core-memory bundle of 4.8GB per core. Jobs are charged in terms of core equivalent usage at the rate of 4 or 4.8 GB per core, as explained above. See Figure 1.<br />
<br />
<!--T:19--><br />
Allocation target tracking is straightforward when requests to use resources on the clusters are made entirely of core and memory amounts that can be portioned only into complete equivalent cores. Things become more complicated when jobs request portions of a core equivalent because it is possible to have many points counted against a research group’s allocation, even when they are using only portions of core equivalents. In practice, the method used by Compute Canada to account for system usage solves problems about fairness and perceptions of fairness but unfortunately the method is not initially intuitive.<br />
<br />
<!--T:20--><br />
Research groups are charged for the maximum number of core equivalents they take from the resources. Assuming a core equivalent of 1 core and 4GB of memory:<br />
* [[File:Two_core_equivalents.png|frame|Figure 2 - Two core equivalents.]] Research groups using more cores than memory (above the 1 core/4GB memory ratio), will be charged by cores. For example, a research group requesting two cores and 2GB per core for a total of 4 GB of memory. The request requires 2 core equivalents worth of cores but only one bundle for memory. This job request will be counted as 2 core equivalents when priority is calculated. See Figure 2. <br clear=all><br />
<br />
<!--T:21--><br />
* [[File:Two_and_a_half_core_equivalents.png|frame|Figure 3 - 2.5 core equivalents.]] Research groups using more memory than the 1 core/4GB ratio will be charged by memory. For example, a research group requests two cores and 5GB per core for a total of 10 GB of memory. The request requires 2.5 core equivalents worth of memory, but only two bundles for cores. This job request will be counted as 2.5 core equivalents when priority is calculated. See Figure 3. <br clear=all><br />
<br />
=What is a GPU equivalent and how is it used by the scheduler?= <!--T:22--><br />
<br />
<!--T:23--><br />
Use of GPUs and their associated resources follow the same principles as already described for core equivalents. The complication is that it is important to separate allocation targets for GPU-based research from allocation targets for non-GPU-based research to ensure that we can meet the allocation targets in each case. If these cases were not separated, then it would be possible for a non-GPU-based researcher to use their allocation targets in the GPU-based research pool, adding load that would effectively block GPU-based researchers from meeting their allocation targets and vice versa.<br />
<br />
<!--T:24--><br />
Given this separation, a distinction must be made between core equivalents and GPU equivalents. Core equivalents are as described above. The GPU-core-memory bundles that make up a GPU equivalent are similar to core-memory bundles except that a GPU is added to the bundle alongside multiple cores and memory. This means that accounting for GPU-based allocation targets must include the GPU. Similar to how the points system was used above when considering resource use as an expression of the concept of core equivalence, we will use a similar point system here as an expression of GPU equivalence.<br />
<br />
<!--T:25--><br />
Research groups are charged for the maximum number of GPU-core-memory bundles they use. Assuming a core-memory bundle of 1 GPU, 6 cores, and 32GB of memory: <br />
[[File:GPU_equivalent_diagram.png|frame|center|Figure 4 - GPU equivalent diagram.]]<br clear=all><br />
<br />
<!--T:26--><br />
* Research groups using more GPUs than cores or memory per GPU-core-memory bundle will be charged by GPU. For example, a research group requests 2 GPUs, 6 cores, and 32GB of memory. The request is for 2 GPU-core-memory bundles worth of GPUs but only one bundle for memory and cores. This job request will be counted as 2 GPU equivalents when the research group’s priority is calculated.<br />
[[File:Two_GPU_equivalents.png|frame|center|Figure 5 - Two GPU equivalents.]] <br clear=all><br />
<br />
<!--T:27--><br />
* Research groups using more cores than GPUs or memory per GPU-core-memory bundle will be charged by core. For example, a researcher requests 1 GPU, 9 cores, and 32GB of memory. The request is for 1.5 GPU-core-memory bundles worth of cores, but only one bundle for GPUs and memory. This job request will be counted as 1.5 GPU equivalents when the research group’s priority is calculated.<br />
[[File:GPU_and_a_half_(cores).png|frame|center|Figure 6 - 1.5 GPU equivalents, based on cores.]] <br clear=all><br />
<br />
<!--T:28--><br />
* Research groups using more memory than GPUs or cores per GPU-core-memory bundle will be charged by memory. For example, a researcher requests 1 GPU, 6 cores, and 48GB of memory. The request is for 1.5 GPU-core-memory bundles worth of memory but only one bundle for GPUs and cores. This job request will be counted as 1.5 GPU equivalents when the research group’s priority is calculated.<br />
[[File:GPU_and_a_half_(memory).png|frame|center|Figure 7 - 1.5 GPU equivalents, based on memory.]] <br clear=all><br />
<br />
== Ratios: GPU / CPU Cores / System-memory == <!--T:29--><br />
Compute Canada systems have the following GPU-core-memory bundle characteristics:<br />
* [[Béluga/en#Node_Characteristics|Béluga]]:<br />
** V100/16GB nodes: 1 GPU / 10 cores / 47000 MB<br />
* [[Cedar#Node_characteristics|Cedar]]:<br />
** P100/12GB nodes: 1 GPU / 6 cores / 32000 MB<br />
** P100/16GB nodes: 1 GPU / 6 cores / 64000 MB<br />
** V100/32GB nodes: 1 GPU / 8 cores / 48000 MB<br />
* [[Graham#Node_characteristics|Graham]]:<br />
** P100/12GB nodes: 1 GPU / 16 cores / 64000 MB<br />
** V100/16GB nodes: 1 GPU / 3.5 cores / 22500 MB<br />
** V100/32GB nodes: 1 GPU / 5 cores / 48000 MB<br />
** T4/16GB nodes: 1 GPU / {4,11} cores / 49000 MB<br />
<br />
<!--T:30--><br />
[[Category:SLURM]]<br />
</translate></div>Jdesjardhttps://docs.alliancecan.ca/mediawiki/index.php?title=Allocations_and_compute_scheduling&diff=103233Allocations and compute scheduling2021-09-09T18:27:49Z<p>Jdesjard: /* Exploring group usage summaries at the Compute Canada portal */</p>
<hr />
<div><languages /><br />
<br />
<translate><br />
<br />
<!--T:31--><br />
''Parent page: [[Job scheduling policies]]''<br />
<br />
= What is an allocation? = <!--T:2--><br />
<br />
<!--T:3--><br />
'''An allocation is an amount of resources that a research group can target for use for a period of time, usually a year.''' This amount is either a maximum amount, as is the case for storage, or an average amount of usage over the period, as is the case for shared resources like computation cores.<br />
<br />
<!--T:4--><br />
Allocations are usually made in terms of core years, GPU years, or storage space. Storage allocations are the most straightforward to understand: research groups will get a maximum amount of storage that they can use exclusively throughout the allocation period. Core year and GPU year allocations are more difficult to understand because these allocations are meant to capture average use throughout the allocation period---typically meant to be a year---and this use will occur across a set of resources shared with other research groups.<br />
<br />
<!--T:5--><br />
The time period of an allocation when it is granted is a reference value, used for the calculation of the average which is applied to the actual period during which the resources are available. This means that if the allocation period was a year and the clusters were down for a week of maintenance, a research group would not be entitled to an additional week of resource usage. Equally so, if the allocation period were to be extended by a month, research groups affected by such a change would not see their resource access diminish during this month.<br />
<br />
<!--T:6--><br />
It should be noted that in the case of core year and GPU year allocations, both of which target resource usage averages over time on shared resources, a research group is more likely to hit (or exceed) its target(s) if the resources are used evenly over the allocation period than if the resources are used in bursts or if use is put off until later in the allocation period.<br />
<br />
== Exploring group usage summaries at the Compute Canada portal ==<br />
<br />
[[File:Select view group usage edit.png|thumb|Navigation to "View Group Usage" at the Compute Canada portal. ]]<br />
Summaries of account usage can be explored by logging into ccdb.computecanada.ca and navigating to ‘My Account’ > ‘View Group Usage’. CPU and GPU core year values are calculated based on the quantity of the resources allocated to jobs on the clusters. It is important to note that the values summarized at the View Group Usage pages do not represent core-equivalent measures such that, in the case of large memory jobs, the usage values will not match the cluster scheduler’s representation of the account usage. <br clear=all><br />
<br />
<br />
[[File:Ccdb view use by compute resource.png|thumb|View of group usage by compute resource.]]<br />
The initial view of the usage at the View Group Usage page is divided up ‘By Compute Resources” (specific compute clusters, e.g. Beluga, Cedar, etc) which summarizes together the usage of all groups that the logged in user either owns or belongs to. The initial view begins with the current allocation year (starting on April 1st) up to the current date. <br clear=all><br />
<br />
<br />
[[File:Ccdb view use by compute resource monthly.png|thumb|Usage by compute resource with monthly breakdown.]]<br />
Using the top selection bar of the View Group Usage page compute usage summaries can be grouped “By Compute Resources” (specific compute clusters, e.g. Beluga, Cedar, etc), “By Resource Allocation Project” (corresponding to the group account that jobs are submitted to, e.g. def-*, rrg-*, etc.), or “By Submitter” (corresponding to the specific users that submitted the jobs). The top selection bar also provides an option to display “Storage Usage”.<br />
<br />
The table summaries for the current year provide a row for each compute resource (cluster), the “Total CPU Usage (in core years)” to date as well as the “Projected CPU Usage (in core years)” which assumes that the throughput achieved to date will be maintained for the remainder of the allocation period.<br />
<br />
The “Year:” selection bar in the top panel of the View Group Usage page provides archival summaries of “Total CPU Usage (in core years)” for previous allocation years for which the “Projected CPU Usage” column is eliminated.<br />
<br />
The “Extra Info” columns of the summary tables provide further breakdowns by “monthly” or “submitter” usage respectively. <br clear=all><br />
<br />
[[File:Ccdb view use by compute resource monthly proj edit.png|thumb|Usage by Resource Allocation Project with monthly breakdown.]]<br />
The scheduling software implements the fair share usage policy across all Resource Allocation Projects (group accounts) on the specific cluster. Selecting the “By Resource Allocation Project” displays the summaries of specific project allocations and their usage on the clusters. This will display a new selection bar at the top of the View Group Usage page listing the available RAPIs (Resource Allocation Project Identifiers) for the selected allocation year. Selecting a specific RAPI displays various summary fields including the account name (e.g. def-*, rrg-* or rpp-*, etc), the project name and ownership as well as allocation and usage summaries. A single Resource Allocation Project may have allocations and usage on several clusters. This view divides the project’s allocations and usage uniquely across the clusters. The summary for the current allocation year provides columns for the “Resource” containing the allocation, “Allocations” details, “Total Core-Years Allocated”, “Total CPU Usage (in core years)”, “Utilization” expressed as a percentage of the allocation used, “Projected CPU Usage (in core years)” and “Projected Utilization” expressed as a percentage of the allocation projected use. <br clear=all><br />
<br />
== What happens if my group overuses my CPU or GPU allocation? == <!--T:32--><br />
<br />
<!--T:33--><br />
Nothing bad. Your CPU or GPU allocation is a target level, i.e., a target number of CPUs or GPUs. If you have jobs waiting to run, and competing demand is low enough, then the scheduler may allow more of your jobs to run than your target level. The only consequence of this is that succeeding jobs of yours ''may'' have lower priority for a time while the scheduler prioritizes other groups which were below their target. You are not prevented from submitting or running new jobs, and the time-average of your usage should still be close to your target, that is, your allocation.<br />
<br />
<!--T:34--><br />
It is even possible that you could end a month or even a year having run more work than your allocation would seem to allow, although this is unlikely given the demand on our resources.<br />
<br />
=How does scheduling work?= <!--T:7--><br />
<br />
<!--T:8--><br />
Compute-related resources granted by core-year and GPU-year allocations require research groups to submit what are referred to as “jobs” to a “scheduler”. A job is a combination of a computer program (an application) and a list of resources that the application is expected to use. The [[What is a scheduler?|scheduler]] is a program that calculates the priority of each job submitted and provides the needed resources based on the priority of each job and the available resources.<br />
<br />
<!--T:9--><br />
The scheduler uses prioritization algorithms to meet the allocation targets of all groups and it is based on a research group’s recent usage of the system as compared to their allocated usage on that system. The past of the allocation period is taken into account but the most weight is put on recent usage (or non-usage). The point of this is to allow a research group that matches their actual usage with their allocated amounts to operate roughly continuously at that level. This smooths resource usage over time across all groups and resources, allowing for it to be theoretically possible for all research groups to hit their allocation targets.<br />
<br />
=How does resource use affect priority?= <!--T:10--><br />
<br />
<!--T:11--><br />
The overarching principle governing the calculation of priority on Compute Canada's new national clusters is that compute-based jobs are considered in the calculation based on the resources that others are prevented from using and not on the resources actually used.<br />
<br />
<!--T:12--><br />
The most common example of unused cores contributing to a priority calculation occurs when a submitted job requests multiple cores but uses fewer cores than requested when run. The usage that will affect the priority of future jobs is the number of cores requested, not the number of cores the application actually used. This is because the unused cores were unavailable to others to use during the job.<br />
<br />
<!--T:13--><br />
Another common case is when a job requests memory beyond what is associated with the cores requested. If a cluster that has 4GB of memory associated with each core receives a job request for only a single core but 8GB of memory, then the job will be deemed to have used two cores. This is because other researchers were effectively prevented from using the second core because there was no memory available for it.<br />
<br />
<!--T:14--><br />
The details of how resources are accounted for require a sound understanding of the core equivalent concept, which is discussed below.<ref>Further details about how priority is calculated are beyond the scope of this document. Additional documentation is in preparation. We also suggest that a [https://www.westgrid.ca/events/scheduling_job_management_how_get_most_cluster training course] might be valuable for anyone wishing to know more.</ref><br />
<br />
=What is a core equivalent and how is it used by the scheduler?= <!--T:15--><br />
<br />
<!--T:16--><br />
A core equivalent is a bundle made up of a single core and some amount of associated memory. In other words, a core equivalent is a core plus the amount of memory considered to be associated with each core on a given system. <br />
<br />
<!--T:17--><br />
[[File:Core_equivalent_diagram_GP.png|frame|Figure 1 - Core equivalent diagram for Cedar and Graham.]]<br />
<br />
<!--T:18--><br />
Cedar and Graham are considered to provide 4GB per core, since this corresponds to the most common node type in those clusters, making a core equivalent on these systems a core-memory bundle of 4GB per core. Niagara is considered to provide 4.8GB of memory per core, making a core equivalent on it a core-memory bundle of 4.8GB per core. Jobs are charged in terms of core equivalent usage at the rate of 4 or 4.8 GB per core, as explained above. See Figure 1.<br />
<br />
<!--T:19--><br />
Allocation target tracking is straightforward when requests to use resources on the clusters are made entirely of core and memory amounts that can be portioned only into complete equivalent cores. Things become more complicated when jobs request portions of a core equivalent because it is possible to have many points counted against a research group’s allocation, even when they are using only portions of core equivalents. In practice, the method used by Compute Canada to account for system usage solves problems about fairness and perceptions of fairness but unfortunately the method is not initially intuitive.<br />
<br />
<!--T:20--><br />
Research groups are charged for the maximum number of core equivalents they take from the resources. Assuming a core equivalent of 1 core and 4GB of memory:<br />
* [[File:Two_core_equivalents.png|frame|Figure 2 - Two core equivalents.]] Research groups using more cores than memory (above the 1 core/4GB memory ratio), will be charged by cores. For example, a research group requesting two cores and 2GB per core for a total of 4 GB of memory. The request requires 2 core equivalents worth of cores but only one bundle for memory. This job request will be counted as 2 core equivalents when priority is calculated. See Figure 2. <br clear=all><br />
<br />
<!--T:21--><br />
* [[File:Two_and_a_half_core_equivalents.png|frame|Figure 3 - 2.5 core equivalents.]] Research groups using more memory than the 1 core/4GB ratio will be charged by memory. For example, a research group requests two cores and 5GB per core for a total of 10 GB of memory. The request requires 2.5 core equivalents worth of memory, but only two bundles for cores. This job request will be counted as 2.5 core equivalents when priority is calculated. See Figure 3. <br clear=all><br />
<br />
=What is a GPU equivalent and how is it used by the scheduler?= <!--T:22--><br />
<br />
<!--T:23--><br />
Use of GPUs and their associated resources follow the same principles as already described for core equivalents. The complication is that it is important to separate allocation targets for GPU-based research from allocation targets for non-GPU-based research to ensure that we can meet the allocation targets in each case. If these cases were not separated, then it would be possible for a non-GPU-based researcher to use their allocation targets in the GPU-based research pool, adding load that would effectively block GPU-based researchers from meeting their allocation targets and vice versa.<br />
<br />
<!--T:24--><br />
Given this separation, a distinction must be made between core equivalents and GPU equivalents. Core equivalents are as described above. The GPU-core-memory bundles that make up a GPU equivalent are similar to core-memory bundles except that a GPU is added to the bundle alongside multiple cores and memory. This means that accounting for GPU-based allocation targets must include the GPU. Similar to how the points system was used above when considering resource use as an expression of the concept of core equivalence, we will use a similar point system here as an expression of GPU equivalence.<br />
<br />
<!--T:25--><br />
Research groups are charged for the maximum number of GPU-core-memory bundles they use. Assuming a core-memory bundle of 1 GPU, 6 cores, and 32GB of memory: <br />
[[File:GPU_equivalent_diagram.png|frame|center|Figure 4 - GPU equivalent diagram.]]<br clear=all><br />
<br />
<!--T:26--><br />
* Research groups using more GPUs than cores or memory per GPU-core-memory bundle will be charged by GPU. For example, a research group requests 2 GPUs, 6 cores, and 32GB of memory. The request is for 2 GPU-core-memory bundles worth of GPUs but only one bundle for memory and cores. This job request will be counted as 2 GPU equivalents when the research group’s priority is calculated.<br />
[[File:Two_GPU_equivalents.png|frame|center|Figure 5 - Two GPU equivalents.]] <br clear=all><br />
<br />
<!--T:27--><br />
* Research groups using more cores than GPUs or memory per GPU-core-memory bundle will be charged by core. For example, a researcher requests 1 GPU, 9 cores, and 32GB of memory. The request is for 1.5 GPU-core-memory bundles worth of cores, but only one bundle for GPUs and memory. This job request will be counted as 1.5 GPU equivalents when the research group’s priority is calculated.<br />
[[File:GPU_and_a_half_(cores).png|frame|center|Figure 6 - 1.5 GPU equivalents, based on cores.]] <br clear=all><br />
<br />
<!--T:28--><br />
* Research groups using more memory than GPUs or cores per GPU-core-memory bundle will be charged by memory. For example, a researcher requests 1 GPU, 6 cores, and 48GB of memory. The request is for 1.5 GPU-core-memory bundles worth of memory but only one bundle for GPUs and cores. This job request will be counted as 1.5 GPU equivalents when the research group’s priority is calculated.<br />
[[File:GPU_and_a_half_(memory).png|frame|center|Figure 7 - 1.5 GPU equivalents, based on memory.]] <br clear=all><br />
<br />
== Ratios: GPU / CPU Cores / System-memory == <!--T:29--><br />
Compute Canada systems have the following GPU-core-memory bundle characteristics:<br />
* [[Béluga/en#Node_Characteristics|Béluga]]:<br />
** V100/16GB nodes: 1 GPU / 10 cores / 47000 MB<br />
* [[Cedar#Node_characteristics|Cedar]]:<br />
** P100/12GB nodes: 1 GPU / 6 cores / 32000 MB<br />
** P100/16GB nodes: 1 GPU / 6 cores / 64000 MB<br />
** V100/32GB nodes: 1 GPU / 8 cores / 48000 MB<br />
* [[Graham#Node_characteristics|Graham]]:<br />
** P100/12GB nodes: 1 GPU / 16 cores / 64000 MB<br />
** V100/16GB nodes: 1 GPU / 3.5 cores / 22500 MB<br />
** V100/32GB nodes: 1 GPU / 5 cores / 48000 MB<br />
** T4/16GB nodes: 1 GPU / {4,11} cores / 49000 MB<br />
<br />
<!--T:30--><br />
[[Category:SLURM]]<br />
</translate></div>Jdesjardhttps://docs.alliancecan.ca/mediawiki/index.php?title=Allocations_and_compute_scheduling&diff=103226Allocations and compute scheduling2021-09-09T17:59:37Z<p>Jdesjard: /* Exploring group usage summaries at the Compute Canada portal */</p>
<hr />
<div><languages /><br />
<br />
<translate><br />
<br />
<!--T:31--><br />
''Parent page: [[Job scheduling policies]]''<br />
<br />
= What is an allocation? = <!--T:2--><br />
<br />
<!--T:3--><br />
'''An allocation is an amount of resources that a research group can target for use for a period of time, usually a year.''' This amount is either a maximum amount, as is the case for storage, or an average amount of usage over the period, as is the case for shared resources like computation cores.<br />
<br />
<!--T:4--><br />
Allocations are usually made in terms of core years, GPU years, or storage space. Storage allocations are the most straightforward to understand: research groups will get a maximum amount of storage that they can use exclusively throughout the allocation period. Core year and GPU year allocations are more difficult to understand because these allocations are meant to capture average use throughout the allocation period---typically meant to be a year---and this use will occur across a set of resources shared with other research groups.<br />
<br />
<!--T:5--><br />
The time period of an allocation when it is granted is a reference value, used for the calculation of the average which is applied to the actual period during which the resources are available. This means that if the allocation period was a year and the clusters were down for a week of maintenance, a research group would not be entitled to an additional week of resource usage. Equally so, if the allocation period were to be extended by a month, research groups affected by such a change would not see their resource access diminish during this month.<br />
<br />
<!--T:6--><br />
It should be noted that in the case of core year and GPU year allocations, both of which target resource usage averages over time on shared resources, a research group is more likely to hit (or exceed) its target(s) if the resources are used evenly over the allocation period than if the resources are used in bursts or if use is put off until later in the allocation period.<br />
<br />
== Exploring group usage summaries at the Compute Canada portal ==<br />
<br />
[[File:Select view group usage edit.png|thumb|Navigation to "View Group Usage" at the Compute Canada portal. ]]<br />
Summaries of account usage can be explored by logging into ccdb.computecanada.ca and navigating to ‘My Account’ > ‘View Group Usage’. CPU and GPU core year values are calculated based on the quantity of the resources allocated to jobs on the clusters. It is important to note that the values summarized at the View Group Usage pages do not represent core-equivalent measures such that, in the case of large memory jobs, the usage values will not match the cluster scheduler’s representation of the account usage. <br clear=all><br />
<br />
<br />
[[File:Ccdb view use by compute resource.png|thumb|View of group usage by compute resource.]]<br />
The initial view of the usage at the View Group Usage page is divided up ‘By Compute Resources” (specific compute clusters, e.g. Beluga, Cedar, etc) which summarizes together the usage of all groups that the logged in user either owns or belongs to. The initial view begins with the current allocation year (starting on April 1st) up to the current date. <br clear=all><br />
<br />
<br />
[[File:Ccdb view use by compute resource monthly.png|thumb|Usage by compute resource with monthly breakdown.]]<br />
Using the top selection bar or the View Group Usage page compute usage summaries can be grouped “By Compute Resources” (specific compute clusters, e.g. Beluga, Cedar, etc), “By Resource Allocation Project” (corresponding to the group account that jobs are submitted to, e.g. def-*, rrg-*, etc.), or “By Submitter” (corresponding to the specific users that submitted the jobs). The top selection bar also provides an option to display “Storage Usage”.<br />
<br />
The table summaries for the current year provide a row for each compute resource (cluster), the “Total CPU Usage (in core years)” to date as well as the “Projected CPU Usage (in core years)” which assumes that the throughput achieved to date will be maintained for the remainder of the allocation period.<br />
<br />
The “Year:” selection bar in the top panel of the View Group Usage page provides archival summaries of “Total CPU Usage (in core years)” for previous allocation years for which the “Projected CPU Usage” column is eliminated.<br />
<br />
The “Extra Info” columns of the summary tables provide further breakdowns by “monthly” or “submitter” usage respectively. <br clear=all><br />
<br />
[[File:Ccdb view use by compute resource monthly proj edit.png|thumb|Usage by Resource Allocation Project with monthly breakdown.]]<br />
The scheduling software implements the fair share usage policy across all Resource Allocation Projects (group accounts) on the specific cluster. Selecting the “By Resource Allocation Project” displays the summaries of specific project allocations and their usage on the clusters. This will display a new selection bar at the top of the View Group Usage page listing the available RAPIs (Resource Allocation Project Identifiers) for the selected allocation year. Selecting a specific RAPI displays various summary fields including the account name (e.g. def-*, rrg-* or rpp-*, etc), the project name and ownership as well as allocation and usage summaries. A single Resource Allocation Project may have allocations and usage on several clusters. This view divides the project’s allocations and usage uniquely across the clusters. The summary for the current allocation year provides columns for the “Resource” containing the allocation, “Allocations” details, “Total Core-Years Allocated”, “Total CPU Usage (in core years)”, “Utilization” expressed as a percentage of the allocation used, “Projected CPU Usage (in core years)” and “Projected Utilization” expressed as a percentage of the allocation projected use. <br clear=all><br />
<br />
== What happens if my group overuses my CPU or GPU allocation? == <!--T:32--><br />
<br />
<!--T:33--><br />
Nothing bad. Your CPU or GPU allocation is a target level, i.e., a target number of CPUs or GPUs. If you have jobs waiting to run, and competing demand is low enough, then the scheduler may allow more of your jobs to run than your target level. The only consequence of this is that succeeding jobs of yours ''may'' have lower priority for a time while the scheduler prioritizes other groups which were below their target. You are not prevented from submitting or running new jobs, and the time-average of your usage should still be close to your target, that is, your allocation.<br />
<br />
<!--T:34--><br />
It is even possible that you could end a month or even a year having run more work than your allocation would seem to allow, although this is unlikely given the demand on our resources.<br />
<br />
=How does scheduling work?= <!--T:7--><br />
<br />
<!--T:8--><br />
Compute-related resources granted by core-year and GPU-year allocations require research groups to submit what are referred to as “jobs” to a “scheduler”. A job is a combination of a computer program (an application) and a list of resources that the application is expected to use. The [[What is a scheduler?|scheduler]] is a program that calculates the priority of each job submitted and provides the needed resources based on the priority of each job and the available resources.<br />
<br />
<!--T:9--><br />
The scheduler uses prioritization algorithms to meet the allocation targets of all groups and it is based on a research group’s recent usage of the system as compared to their allocated usage on that system. The past of the allocation period is taken into account but the most weight is put on recent usage (or non-usage). The point of this is to allow a research group that matches their actual usage with their allocated amounts to operate roughly continuously at that level. This smooths resource usage over time across all groups and resources, allowing for it to be theoretically possible for all research groups to hit their allocation targets.<br />
<br />
=How does resource use affect priority?= <!--T:10--><br />
<br />
<!--T:11--><br />
The overarching principle governing the calculation of priority on Compute Canada's new national clusters is that compute-based jobs are considered in the calculation based on the resources that others are prevented from using and not on the resources actually used.<br />
<br />
<!--T:12--><br />
The most common example of unused cores contributing to a priority calculation occurs when a submitted job requests multiple cores but uses fewer cores than requested when run. The usage that will affect the priority of future jobs is the number of cores requested, not the number of cores the application actually used. This is because the unused cores were unavailable to others to use during the job.<br />
<br />
<!--T:13--><br />
Another common case is when a job requests memory beyond what is associated with the cores requested. If a cluster that has 4GB of memory associated with each core receives a job request for only a single core but 8GB of memory, then the job will be deemed to have used two cores. This is because other researchers were effectively prevented from using the second core because there was no memory available for it.<br />
<br />
<!--T:14--><br />
The details of how resources are accounted for require a sound understanding of the core equivalent concept, which is discussed below.<ref>Further details about how priority is calculated are beyond the scope of this document. Additional documentation is in preparation. We also suggest that a [https://www.westgrid.ca/events/scheduling_job_management_how_get_most_cluster training course] might be valuable for anyone wishing to know more.</ref><br />
<br />
=What is a core equivalent and how is it used by the scheduler?= <!--T:15--><br />
<br />
<!--T:16--><br />
A core equivalent is a bundle made up of a single core and some amount of associated memory. In other words, a core equivalent is a core plus the amount of memory considered to be associated with each core on a given system. <br />
<br />
<!--T:17--><br />
[[File:Core_equivalent_diagram_GP.png|frame|Figure 1 - Core equivalent diagram for Cedar and Graham.]]<br />
<br />
<!--T:18--><br />
Cedar and Graham are considered to provide 4GB per core, since this corresponds to the most common node type in those clusters, making a core equivalent on these systems a core-memory bundle of 4GB per core. Niagara is considered to provide 4.8GB of memory per core, making a core equivalent on it a core-memory bundle of 4.8GB per core. Jobs are charged in terms of core equivalent usage at the rate of 4 or 4.8 GB per core, as explained above. See Figure 1.<br />
<br />
<!--T:19--><br />
Allocation target tracking is straightforward when requests to use resources on the clusters are made entirely of core and memory amounts that can be portioned only into complete equivalent cores. Things become more complicated when jobs request portions of a core equivalent because it is possible to have many points counted against a research group’s allocation, even when they are using only portions of core equivalents. In practice, the method used by Compute Canada to account for system usage solves problems about fairness and perceptions of fairness but unfortunately the method is not initially intuitive.<br />
<br />
<!--T:20--><br />
Research groups are charged for the maximum number of core equivalents they take from the resources. Assuming a core equivalent of 1 core and 4GB of memory:<br />
* [[File:Two_core_equivalents.png|frame|Figure 2 - Two core equivalents.]] Research groups using more cores than memory (above the 1 core/4GB memory ratio), will be charged by cores. For example, a research group requesting two cores and 2GB per core for a total of 4 GB of memory. The request requires 2 core equivalents worth of cores but only one bundle for memory. This job request will be counted as 2 core equivalents when priority is calculated. See Figure 2. <br clear=all><br />
<br />
<!--T:21--><br />
* [[File:Two_and_a_half_core_equivalents.png|frame|Figure 3 - 2.5 core equivalents.]] Research groups using more memory than the 1 core/4GB ratio will be charged by memory. For example, a research group requests two cores and 5GB per core for a total of 10 GB of memory. The request requires 2.5 core equivalents worth of memory, but only two bundles for cores. This job request will be counted as 2.5 core equivalents when priority is calculated. See Figure 3. <br clear=all><br />
<br />
=What is a GPU equivalent and how is it used by the scheduler?= <!--T:22--><br />
<br />
<!--T:23--><br />
Use of GPUs and their associated resources follow the same principles as already described for core equivalents. The complication is that it is important to separate allocation targets for GPU-based research from allocation targets for non-GPU-based research to ensure that we can meet the allocation targets in each case. If these cases were not separated, then it would be possible for a non-GPU-based researcher to use their allocation targets in the GPU-based research pool, adding load that would effectively block GPU-based researchers from meeting their allocation targets and vice versa.<br />
<br />
<!--T:24--><br />
Given this separation, a distinction must be made between core equivalents and GPU equivalents. Core equivalents are as described above. The GPU-core-memory bundles that make up a GPU equivalent are similar to core-memory bundles except that a GPU is added to the bundle alongside multiple cores and memory. This means that accounting for GPU-based allocation targets must include the GPU. Similar to how the points system was used above when considering resource use as an expression of the concept of core equivalence, we will use a similar point system here as an expression of GPU equivalence.<br />
<br />
<!--T:25--><br />
Research groups are charged for the maximum number of GPU-core-memory bundles they use. Assuming a core-memory bundle of 1 GPU, 6 cores, and 32GB of memory: <br />
[[File:GPU_equivalent_diagram.png|frame|center|Figure 4 - GPU equivalent diagram.]]<br clear=all><br />
<br />
<!--T:26--><br />
* Research groups using more GPUs than cores or memory per GPU-core-memory bundle will be charged by GPU. For example, a research group requests 2 GPUs, 6 cores, and 32GB of memory. The request is for 2 GPU-core-memory bundles worth of GPUs but only one bundle for memory and cores. This job request will be counted as 2 GPU equivalents when the research group’s priority is calculated.<br />
[[File:Two_GPU_equivalents.png|frame|center|Figure 5 - Two GPU equivalents.]] <br clear=all><br />
<br />
<!--T:27--><br />
* Research groups using more cores than GPUs or memory per GPU-core-memory bundle will be charged by core. For example, a researcher requests 1 GPU, 9 cores, and 32GB of memory. The request is for 1.5 GPU-core-memory bundles worth of cores, but only one bundle for GPUs and memory. This job request will be counted as 1.5 GPU equivalents when the research group’s priority is calculated.<br />
[[File:GPU_and_a_half_(cores).png|frame|center|Figure 6 - 1.5 GPU equivalents, based on cores.]] <br clear=all><br />
<br />
<!--T:28--><br />
* Research groups using more memory than GPUs or cores per GPU-core-memory bundle will be charged by memory. For example, a researcher requests 1 GPU, 6 cores, and 48GB of memory. The request is for 1.5 GPU-core-memory bundles worth of memory but only one bundle for GPUs and cores. This job request will be counted as 1.5 GPU equivalents when the research group’s priority is calculated.<br />
[[File:GPU_and_a_half_(memory).png|frame|center|Figure 7 - 1.5 GPU equivalents, based on memory.]] <br clear=all><br />
<br />
== Ratios: GPU / CPU Cores / System-memory == <!--T:29--><br />
Compute Canada systems have the following GPU-core-memory bundle characteristics:<br />
* [[Béluga/en#Node_Characteristics|Béluga]]:<br />
** V100/16GB nodes: 1 GPU / 10 cores / 47000 MB<br />
* [[Cedar#Node_characteristics|Cedar]]:<br />
** P100/12GB nodes: 1 GPU / 6 cores / 32000 MB<br />
** P100/16GB nodes: 1 GPU / 6 cores / 64000 MB<br />
** V100/32GB nodes: 1 GPU / 8 cores / 48000 MB<br />
* [[Graham#Node_characteristics|Graham]]:<br />
** P100/12GB nodes: 1 GPU / 16 cores / 64000 MB<br />
** V100/16GB nodes: 1 GPU / 3.5 cores / 22500 MB<br />
** V100/32GB nodes: 1 GPU / 5 cores / 48000 MB<br />
** T4/16GB nodes: 1 GPU / {4,11} cores / 49000 MB<br />
<br />
<!--T:30--><br />
[[Category:SLURM]]<br />
</translate></div>Jdesjardhttps://docs.alliancecan.ca/mediawiki/index.php?title=File:Ccdb_view_use_by_compute_resource_monthly_proj_edit.png&diff=103225File:Ccdb view use by compute resource monthly proj edit.png2021-09-09T17:42:13Z<p>Jdesjard: </p>
<hr />
<div>Usage by resource allocation project.</div>Jdesjardhttps://docs.alliancecan.ca/mediawiki/index.php?title=Allocations_and_compute_scheduling&diff=103224Allocations and compute scheduling2021-09-09T17:23:49Z<p>Jdesjard: /* Exploring group usage summaries at the Compute Canada portal */</p>
<hr />
<div><languages /><br />
<br />
<translate><br />
<br />
<!--T:31--><br />
''Parent page: [[Job scheduling policies]]''<br />
<br />
= What is an allocation? = <!--T:2--><br />
<br />
<!--T:3--><br />
'''An allocation is an amount of resources that a research group can target for use for a period of time, usually a year.''' This amount is either a maximum amount, as is the case for storage, or an average amount of usage over the period, as is the case for shared resources like computation cores.<br />
<br />
<!--T:4--><br />
Allocations are usually made in terms of core years, GPU years, or storage space. Storage allocations are the most straightforward to understand: research groups will get a maximum amount of storage that they can use exclusively throughout the allocation period. Core year and GPU year allocations are more difficult to understand because these allocations are meant to capture average use throughout the allocation period---typically meant to be a year---and this use will occur across a set of resources shared with other research groups.<br />
<br />
<!--T:5--><br />
The time period of an allocation when it is granted is a reference value, used for the calculation of the average which is applied to the actual period during which the resources are available. This means that if the allocation period was a year and the clusters were down for a week of maintenance, a research group would not be entitled to an additional week of resource usage. Equally so, if the allocation period were to be extended by a month, research groups affected by such a change would not see their resource access diminish during this month.<br />
<br />
<!--T:6--><br />
It should be noted that in the case of core year and GPU year allocations, both of which target resource usage averages over time on shared resources, a research group is more likely to hit (or exceed) its target(s) if the resources are used evenly over the allocation period than if the resources are used in bursts or if use is put off until later in the allocation period.<br />
<br />
== Exploring group usage summaries at the Compute Canada portal ==<br />
<br />
Summaries of account usage can be explored at ccdb.computecanada.ca by logging into a user account and navigating to ‘My Account’ > ‘View Group Usage’. CPU and GPU year values are calculated based on the quantity of the resources allocated to jobs on the clusters. It is important to note that these values do not represent core-equivalent measures such that, in the case of large memory jobs, the usage measures displayed here will not match the cluster scheduler’s representation of the account usage.<br />
<br />
[[File:Select view group usage edit.png|thumb|Navigation to "View Group Usage" at the Compute Canada portal. ]]<br />
<br />
The initial view of the usage is divided up ‘By Compute Resources” (specific compute clusters, e.g. Beluga, Cedar, etc) which summarizes together the usage of all groups that the logged in user either owns or belongs to. The default view begins with the current allocation year (starting on April 1st) up to the current date.<br />
<br />
[[File:Ccdb view use by compute resource.png|thumb|View of group usage by compute resource.]]<br />
<br />
Using the top selection bar compute usage summaries can be grouped “By Compute Resources” (specific compute clusters, e.g. Beluga, Cedar, etc), “By Resource Allocation Project” (corresponding to the group account that jobs are submitted to, e.g. def-*, rrg-*, etc.), or “By Submitter” (corresponding to the specific users that submitted the jobs). The top selection bar also provides an option to display “Storage Usage”.<br />
<br />
The table summaries for the current year provide a row for each compute resource (cluster), the “Total CPU Usage (in core years)” to date as well as the “Projected CPU Usage (in core years)” which assumes that the throughput achieved to date will be maintained for the remainder of the allocation period.<br />
<br />
The “Year:” selection bar in the top panel provides archival summaries of “Total CPU Usage (in core years)” for previous allocation years for which the “Projected CPU Usage” column is eliminated.<br />
<br />
The “Extra Info” columns of the summary tables provide further breakdowns by “monthly” or “submitter” usage respectively.<br />
<br />
[[File:Ccdb view use by compute resource monthly.png|thumb|Usage by compute resource with monthly breakdown.]]<br />
<br />
== What happens if my group overuses my CPU or GPU allocation? == <!--T:32--><br />
<br />
<!--T:33--><br />
Nothing bad. Your CPU or GPU allocation is a target level, i.e., a target number of CPUs or GPUs. If you have jobs waiting to run, and competing demand is low enough, then the scheduler may allow more of your jobs to run than your target level. The only consequence of this is that succeeding jobs of yours ''may'' have lower priority for a time while the scheduler prioritizes other groups which were below their target. You are not prevented from submitting or running new jobs, and the time-average of your usage should still be close to your target, that is, your allocation.<br />
<br />
<!--T:34--><br />
It is even possible that you could end a month or even a year having run more work than your allocation would seem to allow, although this is unlikely given the demand on our resources.<br />
<br />
=How does scheduling work?= <!--T:7--><br />
<br />
<!--T:8--><br />
Compute-related resources granted by core-year and GPU-year allocations require research groups to submit what are referred to as “jobs” to a “scheduler”. A job is a combination of a computer program (an application) and a list of resources that the application is expected to use. The [[What is a scheduler?|scheduler]] is a program that calculates the priority of each job submitted and provides the needed resources based on the priority of each job and the available resources.<br />
<br />
<!--T:9--><br />
The scheduler uses prioritization algorithms to meet the allocation targets of all groups and it is based on a research group’s recent usage of the system as compared to their allocated usage on that system. The past of the allocation period is taken into account but the most weight is put on recent usage (or non-usage). The point of this is to allow a research group that matches their actual usage with their allocated amounts to operate roughly continuously at that level. This smooths resource usage over time across all groups and resources, allowing for it to be theoretically possible for all research groups to hit their allocation targets.<br />
<br />
=How does resource use affect priority?= <!--T:10--><br />
<br />
<!--T:11--><br />
The overarching principle governing the calculation of priority on Compute Canada's new national clusters is that compute-based jobs are considered in the calculation based on the resources that others are prevented from using and not on the resources actually used.<br />
<br />
<!--T:12--><br />
The most common example of unused cores contributing to a priority calculation occurs when a submitted job requests multiple cores but uses fewer cores than requested when run. The usage that will affect the priority of future jobs is the number of cores requested, not the number of cores the application actually used. This is because the unused cores were unavailable to others to use during the job.<br />
<br />
<!--T:13--><br />
Another common case is when a job requests memory beyond what is associated with the cores requested. If a cluster that has 4GB of memory associated with each core receives a job request for only a single core but 8GB of memory, then the job will be deemed to have used two cores. This is because other researchers were effectively prevented from using the second core because there was no memory available for it.<br />
<br />
<!--T:14--><br />
The details of how resources are accounted for require a sound understanding of the core equivalent concept, which is discussed below.<ref>Further details about how priority is calculated are beyond the scope of this document. Additional documentation is in preparation. We also suggest that a [https://www.westgrid.ca/events/scheduling_job_management_how_get_most_cluster training course] might be valuable for anyone wishing to know more.</ref><br />
<br />
=What is a core equivalent and how is it used by the scheduler?= <!--T:15--><br />
<br />
<!--T:16--><br />
A core equivalent is a bundle made up of a single core and some amount of associated memory. In other words, a core equivalent is a core plus the amount of memory considered to be associated with each core on a given system. <br />
<br />
<!--T:17--><br />
[[File:Core_equivalent_diagram_GP.png|frame|Figure 1 - Core equivalent diagram for Cedar and Graham.]]<br />
<br />
<!--T:18--><br />
Cedar and Graham are considered to provide 4GB per core, since this corresponds to the most common node type in those clusters, making a core equivalent on these systems a core-memory bundle of 4GB per core. Niagara is considered to provide 4.8GB of memory per core, making a core equivalent on it a core-memory bundle of 4.8GB per core. Jobs are charged in terms of core equivalent usage at the rate of 4 or 4.8 GB per core, as explained above. See Figure 1.<br />
<br />
<!--T:19--><br />
Allocation target tracking is straightforward when requests to use resources on the clusters are made entirely of core and memory amounts that can be portioned only into complete equivalent cores. Things become more complicated when jobs request portions of a core equivalent because it is possible to have many points counted against a research group’s allocation, even when they are using only portions of core equivalents. In practice, the method used by Compute Canada to account for system usage solves problems about fairness and perceptions of fairness but unfortunately the method is not initially intuitive.<br />
<br />
<!--T:20--><br />
Research groups are charged for the maximum number of core equivalents they take from the resources. Assuming a core equivalent of 1 core and 4GB of memory:<br />
* [[File:Two_core_equivalents.png|frame|Figure 2 - Two core equivalents.]] Research groups using more cores than memory (above the 1 core/4GB memory ratio), will be charged by cores. For example, a research group requesting two cores and 2GB per core for a total of 4 GB of memory. The request requires 2 core equivalents worth of cores but only one bundle for memory. This job request will be counted as 2 core equivalents when priority is calculated. See Figure 2. <br clear=all><br />
<br />
<!--T:21--><br />
* [[File:Two_and_a_half_core_equivalents.png|frame|Figure 3 - 2.5 core equivalents.]] Research groups using more memory than the 1 core/4GB ratio will be charged by memory. For example, a research group requests two cores and 5GB per core for a total of 10 GB of memory. The request requires 2.5 core equivalents worth of memory, but only two bundles for cores. This job request will be counted as 2.5 core equivalents when priority is calculated. See Figure 3. <br clear=all><br />
<br />
=What is a GPU equivalent and how is it used by the scheduler?= <!--T:22--><br />
<br />
<!--T:23--><br />
Use of GPUs and their associated resources follow the same principles as already described for core equivalents. The complication is that it is important to separate allocation targets for GPU-based research from allocation targets for non-GPU-based research to ensure that we can meet the allocation targets in each case. If these cases were not separated, then it would be possible for a non-GPU-based researcher to use their allocation targets in the GPU-based research pool, adding load that would effectively block GPU-based researchers from meeting their allocation targets and vice versa.<br />
<br />
<!--T:24--><br />
Given this separation, a distinction must be made between core equivalents and GPU equivalents. Core equivalents are as described above. The GPU-core-memory bundles that make up a GPU equivalent are similar to core-memory bundles except that a GPU is added to the bundle alongside multiple cores and memory. This means that accounting for GPU-based allocation targets must include the GPU. Similar to how the points system was used above when considering resource use as an expression of the concept of core equivalence, we will use a similar point system here as an expression of GPU equivalence.<br />
<br />
<!--T:25--><br />
Research groups are charged for the maximum number of GPU-core-memory bundles they use. Assuming a core-memory bundle of 1 GPU, 6 cores, and 32GB of memory: <br />
[[File:GPU_equivalent_diagram.png|frame|center|Figure 4 - GPU equivalent diagram.]]<br clear=all><br />
<br />
<!--T:26--><br />
* Research groups using more GPUs than cores or memory per GPU-core-memory bundle will be charged by GPU. For example, a research group requests 2 GPUs, 6 cores, and 32GB of memory. The request is for 2 GPU-core-memory bundles worth of GPUs but only one bundle for memory and cores. This job request will be counted as 2 GPU equivalents when the research group’s priority is calculated.<br />
[[File:Two_GPU_equivalents.png|frame|center|Figure 5 - Two GPU equivalents.]] <br clear=all><br />
<br />
<!--T:27--><br />
* Research groups using more cores than GPUs or memory per GPU-core-memory bundle will be charged by core. For example, a researcher requests 1 GPU, 9 cores, and 32GB of memory. The request is for 1.5 GPU-core-memory bundles worth of cores, but only one bundle for GPUs and memory. This job request will be counted as 1.5 GPU equivalents when the research group’s priority is calculated.<br />
[[File:GPU_and_a_half_(cores).png|frame|center|Figure 6 - 1.5 GPU equivalents, based on cores.]] <br clear=all><br />
<br />
<!--T:28--><br />
* Research groups using more memory than GPUs or cores per GPU-core-memory bundle will be charged by memory. For example, a researcher requests 1 GPU, 6 cores, and 48GB of memory. The request is for 1.5 GPU-core-memory bundles worth of memory but only one bundle for GPUs and cores. This job request will be counted as 1.5 GPU equivalents when the research group’s priority is calculated.<br />
[[File:GPU_and_a_half_(memory).png|frame|center|Figure 7 - 1.5 GPU equivalents, based on memory.]] <br clear=all><br />
<br />
== Ratios: GPU / CPU Cores / System-memory == <!--T:29--><br />
Compute Canada systems have the following GPU-core-memory bundle characteristics:<br />
* [[Béluga/en#Node_Characteristics|Béluga]]:<br />
** V100/16GB nodes: 1 GPU / 10 cores / 47000 MB<br />
* [[Cedar#Node_characteristics|Cedar]]:<br />
** P100/12GB nodes: 1 GPU / 6 cores / 32000 MB<br />
** P100/16GB nodes: 1 GPU / 6 cores / 64000 MB<br />
** V100/32GB nodes: 1 GPU / 8 cores / 48000 MB<br />
* [[Graham#Node_characteristics|Graham]]:<br />
** P100/12GB nodes: 1 GPU / 16 cores / 64000 MB<br />
** V100/16GB nodes: 1 GPU / 3.5 cores / 22500 MB<br />
** V100/32GB nodes: 1 GPU / 5 cores / 48000 MB<br />
** T4/16GB nodes: 1 GPU / {4,11} cores / 49000 MB<br />
<br />
<!--T:30--><br />
[[Category:SLURM]]<br />
</translate></div>Jdesjardhttps://docs.alliancecan.ca/mediawiki/index.php?title=File:Ccdb_view_use_by_compute_resource_monthly.png&diff=103223File:Ccdb view use by compute resource monthly.png2021-09-09T17:22:00Z<p>Jdesjard: </p>
<hr />
<div>Compute usage by resource with monthly breakdown.</div>Jdesjardhttps://docs.alliancecan.ca/mediawiki/index.php?title=Allocations_and_compute_scheduling&diff=103222Allocations and compute scheduling2021-09-09T17:18:30Z<p>Jdesjard: /* Exploring group usage summaries at the Compute Canada portal */</p>
<hr />
<div><languages /><br />
<br />
<translate><br />
<br />
<!--T:31--><br />
''Parent page: [[Job scheduling policies]]''<br />
<br />
= What is an allocation? = <!--T:2--><br />
<br />
<!--T:3--><br />
'''An allocation is an amount of resources that a research group can target for use for a period of time, usually a year.''' This amount is either a maximum amount, as is the case for storage, or an average amount of usage over the period, as is the case for shared resources like computation cores.<br />
<br />
<!--T:4--><br />
Allocations are usually made in terms of core years, GPU years, or storage space. Storage allocations are the most straightforward to understand: research groups will get a maximum amount of storage that they can use exclusively throughout the allocation period. Core year and GPU year allocations are more difficult to understand because these allocations are meant to capture average use throughout the allocation period---typically meant to be a year---and this use will occur across a set of resources shared with other research groups.<br />
<br />
<!--T:5--><br />
The time period of an allocation when it is granted is a reference value, used for the calculation of the average which is applied to the actual period during which the resources are available. This means that if the allocation period was a year and the clusters were down for a week of maintenance, a research group would not be entitled to an additional week of resource usage. Equally so, if the allocation period were to be extended by a month, research groups affected by such a change would not see their resource access diminish during this month.<br />
<br />
<!--T:6--><br />
It should be noted that in the case of core year and GPU year allocations, both of which target resource usage averages over time on shared resources, a research group is more likely to hit (or exceed) its target(s) if the resources are used evenly over the allocation period than if the resources are used in bursts or if use is put off until later in the allocation period.<br />
<br />
== Exploring group usage summaries at the Compute Canada portal ==<br />
<br />
Summaries of account usage can be explored at ccdb.computecanada.ca by logging into a user account and navigating to ‘My Account’ > ‘View Group Usage’. CPU and GPU year values are calculated based on the quantity of the resources allocated to jobs on the clusters. It is important to note that these values do not represent core-equivalent measures such that, in the case of large memory jobs, the usage measures displayed here will not match the cluster scheduler’s representation of the account usage.<br />
<br />
[[File:Select view group usage edit.png|thumb|Navigation to "View Group Usage" at the Compute Canada portal. ]]<br />
<br />
The initial view of the usage is divided up ‘By Compute Resources” (specific compute clusters, e.g. Beluga, Cedar, etc) which summarizes together the usage of all groups that the logged in user either owns or belongs to. The default view begins with the current allocation year (starting on April 1st) up to the current date.<br />
<br />
[[File:Ccdb view use by compute resource.png|thumb|View of group usage by compute resource.]]<br />
<br />
== What happens if my group overuses my CPU or GPU allocation? == <!--T:32--><br />
<br />
<!--T:33--><br />
Nothing bad. Your CPU or GPU allocation is a target level, i.e., a target number of CPUs or GPUs. If you have jobs waiting to run, and competing demand is low enough, then the scheduler may allow more of your jobs to run than your target level. The only consequence of this is that succeeding jobs of yours ''may'' have lower priority for a time while the scheduler prioritizes other groups which were below their target. You are not prevented from submitting or running new jobs, and the time-average of your usage should still be close to your target, that is, your allocation.<br />
<br />
<!--T:34--><br />
It is even possible that you could end a month or even a year having run more work than your allocation would seem to allow, although this is unlikely given the demand on our resources.<br />
<br />
=How does scheduling work?= <!--T:7--><br />
<br />
<!--T:8--><br />
Compute-related resources granted by core-year and GPU-year allocations require research groups to submit what are referred to as “jobs” to a “scheduler”. A job is a combination of a computer program (an application) and a list of resources that the application is expected to use. The [[What is a scheduler?|scheduler]] is a program that calculates the priority of each job submitted and provides the needed resources based on the priority of each job and the available resources.<br />
<br />
<!--T:9--><br />
The scheduler uses prioritization algorithms to meet the allocation targets of all groups and it is based on a research group’s recent usage of the system as compared to their allocated usage on that system. The past of the allocation period is taken into account but the most weight is put on recent usage (or non-usage). The point of this is to allow a research group that matches their actual usage with their allocated amounts to operate roughly continuously at that level. This smooths resource usage over time across all groups and resources, allowing for it to be theoretically possible for all research groups to hit their allocation targets.<br />
<br />
=How does resource use affect priority?= <!--T:10--><br />
<br />
<!--T:11--><br />
The overarching principle governing the calculation of priority on Compute Canada's new national clusters is that compute-based jobs are considered in the calculation based on the resources that others are prevented from using and not on the resources actually used.<br />
<br />
<!--T:12--><br />
The most common example of unused cores contributing to a priority calculation occurs when a submitted job requests multiple cores but uses fewer cores than requested when run. The usage that will affect the priority of future jobs is the number of cores requested, not the number of cores the application actually used. This is because the unused cores were unavailable to others to use during the job.<br />
<br />
<!--T:13--><br />
Another common case is when a job requests memory beyond what is associated with the cores requested. If a cluster that has 4GB of memory associated with each core receives a job request for only a single core but 8GB of memory, then the job will be deemed to have used two cores. This is because other researchers were effectively prevented from using the second core because there was no memory available for it.<br />
<br />
<!--T:14--><br />
The details of how resources are accounted for require a sound understanding of the core equivalent concept, which is discussed below.<ref>Further details about how priority is calculated are beyond the scope of this document. Additional documentation is in preparation. We also suggest that a [https://www.westgrid.ca/events/scheduling_job_management_how_get_most_cluster training course] might be valuable for anyone wishing to know more.</ref><br />
<br />
=What is a core equivalent and how is it used by the scheduler?= <!--T:15--><br />
<br />
<!--T:16--><br />
A core equivalent is a bundle made up of a single core and some amount of associated memory. In other words, a core equivalent is a core plus the amount of memory considered to be associated with each core on a given system. <br />
<br />
<!--T:17--><br />
[[File:Core_equivalent_diagram_GP.png|frame|Figure 1 - Core equivalent diagram for Cedar and Graham.]]<br />
<br />
<!--T:18--><br />
Cedar and Graham are considered to provide 4GB per core, since this corresponds to the most common node type in those clusters, making a core equivalent on these systems a core-memory bundle of 4GB per core. Niagara is considered to provide 4.8GB of memory per core, making a core equivalent on it a core-memory bundle of 4.8GB per core. Jobs are charged in terms of core equivalent usage at the rate of 4 or 4.8 GB per core, as explained above. See Figure 1.<br />
<br />
<!--T:19--><br />
Allocation target tracking is straightforward when requests to use resources on the clusters are made entirely of core and memory amounts that can be portioned only into complete equivalent cores. Things become more complicated when jobs request portions of a core equivalent because it is possible to have many points counted against a research group’s allocation, even when they are using only portions of core equivalents. In practice, the method used by Compute Canada to account for system usage solves problems about fairness and perceptions of fairness but unfortunately the method is not initially intuitive.<br />
<br />
<!--T:20--><br />
Research groups are charged for the maximum number of core equivalents they take from the resources. Assuming a core equivalent of 1 core and 4GB of memory:<br />
* [[File:Two_core_equivalents.png|frame|Figure 2 - Two core equivalents.]] Research groups using more cores than memory (above the 1 core/4GB memory ratio), will be charged by cores. For example, a research group requesting two cores and 2GB per core for a total of 4 GB of memory. The request requires 2 core equivalents worth of cores but only one bundle for memory. This job request will be counted as 2 core equivalents when priority is calculated. See Figure 2. <br clear=all><br />
<br />
<!--T:21--><br />
* [[File:Two_and_a_half_core_equivalents.png|frame|Figure 3 - 2.5 core equivalents.]] Research groups using more memory than the 1 core/4GB ratio will be charged by memory. For example, a research group requests two cores and 5GB per core for a total of 10 GB of memory. The request requires 2.5 core equivalents worth of memory, but only two bundles for cores. This job request will be counted as 2.5 core equivalents when priority is calculated. See Figure 3. <br clear=all><br />
<br />
=What is a GPU equivalent and how is it used by the scheduler?= <!--T:22--><br />
<br />
<!--T:23--><br />
Use of GPUs and their associated resources follow the same principles as already described for core equivalents. The complication is that it is important to separate allocation targets for GPU-based research from allocation targets for non-GPU-based research to ensure that we can meet the allocation targets in each case. If these cases were not separated, then it would be possible for a non-GPU-based researcher to use their allocation targets in the GPU-based research pool, adding load that would effectively block GPU-based researchers from meeting their allocation targets and vice versa.<br />
<br />
<!--T:24--><br />
Given this separation, a distinction must be made between core equivalents and GPU equivalents. Core equivalents are as described above. The GPU-core-memory bundles that make up a GPU equivalent are similar to core-memory bundles except that a GPU is added to the bundle alongside multiple cores and memory. This means that accounting for GPU-based allocation targets must include the GPU. Similar to how the points system was used above when considering resource use as an expression of the concept of core equivalence, we will use a similar point system here as an expression of GPU equivalence.<br />
<br />
<!--T:25--><br />
Research groups are charged for the maximum number of GPU-core-memory bundles they use. Assuming a core-memory bundle of 1 GPU, 6 cores, and 32GB of memory: <br />
[[File:GPU_equivalent_diagram.png|frame|center|Figure 4 - GPU equivalent diagram.]]<br clear=all><br />
<br />
<!--T:26--><br />
* Research groups using more GPUs than cores or memory per GPU-core-memory bundle will be charged by GPU. For example, a research group requests 2 GPUs, 6 cores, and 32GB of memory. The request is for 2 GPU-core-memory bundles worth of GPUs but only one bundle for memory and cores. This job request will be counted as 2 GPU equivalents when the research group’s priority is calculated.<br />
[[File:Two_GPU_equivalents.png|frame|center|Figure 5 - Two GPU equivalents.]] <br clear=all><br />
<br />
<!--T:27--><br />
* Research groups using more cores than GPUs or memory per GPU-core-memory bundle will be charged by core. For example, a researcher requests 1 GPU, 9 cores, and 32GB of memory. The request is for 1.5 GPU-core-memory bundles worth of cores, but only one bundle for GPUs and memory. This job request will be counted as 1.5 GPU equivalents when the research group’s priority is calculated.<br />
[[File:GPU_and_a_half_(cores).png|frame|center|Figure 6 - 1.5 GPU equivalents, based on cores.]] <br clear=all><br />
<br />
<!--T:28--><br />
* Research groups using more memory than GPUs or cores per GPU-core-memory bundle will be charged by memory. For example, a researcher requests 1 GPU, 6 cores, and 48GB of memory. The request is for 1.5 GPU-core-memory bundles worth of memory but only one bundle for GPUs and cores. This job request will be counted as 1.5 GPU equivalents when the research group’s priority is calculated.<br />
[[File:GPU_and_a_half_(memory).png|frame|center|Figure 7 - 1.5 GPU equivalents, based on memory.]] <br clear=all><br />
<br />
== Ratios: GPU / CPU Cores / System-memory == <!--T:29--><br />
Compute Canada systems have the following GPU-core-memory bundle characteristics:<br />
* [[Béluga/en#Node_Characteristics|Béluga]]:<br />
** V100/16GB nodes: 1 GPU / 10 cores / 47000 MB<br />
* [[Cedar#Node_characteristics|Cedar]]:<br />
** P100/12GB nodes: 1 GPU / 6 cores / 32000 MB<br />
** P100/16GB nodes: 1 GPU / 6 cores / 64000 MB<br />
** V100/32GB nodes: 1 GPU / 8 cores / 48000 MB<br />
* [[Graham#Node_characteristics|Graham]]:<br />
** P100/12GB nodes: 1 GPU / 16 cores / 64000 MB<br />
** V100/16GB nodes: 1 GPU / 3.5 cores / 22500 MB<br />
** V100/32GB nodes: 1 GPU / 5 cores / 48000 MB<br />
** T4/16GB nodes: 1 GPU / {4,11} cores / 49000 MB<br />
<br />
<!--T:30--><br />
[[Category:SLURM]]<br />
</translate></div>Jdesjardhttps://docs.alliancecan.ca/mediawiki/index.php?title=Allocations_and_compute_scheduling&diff=103221Allocations and compute scheduling2021-09-09T17:17:48Z<p>Jdesjard: /* Exploring group usage summaries at the Compute Canada portal */</p>
<hr />
<div><languages /><br />
<br />
<translate><br />
<br />
<!--T:31--><br />
''Parent page: [[Job scheduling policies]]''<br />
<br />
= What is an allocation? = <!--T:2--><br />
<br />
<!--T:3--><br />
'''An allocation is an amount of resources that a research group can target for use for a period of time, usually a year.''' This amount is either a maximum amount, as is the case for storage, or an average amount of usage over the period, as is the case for shared resources like computation cores.<br />
<br />
<!--T:4--><br />
Allocations are usually made in terms of core years, GPU years, or storage space. Storage allocations are the most straightforward to understand: research groups will get a maximum amount of storage that they can use exclusively throughout the allocation period. Core year and GPU year allocations are more difficult to understand because these allocations are meant to capture average use throughout the allocation period---typically meant to be a year---and this use will occur across a set of resources shared with other research groups.<br />
<br />
<!--T:5--><br />
The time period of an allocation when it is granted is a reference value, used for the calculation of the average which is applied to the actual period during which the resources are available. This means that if the allocation period was a year and the clusters were down for a week of maintenance, a research group would not be entitled to an additional week of resource usage. Equally so, if the allocation period were to be extended by a month, research groups affected by such a change would not see their resource access diminish during this month.<br />
<br />
<!--T:6--><br />
It should be noted that in the case of core year and GPU year allocations, both of which target resource usage averages over time on shared resources, a research group is more likely to hit (or exceed) its target(s) if the resources are used evenly over the allocation period than if the resources are used in bursts or if use is put off until later in the allocation period.<br />
<br />
== Exploring group usage summaries at the Compute Canada portal ==<br />
<br />
Summaries of account usage can be explored at ccdb.computecanada.ca by logging into a user account and navigating to ‘My Account’ > ‘View Group Usage’. CPU and GPU year values are calculated based on the quantity of the resources allocated to jobs on the clusters. It is important to note that these values do not represent core-equivalent measures such that, in the case of large memory jobs, the usage measures displayed here will not match the cluster scheduler’s representation of the account usage.<br />
<br />
[[File:Select view group usage edit.png|frame|Navigation to "View Group Usage" at the Compute Canada portal. ]]<br />
<br />
The initial view of the usage is divided up ‘By Compute Resources” (specific compute clusters, e.g. Beluga, Cedar, etc) which summarizes together the usage of all groups that the logged in user either owns or belongs to. The default view begins with the current allocation year (starting on April 1st) up to the current date.<br />
<br />
[[File:Ccdb view use by compute resource.png|frame|View of group usage by compute resource.]]<br />
<br />
== What happens if my group overuses my CPU or GPU allocation? == <!--T:32--><br />
<br />
<!--T:33--><br />
Nothing bad. Your CPU or GPU allocation is a target level, i.e., a target number of CPUs or GPUs. If you have jobs waiting to run, and competing demand is low enough, then the scheduler may allow more of your jobs to run than your target level. The only consequence of this is that succeeding jobs of yours ''may'' have lower priority for a time while the scheduler prioritizes other groups which were below their target. You are not prevented from submitting or running new jobs, and the time-average of your usage should still be close to your target, that is, your allocation.<br />
<br />
<!--T:34--><br />
It is even possible that you could end a month or even a year having run more work than your allocation would seem to allow, although this is unlikely given the demand on our resources.<br />
<br />
=How does scheduling work?= <!--T:7--><br />
<br />
<!--T:8--><br />
Compute-related resources granted by core-year and GPU-year allocations require research groups to submit what are referred to as “jobs” to a “scheduler”. A job is a combination of a computer program (an application) and a list of resources that the application is expected to use. The [[What is a scheduler?|scheduler]] is a program that calculates the priority of each job submitted and provides the needed resources based on the priority of each job and the available resources.<br />
<br />
<!--T:9--><br />
The scheduler uses prioritization algorithms to meet the allocation targets of all groups and it is based on a research group’s recent usage of the system as compared to their allocated usage on that system. The past of the allocation period is taken into account but the most weight is put on recent usage (or non-usage). The point of this is to allow a research group that matches their actual usage with their allocated amounts to operate roughly continuously at that level. This smooths resource usage over time across all groups and resources, allowing for it to be theoretically possible for all research groups to hit their allocation targets.<br />
<br />
=How does resource use affect priority?= <!--T:10--><br />
<br />
<!--T:11--><br />
The overarching principle governing the calculation of priority on Compute Canada's new national clusters is that compute-based jobs are considered in the calculation based on the resources that others are prevented from using and not on the resources actually used.<br />
<br />
<!--T:12--><br />
The most common example of unused cores contributing to a priority calculation occurs when a submitted job requests multiple cores but uses fewer cores than requested when run. The usage that will affect the priority of future jobs is the number of cores requested, not the number of cores the application actually used. This is because the unused cores were unavailable to others to use during the job.<br />
<br />
<!--T:13--><br />
Another common case is when a job requests memory beyond what is associated with the cores requested. If a cluster that has 4GB of memory associated with each core receives a job request for only a single core but 8GB of memory, then the job will be deemed to have used two cores. This is because other researchers were effectively prevented from using the second core because there was no memory available for it.<br />
<br />
<!--T:14--><br />
The details of how resources are accounted for require a sound understanding of the core equivalent concept, which is discussed below.<ref>Further details about how priority is calculated are beyond the scope of this document. Additional documentation is in preparation. We also suggest that a [https://www.westgrid.ca/events/scheduling_job_management_how_get_most_cluster training course] might be valuable for anyone wishing to know more.</ref><br />
<br />
=What is a core equivalent and how is it used by the scheduler?= <!--T:15--><br />
<br />
<!--T:16--><br />
A core equivalent is a bundle made up of a single core and some amount of associated memory. In other words, a core equivalent is a core plus the amount of memory considered to be associated with each core on a given system. <br />
<br />
<!--T:17--><br />
[[File:Core_equivalent_diagram_GP.png|frame|Figure 1 - Core equivalent diagram for Cedar and Graham.]]<br />
<br />
<!--T:18--><br />
Cedar and Graham are considered to provide 4GB per core, since this corresponds to the most common node type in those clusters, making a core equivalent on these systems a core-memory bundle of 4GB per core. Niagara is considered to provide 4.8GB of memory per core, making a core equivalent on it a core-memory bundle of 4.8GB per core. Jobs are charged in terms of core equivalent usage at the rate of 4 or 4.8 GB per core, as explained above. See Figure 1.<br />
<br />
<!--T:19--><br />
Allocation target tracking is straightforward when requests to use resources on the clusters are made entirely of core and memory amounts that can be portioned only into complete equivalent cores. Things become more complicated when jobs request portions of a core equivalent because it is possible to have many points counted against a research group’s allocation, even when they are using only portions of core equivalents. In practice, the method used by Compute Canada to account for system usage solves problems about fairness and perceptions of fairness but unfortunately the method is not initially intuitive.<br />
<br />
<!--T:20--><br />
Research groups are charged for the maximum number of core equivalents they take from the resources. Assuming a core equivalent of 1 core and 4GB of memory:<br />
* [[File:Two_core_equivalents.png|frame|Figure 2 - Two core equivalents.]] Research groups using more cores than memory (above the 1 core/4GB memory ratio), will be charged by cores. For example, a research group requesting two cores and 2GB per core for a total of 4 GB of memory. The request requires 2 core equivalents worth of cores but only one bundle for memory. This job request will be counted as 2 core equivalents when priority is calculated. See Figure 2. <br clear=all><br />
<br />
<!--T:21--><br />
* [[File:Two_and_a_half_core_equivalents.png|frame|Figure 3 - 2.5 core equivalents.]] Research groups using more memory than the 1 core/4GB ratio will be charged by memory. For example, a research group requests two cores and 5GB per core for a total of 10 GB of memory. The request requires 2.5 core equivalents worth of memory, but only two bundles for cores. This job request will be counted as 2.5 core equivalents when priority is calculated. See Figure 3. <br clear=all><br />
<br />
=What is a GPU equivalent and how is it used by the scheduler?= <!--T:22--><br />
<br />
<!--T:23--><br />
Use of GPUs and their associated resources follow the same principles as already described for core equivalents. The complication is that it is important to separate allocation targets for GPU-based research from allocation targets for non-GPU-based research to ensure that we can meet the allocation targets in each case. If these cases were not separated, then it would be possible for a non-GPU-based researcher to use their allocation targets in the GPU-based research pool, adding load that would effectively block GPU-based researchers from meeting their allocation targets and vice versa.<br />
<br />
<!--T:24--><br />
Given this separation, a distinction must be made between core equivalents and GPU equivalents. Core equivalents are as described above. The GPU-core-memory bundles that make up a GPU equivalent are similar to core-memory bundles except that a GPU is added to the bundle alongside multiple cores and memory. This means that accounting for GPU-based allocation targets must include the GPU. Similar to how the points system was used above when considering resource use as an expression of the concept of core equivalence, we will use a similar point system here as an expression of GPU equivalence.<br />
<br />
<!--T:25--><br />
Research groups are charged for the maximum number of GPU-core-memory bundles they use. Assuming a core-memory bundle of 1 GPU, 6 cores, and 32GB of memory: <br />
[[File:GPU_equivalent_diagram.png|frame|center|Figure 4 - GPU equivalent diagram.]]<br clear=all><br />
<br />
<!--T:26--><br />
* Research groups using more GPUs than cores or memory per GPU-core-memory bundle will be charged by GPU. For example, a research group requests 2 GPUs, 6 cores, and 32GB of memory. The request is for 2 GPU-core-memory bundles worth of GPUs but only one bundle for memory and cores. This job request will be counted as 2 GPU equivalents when the research group’s priority is calculated.<br />
[[File:Two_GPU_equivalents.png|frame|center|Figure 5 - Two GPU equivalents.]] <br clear=all><br />
<br />
<!--T:27--><br />
* Research groups using more cores than GPUs or memory per GPU-core-memory bundle will be charged by core. For example, a researcher requests 1 GPU, 9 cores, and 32GB of memory. The request is for 1.5 GPU-core-memory bundles worth of cores, but only one bundle for GPUs and memory. This job request will be counted as 1.5 GPU equivalents when the research group’s priority is calculated.<br />
[[File:GPU_and_a_half_(cores).png|frame|center|Figure 6 - 1.5 GPU equivalents, based on cores.]] <br clear=all><br />
<br />
<!--T:28--><br />
* Research groups using more memory than GPUs or cores per GPU-core-memory bundle will be charged by memory. For example, a researcher requests 1 GPU, 6 cores, and 48GB of memory. The request is for 1.5 GPU-core-memory bundles worth of memory but only one bundle for GPUs and cores. This job request will be counted as 1.5 GPU equivalents when the research group’s priority is calculated.<br />
[[File:GPU_and_a_half_(memory).png|frame|center|Figure 7 - 1.5 GPU equivalents, based on memory.]] <br clear=all><br />
<br />
== Ratios: GPU / CPU Cores / System-memory == <!--T:29--><br />
Compute Canada systems have the following GPU-core-memory bundle characteristics:<br />
* [[Béluga/en#Node_Characteristics|Béluga]]:<br />
** V100/16GB nodes: 1 GPU / 10 cores / 47000 MB<br />
* [[Cedar#Node_characteristics|Cedar]]:<br />
** P100/12GB nodes: 1 GPU / 6 cores / 32000 MB<br />
** P100/16GB nodes: 1 GPU / 6 cores / 64000 MB<br />
** V100/32GB nodes: 1 GPU / 8 cores / 48000 MB<br />
* [[Graham#Node_characteristics|Graham]]:<br />
** P100/12GB nodes: 1 GPU / 16 cores / 64000 MB<br />
** V100/16GB nodes: 1 GPU / 3.5 cores / 22500 MB<br />
** V100/32GB nodes: 1 GPU / 5 cores / 48000 MB<br />
** T4/16GB nodes: 1 GPU / {4,11} cores / 49000 MB<br />
<br />
<!--T:30--><br />
[[Category:SLURM]]<br />
</translate></div>Jdesjardhttps://docs.alliancecan.ca/mediawiki/index.php?title=Allocations_and_compute_scheduling&diff=103220Allocations and compute scheduling2021-09-09T17:15:12Z<p>Jdesjard: </p>
<hr />
<div><languages /><br />
<br />
<translate><br />
<br />
<!--T:31--><br />
''Parent page: [[Job scheduling policies]]''<br />
<br />
= What is an allocation? = <!--T:2--><br />
<br />
<!--T:3--><br />
'''An allocation is an amount of resources that a research group can target for use for a period of time, usually a year.''' This amount is either a maximum amount, as is the case for storage, or an average amount of usage over the period, as is the case for shared resources like computation cores.<br />
<br />
<!--T:4--><br />
Allocations are usually made in terms of core years, GPU years, or storage space. Storage allocations are the most straightforward to understand: research groups will get a maximum amount of storage that they can use exclusively throughout the allocation period. Core year and GPU year allocations are more difficult to understand because these allocations are meant to capture average use throughout the allocation period---typically meant to be a year---and this use will occur across a set of resources shared with other research groups.<br />
<br />
<!--T:5--><br />
The time period of an allocation when it is granted is a reference value, used for the calculation of the average which is applied to the actual period during which the resources are available. This means that if the allocation period was a year and the clusters were down for a week of maintenance, a research group would not be entitled to an additional week of resource usage. Equally so, if the allocation period were to be extended by a month, research groups affected by such a change would not see their resource access diminish during this month.<br />
<br />
<!--T:6--><br />
It should be noted that in the case of core year and GPU year allocations, both of which target resource usage averages over time on shared resources, a research group is more likely to hit (or exceed) its target(s) if the resources are used evenly over the allocation period than if the resources are used in bursts or if use is put off until later in the allocation period.<br />
<br />
== Exploring group usage summaries at the Compute Canada portal ==<br />
<br />
Summaries of account usage can be explored at ccdb.computecanada.ca by logging into a user account and navigating to ‘My Account’ > ‘View Group Usage’. CPU and GPU year values are calculated based on the quantity of the resources allocated to jobs on the clusters. It is important to note that these values do not represent core-equivalent measures such that, in the case of large memory jobs, the usage measures displayed here will not match the cluster scheduler’s representation of the account usage.<br />
<br />
[[File:Select view group usage edit.png|thumb|Navigation to "View Group Usage" at the Compute Canada portal. ]]<br />
<br />
The initial view of the usage is divided up ‘By Compute Resources” (specific compute clusters, e.g. Beluga, Cedar, etc) which summarizes together the usage of all groups that the logged in user either owns or belongs to. The default view begins with the current allocation year (starting on April 1st) up to the current date.<br />
<br />
[[File:Ccdb view use by compute resource.png|thumb|View of group usage by compute resource.]]<br />
<br />
== What happens if my group overuses my CPU or GPU allocation? == <!--T:32--><br />
<br />
<!--T:33--><br />
Nothing bad. Your CPU or GPU allocation is a target level, i.e., a target number of CPUs or GPUs. If you have jobs waiting to run, and competing demand is low enough, then the scheduler may allow more of your jobs to run than your target level. The only consequence of this is that succeeding jobs of yours ''may'' have lower priority for a time while the scheduler prioritizes other groups which were below their target. You are not prevented from submitting or running new jobs, and the time-average of your usage should still be close to your target, that is, your allocation.<br />
<br />
<!--T:34--><br />
It is even possible that you could end a month or even a year having run more work than your allocation would seem to allow, although this is unlikely given the demand on our resources.<br />
<br />
=How does scheduling work?= <!--T:7--><br />
<br />
<!--T:8--><br />
Compute-related resources granted by core-year and GPU-year allocations require research groups to submit what are referred to as “jobs” to a “scheduler”. A job is a combination of a computer program (an application) and a list of resources that the application is expected to use. The [[What is a scheduler?|scheduler]] is a program that calculates the priority of each job submitted and provides the needed resources based on the priority of each job and the available resources.<br />
<br />
<!--T:9--><br />
The scheduler uses prioritization algorithms to meet the allocation targets of all groups and it is based on a research group’s recent usage of the system as compared to their allocated usage on that system. The past of the allocation period is taken into account but the most weight is put on recent usage (or non-usage). The point of this is to allow a research group that matches their actual usage with their allocated amounts to operate roughly continuously at that level. This smooths resource usage over time across all groups and resources, allowing for it to be theoretically possible for all research groups to hit their allocation targets.<br />
<br />
=How does resource use affect priority?= <!--T:10--><br />
<br />
<!--T:11--><br />
The overarching principle governing the calculation of priority on Compute Canada's new national clusters is that compute-based jobs are considered in the calculation based on the resources that others are prevented from using and not on the resources actually used.<br />
<br />
<!--T:12--><br />
The most common example of unused cores contributing to a priority calculation occurs when a submitted job requests multiple cores but uses fewer cores than requested when run. The usage that will affect the priority of future jobs is the number of cores requested, not the number of cores the application actually used. This is because the unused cores were unavailable to others to use during the job.<br />
<br />
<!--T:13--><br />
Another common case is when a job requests memory beyond what is associated with the cores requested. If a cluster that has 4GB of memory associated with each core receives a job request for only a single core but 8GB of memory, then the job will be deemed to have used two cores. This is because other researchers were effectively prevented from using the second core because there was no memory available for it.<br />
<br />
<!--T:14--><br />
The details of how resources are accounted for require a sound understanding of the core equivalent concept, which is discussed below.<ref>Further details about how priority is calculated are beyond the scope of this document. Additional documentation is in preparation. We also suggest that a [https://www.westgrid.ca/events/scheduling_job_management_how_get_most_cluster training course] might be valuable for anyone wishing to know more.</ref><br />
<br />
=What is a core equivalent and how is it used by the scheduler?= <!--T:15--><br />
<br />
<!--T:16--><br />
A core equivalent is a bundle made up of a single core and some amount of associated memory. In other words, a core equivalent is a core plus the amount of memory considered to be associated with each core on a given system. <br />
<br />
<!--T:17--><br />
[[File:Core_equivalent_diagram_GP.png|frame|Figure 1 - Core equivalent diagram for Cedar and Graham.]]<br />
<br />
<!--T:18--><br />
Cedar and Graham are considered to provide 4GB per core, since this corresponds to the most common node type in those clusters, making a core equivalent on these systems a core-memory bundle of 4GB per core. Niagara is considered to provide 4.8GB of memory per core, making a core equivalent on it a core-memory bundle of 4.8GB per core. Jobs are charged in terms of core equivalent usage at the rate of 4 or 4.8 GB per core, as explained above. See Figure 1.<br />
<br />
<!--T:19--><br />
Allocation target tracking is straightforward when requests to use resources on the clusters are made entirely of core and memory amounts that can be portioned only into complete equivalent cores. Things become more complicated when jobs request portions of a core equivalent because it is possible to have many points counted against a research group’s allocation, even when they are using only portions of core equivalents. In practice, the method used by Compute Canada to account for system usage solves problems about fairness and perceptions of fairness but unfortunately the method is not initially intuitive.<br />
<br />
<!--T:20--><br />
Research groups are charged for the maximum number of core equivalents they take from the resources. Assuming a core equivalent of 1 core and 4GB of memory:<br />
* [[File:Two_core_equivalents.png|frame|Figure 2 - Two core equivalents.]] Research groups using more cores than memory (above the 1 core/4GB memory ratio), will be charged by cores. For example, a research group requesting two cores and 2GB per core for a total of 4 GB of memory. The request requires 2 core equivalents worth of cores but only one bundle for memory. This job request will be counted as 2 core equivalents when priority is calculated. See Figure 2. <br clear=all><br />
<br />
<!--T:21--><br />
* [[File:Two_and_a_half_core_equivalents.png|frame|Figure 3 - 2.5 core equivalents.]] Research groups using more memory than the 1 core/4GB ratio will be charged by memory. For example, a research group requests two cores and 5GB per core for a total of 10 GB of memory. The request requires 2.5 core equivalents worth of memory, but only two bundles for cores. This job request will be counted as 2.5 core equivalents when priority is calculated. See Figure 3. <br clear=all><br />
<br />
=What is a GPU equivalent and how is it used by the scheduler?= <!--T:22--><br />
<br />
<!--T:23--><br />
Use of GPUs and their associated resources follow the same principles as already described for core equivalents. The complication is that it is important to separate allocation targets for GPU-based research from allocation targets for non-GPU-based research to ensure that we can meet the allocation targets in each case. If these cases were not separated, then it would be possible for a non-GPU-based researcher to use their allocation targets in the GPU-based research pool, adding load that would effectively block GPU-based researchers from meeting their allocation targets and vice versa.<br />
<br />
<!--T:24--><br />
Given this separation, a distinction must be made between core equivalents and GPU equivalents. Core equivalents are as described above. The GPU-core-memory bundles that make up a GPU equivalent are similar to core-memory bundles except that a GPU is added to the bundle alongside multiple cores and memory. This means that accounting for GPU-based allocation targets must include the GPU. Similar to how the points system was used above when considering resource use as an expression of the concept of core equivalence, we will use a similar point system here as an expression of GPU equivalence.<br />
<br />
<!--T:25--><br />
Research groups are charged for the maximum number of GPU-core-memory bundles they use. Assuming a core-memory bundle of 1 GPU, 6 cores, and 32GB of memory: <br />
[[File:GPU_equivalent_diagram.png|frame|center|Figure 4 - GPU equivalent diagram.]]<br clear=all><br />
<br />
<!--T:26--><br />
* Research groups using more GPUs than cores or memory per GPU-core-memory bundle will be charged by GPU. For example, a research group requests 2 GPUs, 6 cores, and 32GB of memory. The request is for 2 GPU-core-memory bundles worth of GPUs but only one bundle for memory and cores. This job request will be counted as 2 GPU equivalents when the research group’s priority is calculated.<br />
[[File:Two_GPU_equivalents.png|frame|center|Figure 5 - Two GPU equivalents.]] <br clear=all><br />
<br />
<!--T:27--><br />
* Research groups using more cores than GPUs or memory per GPU-core-memory bundle will be charged by core. For example, a researcher requests 1 GPU, 9 cores, and 32GB of memory. The request is for 1.5 GPU-core-memory bundles worth of cores, but only one bundle for GPUs and memory. This job request will be counted as 1.5 GPU equivalents when the research group’s priority is calculated.<br />
[[File:GPU_and_a_half_(cores).png|frame|center|Figure 6 - 1.5 GPU equivalents, based on cores.]] <br clear=all><br />
<br />
<!--T:28--><br />
* Research groups using more memory than GPUs or cores per GPU-core-memory bundle will be charged by memory. For example, a researcher requests 1 GPU, 6 cores, and 48GB of memory. The request is for 1.5 GPU-core-memory bundles worth of memory but only one bundle for GPUs and cores. This job request will be counted as 1.5 GPU equivalents when the research group’s priority is calculated.<br />
[[File:GPU_and_a_half_(memory).png|frame|center|Figure 7 - 1.5 GPU equivalents, based on memory.]] <br clear=all><br />
<br />
== Ratios: GPU / CPU Cores / System-memory == <!--T:29--><br />
Compute Canada systems have the following GPU-core-memory bundle characteristics:<br />
* [[Béluga/en#Node_Characteristics|Béluga]]:<br />
** V100/16GB nodes: 1 GPU / 10 cores / 47000 MB<br />
* [[Cedar#Node_characteristics|Cedar]]:<br />
** P100/12GB nodes: 1 GPU / 6 cores / 32000 MB<br />
** P100/16GB nodes: 1 GPU / 6 cores / 64000 MB<br />
** V100/32GB nodes: 1 GPU / 8 cores / 48000 MB<br />
* [[Graham#Node_characteristics|Graham]]:<br />
** P100/12GB nodes: 1 GPU / 16 cores / 64000 MB<br />
** V100/16GB nodes: 1 GPU / 3.5 cores / 22500 MB<br />
** V100/32GB nodes: 1 GPU / 5 cores / 48000 MB<br />
** T4/16GB nodes: 1 GPU / {4,11} cores / 49000 MB<br />
<br />
<!--T:30--><br />
[[Category:SLURM]]<br />
</translate></div>Jdesjardhttps://docs.alliancecan.ca/mediawiki/index.php?title=File:Ccdb_view_use_by_compute_resource.png&diff=103219File:Ccdb view use by compute resource.png2021-09-09T17:14:01Z<p>Jdesjard: </p>
<hr />
<div>Usage view By Compute Resource</div>Jdesjardhttps://docs.alliancecan.ca/mediawiki/index.php?title=File:Select_view_group_usage_edit.png&diff=103218File:Select view group usage edit.png2021-09-09T17:10:37Z<p>Jdesjard: </p>
<hr />
<div>Screen capture of ccdb.ca navigation menu to View Group Usage.</div>Jdesjardhttps://docs.alliancecan.ca/mediawiki/index.php?title=Using_GPUs_with_Slurm&diff=89923Using GPUs with Slurm2020-09-22T19:15:06Z<p>Jdesjard: </p>
<hr />
<div><languages /><br />
<translate><br />
<br />
<!--T:15--><br />
For general advice on job scheduling, see [[Running jobs]].<br />
<br />
== Available hardware == <!--T:1--><br />
These are the GPUs currently available:<br />
<br />
<!--T:2--><br />
{| class="wikitable"<br />
|-<br />
! # of Nodes !! Cluster !! Type specifier !! CPU cores !! CPU memory !! GPUs per node !! GPU model !! Topology<br />
|-<br />
| 172 || Béluga || - || 40 || 191000M || 4 || V100-SXM2-16GB || All GPUs associated with the same CPU socket, connected via NVLink<br />
|-<br />
| 114 || Cedar || p100 || 24 || 128000M || 4 || P100-PCIE-12GB || Two GPUs per CPU socket<br />
|-<br />
| 32 || Cedar || p100l || 24 || 257000M || 4 || P100-PCIE-16GB || All GPUs associated with the same CPU socket<br />
|-<br />
| 192 || Cedar || v100l || 32 || 192000M || 4 || V100-PCIE-32GB || Two GPUs per CPU socket; all GPUs connected via NVLink<br />
|-<br />
| 160 || Graham || p100 || 32 || 127518M || 2 || P100-PCIE-12GB || One GPU per CPU socket<br />
|-<br />
| 7 || Graham || v100 || 28 || 183105M || 8 || V100-PCIE-16GB || See [[Graham#Volta_GPU_nodes_on_Graham|Graham: Volta GPU nodes]]<br />
|-<br />
| 30 || Graham || t4 || 44 || 192000M || 4 || Tesla T4 16GB || Two GPUs per CPU socket<br />
|-<br />
| 15 || Hélios || k20 || 20 || 110000M || 8 || K20 5GB || Four GPUs per CPU socket<br />
|- <br />
| 6 || Hélios || k80 || 24 || 257000M || 16 || K80 12GB || Eight GPUs per CPU socket<br />
|- <br />
| 54 || Mist || - || 32 || 256GB || 4 || V100-SXM2-32GB || See [https://docs.scinet.utoronto.ca/index.php/Mist#Specifications Mist specifications]<br />
|}<br />
<br />
== Specifying the type of GPU to use == <!--T:16--><br />
<br />
<!--T:37--><br />
Some clusters have more than one GPU type available ([[Cedar]], [[Graham]], [[Hélios/en|Hélios]]), and some clusters only have GPUs on certain nodes ([[Béluga/en|Béluga]], [[Cedar]], [[Graham]]). You can choose the type of GPU to use by supplying the type specifier to Slurm. The following options are available: <br />
<br />
=== On Béluga === <!--T:29--><br />
Béluga has only one type of GPU, so no type specification is required.<br />
<br />
=== On Cedar === <!--T:17--><br />
You can request a 12G P100 using<br />
<br />
<!--T:18--><br />
#SBATCH --gres=gpu:p100:1<br />
<br />
<!--T:19--><br />
or a 16G P100 using <br />
<br />
<!--T:20--><br />
#SBATCH --gres=gpu:p100l:4<br />
<br />
<!--T:21--><br />
or a 32G V100 using <br />
<br />
<!--T:34--><br />
#SBATCH --gres=gpu:v100l:1<br />
<br />
<!--T:35--><br />
If no type is specified, GPU jobs will run on 12G P100s.<br />
<br />
=== On Graham === <!--T:22--><br />
You can request a P100 using<br />
<br />
<!--T:23--><br />
#SBATCH --gres=gpu:p100:1<br />
<br />
<!--T:24--><br />
or a V100 using <br />
<br />
<!--T:25--><br />
#SBATCH --gres=gpu:v100:1<br />
<br />
<!--T:26--><br />
or a T4 using <br />
<br />
<!--T:27--><br />
#SBATCH --gres=gpu:t4:1<br />
<br />
<!--T:28--><br />
If no type is specified, a GPU job will run on a P100.<br />
<br />
=== On Hélios === <!--T:30--><br />
You can request a K20 using<br />
<br />
<!--T:31--><br />
#SBATCH --gres=gpu:k20:1<br />
<br />
<!--T:32--><br />
or a K80 using <br />
<br />
<!--T:33--><br />
#SBATCH --gres=gpu:k80:1<br />
<br />
=== Mist === <!--T:38--><br />
[https://docs.scinet.utoronto.ca/index.php/Mist Mist] is a cluster comprised of IBM Power9 CPUs (not Intel x86!) and NVIDIA V100 GPUs. <br />
Users with access to Niagara can also access Mist. To specify job requirements on Mist, <br />
please see the specific instructions on the [https://docs.scinet.utoronto.ca/index.php/Mist#Submitting_jobs SciNet web site].<br />
<br />
== Single-core job == <!--T:3--><br />
If you need only a single CPU core and one GPU:<br />
{{File<br />
|name=gpu_serial_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --account=def-someuser<br />
#SBATCH --gres=gpu:1 # Number of GPUs (per node)<br />
#SBATCH --mem=4000M # memory (per node)<br />
#SBATCH --time=0-03:00 # time (DD-HH:MM)<br />
./program # you can use 'nvidia-smi' for a test<br />
}}<br />
<br />
== Multi-threaded job == <!--T:4--><br />
For GPU jobs asking for multiple CPUs in a single node:<br />
{{File<br />
|name=gpu_threaded_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --account=def-someuser<br />
#SBATCH --gres=gpu:1 # Number of GPU(s) per node<br />
#SBATCH --cpus-per-task=6 # CPU cores/threads<br />
#SBATCH --mem=4000M # memory per node<br />
#SBATCH --time=0-03:00 # time (DD-HH:MM)<br />
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK<br />
./program<br />
}}<br />
For each GPU requested on:<br />
* Béluga, we recommend no more than 10 CPU cores.<br />
* Cedar, we recommend no more than 6 CPU cores per P100 GPU (p100 and p100l) and no more than 8 CPU cores per V100 GPU (v100l).<br />
* Graham, we recommend no more than 16 CPU cores.<br />
<br />
== MPI job == <!--T:5--><br />
{{File<br />
|name=gpu_mpi_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --account=def-someuser<br />
#SBATCH --gres=gpu:4 # Number of GPUs per node<br />
#SBATCH --nodes=2 # Number of nodes<br />
#SBATCH --ntasks=48 # Number of MPI process<br />
#SBATCH --cpus-per-task=1 # CPU cores per MPI process<br />
#SBATCH --mem=120G # memory per node<br />
#SBATCH --time=0-03:00 # time (DD-HH:MM)<br />
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK<br />
srun ./program<br />
}}<br />
<br />
== Whole nodes == <!--T:6--><br />
If your application can efficiently use an entire node and its associated GPUs, you will probably experience shorter wait times if you ask Slurm for a whole node. Use one of the following job scripts as a template. <br />
<br />
=== Requesting a GPU node on Graham === <!--T:7--><br />
{{File<br />
|name=graham_gpu_node_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --nodes=1<br />
#SBATCH --gres=gpu:2<br />
#SBATCH --ntasks-per-node=32<br />
#SBATCH --mem=127000M<br />
#SBATCH --time=3:00<br />
#SBATCH --account=def-someuser<br />
nvidia-smi<br />
}}<br />
<br />
=== Requesting a P100 GPU node on Cedar === <!--T:8--><br />
{{File<br />
|name=cedar_gpu_node_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --nodes=1<br />
#SBATCH --gres=gpu:p100:4<br />
#SBATCH --ntasks-per-node=24<br />
#SBATCH --exclusive<br />
#SBATCH --mem=125G<br />
#SBATCH --time=3:00<br />
#SBATCH --account=def-someuser<br />
nvidia-smi<br />
}}<br />
<br />
=== Requesting a P100-16G GPU node on Cedar === <!--T:9--><br />
<br />
<!--T:10--><br />
There is a special group of GPU nodes on [[Cedar]] which have four Tesla P100 16GB cards each. (Other P100 GPUs in the cluster have 12GB and the V100 GPUs have 32G.) The GPUs in a P100L node all use the same PCI switch, so the inter-GPU communication latency is lower, but bandwidth between CPU and GPU is lower than on the regular GPU nodes. The nodes also have 256GB RAM. You may only request these nodes as whole nodes, therefore you must specify <code>--gres=gpu:p100l:4</code>. P100L GPU jobs up to 28 days can be run on Cedar.<br />
<br />
<!--T:11--><br />
{{File<br />
|name=p100l_gpu_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --nodes=1 <br />
#SBATCH --gres=gpu:p100l:4 <br />
#SBATCH --ntasks=1<br />
#SBATCH --cpus-per-task=24 # There are 24 CPU cores on P100 Cedar GPU nodes<br />
#SBATCH --mem=0 # Request the full memory of the node<br />
#SBATCH --time=3:00<br />
#SBATCH --account=def-someuser<br />
hostname<br />
nvidia-smi<br />
}}<br />
<br />
===Packing single-GPU jobs within one SLURM job=== <!--T:12--><br />
<br />
<!--T:13--><br />
If you need to run four single-GPU programs or two 2-GPU programs for longer than 24 hours, [[GNU Parallel]] is recommended. A simple example is given below:<br />
<pre><br />
cat params.input | parallel -j4 'CUDA_VISIBLE_DEVICES=$(({%} - 1)) python {} &> {#}.out'<br />
</pre><br />
In this example the GPU ID is calculated by subtracting 1 from the slot ID {%}. {#} is the job ID, starting from 1.<br />
<br />
<!--T:14--><br />
A params.input file should include input parameters in each line, like this:<br />
<pre><br />
code1.py<br />
code2.py<br />
code3.py<br />
code4.py<br />
...<br />
</pre><br />
With this method, users can run multiple tasks in one submission. The <code>-j4</code> parameter means GNU Parallel can run a maximum of four concurrent tasks, launching another as soon as each one ends. CUDA_VISIBLE_DEVICES is used to ensure that two tasks do not try to use the same GPU at the same time.<br />
<br />
<!--T:36--><br />
[[Category:SLURM]]<br />
</translate></div>Jdesjardhttps://docs.alliancecan.ca/mediawiki/index.php?title=Graham&diff=71078Graham2019-04-09T18:27:03Z<p>Jdesjard: </p>
<hr />
<div><noinclude><br />
<languages /><br />
<br />
<translate><br />
<!--T:27--><br />
</noinclude><br />
{| class="wikitable"<br />
|-<br />
| Availability: In production since June 2017<br />
|-<br />
| Login node: '''graham.computecanada.ca'''<br />
|-<br />
| Globus endpoint: '''computecanada#graham-dtn'''<br />
|-<br />
| Data mover node (rsync, scp, sftp,...): '''gra-dtn1.computecanada.ca'''<br />
|}<br />
<br />
<!--T:2--><br />
GRAHAM is a heterogeneous cluster, suitable for a variety of workloads, and located at the University of Waterloo. It is named after [https://en.wikipedia.org/wiki/Wes_Graham Wes Graham], the first director of the Computing Centre at Waterloo.<br />
<br />
<!--T:4--><br />
The parallel filesystem and external persistent storage ([[National Data Cyberinfrastructure|NDC-Waterloo]]) are similar to [[Cedar|Cedar's]]. The interconnect is different and there is a slightly different mix of compute nodes.<br />
<br />
<!--T:28--><br />
The Graham system is sold and supported by Huawei Canada, Inc. It is entirely liquid cooled, using rear-door heat exchangers.<br />
<br />
<!--T:33--><br />
[[Getting started with the new national systems|Getting started with Graham]]<br />
<br />
<!--T:36--><br />
[[Running_jobs|How to run jobs]]<br />
<br />
<!--T:37--><br />
[[Transferring_data|Transfering data]]<br />
<br />
= Site-specific policies = <!--T:39--><br />
<br />
<!--T:40--><br />
By policy, Graham's compute nodes cannot access the internet. If you need an exception to this rule, <br />
contact [[Technical Support|technical support]] with the following information:<br />
<br />
<!--T:42--><br />
<pre><br />
IP: <br />
Port/s: <br />
Protocol: TCP or UDP<br />
Contact: <br />
Removal Date: <br />
</pre><br />
<br />
<!--T:43--><br />
We will follow up with the contact before removing to confirm if this rule is still required.<br />
<br />
<!--T:41--><br />
Crontab is not offered on Graham.<br />
<br />
=Attached storage systems= <!--T:23--><br />
<br />
<!--T:24--><br />
{| class="wikitable sortable"<br />
|-<br />
| '''Home space''' ||<br />
* Location of home directories.<br />
* Each home directory has a small, fixed [[Storage and file management#Filesystem_quotas_and_policies|quota]]. <br />
* Not allocated via [https://www.computecanada.ca/research-portal/accessing-resources/rapid-access-service/ RAS] or [https://www.computecanada.ca/research-portal/accessing-resources/resource-allocation-competitions/ RAC]. Larger requests go to Project space.<br />
* Has daily backup.<br />
|-<br />
| '''Scratch space'''<br />3.6PB total volume<br />Parallel high-performance filesystem ||<br />
* For active or temporary (<code>/scratch</code>) storage.<br />
* Not allocated.<br />
* Large fixed [[Storage and file management#Filesystem_quotas_and_policies|quota]] per user.<br />
* Inactive data will be purged.<br />
|-<br />
|'''Project space'''<br />External persistent storage<br />
||<br />
* Part of the [[National Data Cyberinfrastructure]].<br />
* Allocated via [https://www.computecanada.ca/research-portal/accessing-resources/rapid-access-service/ RAS] or [https://www.computecanada.ca/research-portal/accessing-resources/resource-allocation-competitions/ RAC].<br />
* Not designed for parallel I/O workloads. Use Scratch space instead.<br />
* Large adjustable [[Storage and file management#Filesystem_quotas_and_policies|quota]] per project.<br />
* Has daily backup.<br />
|}<br />
<br />
=High-performance interconnect= <!--T:19--><br />
<br />
<!--T:21--><br />
Mellanox FDR (56Gb/s) and EDR (100Gb/s) InfiniBand interconnect. FDR is used for GPU and cloud nodes, EDR for other node types. A central 324-port director switch aggregates connections from islands of 1024 cores each for CPU and GPU nodes. The 56 cloud nodes are a variation on CPU nodes, and are on a single larger island sharing 8 FDR uplinks to the director switch.<br />
<br />
<!--T:29--><br />
A low-latency high-bandwidth Infiniband fabric connects all nodes and scratch storage.<br />
<br />
<!--T:30--><br />
Nodes configurable for cloud provisioning also have a 10Gb/s Ethernet network, with 40Gb/s uplinks to scratch storage.<br />
<br />
<!--T:22--><br />
The design of Graham is to support multiple simultaneous parallel jobs of up to 1024 cores in a fully non-blocking manner. <br />
<br />
<!--T:31--><br />
For larger jobs the interconnect has a 8:1 blocking factor, i.e., even for jobs running on multiple islands the Graham system provides a high-performance interconnect.<br />
<br />
<!--T:32--><br />
[https://docs.computecanada.ca/mediawiki/images/b/b3/Gp3-network-topo.png Graham high performance interconnect diagram]<br />
<br />
=Visualization on Graham= <!--T:44--><br />
<br />
<!--T:45--><br />
Graham has dedicated visualization nodes available at '''gra-vdi.computecanada.ca''' that allow only VNC connections. For instruction on how to use them, see the [[VNC]] page. We are working on making the complete graham software stack available on the visualization nodes.<br />
<br />
=Node types and characteristics= <!--T:5--><br />
A total of 36,160 cores and 320 GPU devices, spread across 1,127 nodes of different types.<br />
<br />
<!--T:25--><br />
''Processor type:'' All nodes except bigmem3000 have Intel E5-2683 V4 CPUs, running at 2.1 GHz<br />
<br />
<!--T:26--><br />
''GPU type:'' P100 12g<br />
<br />
<!--T:6--><br />
{| class="wikitable sortable"<br />
! count !! Node type !! cores !! available memory !! hardware detail<br />
|-<br />
| 903 || base "128G" || 32 || 125G or 128000M || two Intel E5-2683 v4 "Broadwell" at 2.1Ghz; 960GB SATA SSD<br />
|-<br />
| 24 || large "512G" || 32 || 502G or 514500M || (same as base nodes)<br />
|-<br />
| 56 || large/cloud || 32 || 250G or 256500M || (same as base nodes) may be reserved for cloud use<br />
|-<br />
| 3 || bigmem3000 "3T" || 64 || 3022G or 3095000M || like base nodes but four E7-4850 v4 "Broadwell" CPUs at 2.1Ghz<br />
|-<br />
| 160 || GPU || 32 || 124G or 127518M || like base nodes but also two NVIDIA P100 Pascal GPUs (12GB HBM2 memory, 1.6TB NVMe SSD<br />
<br />
<!--T:35--><br />
|}<br />
<br />
<!--T:7--><br />
Best practice for local on-node storage is to use the temporary directory generated by [[Running jobs|Slurm]], <tt>$SLURM_TMPDIR</tt>. Note that this directory and its contents will disappear upon job completion.<br />
<br />
<!--T:38--><br />
Note that the amount of available memory is less than the "round number" suggested by hardware configuration. For instance, "base" nodes do have 128 GiB of RAM, but some of it is permanently occupied by the kernel and OS. To avoid wasting time by swapping/paging, the scheduler will never allocate jobs whose memory requirements exceed the specified amount of "available" memory. Please also note that the memory allocated to the job must be sufficient for IO buffering performed by the kernel and filesystem - this means that an IO-intensive job will often benefit from requesting somewhat more memory than the aggregate size of processes.<br />
<br />
=Volta GPU nodes on Graham= <!--T:46--><br />
In the Winter of 2019 seven new Volta GPU nodes have been added.<br />
<br />
<!--T:47--><br />
The node specifications are as follows:<br />
<br />
<!--T:48--><br />
'''gra[1148-1153]'''<br />
* 2 Sockets<br />
* 14 CoresPerSocket<br />
* 192 GB<br />
* 3.6 TB NVMe /local disk<br />
* 8 x V100 (16 GB) GPUs (4 per socket)<br />
<br />
<!--T:49--><br />
'''gra1147'''<br><br />
* Same as above, except for 6 x V100 (16 GB) GPUs (3 per socket).<br />
<br />
<!--T:50--><br />
The nodes are still in testing and access can be granted to Ontario researchers by request. <br />
<br />
<!--T:51--><br />
Here is an example job script to submit a job to one of the nodes (with 8 GPUs). The module load command will ensure that modules compiled for Skylake architecture will be used. Replace nvidia-smi with the command you want to run.<br />
<br />
<!--T:52--><br />
'''Important''': You should scale the number of CPUs requested, keeping the ratio of CPUs to GPUs at 3.5 or less. For example, if you want to run a job using 4 GPUs, you should request at most 14 CPU cores. For a job with 1 GPU, you should request at most 3 CPU cores. As the nodes are in testing, for now users are allowed to run a few short test jobs (shorter than 1 hour) that break this rule to see how your code performs.<br />
<br />
<!--T:53--><br />
{{File<br />
|name=gpu_single_node_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --account=ctb-ontario<br />
#SBATCH --partition=c-ontario<br />
#SBATCH --nodes=1<br />
#SBATCH --gres=gpu:v100:8<br />
#SBATCH --exclusive<br />
#SBATCH --cpus-per-task=28<br />
#SBATCH --mem=150G<br />
#SBATCH --time=3-00:00<br />
module load arch/avx512 StdEnv/2018.3<br />
nvidia-smi<br />
}}<br />
<br />
<!--T:54--><br />
The graham Volta nodes have a fast local disk, which should be used for jobs if the amount of I/O performed by your job is significant. Inside the job, the location of the temporary directory on fast local disk is specified by the environment variable $SLURM_TMPDIR. You can copy your input files there at the start of your job script before you run your program and your output files out at the end of your job script. All the files in $SLURM_TMPDIR will be removed once the job ends, so you do not have to clean up that directory yourself. You can even create Python virtual environments in this temporary space for greater efficiency. Please see [[Python#Creating_virtual_environments_inside_of_your_jobs]] for information on how to that.<br />
<br />
<!--T:14--><br />
<noinclude><br />
</translate><br />
</noinclude></div>Jdesjardhttps://docs.alliancecan.ca/mediawiki/index.php?title=Managing_Slurm_accounts&diff=69366Managing Slurm accounts2019-03-08T19:36:52Z<p>Jdesjard: </p>
<hr />
<div>{{Draft}}<br />
<br />
Each job submitted to the [[Running jobs|job scheduler Slurm]] has an associated<br />
Resource Allocation Project (RAP) which is selected with the <code>--account</code> option to <code>sbatch</code>.<br />
The scheduling priority of the job will be determined by the target share of the account relative to the account's recent usage, as described at [[Job scheduling policies]].<br />
<br />
A research group may have many individual users submitting jobs to a given RAP account. The usage of all the users within the RAP is charged to a single account, thus usage of each user affects the priority of jobs submitted by all the users in the group. Because of this shared accounting structure within a RAP account there are circumstances when active coordination among users may have substantial impact on the projects throughput. <br />
<br />
== When can managing usage within a RAP account be useful? ==<br />
A user who has not used many resources may submit a job and find that it has very low priority because other users in the group have run a lot of work recently. In this case all users of that account will have to wait for the account's fairshare (LevelFS) to return to a competitive value. Because the fairshare principle is applied ''within'' a group as well as ''between'' groups, jobs belonging to the underserved user will have the highest priority and, all else being equal, will run first when the group's fairshare recovers.<br />
<br />
This may not happen, however, if different users in the group have jobs with drastically different requirements. For example, if one user runs a lot of small jobs which fit into scheduling gaps ("back filling" or "cycle scavenging") they may be able to achieve substantial throughput even while the group priority remains low. This will make it difficult for other users within the RAP to run jobs with greater resource requirements. <br />
<br />
== What strategies are there for managing usage within accounts? ==<br />
Several of the strategies are things that can be discussed by the group in lab meetings. <br />
* If various users have distinct deadlines that require bursts of computation it can be valuable to schedule their usage on the system so that they do not affect each others' priority at critical times in their projects.<br />
* Use different clusters. The national general purpose systems are largely identical in capability, and each RAP is independently accounted on each cluster. User X of account Y on Graham will not affect the priority of jobs submitted by user Z to account Y on Cedar. <br />
* Use multiple accounts. A group that has a RAC award can submit jobs to both the RAC account and the default account; jobs running under one account will not affect the fairshare of the other account. <br />
* For collaborative work each faculty member within a project can obtain their own account and divide users' work appropriately among all of the PI accounts collaborating on a project.<br />
<br />
== Augmented account coordination privileges by request ==<br />
If the above suggestions are insufficient and the PI wishes to manage configuration settings of the account, the PI can ask to be added as a "coordinator" of the account. With coordinator privileges one can manage user-specific settings such as ''share'', ''maxjobs'', ''maxcpus'', ''maxsubmit'', etc. For example, a specific user may be limited to have no more than 20 jobs in the system like so:<br />
<br />
<source lang="bash"><br />
[someuser@host ~]$ sacctmgr modify user where account=$account name=$membername set maxjobs=20<br />
</source></div>Jdesjardhttps://docs.alliancecan.ca/mediawiki/index.php?title=Managing_Slurm_accounts&diff=64183Managing Slurm accounts2018-12-06T20:47:50Z<p>Jdesjard: /* Augmented account coordination privileges by request */</p>
<hr />
<div>{{Draft}}<br />
<br />
Each job submitted to the [[Running jobs|job scheduler Slurm]] has an associated<br />
Resource Allocation Project (RAP) which is selected with the <code>--account</code> option to <code>sbatch</code>.<br />
The scheduling priority of the job will be determined by the target share of the account relative to the account's recent usage, as described at [[Job scheduling policies]].<br />
<br />
A research group may have many individual users submitting jobs to a given RAP account. The usage of all the users within the RAP is charged to a single account, thus usage of each user affects the priority of jobs submitted by all the users in the group. Because of this shared accounting structure within a RAP account there are circumstances when active coordination among users may have substantial impact on the projects throughput. <br />
<br />
== When can managing usage within a RAP account be useful? ==<br />
A user who has not used many resources may submit a job and find that it has very low priority because other users in the group have run a lot of work recently. In this case all users of that account will have to wait for the account's fairshare (LevelFS) to return to a competitive value. Because the fairshare principle is applied ''within'' a group as well as ''between'' groups, jobs belonging to the underserved user will have the highest priority and, all else being equal, will run first when the group's fairshare recovers.<br />
<br />
This may not happen, however, if different users in the group have jobs with drastically different requirements. For example, if one user runs a lot of small jobs which fit into scheduling gaps ("back filling" or "cycle scavenging") they may be able to achieve substantial throughput even while the group priority remains low. This will make it difficult for other users within the RAP to run jobs with greater resource requirements. <br />
<br />
== What strategies are there for managing usage within accounts? ==<br />
Several of the strategies are things that can be discussed by the group in lab meetings. <br />
* If various users have distinct deadlines that require bursts of computation it can be valuable to schedule their usage on the system so that they do not affect each others' priority at critical times in their projects.<br />
* Use different clusters. The national general purpose systems are largely identical in capability, and each RAP is independently accounted on each cluster. User X of account Y on Graham will not affect the priority of jobs submitted by user Z to account Y on Cedar. <br />
* Use multiple accounts. A group that has a RAC award can submit jobs to both the RAC account and the default account; jobs running under one account will not affect the fairshare of the other account. <br />
* For collaborative work each faculty member within a project can obtain their own account and divide users' work appropriately among all of the PI accounts collaborating on a project.<br />
<br />
== Augmented account coordination privileges by request ==<br />
If the above suggestions are insufficient and the PI wishes to manage configuration settings of the account, the PI can ask to be added as a "coordinator" of the account. With coordinator privileges one can manage user-specific settings such as ''share'', ''maxjobs'', ''maxcpus'', ''maxsubmit'', etc. For example, a specific user may be limited to have no more than 20 jobs in the system like so:<br />
<br />
<source lang="bash"><br />
[someuser@host ~]$ sacctmgr modify user where name=username set maxjobs=20<br />
</source></div>Jdesjardhttps://docs.alliancecan.ca/mediawiki/index.php?title=Managing_Slurm_accounts&diff=64177Managing Slurm accounts2018-12-06T03:48:04Z<p>Jdesjard: </p>
<hr />
<div>{{Draft}}<br />
<br />
Each job submitted to the [[Running jobs|job scheduler Slurm]] has an associated<br />
Resource Allocation Project (RAP) which is selected with the <code>--account</code> option to <code>sbatch</code>.<br />
The scheduling priority of the job will be determined by the target share of the account relative to the account's recent usage, as described at [[Job scheduling policies]].<br />
<br />
A research group may have many individual users submitting jobs to a given RAP account. The usage of all the users within the RAP is charged to a single account, thus usage of each user affects the priority of jobs submitted by all the users in the group. Because of this shared accounting structure within a RAP account there are circumstances when active coordination among users may have substantial impact on the projects throughput. <br />
<br />
==When can managing usage within a RAP account be useful?==<br />
A user who has not used many resources may submit a job and find that it has very low priority because other users in the group submitting jobs to the same account have achieved high throughput recently. In this case all users within the account will have to wait for the account fair share to return to a competitive level compared to other accounts submitting work in the queue. In many cases this user impact within an accounting group is increased by various users requiring different types of jobs. For example, if one user has large workloads of numerous small jobs which fit into scheduling gaps (e.g. back filling) this user may be able to achieve substantial throughput even though the job priorities are very low (cycle scavenging). This type of high throughput at low priorities will make it very difficult for other users within the RAP to run jobs that require high priority in order to obtain resources. <br />
<br />
==What strategies are recommended for managing usage within accounts==<br />
Several of the strategies for managing usage within a RAP account are the kind of things that can be discussed by the group in lab meetings. For example, if various users have overlapping deadlines that require bursts of computation it can be valuable to schedule their usage on the system so that the users do not affect each others' priority at critical times in their projects. Another strategy is to use multiple machines. The national general purpose systems are largely identical in usage, and all RAPs are independently accounted on each machine. Specifically, user X of account Y on Graham will not affect the priority of jobs submitted by user Z to account Y on Cedar. Another recommendation is to use multiple accounts. Beyond using multiple systems, there are opportunities for research groups to use multiple accounts. For example, a group that has a RAC award can submit jobs to both the RAC account and the default account which will not influence the priority across accounts. Further, for collaborative work each faculty member within the project can obtain their own account and divids users' work appropriately among all of the PI accounts collaborating on a project.<br />
<br />
==augmented account coordination privileges by request==<br />
In the case where a usage coordination within a RAP account can benefit configuration settings the PI of the account can request to be added as a "coordinator" of the account. With coordinator privileges the PI of the account can mange user specific access settings (such as `share`, `maxjobs`, `maxcpus`, `maxsubmit`, etc.) by performing e.g.:<br />
<br />
<source lang="bash"><br />
[someuser@host ~]$ sacctmgr modify user where name=username set maxjobs=20<br />
</source></div>Jdesjardhttps://docs.alliancecan.ca/mediawiki/index.php?title=Running_jobs&diff=53605Running jobs2018-06-15T15:52:22Z<p>Jdesjard: /* Use sbatch to submit jobs */Added recommendation to specify job memory requests as megabytes rather than gigabytes.</p>
<hr />
<div><languages /><br />
<translate><br />
<br />
<!--T:54--><br />
This page is intended for the user who is already familiar with the concepts of job scheduling and job scripts, and who wants guidance on submitting jobs to Compute Canada clusters.<br />
If you have not worked on a large shared computer cluster before, you should probably read [[What is a scheduler?]] first.<br />
<br />
<!--T:112--><br />
{{box|'''All jobs must be submitted via the scheduler!'''<br />
<br><br />
Exceptions are made for compilation and other tasks not expected to consume more than about 10 CPU-minutes and about 4 gigabytes of RAM. Such tasks may be run on a login node. In no case should you run processes on compute nodes except via the scheduler.}}<br />
<br />
<!--T:55--><br />
On Compute Canada clusters, the job scheduler is the <br />
[https://en.wikipedia.org/wiki/Slurm_Workload_Manager Slurm Workload Manager].<br />
Comprehensive [https://slurm.schedmd.com/documentation.html documentation for Slurm] is maintained by SchedMD. If you are coming to Slurm from PBS/Torque, SGE, LSF, or LoadLeveler, you might find this table of [https://slurm.schedmd.com/rosetta.pdf corresponding commands] useful.<br />
<br />
==Use <code>sbatch</code> to submit jobs== <!--T:56--><br />
The command to submit a job is [https://slurm.schedmd.com/sbatch.html <code>sbatch</code>]:<br />
<source lang="bash"><br />
[someuser@host ~]$ sbatch simple_job.sh<br />
Submitted batch job 123456<br />
</source><br />
<br />
<!--T:57--><br />
A minimal Slurm job script looks like this:<br />
{{File<br />
|name=simple_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --time=00:01:00<br />
#SBATCH --account=def-someuser<br />
echo 'Hello, world!'<br />
sleep 30 <br />
}}<br />
<br />
<!--T:58--><br />
Directives (or "options") in the job script are prefixed with <code>#SBATCH</code> and must precede all executable commands. All available directives are described on the [https://slurm.schedmd.com/sbatch.html sbatch page]. Compute Canada policies require that you supply at least a time limit (<code>--time</code>) for each job. You may also need to supply an account name (<code>--account</code>). See [[#Accounts and projects|Accounts and projecs]] below.<br />
<br />
<!--T:106--><br />
A default memory amount of 256 MB per core will be allocated unless you make some other memory request with <code>--mem-per-cpu</code> (memory per core) or <code>--mem</code> (memory per node). It is recommended that memory is specified in megabytes (e.g. 8000M) rather than gigabytes (e.g. 8G). In many circumstances specifying memory requests as 1000's of megabytes will result in shorter queue wait times than specifying gigabytes. For example, requesting --mem=128G (equivalent to 131072M) is more memory than in available for jobs on 128G memory nodes. Therefore a job requesting --mem=128G requires a smaller subset of nodes with more that 128G memory in total.<br />
<br />
<!--T:59--><br />
You can also specify directives as command-line arguments to <code>sbatch</code>. So for example,<br />
[someuser@host ~]$ sbatch --time=00:30:00 simple_job.sh <br />
will submit the above job script with a time limit of 30 minutes.<br />
<br />
<!--T:114--><br />
Please be cautious if you use a script to submit multiple Slurm jobs in a short time. Submitting thousands of jobs at a time can cause Slurm to become [[Frequently_Asked_Questions#.22sbatch:_error:_Batch_job_submission_failed:_Socket_timed_out_on_send.2Frecv_operation.22|unresponsive]] to other users. Consider using an [[Running jobs#Array job|array job]] instead, or use <code>sleep</code> to space out calls to <code>sbatch</code> by one second or more.<br />
<br />
==Use <code>squeue</code> to list jobs== <!--T:60--><br />
<br />
<!--T:61--><br />
The [https://slurm.schedmd.com/squeue.html <code>squeue</code>] command lists pending and running jobs. Supply your username as an argument with <code>-u</code> to list only your own jobs:<br />
<br />
<!--T:62--><br />
<source lang="bash"><br />
[someuser@host ~]$ squeue -u $USER<br />
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)<br />
123456 cpubase_b simple_j someuser R 0:03 1 cdr234<br />
123457 cpubase_b simple_j someuser PD 1 (Priority)<br />
</source><br />
<br />
<!--T:12--><br />
The ST column of the output shows the status of each job. The two most common states are "PD" for "pending" or "R" for "running". See the [https://slurm.schedmd.com/squeue.html squeue page]<br />
for more on selecting, formatting, and interpreting the <code>squeue</code> output.<br />
<br />
<!--T:115--><br />
Please ''do not'' run <code>squeue</code> from a script or program at high frequency, ''e.g.,'' every few seconds. Responding to <code>squeue</code> adds load to Slurm, and may interfere with its performance or correct operation.<br />
<br />
==Where does the output go?== <!--T:63--><br />
<br />
<!--T:64--><br />
By default the output is placed in a file named "slurm-", suffixed with the job ID number and ".out", ''e.g.'' <code>slurm-123456.out</code>, in the directory from which the job was submitted.<br />
You can use <code>--output</code> to specify a different name or location. <br />
Certain replacement symbols can be used in the filename, ''e.g.'' <code>%j</code> will be replaced <br />
by the job ID number. See [https://slurm.schedmd.com/sbatch.html sbatch] for a complete list.<br />
<br />
<!--T:65--><br />
The following sample script sets a ''job name'' (which appears in <code>squeue</code> output) and sends the output to a file with a name constructed from the job name (%x) and the job ID number (%j). <br />
<br />
<!--T:15--><br />
{{File<br />
|name=name_output.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --account=def-someuser<br />
#SBATCH --time=00:01:00<br />
#SBATCH --job-name=test<br />
#SBATCH --output=%x-%j.out<br />
echo 'Hello, world!'<br />
}}<br />
<br />
<!--T:16--><br />
Error output will normally appear in the same file as standard output, just as it would if you were typing commands interactively. If you want to send the standard error channel (stderr) to a separate file, use <code>--error</code>.<br />
<br />
==Accounts and projects== <!--T:66--><br />
<br />
<!--T:67--><br />
Every job must have an associated ''account name'' corresponding to a Compute Canada [https://ccdb.computecanada.ca/me/faq#what_is_rap Resource Allocation Project] (RAP).<br />
<br />
<!--T:107--><br />
If you try to submit a job with <code>sbatch</code> and receive one of these messages:<br />
<pre><br />
You are associated with multiple _cpu allocations...<br />
Please specify one of the following accounts to submit this job:<br />
<br />
<!--T:108--><br />
You are associated with multiple _gpu allocations...<br />
Please specify one of the following accounts to submit this job:<br />
</pre> <br />
then you have more than one valid account, and you will have to specify one<br />
using the <code>--account</code> directive:<br />
#SBATCH --account=def-user-ab<br />
<br />
<!--T:68--><br />
To find out which account name corresponds<br />
to a given Resource Allocation Project, log in to [https://ccdb.computecanada.ca CCDB] <br />
and click on "My Account -> Account Details". You will see a list of all the projects <br />
you are a member of. The string you should use with the <code>--account</code> for <br />
a given project is under the column '''Group Name'''. Note that a Resource <br />
Allocation Project may only apply to a specific cluster (or set of clusters) and therefore<br />
may not be transferable from one cluster to another. <br />
<br />
<!--T:69--><br />
In the illustration below, jobs submitted with <code>--account=def-rdickson</code> will be accounted against RAP wnp-003-aa.<br />
<br />
<!--T:70--><br />
[[File:Find-group-name-annotated.png|750px|frame|left| Finding the group name for a Resource Allocation Project (RAP)]]<br />
<br clear=all> <!-- This is to prevent the next section from filling to the right of the image. --><br />
<br />
<!--T:71--><br />
If you plan to use one account consistently for all jobs, once you have determined the right account name you may find it convenient to set the following three environment variables in your <code>~/.bashrc</code> file:<br />
export SLURM_ACCOUNT=def-someuser<br />
export SBATCH_ACCOUNT=$SLURM_ACCOUNT<br />
export SALLOC_ACCOUNT=$SLURM_ACCOUNT<br />
Slurm will use the value of <code>SBATCH_ACCOUNT</code> in place of the <code>--account</code> directive in the job script. Note that even if you supply an account name inside the job script, ''the environment variable takes priority.'' In order to override the environment variable you must supply an account name as a command-line argument to <code>sbatch</code>.<br />
<br />
<!--T:72--><br />
<code>SLURM_ACCOUNT</code> plays the same role as <code>SBATCH_ACCOUNT</code>, but for the <code>srun</code> command instead of <code>sbatch</code>. The same idea holds for <code>SALLOC_ACCOUNT</code>.<br />
<br />
== Examples of job scripts == <!--T:17--><br />
<br />
=== MPI job === <!--T:18--><br />
<br />
<!--T:51--><br />
This example script launches four MPI processes, each with 1024 MB of memory. The run time is limited to 5 minutes. <br />
<br />
<!--T:19--><br />
{{File<br />
|name=mpi_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --account=def-someuser<br />
#SBATCH --ntasks=4 # number of MPI processes<br />
#SBATCH --mem-per-cpu=1024M # memory; default unit is megabytes<br />
#SBATCH --time=0-00:05 # time (DD-HH:MM)<br />
srun ./mpi_program # mpirun or mpiexec also work<br />
}}<br />
<br />
<!--T:20--><br />
Large MPI jobs, specifically those which can efficiently use whole nodes, should use <code>--nodes</code> and <code>--ntasks-per-node</code> instead of <code>--ntasks</code>. Hybrid MPI/threaded jobs are also possible. For more on these and other options relating to distributed parallel jobs, see [[Advanced MPI scheduling]].<br />
<br />
=== Threaded or OpenMP job === <!--T:21--><br />
This example script launches a single process with eight CPU cores. Bear in mind that for an application to use OpenMP it must be compiled with the appropriate flag, e.g. <code>gcc -fopenmp ...</code> or <code>icc -openmp ...</code><br />
<br />
<!--T:22--><br />
{{File<br />
|name=openmp_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --account=def-someuser<br />
#SBATCH --time=0-0:5<br />
#SBATCH --cpus-per-task=8<br />
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK<br />
./ompHello<br />
}}<br />
<br />
<!--T:23--><br />
For more on writing and running parallel programs with OpenMP, see [[OpenMP]].<br />
<br />
=== GPU job === <!--T:24--><br />
There are many options involved in requesting GPUs because <br />
* the GPU-equipped nodes at [[Cedar]] and [[Graham]] have different configurations,<br />
* there are two different configurations at Cedar, and <br />
*there are different policies for the different Cedar GPU nodes. <br />
Please see [[Using GPUs with Slurm]] for a discussion and examples of how to schedule various job types on the available GPU resources.<br />
<br />
=== Array job === <!--T:27--><br />
Also known as a ''task array'', an array job is a way to submit a whole set of jobs with one command. The individual jobs in the array are distinguished by an environment variable, <code>$SLURM_ARRAY_TASK_ID</code>, which is set to a different value for each instance of the job. <br />
sbatch --array=0-7 ... # $SLURM_ARRAY_TASK_ID will take values from 0 to 7 inclusive<br />
sbatch --array=1,3,5,7 ... # $SLURM_ARRAY_TASK_ID will take the listed values<br />
sbatch --array=1-7:2 ... # Step-size of 2, does the same as the previous example<br />
sbatch --array=1-100%10 ... # Allow no more than 10 of the jobs to run simultaneously<br />
<br />
<!--T:142--><br />
For examples, see [[Job arrays]]. See [https://slurm.schedmd.com/job_array.html Job Array Support] at SchedMD.com for detailed documentation.<br />
<br />
== Interactive jobs == <!--T:28--><br />
Though batch submission is the most common and most efficient way to take advantage of our clusters, interactive jobs are also supported. These can be useful for things like:<br />
* Data exploration at the command line<br />
* Interactive "console tools" like R and iPython<br />
* Significant software development, debugging, or compiling<br />
<br />
<!--T:29--><br />
You can start an interactive session on a compute node with [https://slurm.schedmd.com/salloc.html salloc]. In the following example we request two tasks, which corresponds to two CPU cores, for an hour:<br />
[name@login ~]$ salloc --time=1:0:0 --ntasks=2 --account=def-someuser<br />
salloc: Granted job allocation 1234567<br />
[name@node01 ~]$ ... # do some work<br />
[name@node01 ~]$ exit # terminate the allocation<br />
salloc: Relinquishing job allocation 1234567<br />
<br />
<!--T:113--><br />
The maximum run time of an interactive job is 3 hours.<br />
<br />
<!--T:129--><br />
It is also possible to run graphical programs interactively on a compute node by adding the '''--x11''' flag to your ''salloc'' command. In order for this to work, you must first connect to the cluster with X11 forwarding enabled (see the [[SSH]] page for instructions on how to do that).<br />
<br />
== Monitoring jobs == <!--T:31--><br />
<br />
<!--T:32--><br />
By default [https://slurm.schedmd.com/squeue.html squeue] will show all the jobs the scheduler is managing at the moment. It may run much faster if you ask only about your own jobs with<br />
squeue -u <username><br />
<br />
<!--T:33--><br />
You can show only running jobs, or only pending jobs:<br />
squeue -u <username> -t RUNNING<br />
squeue -u <username> -t PENDING<br />
<br />
<!--T:34--><br />
You can show detailed information for a specific job with [https://slurm.schedmd.com/scontrol.html scontrol]:<br />
scontrol show job -dd <jobid><br />
<br />
<!--T:35--><br />
Find information about a completed job with [https://slurm.schedmd.com/sacct.html sacct], and optionally, control what it prints using <code>--format</code>:<br />
sacct -j <jobid><br />
sacct -j <jobid> --format=JobID,JobName,MaxRSS,Elapsed<br />
<br />
<!--T:73--><br />
If a node fails while running a job, the job may be restarted. <code>sacct</code> will normally show you only the record for the last (presumably successful) run. If you wish to see all records related to a given job, add the <code>--duplicates</code> option.<br />
<br />
<!--T:52--><br />
Use the MaxRSS accounting field to determine how much memory a job needed. The value returned will be the largest [https://en.wikipedia.org/wiki/Resident_set_size resident set size] for any of the tasks. If you want to know which task and node this occurred on, print the MaxRSSTask and MaxRSSNode fields also.<br />
<br />
<!--T:53--><br />
The [https://slurm.schedmd.com/sstat.html sstat] command works on a running job much the same way that [https://slurm.schedmd.com/sacct.html sacct] works on a completed job.<br />
<br />
<!--T:36--><br />
You can ask to be notified by email of certain job conditions by supplying options to <br />
[https://slurm.schedmd.com/sbatch.html sbatch]:<br />
#SBATCH --mail-user=<email_address><br />
#SBATCH --mail-type=BEGIN<br />
#SBATCH --mail-type=END<br />
#SBATCH --mail-type=FAIL<br />
#SBATCH --mail-type=REQUEUE<br />
#SBATCH --mail-type=ALL<br />
<br />
=== Attaching to a running job === <!--T:130--><br />
It is possible to connect to the node running a job and execute new processes there. You might want to do this for troubleshooting or to monitor the progress of a job.<br />
<br />
<!--T:131--><br />
Suppose you want to run the utility [https://developer.nvidia.com/nvidia-system-management-interface <code>nvidia-smi</code>] to monitor GPU usage on a node where you have a job running. The following command runs <code>watch</code> on the node assigned to the given job, which in turn runs <code>nvidia-smi</code> every 30 seconds, displaying the output on your terminal.<br />
<br />
<!--T:132--><br />
{{Command2<br />
|srun --jobid 123456 --pty watch -n 30 nvidia-smi}}<br />
<br />
<!--T:133--><br />
It is possible to launch multiple monitoring commands using [https://en.wikipedia.org/wiki/Tmux <code>tmux</code>]. The following command launches <code>htop</code> and <code>nvidia-smi</code> in separate panes to monitor the activity on a node assigned to the given job.<br />
<br />
<!--T:134--><br />
{{Command2<br />
|srun --jobid 123456 --pty tmux new-session -d 'htop -u $USER' \; split-window -h 'watch nvidia-smi' \; attach}}<br />
<br />
<!--T:135--><br />
Processes launched with <code>srun</code> share the resources with the job specified. You should therefore be careful not to launch processes that would use a significant portion of the resources allocated for the job. Using too much memory, for example, might result in the job being killed; using too many CPU cycles will slow down the job.<br />
<br />
<!--T:136--><br />
'''Noteː''' The <code>srun</code> commands shown above work only to monitor a job submitted with <code>sbatch</code>. To monitor an interactive job, create multiple panes with <code>tmux</code> and start each process in its own pane.<br />
<br />
==Cancelling jobs== <!--T:37--><br />
<br />
<!--T:38--><br />
Use [https://slurm.schedmd.com/scancel.html scancel] with the job ID to cancel a job:<br />
<br />
<!--T:39--><br />
scancel <jobid><br />
<br />
<!--T:40--><br />
You can also use it to cancel all your jobs, or all your pending jobs:<br />
<br />
<!--T:41--><br />
scancel -u <username><br />
scancel -t PENDING -u <username><br />
<br />
== Resubmitting jobs for long running computations == <!--T:74--><br />
<br />
<!--T:75--><br />
When a computation is going to require a long time to complete, so long that it cannot be done within the time limits on the system, <br />
the application you are running must support [[Points de contrôle/en|checkpointing]]. The application should be able to save its state to a file, called a ''checkpoint file'', and<br />
then it should be able to restart and continue the computation from that saved state. <br />
<br />
<!--T:76--><br />
For many users restarting a calculation will be rare and may be done manually, <br />
but some workflows require frequent restarts. <br />
In this case some kind of automation technique may be employed. <br />
<br />
<!--T:77--><br />
Here are two recommended methods of automatic restarting:<br />
* Using SLURM '''job arrays'''.<br />
* Resubmitting from the end of the job script.<br />
<br />
=== Restarting using job arrays === <!--T:90--><br />
<br />
<!--T:91--><br />
Using the <code>--array=1-100%10</code> syntax mentioned earlier one can submit a collection of identical jobs with the condition that only one job of them will run at any given time.<br />
The script should be written to ensure that the last checkpoint is always used for the next job. The number of restarts is fixed by the <code>--array</code> argument.<br />
<br />
<!--T:78--><br />
Consider, for example, a molecular dynamics simulations that has to be run for 1 000 000 steps, and such simulation does not fit into the time limit on the cluster. <br />
We can split the simulation into 10 smaller jobs of 100 000 steps, one after another. <br />
<br />
<!--T:79--><br />
An example of using a job array to restart a simulation:<br />
{{File<br />
|name=job_array_restart.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
# ---------------------------------------------------------------------<br />
# SLURM script for a multi-step job on a Compute Canada cluster. <br />
# ---------------------------------------------------------------------<br />
#SBATCH --account=def-someuser<br />
#SBATCH --cpus-per-task=1<br />
#SBATCH --time=0-10:00<br />
#SBATCH --mem=100M<br />
#SBATCH --array=1-10%1 # Run a 10-job array, one job at a time.<br />
# ---------------------------------------------------------------------<br />
echo "Current working directory: `pwd`"<br />
echo "Starting run at: `date`"<br />
# ---------------------------------------------------------------------<br />
echo ""<br />
echo "Job Array ID / Job ID: $SLURM_ARRAY_JOB_ID / $SLURM_JOB_ID"<br />
echo "This is job $SLURM_ARRAY_TASK_ID out of $SLURM_ARRAY_TASK_COUNT jobs."<br />
echo ""<br />
# ---------------------------------------------------------------------<br />
# Run your simulation step here...<br />
<br />
<!--T:92--><br />
if test -e state.cpt; then <br />
# There is a checkpoint file, restart;<br />
mdrun --restart state.cpt<br />
else<br />
# There is no checkpoint file, start a new simulation.<br />
mdrun<br />
fi<br />
<br />
<!--T:93--><br />
# ---------------------------------------------------------------------<br />
echo "Job finished with exit code $? at: `date`"<br />
# ---------------------------------------------------------------------<br />
}}<br />
<br />
=== Resubmission from the job script === <!--T:94--><br />
<br />
<!--T:95--><br />
In this case one submits a job that runs the first chunk of the calculation and saves a checkpoint. <br />
Once the chunk is done but before the allocated run-time of the job has elapsed,<br />
the script checks if the end of the calculation has been reached.<br />
If the calculation is not yet finished, the script submits a copy of itself to continue working.<br />
<br />
<!--T:96--><br />
An example of a job script with resubmission:<br />
{{File<br />
|name=job_resubmission.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
# ---------------------------------------------------------------------<br />
# SLURM script for job resubmission on a Compute Canada cluster. <br />
# ---------------------------------------------------------------------<br />
#SBATCH --job-name=job_chain<br />
#SBATCH --account=def-someuser<br />
#SBATCH --cpus-per-task=1<br />
#SBATCH --time=0-10:00<br />
#SBATCH --mem=100M<br />
# ---------------------------------------------------------------------<br />
echo "Current working directory: `pwd`"<br />
echo "Starting run at: `date`"<br />
# ---------------------------------------------------------------------<br />
# Run your simulation step here...<br />
<br />
<!--T:100--><br />
if test -e state.cpt; then <br />
# There is a checkpoint file, restart;<br />
mdrun --restart state.cpt<br />
else<br />
# There is no checkpoint file, start a new simulation.<br />
mdrun<br />
fi<br />
<br />
<!--T:101--><br />
# Resubmit if not all work has been done yet.<br />
# You must define the function work_should_continue().<br />
if work_should_continue; then<br />
sbatch ${BASH_SOURCE[0]}<br />
fi<br />
<br />
<!--T:102--><br />
# ---------------------------------------------------------------------<br />
echo "Job finished with exit code $? at: `date`"<br />
# ---------------------------------------------------------------------<br />
}}<br />
<br />
<!--T:143--><br />
'''Please note:''' The test to determine whether to submit a followup job, abbreviated as <code>work_should_continue</code> in the above example, should be a ''positive test''. There may be a temptation to test for a stopping condition (e.g. is some convergence criterion met?) and submit a new job if the condition is ''not'' detected. But if some error arises that you didn't foresee, the stopping condition might never be met and your chain of jobs may continue indefinitely, doing nothing useful.<br />
<br />
== Other considerations == <!--T:137--> <br />
<br />
=== Specifying a CPU architecture === <!--T:138--><br />
<br />
<!--T:139--><br />
Cedar has two distinct CPU architectures available: [https://en.wikipedia.org/wiki/Broadwell_(microarchitecture) Broadwell] and [https://en.wikipedia.org/wiki/Skylake_(microarchitecture) Skylake]. Users requiring a specific architecture can request it when submitting a job using the <code>--constraint</code> flag. Note that the names should be written all in lower-case, <code>skylake</code> or <code>broadwell</code>. <br />
<br />
<!--T:140--><br />
An example job requesting the <code>skylake</code> feature on Cedar:<br />
<pre><br />
#!/bin/bash<br />
#SBATCH --account=def-someuser<br />
#SBATCH --constraint=skylake<br />
#SBATCH --time=0:5:0<br />
# Display CPU-specific information with 'lscpu'. <br />
# Skylake CPUs will have 'avx512f' in the 'Flags' section of the output.<br />
lscpu<br />
</pre><br />
<br />
<!--T:141--><br />
Keep in mind that a job which would have obtained an entire node for itself by specifying for example <tt>#SBATCH --cpus-per-task=32</tt> will now share the remaining 16 CPU cores with another job if it happens to use a Skylake node; if you wish to reserve the entire node you will need to request all 48 cores or add the <tt>#SBATCH --constraint=broadwell</tt> option to your job script. <br />
<br />
<!--T:144--><br />
''If you are unsure if your job requires a specific architecture, do not use this option.'' Jobs that do not specify a CPU architecture can be scheduled on either Broadwell or Skylake nodes, and will therefore generally start earlier.<br />
<br />
== Troubleshooting == <!--T:42--><br />
<br />
==== Avoid hidden characters in job scripts ==== <!--T:43--><br />
Preparing a job script with a ''word processor'' instead of a ''text editor'' is a common cause of trouble. Best practice is to prepare your job script on the cluster using an [[Editors|editor]] such as nano, vim, or emacs. If you prefer to prepare or alter the script off-line, then:<br />
* '''Windows users:''' <br />
** Use a text editor such as Notepad or [https://notepad-plus-plus.org/ Notepad++].<br />
** After uploading the script, use <code>dos2unix</code> to change Windows end-of-line characters to Linux end-of-line characters. <br />
* '''Mac users:'''<br />
** Open a terminal window and use an [[Editors|editor]] such as nano, vim, or emacs.<br />
<br />
==== Cancellation of jobs with dependency conditions which cannot be met ==== <!--T:109--><br />
A job submitted with <code>dependency=afterok:<jobid></code> is a "dependent job". A dependent job will wait for the parent job to be completed. If the parent job fails (that is, ends with a non-zero exit code) the dependent job can never be scheduled and so will be automatically cancelled. See [https://slurm.schedmd.com/sbatch.html sbatch] for more on dependency.<br />
<br />
==== Job cannot load a module ==== <!--T:116--><br />
It is possible to see an error such as:<br />
<br />
<!--T:117--><br />
Lmod has detected the following error: These module(s) exist but cannot be<br />
loaded as requested: "<module-name>/<version>"<br />
Try: "module spider <module-name>/<version>" to see how to load the module(s).<br />
<br />
<!--T:118--><br />
This can occur if the particular module has an unsatisfied prerequisite. For example<br />
<br />
<!--T:119--><br />
<source lang="console"><br />
[name@server]$ module load gcc<br />
[name@server]$ module load quantumespresso/6.1<br />
Lmod has detected the following error: These module(s) exist but cannot be loaded as requested: "quantumespresso/6.1"<br />
Try: "module spider quantumespresso/6.1" to see how to load the module(s).<br />
[name@server]$ module spider quantumespresso/6.1<br />
<br />
<!--T:120--><br />
-----------------------------------------<br />
quantumespresso: quantumespresso/6.1<br />
------------------------------------------<br />
Description:<br />
Quantum ESPRESSO is an integrated suite of computer codes for electronic-structure calculations and materials modeling at the nanoscale. It is based on density-functional theory, plane waves, and pseudopotentials (both<br />
norm-conserving and ultrasoft).<br />
<br />
<!--T:121--><br />
Properties:<br />
Chemistry libraries/apps / Logiciels de chimie<br />
<br />
<!--T:122--><br />
You will need to load all module(s) on any one of the lines below before the "quantumespresso/6.1" module is available to load.<br />
<br />
<!--T:123--><br />
nixpkgs/16.09 intel/2016.4 openmpi/2.1.1<br />
<br />
<!--T:124--><br />
Help:<br />
<br />
<!--T:125--><br />
Description<br />
===========<br />
Quantum ESPRESSO is an integrated suite of computer codes<br />
for electronic-structure calculations and materials modeling at the nanoscale.<br />
It is based on density-functional theory, plane waves, and pseudopotentials<br />
(both norm-conserving and ultrasoft).<br />
<br />
<br />
<!--T:126--><br />
More information<br />
================<br />
- Homepage: http://www.pwscf.org/<br />
</source><br />
<br />
<!--T:127--><br />
In this case adding the line <code>module load nixpkgs/16.09 intel/2016.4 openmpi/2.1.1</code> to your job script before loading the "quantumespresso/6.1" will solve this problem.<br />
<br />
==== Jobs inherit environment variables ==== <!--T:128--><br />
By default a job will inherit the environment variables of the shell where the job was submitted. The [[Using modules|module]] command which is used to make various software packages available changes and sets environment variables. Changes will propagate to any job submitted from the shell and thus could affect the job's ability to load modules if there are missing prerequisites. It is best to include the line <code>module purge</code> in your job script before loading all the required modules to ensure a consistent state for each job submission and avoid changes made in your shell affecting your jobs.<br />
<br />
== Job status and priority == <!--T:103--><br />
* For a discussion of how job priority is determined and how things like time limits may affect the scheduling of your jobs at Cedar and Graham, see [[Job scheduling policies]].<br />
<br />
== Further reading == <!--T:44--><br />
* Comprehensive [https://slurm.schedmd.com/documentation.html documentation] is maintained by SchedMD, as well as some [https://slurm.schedmd.com/tutorials.html tutorials].<br />
** [https://slurm.schedmd.com/sbatch.html sbatch] command options<br />
* There is also a [https://slurm.schedmd.com/rosetta.pdf "Rosetta stone"] mapping commands and directives from PBS/Torque, SGE, LSF, and LoadLeveler, to SLURM. NERSC also offers some [http://www.nersc.gov/users/computational-systems/cori/running-jobs/for-edison-users/torque-moab-vs-slurm-comparisons/ tables comparing Torque and SLURM].<br />
* Here is a text tutorial from [http://www.ceci-hpc.be/slurm_tutorial.html CÉCI], Belgium<br />
* Here is a rather minimal text tutorial from [http://www.brightcomputing.com/blog/bid/174099/slurm-101-basic-slurm-usage-for-linux-clusters Bright Computing]<br />
<br />
<!--T:48--><br />
[[Category:SLURM]]<br />
</translate></div>Jdesjardhttps://docs.alliancecan.ca/mediawiki/index.php?title=Cedar&diff=26886Cedar2017-06-16T20:14:39Z<p>Jdesjard: </p>
<hr />
<div><noinclude><languages /><br />
<br />
<translate><br />
</noinclude><br />
===Cedar (GP2)=== <!--T:1--><br />
<br />
<!--T:23--><br />
{| class="wikitable"<br />
|-<br />
| Expected availability: '''June 2017''' for opportunistic use<br />
|-<br />
| Login node: '''cedar.computecanada.ca'''<br />
|-<br />
| Globus endpoint: '''computecanada#cedar'''<br />
|}<br />
<br />
<!--T:2--><br />
Cedar is a heterogeneous cluster suitable for a variety of workloads; it is located at Simon Fraser University. It is named for the [https://en.wikipedia.org/wiki/Thuja_plicata Western Red Cedar], B.C.’s official tree, which is of great spiritual significance to the region's First Nations people. It was previously known as "GP2" and is still identified as such in the [https://www.computecanada.ca/research-portal/accessing-resources/resource-allocation-competitions/ 2017 RAC] documentation. <br />
<br />
<!--T:3--><br />
Cedar is sold and supported by Scalar Decisions, Inc. The node manufacturer is Dell, the high performance temporary space is from DDN, and the interconnect is from Intel. It is entirely liquid cooled, using rear-door heat exchangers. <br />
<br />
[https://docs.computecanada.ca/wiki/Getting_Started_with_the_new_National_Systems Getting started with Cedar]<br />
<br />
====Attached storage==== <!--T:4--><br />
<br />
<!--T:5--><br />
{| class="wikitable sortable"<br />
|-<br />
| '''Home space''' ||<br />
* Standard home directory.<br />
* Small, standard quota.<br />
* Not allocated via [https://www.computecanada.ca/research-portal/accessing-resources/rapid-access-service/ RAS] or [https://www.computecanada.ca/research-portal/accessing-resources/resource-allocation-competitions/ RAC]. Larger requests go to Project space.<br />
|-<br />
| '''Scratch space'''<br />Parallel high-performance filesystem ||<br />
* For active or temporary (<code>/scratch</code>) storage.<br />
* Available to all nodes.<br />
* Not allocated.<br />
* Inactive data will be purged.<br />
* [http://www.ddn.com/ DDN] storage with approximately 4PB usable capacity.<br />
|-<br />
|'''Project space'''<br />External persistent storage<br />
||<br />
* Part of the [[National Data Cyberinfrastructure]].<br />
* Available to all nodes.<br />
* Not designed for parallel I/O workloads. Use Scratch space instead.<br />
|}<br />
<br />
====High-performance interconnect==== <!--T:19--><br />
<br />
<!--T:20--><br />
''Intel OmniPath (version 1) interconnect (100Gbit/s bandwidth).''<br />
<br />
<!--T:21--><br />
A low-latency high-performance fabric connecting all nodes and temporary storage.<br />
<br />
<!--T:22--><br />
By design, Cedar supports multiple simultaneous parallel jobs of up to 1024 cores in a fully non-blocking manner. For larger jobs the interconnect has a 2:1 blocking factor, i.e., even for jobs running on several thousand cores, Cedar provides a high-performance interconnect.<br />
<br />
====Node types and characteristics==== <!--T:6--><br />
<br />
<!--T:17--><br />
Cedar has a total of 27,696 CPU cores for computation, and 584 GPU devices. Total theoretical peak double precision performance is 936 teraflops for CPUs, plus 2,744 for GPUs, yielding over 3.6 petaflops of theoretical peak double precision performance. 22 fully connected "islands" of 32 base or large nodes each have 1024 cores in a fully non-blocking topology (Omni-Path fabric), with each island designed to yield over 30 teraflops of double-precision performance (measured with high performance LINPACK). There is a 2:1 blocking factor between the 1024 core islands.<br />
<br />
<!--T:7--><br />
{| class="wikitable sortable"<br />
|-<br />
| "Base" compute nodes: || 576 nodes || 128 GB of memory, 16 cores/socket, 2 sockets/node. Intel "Broadwell" CPUs at 2.1Ghz, model E5-2683 v4.<br />
|-<br />
| "Large" compute nodes: || 128 nodes || 256 GB of memory, 16 cores/socket, 2 sockets/node. Intel "Broadwell" CPUs at 2.1Ghz, model E5-2683 v4.<br />
|-<br />
| "Bigmem500" || 24 nodes || 0.5 TB (512 GB) of memory, 16 cores/socket, 2 sockets/node. Intel "Broadwell" CPUs at 2.1Ghz, model E5-2683 v4.<br />
|-<br />
|"Bigmem1500" nodes || 24 nodes || 1.5 TB of memory, 16 cores/socket, 2 sockets/node. Intel "Broadwell" CPUs at 2.1Ghz, model E5-2683 v4.<br />
|-<br />
| "GPU base" nodes: || 114 nodes || 128 GB of memory, 12 cores/socket, 2 sockets/node, 4 NVIDIA P100 Pascal GPUs/node (12GB HBM2 memory), 2 GPUs/PCI root. Intel "Broadwell" CPUs at 2.2Ghz, model E5-2650 v4<br />
|-<br />
| "GPU large" nodes. || 32 nodes || 256 GB of memory, 12 cores/socket, 2 sockets/node, 4 NVIDIA P100 Pascal GPUs/node (16GB HBM2 memory), All GPUs on the same PCI root. E5-2650 v4<br />
|-<br />
| "Bigmem3000" nodes || 4 nodes || 3 TB of memory, 8 cores/socket, 4 sockets/node. Intel "Broadwell" CPUs at 2.1Ghz, model E7-4809 v4.<br />
|}<br />
<br />
<!--T:10--><br />
All of the above nodes have local (on-node) temporary storage. GPU nodes have a single 800GB SSD drive. All other compute nodes have two 480GB SSD drives, for a total raw capacity of 960GB.<br />
<br />
<!--T:18--><br />
Scratch storage is a Lustre filesystem based on DDN model ES14K technology. It includes 640 8TB NL-SAS disk drives, and dual redundant metadata controllers with SSD-based storage. <br />
<br />
<!--T:16--><br />
<noinclude><br />
</translate><br />
</noinclude></div>Jdesjardhttps://docs.alliancecan.ca/mediawiki/index.php?title=Graham&diff=26885Graham2017-06-16T20:12:24Z<p>Jdesjard: </p>
<hr />
<div><noinclude><br />
<languages /><br />
<br />
<translate><br />
</noinclude><br />
===Graham (GP3)=== <!--T:1--><br />
<br />
<!--T:27--><br />
{| class="wikitable"<br />
|-<br />
| Expected availability: '''June 2017''' for opportunistic use<br />
|-<br />
| Login node: '''graham.computecanada.ca'''<br />
|-<br />
| Globus endpoint: '''computecanada#graham'''<br />
|}<br />
<br />
<!--T:2--><br />
GRAHAM is a heterogeneous cluster, suitable for a variety of workloads, and located at the University of Waterloo. It is named after [https://en.wikipedia.org/wiki/Wes_Graham Wes Graham], the first director of the Computing Centre at Waterloo. It was previously known as "GP3" and is still identified as such in the [https://www.computecanada.ca/research-portal/accessing-resources/resource-allocation-competitions/ 2017 RAC] documentation.<br />
<br />
<!--T:4--><br />
The parallel filesystem and external persistent storage ([[National Data Cyberinfrastructure|NDC-Waterloo]]) are similar to [[Cedar|Cedar's]]. The interconnect is different and there is a slightly different mix of compute nodes.<br />
<br />
<!--T:28--><br />
The Graham system is sold and supported by Huawei Canada, Inc. It is entirely liquid cooled, using rear-door heat exchangers.<br />
<br />
<!--T:29--><br />
[https://docs.computecanada.ca/wiki/Getting_Started_with_the_new_National_Systems Getting started with Graham]<br />
<br />
====Attached storage systems==== <!--T:23--><br />
<br />
<!--T:24--><br />
{| class="wikitable sortable"<br />
|-<br />
| '''Home space''' ||<br />
* Standard home directory.<br />
* Small, standard quota. <br />
* Not allocated via [https://www.computecanada.ca/research-portal/accessing-resources/rapid-access-service/ RAS] or [https://www.computecanada.ca/research-portal/accessing-resources/resource-allocation-competitions/ RAC]. Larger requests go to Project space.<br />
|-<br />
| '''Scratch space'''<br />Parallel high-performance filesystem ||<br />
* For active or temporary (<code>/scratch</code>) storage.<br />
* Available to all nodes.<br />
* Not allocated.<br />
* Inactive data will be purged.<br />
* [http://e.huawei.com/en/products/cloud-computing-dc/storage Huawei OceanStor] storage system with approximately 3.6PB usable capacity and aggregate performance of approximately 30GB/s.<br />
|-<br />
|'''Project space'''<br />External persistent storage<br />
||<br />
* Part of the [[National Data Cyberinfrastructure]].<br />
* Allocated via [https://www.computecanada.ca/research-portal/accessing-resources/rapid-access-service/ RAS] or [https://www.computecanada.ca/research-portal/accessing-resources/resource-allocation-competitions/ RAC].<br />
* Available to all nodes.<br />
* Not designed for parallel I/O workloads. Use Scratch space instead.<br />
|}<br />
<br />
====High-performance interconnect==== <!--T:19--><br />
<br />
<!--T:21--><br />
Mellanox FDR (56Gb/s) and EDR (100Gb/s) InfiniBand interconnect. FDR is used for GPU and cloud nodes, EDR for other node types. A central 324-port director switch aggregates connections from islands of 1024 cores each for CPU and GPU nodes. The 56 cloud nodes are a variation on CPU nodes, and are on a single larger island sharing 8 FDR uplinks to the director switch.<br />
<br />
<!--T:29--><br />
A low-latency high-bandwidth Infiniband fabric connects all nodes and scratch storage.<br />
<br />
<!--T:30--><br />
Nodes configurable for cloud provisioning also have a 10Gb/s Ethernet network, with 40Gb/s uplinks to scratch storage.<br />
<br />
<!--T:22--><br />
The design of Graham is to support multiple simultaneous parallel jobs of up to 1024 cores in a fully non-blocking manner. <br />
<br />
<!--T:31--><br />
For larger jobs the interconnect has a 8:1 blocking factor, i.e., even for jobs running on multiple islands the Graham system provides a high-performance interconnect.<br />
<br />
<!--T:32--><br />
[https://docs.computecanada.ca/mediawiki/images/b/b3/Gp3-network-topo.png Graham high performance interconnect diagram]<br />
<br />
====Node types and characteristics==== <!--T:5--><br />
<br />
<!--T:25--><br />
''Processor type:'' All nodes except bigmem3000 have Intel E5-2683 V4 CPUs, running at 2.1 GHz<br />
<br />
<!--T:26--><br />
''GPU type:'' P100 12g<br />
<br />
<!--T:6--><br />
{| class="wikitable sortable"<br />
|-<br />
| "Base" compute nodes || 800 nodes || 128 GB of memory, 16 cores/socket, 2 sockets/node. Intel "Broadwell" CPUs at 2.1Ghz, model E5-2683 v4. 960GB SATA SSD.<br />
|-<br />
| "Large" nodes (cloud configuration) || 56 nodes || 256 GB of memory, 16 cores/socket, 2 sockets/node. Intel "Broadwell" CPUs at 2.1Ghz, model E5-2683 v4. 960GB SATA SSD.<br />
|-<br />
| "Bigmem500" nodes|| 24 nodes || 0.5 TB (512 GB) of memory, 16 cores/socket, 2 sockets/node. Intel "Broadwell" CPUs at 2.1Ghz, model E5-2683 v4. 960GB SATA SSD.<br />
|-<br />
| "Bigmem3000" nodes || 3 nodes || 3 TB of memory, 16 cores/socket, 4 sockets/node. Intel "Broadwell" CPUs at 2.1Ghz, model E7-4850 v4. 960GB SATA SSD.<br />
|-<br />
| "GPU" nodes || 160 nodes || 128 GB of memory, 16 cores/socket, 2 sockets/node, 2 NVIDIA P100 Pascal GPUs/node (12GB HBM2 memory). Intel "Broadwell" CPUs at 2.1Ghz, model E5-2683 v4. 1.6TB NVMe SSD.<br />
|}<br />
<br />
<!--T:7--><br />
Local (on-node) storage in the above nodes will be available as /tmp.<br />
<br />
<!--T:14--><br />
<noinclude><br />
</translate><br />
</noinclude></div>Jdesjard