Graham: Difference between revisions

Jump to navigation Jump to search
m
fix heading levels
(Marked this version for translation)
m (fix heading levels)
 
Line 34: Line 34:
[[Transferring_data|Transferring data]]
[[Transferring_data|Transferring data]]


= Site-specific policies = <!--T:39-->
==Site-specific policies== <!--T:39-->


<!--T:40-->
<!--T:40-->
Line 56: Line 56:
* A user cannot have more than 1000 jobs, running and queued, at any given moment.  An array job is counted as the number of tasks in the array.
* A user cannot have more than 1000 jobs, running and queued, at any given moment.  An array job is counted as the number of tasks in the array.


=Storage= <!--T:23-->
==Storage== <!--T:23-->


<!--T:24-->
<!--T:24-->
Line 81: Line 81:
|}
|}


=High-performance interconnect= <!--T:19-->
==High-performance interconnect== <!--T:19-->


<!--T:21-->
<!--T:21-->
Line 101: Line 101:
[https://docs.computecanada.ca/mediawiki/images/b/b3/Gp3-network-topo.png Graham high performance interconnect diagram]
[https://docs.computecanada.ca/mediawiki/images/b/b3/Gp3-network-topo.png Graham high performance interconnect diagram]


=Visualization on Graham= <!--T:44-->
==Visualization on Graham== <!--T:44-->


<!--T:45-->
<!--T:45-->
Graham has dedicated visualization nodes available at <b>gra-vdi.alliancecan.ca</b> that allow only VNC connections. For instructions on how to use them, see the [[VNC]] page.
Graham has dedicated visualization nodes available at <b>gra-vdi.alliancecan.ca</b> that allow only VNC connections. For instructions on how to use them, see the [[VNC]] page.


=Node characteristics= <!--T:5-->
==Node characteristics== <!--T:5-->
A total of 41,548 cores and 520 GPU devices, spread across 1,185 nodes of different types; note that Turbo Boost is activated for the ensemble of Graham nodes.
A total of 41,548 cores and 520 GPU devices, spread across 1,185 nodes of different types; note that Turbo Boost is activated for the ensemble of Graham nodes.


Line 150: Line 150:
Note that the amount of available memory is less than the "round number" suggested by hardware configuration.  For instance, "base" nodes do have 128 GiB of RAM, but some of it is permanently occupied by the kernel and OS.  To avoid wasting time by swapping/paging, the scheduler will never allocate jobs whose memory requirements exceed the specified amount of "available" memory.  Please also note that the memory allocated to the job must be sufficient for IO buffering performed by the kernel and filesystem - this means that an IO-intensive job will often benefit from requesting somewhat more memory than the aggregate size of processes.
Note that the amount of available memory is less than the "round number" suggested by hardware configuration.  For instance, "base" nodes do have 128 GiB of RAM, but some of it is permanently occupied by the kernel and OS.  To avoid wasting time by swapping/paging, the scheduler will never allocate jobs whose memory requirements exceed the specified amount of "available" memory.  Please also note that the memory allocated to the job must be sufficient for IO buffering performed by the kernel and filesystem - this means that an IO-intensive job will often benefit from requesting somewhat more memory than the aggregate size of processes.


= GPUs on Graham = <!--T:56-->
==GPUs on Graham== <!--T:56-->
Graham contains Tesla GPUs from three different generations, listed here in order of age, from oldest to newest.
Graham contains Tesla GPUs from three different generations, listed here in order of age, from oldest to newest.
* P100 Pascal GPUs
* P100 Pascal GPUs
Line 159: Line 159:
P100 is NVIDIA's all-purpose high performance card.  V100 is its successor, with about double the performance for standard computation, and about 8X performance for deep learning computations which can utilize its tensor core computation units.  T4 Turing is the latest card targeted specifically at deep learning workloads - it does not support efficient double precision computations, but it has good performance for single precision, and it also has tensor cores, plus support for reduced precision integer calculations.
P100 is NVIDIA's all-purpose high performance card.  V100 is its successor, with about double the performance for standard computation, and about 8X performance for deep learning computations which can utilize its tensor core computation units.  T4 Turing is the latest card targeted specifically at deep learning workloads - it does not support efficient double precision computations, but it has good performance for single precision, and it also has tensor cores, plus support for reduced precision integer calculations.


== Pascal GPU nodes on Graham == <!--T:58-->
===Pascal GPU nodes on Graham=== <!--T:58-->


<!--T:59-->
<!--T:59-->
These are Graham's default GPU cards.  Job submission for these cards is described on page: [[Using GPUs with Slurm]].  When a job simply request a GPU with --gres=gpu:1 or --gres=gpu:2, it will be assigned to any type of available GPU. If you require a specific type of GPU, please request it. As all Pascal nodes have only 2 P100 GPUs, configuring jobs using these cards is relatively simple.
These are Graham's default GPU cards.  Job submission for these cards is described on page: [[Using GPUs with Slurm]].  When a job simply request a GPU with --gres=gpu:1 or --gres=gpu:2, it will be assigned to any type of available GPU. If you require a specific type of GPU, please request it. As all Pascal nodes have only 2 P100 GPUs, configuring jobs using these cards is relatively simple.


==Volta GPU nodes on Graham== <!--T:46-->
===Volta GPU nodes on Graham=== <!--T:46-->
Graham has a total of 9 Volta nodes.
Graham has a total of 9 Volta nodes.
In 7 of these, four GPUs are connected to each CPU socket (except for one node, which is only populated with 6 GPUs, three per socket).  The other 2 have high bandwidth NVLINK interconnect.
In 7 of these, four GPUs are connected to each CPU socket (except for one node, which is only populated with 6 GPUs, three per socket).  The other 2 have high bandwidth NVLINK interconnect.
Line 178: Line 178:


<!--T:65-->
<!--T:65-->
The two newest Volta nodes have 40 cores so the number of cores requested per GPU should be adjusted upwards accordingly, i.e. you can use 5 CPU cores per GPU. They also have NVLINK, which can provide huge benefits for situations where memory bandwidth between GPUs is the bottleneck. If you want to use one of these NVLINK nodes, you should request it directly by adding  the <code>--constraint=cascade,v100</code> parameter to the job submission script.
The two newest Volta nodes have 40 cores so the number of cores requested per GPU should be adjusted upwards accordingly, i.e. you can use 5 CPU cores per GPU. They also have NVLINK, which can provide huge benefits for situations where memory bandwidth between GPUs is the bottleneck. If you want to use one of these NVLINK nodes, you should request it directly by adding  the <code>--constraint=cascade,v100</code> parameter to the job submission script.


<!--T:53-->
<!--T:53-->
Line 214: Line 214:
The Volta nodes have a fast local disk, which should be used for jobs if the amount of I/O performed by your job is significant. Inside the job, the location of the temporary directory on fast local disk  is specified by the environment variable $SLURM_TMPDIR. You can copy your input files there at the start of your job script before you run your program and your output files out at the end of your job script. All the files in $SLURM_TMPDIR will be removed once the job ends, so you do not have to clean up that directory yourself.  You can even create Python virtual environments in this temporary space for greater efficiency.  Please see the [[Python#Creating_virtual_environments_inside_of_your_jobs|information on how to do this]].
The Volta nodes have a fast local disk, which should be used for jobs if the amount of I/O performed by your job is significant. Inside the job, the location of the temporary directory on fast local disk  is specified by the environment variable $SLURM_TMPDIR. You can copy your input files there at the start of your job script before you run your program and your output files out at the end of your job script. All the files in $SLURM_TMPDIR will be removed once the job ends, so you do not have to clean up that directory yourself.  You can even create Python virtual environments in this temporary space for greater efficiency.  Please see the [[Python#Creating_virtual_environments_inside_of_your_jobs|information on how to do this]].


==Turing GPU nodes on Graham== <!--T:60-->
===Turing GPU nodes on Graham=== <!--T:60-->


<!--T:61-->
<!--T:61-->
Line 225: Line 225:
In this example, two T4 cards per node are requested.
In this example, two T4 cards per node are requested.


==Ampere GPU nodes on Graham== <!--T:66-->  
===Ampere GPU nodes on Graham=== <!--T:66-->  


<!--T:67-->
<!--T:67-->
cc_staff
65

edits

Navigation menu