https://docs.alliancecan.ca/mediawiki/api.php?action=feedcontributions&user=Feimao&feedformat=atomAlliance Doc - User contributions [en]2024-03-28T23:00:27ZUser contributionsMediaWiki 1.39.6https://docs.alliancecan.ca/mediawiki/index.php?title=TensorFlow&diff=49256TensorFlow2018-03-22T20:40:02Z<p>Feimao: /* Replicated */</p>
<hr />
<div><languages /><br />
[[Category:Software]]<br />
<translate><br />
<!--T:18--><br />
[https://www.tensorflow.org/ TensorFlow] is "an open-source software library for Machine Intelligence".<br />
<br />
==Installing TensorFlow== <!--T:1--><br />
<br />
<!--T:2--><br />
These instructions install Tensorflow into your home directory using Compute Canada's pre-built [http://pythonwheels.com/ Python wheels]. Custom Python wheels are stored in <code>/cvmfs/soft.computecanada.ca/custom/python/wheelhouse/</code>. To install a TensorFlow wheel we will use the <code>pip</code> command and install it into a [[Python#Creating_and_using_a_virtual_environment | Python virtual environment]]. The instructions below are for Python 3.5.2 but you can also install other Python versions by loading a different Python module.<br />
<br />
<!--T:3--><br />
Load modules required by TensorFlow:<br />
{{Command2|module load python/3.5.2}}<br />
<br />
<!--T:17--><br />
Create a new Python virtual environment:<br />
{{Command2|virtualenv tensorflow}}<br />
<br />
<!--T:4--><br />
Activate your newly created Python virtual environment:<br />
{{Command2|source tensorflow/bin/activate}}<br />
<br />
Install TensorFlow into your newly created virtual environment using the command from either one of the two following subsections. <br />
=== CPU-only === <!--T:8--><br />
{{Command2|prompt=(tensorflow)_[name@server ~]$<br />
|pip install tensorflow-cpu}}<br />
<br />
=== GPU === <!--T:9--><br />
<br />
<!--T:10--><br />
{{Command2|prompt=(tensorflow)_[name@server ~]$<br />
|pip install tensorflow-gpu}}<br />
<br />
==Submitting a TensorFlow job with a GPU== <!--T:5--><br />
Once you have the above setup completed you can submit a TensorFlow job as<br />
{{Command2|sbatch tensorflow-test.sh}}<br />
The job submission script has the content<br />
</translate><br />
{{File<br />
|name=tensorflow-test.sh<br />
|lang="bash"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --gres=gpu:1 # request GPU "generic resource"<br />
#SBATCH --cpus-per-task=6 # maximum CPU cores per GPU request: 6 on Cedar, 16 on Graham.<br />
#SBATCH --mem=32000M # memory per node<br />
#SBATCH --time=0-03:00 # time (DD-HH:MM)<br />
#SBATCH --output=%N-%j.out # %N for node name, %j for jobID<br />
<br />
module load cuda cudnn python/3.5.2<br />
source tensorflow/bin/activate<br />
python ./tensorflow-test.py<br />
}}<br />
<translate><br />
<!--T:6--><br />
while the Python script has the form,<br />
</translate><br />
{{File<br />
|name=tensorflow-test.py<br />
|lang="python"<br />
|contents=<br />
import tensorflow as tf<br />
node1 = tf.constant(3.0, dtype=tf.float32)<br />
node2 = tf.constant(4.0) # also tf.float32 implicitly<br />
print(node1, node2)<br />
sess = tf.Session()<br />
print(sess.run([node1, node2]))<br />
}}<br />
<translate><br />
<!--T:7--><br />
Once the above job has completed (should take less than a minute) you should see an output file called something like <tt>cdr116-122907.out</tt> with contents similar to the following example,<br />
</translate><br />
{{File<br />
|name=cdr116-122907.out<br />
|lang="text"<br />
|contents=<br />
2017-07-10 12:35:19.489458: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 0 with properties:<br />
name: Tesla P100-PCIE-12GB<br />
major: 6 minor: 0 memoryClockRate (GHz) 1.3285<br />
pciBusID 0000:82:00.0<br />
Total memory: 11.91GiB<br />
Free memory: 11.63GiB<br />
2017-07-10 12:35:19.491097: I tensorflow/core/common_runtime/gpu/gpu_device.cc:961] DMA: 0<br />
2017-07-10 12:35:19.491156: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0: Y<br />
2017-07-10 12:35:19.520737: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla P100-PCIE-12GB, pci bus id: 0000:82:00.0)<br />
Tensor("Const:0", shape=(), dtype=float32) Tensor("Const_1:0", shape=(), dtype=float32)<br />
[3.0, 4.0]<br />
}}<br />
<translate><br />
<br />
<!--T:16--><br />
TensorFlow can run on all GPU node types. Cedar's ''GPU large'' node type, which is equipped with 4 x P100-PCIE-16GB with GPUDirect P2P enabled between each pair, is highly recommended for large scale Deep Learning or Machine Learning research. See [[Using GPUs with SLURM]] for more information.<br />
</translate><br />
==Monitoring==<br />
<br />
It is possible to connect to the node running a job and execute processes. This can be used to monitor resources used by TensorFlow and to visualize the progress of the training. See [[Running jobs#Attaching to a running job|Attaching to a running job]] for examples.<br />
<br />
===TensorBoard===<br />
<br />
TensorFlow comes with a suite of visualization tools called [https://www.tensorflow.org/programmers_guide/summaries_and_tensorboard TensorBoard]. TensorBoard operates by reading TensorFlow events and model files. To know how to create these files, read [https://www.tensorflow.org/programmers_guide/summaries_and_tensorboard#serializing_the_data TensorBoard tutorial on summaries]. The event files are created in a directory specified by the user referred to as '''logdir'''.<br />
<br />
The following command will launch TensorBoard:<br />
{{Command2<br />
|tensorboard --logdir{{=}}path/to/logdir --host localhost<br />
}}<br />
<br />
Note, however, thatTensorBoard requires too much processing power to be run on a login node. Users are strongly encouraged to execute it in parallel with their TensorFlow job. The following submit script gives an example. The source code of <code>mnist_with_summaries.py</code> is available [https://github.com/tensorflow/tensorflow/blob/r1.5/tensorflow/examples/tutorials/mnist/mnist_with_summaries.py here].<br />
{{File<br />
|name=tensorboard.sh<br />
|lang="bash"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --gres=gpu:1 # request GPU "generic resource"<br />
#SBATCH --cpus-per-task=6 # maximum CPU cores per GPU request: 6 on Cedar, 16 on Graham.<br />
#SBATCH --mem=32000M # memory per node<br />
#SBATCH --time=01:00 # time (DD-HH:MM)<br />
<br />
source tensorflow/bin/activate<br />
tensorboard --logdir=/tmp/tensorflow/mnist/logs/mnist_with_summaries --host localhost &<br />
python mnist_with_summaries.py<br />
}}<br />
<br />
Once the job is running, to access TensorBoard with a web browser, you need to create a connection between your computer and the compute node running TensorFlow and TensorBoard. To create that connection, use the following command.<br />
{{Command2|prompt=[name@my_computer ~]$<br />
|ssh -J userid@cluster.computecanada.ca -N -f -L localhost:6006:localhost:6006 userid@compute_node}}<br />
Replace <code>userid</code> by your Compute Canada username, <code>cluster</code> by the cluster hostname (i.e.: Cedar, Graham, etc.), and <code>computenode</code> by the compute node hostname. To retrieve the compute node hostname associated with your <code>JOBID</code> use the following command <br />
{{Command2<br />
|squeue --job JOBID -o %N<br />
}}<br />
<br />
Once the connection is created, go to [http://localhost:6006 http://localhost:6006].<br />
<br />
==TensorFlow with Multi-GPUs==<br />
TensorFlow provides different methods of managing variables when training models on multiple GPUs. "Parameter Server" and "Replicated" are the most two common methods. <br />
*In this section, [https://github.com/tensorflow/benchmarks TensorFlow Benchmarks] code will be used as an example to explain the different methods. Users can reference the TensorFlow Benchmarks code to implement their own.<br />
===Parameter Server===<br />
Variables are stored on a parameter server that holds the master copy of the variable. In distributed training, the parameter servers are separate processes in the different devices. For each step, each tower gets a copy of the variables from the parameter server, and sends its gradients to the param server.<br />
<br />
Parameters can be stored in CPU:<br />
<pre><br />
python tf_cnn_benchmarks.py --variable_update=parameter_server --local_parameter_device=cpu<br />
</pre><br />
or GPU:<br />
<pre><br />
python tf_cnn_benchmarks.py --variable_update=parameter_server --local_parameter_device=gpu<br />
</pre><br />
<br />
===Replicated===<br />
With this method, each GPU has its own copy of the variables. To apply gradients, an all_reduce algorithm or or regular cross-device aggregation is used to replicate the combined gradients to all towers (depending on all_reduce_spec parameter setting).<br />
<br />
All reduce method can be default:<br />
<pre><br />
python tf_cnn_benchmarks.py --variable_update=replicated<br />
</pre><br />
Xring --- use one global ring reduction for all tensors:<br />
<pre><br />
python tf_cnn_benchmarks.py --variable_update=replicated --all_reduce_spec=xring<br />
</pre><br />
Pscpu --- use CPU at worker 0 to reduce all tensors:<br />
<pre><br />
python tf_cnn_benchmarks.py --variable_update=replicated --all_reduce_spec=pscpu<br />
</pre><br />
NCCL --- use NCCL to locally reduce all tensors:<br />
<pre><br />
python tf_cnn_benchmarks.py --variable_update=replicated --all_reduce_spec=nccl<br />
</pre><br />
Different variable managing methods perform differently with different models. Users are highly recommended to test their own models with all methods on different types of GPU node.<br />
<br />
===Benchmarks===<br />
This section will give ResNet-50 and VGG-16 benchmarking results on both Graham and Cedar with single and multiple GPUs using different methods for managing variables. TensorFlow v1.5 (built with CUDA 9 and cuDNN 7) is used. The benchmark can be found on github: [https://github.com/tensorflow/benchmarks TensorFlow Benchmarks]. <br />
*ResNet-50<br />
Batch size is 32 per GPU. Data parallelism is used. (Results in "images per second")<br />
{| class="wikitable"<br />
|-<br />
! Node type !! Single GPU baseline !! Number of GPUs !! ps,cpu !! ps, gpu !! replicated !! replicated, xring !! replicated, pscpu !! replicated, nccl<br />
|-<br />
| Graham GPU node || 171.23||2 || 93.31 || '''324.04''' || 318.33 || 316.01 || 109.82 || 315.99<br />
|-<br />
| Cedar GPU Base || 172.99|| 4 || '''662.65''' ||595.43 || 616.02 || 490.03|| 645.04 || 608.95<br />
|-<br />
| Cedar GPU Large || 205.71 ||4 || 673.47 || 721.98 || '''754.35''' || 574.91 || 664.72 || 692.25<br />
|}<br />
<br />
*VGG-16<br />
Batch size is 32 per GPU. Data parallelism is used. (Results in "images per second")<br />
<br />
{| class="wikitable"<br />
|-<br />
! Node type !! Single GPU baseline !! Number of GPUs !! ps,cpu !! ps, gpu !! replicated !! replicated, xring !! replicated, pscpu !! replicated, nccl<br />
|-<br />
| Graham GPU node || 115.89||2 || 91.29 || 194.46 || 194.43 || 203.83 || 132.19 || '''219.72'''<br />
|-<br />
| Cedar GPU Base || 114.77 ||4 || 232.85 || 280.69 || 274.41 || 341.29 || 330.04 || '''388.53'''<br />
|-<br />
| Cedar GPU Large || 137.16 ||4 || 175.20 || 379.80 ||336.72 || 417.46 || 225.37 || '''490.52'''<br />
|}</div>Feimaohttps://docs.alliancecan.ca/mediawiki/index.php?title=TensorFlow&diff=49255TensorFlow2018-03-22T20:39:41Z<p>Feimao: /* Parameter Server */</p>
<hr />
<div><languages /><br />
[[Category:Software]]<br />
<translate><br />
<!--T:18--><br />
[https://www.tensorflow.org/ TensorFlow] is "an open-source software library for Machine Intelligence".<br />
<br />
==Installing TensorFlow== <!--T:1--><br />
<br />
<!--T:2--><br />
These instructions install Tensorflow into your home directory using Compute Canada's pre-built [http://pythonwheels.com/ Python wheels]. Custom Python wheels are stored in <code>/cvmfs/soft.computecanada.ca/custom/python/wheelhouse/</code>. To install a TensorFlow wheel we will use the <code>pip</code> command and install it into a [[Python#Creating_and_using_a_virtual_environment | Python virtual environment]]. The instructions below are for Python 3.5.2 but you can also install other Python versions by loading a different Python module.<br />
<br />
<!--T:3--><br />
Load modules required by TensorFlow:<br />
{{Command2|module load python/3.5.2}}<br />
<br />
<!--T:17--><br />
Create a new Python virtual environment:<br />
{{Command2|virtualenv tensorflow}}<br />
<br />
<!--T:4--><br />
Activate your newly created Python virtual environment:<br />
{{Command2|source tensorflow/bin/activate}}<br />
<br />
Install TensorFlow into your newly created virtual environment using the command from either one of the two following subsections. <br />
=== CPU-only === <!--T:8--><br />
{{Command2|prompt=(tensorflow)_[name@server ~]$<br />
|pip install tensorflow-cpu}}<br />
<br />
=== GPU === <!--T:9--><br />
<br />
<!--T:10--><br />
{{Command2|prompt=(tensorflow)_[name@server ~]$<br />
|pip install tensorflow-gpu}}<br />
<br />
==Submitting a TensorFlow job with a GPU== <!--T:5--><br />
Once you have the above setup completed you can submit a TensorFlow job as<br />
{{Command2|sbatch tensorflow-test.sh}}<br />
The job submission script has the content<br />
</translate><br />
{{File<br />
|name=tensorflow-test.sh<br />
|lang="bash"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --gres=gpu:1 # request GPU "generic resource"<br />
#SBATCH --cpus-per-task=6 # maximum CPU cores per GPU request: 6 on Cedar, 16 on Graham.<br />
#SBATCH --mem=32000M # memory per node<br />
#SBATCH --time=0-03:00 # time (DD-HH:MM)<br />
#SBATCH --output=%N-%j.out # %N for node name, %j for jobID<br />
<br />
module load cuda cudnn python/3.5.2<br />
source tensorflow/bin/activate<br />
python ./tensorflow-test.py<br />
}}<br />
<translate><br />
<!--T:6--><br />
while the Python script has the form,<br />
</translate><br />
{{File<br />
|name=tensorflow-test.py<br />
|lang="python"<br />
|contents=<br />
import tensorflow as tf<br />
node1 = tf.constant(3.0, dtype=tf.float32)<br />
node2 = tf.constant(4.0) # also tf.float32 implicitly<br />
print(node1, node2)<br />
sess = tf.Session()<br />
print(sess.run([node1, node2]))<br />
}}<br />
<translate><br />
<!--T:7--><br />
Once the above job has completed (should take less than a minute) you should see an output file called something like <tt>cdr116-122907.out</tt> with contents similar to the following example,<br />
</translate><br />
{{File<br />
|name=cdr116-122907.out<br />
|lang="text"<br />
|contents=<br />
2017-07-10 12:35:19.489458: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 0 with properties:<br />
name: Tesla P100-PCIE-12GB<br />
major: 6 minor: 0 memoryClockRate (GHz) 1.3285<br />
pciBusID 0000:82:00.0<br />
Total memory: 11.91GiB<br />
Free memory: 11.63GiB<br />
2017-07-10 12:35:19.491097: I tensorflow/core/common_runtime/gpu/gpu_device.cc:961] DMA: 0<br />
2017-07-10 12:35:19.491156: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0: Y<br />
2017-07-10 12:35:19.520737: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla P100-PCIE-12GB, pci bus id: 0000:82:00.0)<br />
Tensor("Const:0", shape=(), dtype=float32) Tensor("Const_1:0", shape=(), dtype=float32)<br />
[3.0, 4.0]<br />
}}<br />
<translate><br />
<br />
<!--T:16--><br />
TensorFlow can run on all GPU node types. Cedar's ''GPU large'' node type, which is equipped with 4 x P100-PCIE-16GB with GPUDirect P2P enabled between each pair, is highly recommended for large scale Deep Learning or Machine Learning research. See [[Using GPUs with SLURM]] for more information.<br />
</translate><br />
==Monitoring==<br />
<br />
It is possible to connect to the node running a job and execute processes. This can be used to monitor resources used by TensorFlow and to visualize the progress of the training. See [[Running jobs#Attaching to a running job|Attaching to a running job]] for examples.<br />
<br />
===TensorBoard===<br />
<br />
TensorFlow comes with a suite of visualization tools called [https://www.tensorflow.org/programmers_guide/summaries_and_tensorboard TensorBoard]. TensorBoard operates by reading TensorFlow events and model files. To know how to create these files, read [https://www.tensorflow.org/programmers_guide/summaries_and_tensorboard#serializing_the_data TensorBoard tutorial on summaries]. The event files are created in a directory specified by the user referred to as '''logdir'''.<br />
<br />
The following command will launch TensorBoard:<br />
{{Command2<br />
|tensorboard --logdir{{=}}path/to/logdir --host localhost<br />
}}<br />
<br />
Note, however, thatTensorBoard requires too much processing power to be run on a login node. Users are strongly encouraged to execute it in parallel with their TensorFlow job. The following submit script gives an example. The source code of <code>mnist_with_summaries.py</code> is available [https://github.com/tensorflow/tensorflow/blob/r1.5/tensorflow/examples/tutorials/mnist/mnist_with_summaries.py here].<br />
{{File<br />
|name=tensorboard.sh<br />
|lang="bash"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --gres=gpu:1 # request GPU "generic resource"<br />
#SBATCH --cpus-per-task=6 # maximum CPU cores per GPU request: 6 on Cedar, 16 on Graham.<br />
#SBATCH --mem=32000M # memory per node<br />
#SBATCH --time=01:00 # time (DD-HH:MM)<br />
<br />
source tensorflow/bin/activate<br />
tensorboard --logdir=/tmp/tensorflow/mnist/logs/mnist_with_summaries --host localhost &<br />
python mnist_with_summaries.py<br />
}}<br />
<br />
Once the job is running, to access TensorBoard with a web browser, you need to create a connection between your computer and the compute node running TensorFlow and TensorBoard. To create that connection, use the following command.<br />
{{Command2|prompt=[name@my_computer ~]$<br />
|ssh -J userid@cluster.computecanada.ca -N -f -L localhost:6006:localhost:6006 userid@compute_node}}<br />
Replace <code>userid</code> by your Compute Canada username, <code>cluster</code> by the cluster hostname (i.e.: Cedar, Graham, etc.), and <code>computenode</code> by the compute node hostname. To retrieve the compute node hostname associated with your <code>JOBID</code> use the following command <br />
{{Command2<br />
|squeue --job JOBID -o %N<br />
}}<br />
<br />
Once the connection is created, go to [http://localhost:6006 http://localhost:6006].<br />
<br />
==TensorFlow with Multi-GPUs==<br />
TensorFlow provides different methods of managing variables when training models on multiple GPUs. "Parameter Server" and "Replicated" are the most two common methods. <br />
*In this section, [https://github.com/tensorflow/benchmarks TensorFlow Benchmarks] code will be used as an example to explain the different methods. Users can reference the TensorFlow Benchmarks code to implement their own.<br />
===Parameter Server===<br />
Variables are stored on a parameter server that holds the master copy of the variable. In distributed training, the parameter servers are separate processes in the different devices. For each step, each tower gets a copy of the variables from the parameter server, and sends its gradients to the param server.<br />
<br />
Parameters can be stored in CPU:<br />
<pre><br />
python tf_cnn_benchmarks.py --variable_update=parameter_server --local_parameter_device=cpu<br />
</pre><br />
or GPU:<br />
<pre><br />
python tf_cnn_benchmarks.py --variable_update=parameter_server --local_parameter_device=gpu<br />
</pre><br />
<br />
===Replicated===<br />
With this method, each GPU has its own copy of the variables. To apply gradients, an all_reduce algorithm or or regular cross-device aggregation is used to replicate the combined gradients to all towers (depending on all_reduce_spec parameter setting).<br />
<br />
All reduce method can be default:<br />
<pre><br />
--variable_update=replicated<br />
</pre><br />
Xring --- use one global ring reduction for all tensors:<br />
<pre><br />
--variable_update=replicated --all_reduce_spec=xring<br />
</pre><br />
Pscpu --- use CPU at worker 0 to reduce all tensors:<br />
<pre><br />
--variable_update=replicated --all_reduce_spec=pscpu<br />
</pre><br />
NCCL --- use NCCL to locally reduce all tensors:<br />
<pre><br />
--variable_update=replicated --all_reduce_spec=nccl<br />
</pre><br />
Different variable managing methods perform differently with different models. Users are highly recommended to test their own models with all methods on different types of GPU node.<br />
<br />
===Benchmarks===<br />
This section will give ResNet-50 and VGG-16 benchmarking results on both Graham and Cedar with single and multiple GPUs using different methods for managing variables. TensorFlow v1.5 (built with CUDA 9 and cuDNN 7) is used. The benchmark can be found on github: [https://github.com/tensorflow/benchmarks TensorFlow Benchmarks]. <br />
*ResNet-50<br />
Batch size is 32 per GPU. Data parallelism is used. (Results in "images per second")<br />
{| class="wikitable"<br />
|-<br />
! Node type !! Single GPU baseline !! Number of GPUs !! ps,cpu !! ps, gpu !! replicated !! replicated, xring !! replicated, pscpu !! replicated, nccl<br />
|-<br />
| Graham GPU node || 171.23||2 || 93.31 || '''324.04''' || 318.33 || 316.01 || 109.82 || 315.99<br />
|-<br />
| Cedar GPU Base || 172.99|| 4 || '''662.65''' ||595.43 || 616.02 || 490.03|| 645.04 || 608.95<br />
|-<br />
| Cedar GPU Large || 205.71 ||4 || 673.47 || 721.98 || '''754.35''' || 574.91 || 664.72 || 692.25<br />
|}<br />
<br />
*VGG-16<br />
Batch size is 32 per GPU. Data parallelism is used. (Results in "images per second")<br />
<br />
{| class="wikitable"<br />
|-<br />
! Node type !! Single GPU baseline !! Number of GPUs !! ps,cpu !! ps, gpu !! replicated !! replicated, xring !! replicated, pscpu !! replicated, nccl<br />
|-<br />
| Graham GPU node || 115.89||2 || 91.29 || 194.46 || 194.43 || 203.83 || 132.19 || '''219.72'''<br />
|-<br />
| Cedar GPU Base || 114.77 ||4 || 232.85 || 280.69 || 274.41 || 341.29 || 330.04 || '''388.53'''<br />
|-<br />
| Cedar GPU Large || 137.16 ||4 || 175.20 || 379.80 ||336.72 || 417.46 || 225.37 || '''490.52'''<br />
|}</div>Feimaohttps://docs.alliancecan.ca/mediawiki/index.php?title=TensorFlow&diff=47729TensorFlow2018-03-02T16:08:16Z<p>Feimao: /* Benchmarks */</p>
<hr />
<div><languages /><br />
[[Category:Software]]<br />
<translate><br />
<!--T:18--><br />
[https://www.tensorflow.org/ TensorFlow] is "an open-source software library for Machine Intelligence".<br />
<br />
==Installing TensorFlow== <!--T:1--><br />
<br />
<!--T:2--><br />
These instructions install Tensorflow into your home directory using Compute Canada's pre-built [http://pythonwheels.com/ Python wheels]. Custom Python wheels are stored in <code>/cvmfs/soft.computecanada.ca/custom/python/wheelhouse/</code>. To install a TensorFlow wheel we will use the <code>pip</code> command and install it into a [[Python#Creating_and_using_a_virtual_environment | Python virtual environment]]. The instructions below are for Python 3.5.2 but you can also install other Python versions by loading a different Python module.<br />
<br />
<!--T:3--><br />
Load modules required by TensorFlow:<br />
{{Command2|module load python/3.5.2}}<br />
<br />
<!--T:17--><br />
Create a new Python virtual environment:<br />
{{Command2|virtualenv tensorflow}}<br />
<br />
<!--T:4--><br />
Activate your newly created Python virtual environment:<br />
{{Command2|source tensorflow/bin/activate}}<br />
<br />
Install TensorFlow into your newly created virtual environment using the command from either one of the two following subsections. <br />
=== CPU-only === <!--T:8--><br />
{{Command2|prompt=(tensorflow)_[name@server ~]$<br />
|pip install tensorflow-cpu}}<br />
<br />
=== GPU === <!--T:9--><br />
<br />
<!--T:10--><br />
{{Command2|prompt=(tensorflow)_[name@server ~]$<br />
|pip install tensorflow-gpu}}<br />
<br />
==Submitting a TensorFlow job with a GPU== <!--T:5--><br />
Once you have the above setup completed you can submit a TensorFlow job as<br />
{{Command2|sbatch tensorflow-test.sh}}<br />
The job submission script has the content<br />
</translate><br />
{{File<br />
|name=tensorflow-test.sh<br />
|lang="bash"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --gres=gpu:1 # request GPU "generic resource"<br />
#SBATCH --cpus-per-task=6 # maximum CPU cores per GPU request: 6 on Cedar, 16 on Graham.<br />
#SBATCH --mem=32000M # memory per node<br />
#SBATCH --time=0-03:00 # time (DD-HH:MM)<br />
#SBATCH --output=%N-%j.out # %N for node name, %j for jobID<br />
<br />
module load cuda cudnn python/3.5.2<br />
source tensorflow/bin/activate<br />
python ./tensorflow-test.py<br />
}}<br />
<translate><br />
<!--T:6--><br />
while the Python script has the form,<br />
</translate><br />
{{File<br />
|name=tensorflow-test.py<br />
|lang="python"<br />
|contents=<br />
import tensorflow as tf<br />
node1 = tf.constant(3.0, dtype=tf.float32)<br />
node2 = tf.constant(4.0) # also tf.float32 implicitly<br />
print(node1, node2)<br />
sess = tf.Session()<br />
print(sess.run([node1, node2]))<br />
}}<br />
<translate><br />
<!--T:7--><br />
Once the above job has completed (should take less than a minute) you should see an output file called something like <tt>cdr116-122907.out</tt> with contents similar to the following example,<br />
</translate><br />
{{File<br />
|name=cdr116-122907.out<br />
|lang="text"<br />
|contents=<br />
2017-07-10 12:35:19.489458: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 0 with properties:<br />
name: Tesla P100-PCIE-12GB<br />
major: 6 minor: 0 memoryClockRate (GHz) 1.3285<br />
pciBusID 0000:82:00.0<br />
Total memory: 11.91GiB<br />
Free memory: 11.63GiB<br />
2017-07-10 12:35:19.491097: I tensorflow/core/common_runtime/gpu/gpu_device.cc:961] DMA: 0<br />
2017-07-10 12:35:19.491156: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0: Y<br />
2017-07-10 12:35:19.520737: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla P100-PCIE-12GB, pci bus id: 0000:82:00.0)<br />
Tensor("Const:0", shape=(), dtype=float32) Tensor("Const_1:0", shape=(), dtype=float32)<br />
[3.0, 4.0]<br />
}}<br />
<translate><br />
<br />
<!--T:16--><br />
TensorFlow can run on all GPU node types. Cedar's ''GPU large'' node type, which is equipped with 4 x P100-PCIE-16GB with GPUDirect P2P enabled between each pair, is highly recommended for large scale Deep Learning or Machine Learning research. See [[Using GPUs with SLURM]] for more information.<br />
</translate><br />
==Monitoring==<br />
<br />
It is possible to connect to the node running a job and execute processes. This can be used to monitor resources used by TensorFlow and to visualize the progress of the training. See [[Running jobs#Attaching to a running job|Attaching to a running job]] for examples.<br />
<br />
===TensorBoard===<br />
<br />
TensorFlow comes with a suite of visualization tools called [https://www.tensorflow.org/programmers_guide/summaries_and_tensorboard TensorBoard]. TensorBoard operates by reading TensorFlow events and model files. To know how to create these files, read [https://www.tensorflow.org/programmers_guide/summaries_and_tensorboard#serializing_the_data TensorBoard tutorial on summaries]. The event files are created in a directory specified by the user referred to as '''logdir'''.<br />
<br />
The following command will launch TensorBoard:<br />
{{Command2<br />
|tensorboard --logdir{{=}}path/to/logdir --host localhost<br />
}}<br />
<br />
Note, however, thatTensorBoard requires too much processing power to be run on a login node. Users are strongly encouraged to execute it in parallel with their TensorFlow job. The following submit script gives an example. The source code of <code>mnist_with_summaries.py</code> is available [https://github.com/tensorflow/tensorflow/blob/r1.5/tensorflow/examples/tutorials/mnist/mnist_with_summaries.py here].<br />
{{File<br />
|name=tensorboard.sh<br />
|lang="bash"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --gres=gpu:1 # request GPU "generic resource"<br />
#SBATCH --cpus-per-task=6 # maximum CPU cores per GPU request: 6 on Cedar, 16 on Graham.<br />
#SBATCH --mem=32000M # memory per node<br />
#SBATCH --time=01:00 # time (DD-HH:MM)<br />
<br />
source tensorflow/bin/activate<br />
tensorboard --logdir=/tmp/tensorflow/mnist/logs/mnist_with_summaries --host localhost &<br />
python mnist_with_summaries.py<br />
}}<br />
<br />
Once the job is running, to access TensorBoard with a web browser, you need to create a connection between your computer and the compute node running TensorFlow and TensorBoard. To create that connection, use the following command.<br />
{{Command2|prompt=[name@my_computer ~]$<br />
|ssh -J userid@cluster.computecanada.ca -N -f -L localhost:6006:localhost:6006 userid@compute_node}}<br />
Replace <code>userid</code> by your Compute Canada username, <code>cluster</code> by the cluster hostname (i.e.: Cedar, Graham, etc.), and <code>computenode</code> by the compute node hostname. To retrieve the compute node hostname associated with your <code>JOBID</code> use the following command <br />
{{Command2<br />
|squeue --job JOBID -o %N<br />
}}<br />
<br />
Once the connection is created, go to [http://localhost:6006 http://localhost:6006].<br />
<br />
==TensorFlow with Multi-GPUs==<br />
TensorFlow provides different methods of managing variables when training models on multiple GPUs. "Parameter Server" and "Replicated" are the most two common methods. <br />
*In this section, [https://github.com/tensorflow/benchmarks TensorFlow Benchmarks] code will be used as an example to explain the different methods. Users can reference the TensorFlow Benchmarks code to implement their own.<br />
===Parameter Server===<br />
Variables are stored on a parameter server that holds the master copy of the variable. In distributed training, the parameter servers are separate processes in the different devices. For each step, each tower gets a copy of the variables from the parameter server, and sends its gradients to the param server.<br />
<br />
Parameters can be stored in CPU:<br />
<pre><br />
--variable_update=parameter_server --local_parameter_device=cpu<br />
</pre><br />
or GPU:<br />
<pre><br />
--variable_update=parameter_server --local_parameter_device=gpu<br />
</pre><br />
<br />
===Replicated===<br />
With this method, each GPU has its own copy of the variables. To apply gradients, an all_reduce algorithm or or regular cross-device aggregation is used to replicate the combined gradients to all towers (depending on all_reduce_spec parameter setting).<br />
<br />
All reduce method can be default:<br />
<pre><br />
--variable_update=replicated<br />
</pre><br />
Xring --- use one global ring reduction for all tensors:<br />
<pre><br />
--variable_update=replicated --all_reduce_spec=xring<br />
</pre><br />
Pscpu --- use CPU at worker 0 to reduce all tensors:<br />
<pre><br />
--variable_update=replicated --all_reduce_spec=pscpu<br />
</pre><br />
NCCL --- use NCCL to locally reduce all tensors:<br />
<pre><br />
--variable_update=replicated --all_reduce_spec=nccl<br />
</pre><br />
Different variable managing methods perform differently with different models. Users are highly recommended to test their own models with all methods on different types of GPU node.<br />
<br />
===Benchmarks===<br />
This section will give ResNet-50 and VGG-16 benchmarking results on both Graham and Cedar with single and multiple GPUs using different methods for managing variables. TensorFlow v1.5 (built with CUDA 9 and cuDNN 7) is used. The benchmark can be found on github: [https://github.com/tensorflow/benchmarks TensorFlow Benchmarks]. <br />
*ResNet-50<br />
Batch size is 32 per GPU. Data parallelism is used. (Results in "images per second")<br />
{| class="wikitable"<br />
|-<br />
! Node type !! Single GPU baseline !! Number of GPUs !! ps,cpu !! ps, gpu !! replicated !! replicated, xring !! replicated, pscpu !! replicated, nccl<br />
|-<br />
| Graham GPU node || 171.23||2 || 93.31 || '''324.04''' || 318.33 || 316.01 || 109.82 || 315.99<br />
|-<br />
| Cedar GPU Base || 172.99|| 4 || '''662.65''' ||595.43 || 616.02 || 490.03|| 645.04 || 608.95<br />
|-<br />
| Cedar GPU Large || 205.71 ||4 || 673.47 || 721.98 || '''754.35''' || 574.91 || 664.72 || 692.25<br />
|}<br />
<br />
*VGG-16<br />
Batch size is 32 per GPU. Data parallelism is used. (Results in "images per second")<br />
<br />
{| class="wikitable"<br />
|-<br />
! Node type !! Single GPU baseline !! Number of GPUs !! ps,cpu !! ps, gpu !! replicated !! replicated, xring !! replicated, pscpu !! replicated, nccl<br />
|-<br />
| Graham GPU node || 115.89||2 || 91.29 || 194.46 || 194.43 || 203.83 || 132.19 || '''219.72'''<br />
|-<br />
| Cedar GPU Base || 114.77 ||4 || 232.85 || 280.69 || 274.41 || 341.29 || 330.04 || '''388.53'''<br />
|-<br />
| Cedar GPU Large || 137.16 ||4 || 175.20 || 379.80 ||336.72 || 417.46 || 225.37 || '''490.52'''<br />
|}</div>Feimaohttps://docs.alliancecan.ca/mediawiki/index.php?title=TensorFlow&diff=46442TensorFlow2018-02-12T21:01:30Z<p>Feimao: /* TensorFlow with Multi-GPUs */</p>
<hr />
<div><languages /><br />
[[Category:Software]]<br />
<translate><br />
<!--T:18--><br />
[https://www.tensorflow.org/ TensorFlow] is "an open-source software library for Machine Intelligence".<br />
<br />
==Installing TensorFlow== <!--T:1--><br />
<br />
<!--T:2--><br />
These instructions install Tensorflow into your home directory using Compute Canada's pre-built [http://pythonwheels.com/ Python wheels]. Custom Python wheels are stored in <code>/cvmfs/soft.computecanada.ca/custom/python/wheelhouse/</code>. To install a TensorFlow wheel we will use the <code>pip</code> command and install it into a [[Python#Creating_and_using_a_virtual_environment | Python virtual environment]]. The instructions below are for Python 3.5.2 but you can also install other Python versions by loading a different Python module.<br />
<br />
<!--T:3--><br />
Load modules required by TensorFlow:<br />
{{Command|module load python/3.5.2}}<br />
<br />
<!--T:17--><br />
Create a new Python virtual environment:<br />
{{Command|virtualenv tensorflow}}<br />
<br />
<!--T:4--><br />
Activate your newly created Python virtual environment:<br />
{{Command|source tensorflow/bin/activate}}<br />
<br />
Install TensorFlow into your newly created virtual environment using the command from either one of the two following subsections. <br />
=== CPU-only === <!--T:8--><br />
{{Command|prompt=(tensorflow) [name@server $]<br />
|pip install tensorflow-cpu}}<br />
<br />
=== GPU === <!--T:9--><br />
<br />
<!--T:10--><br />
{{Command|prompt=(tensorflow) [name@server $]<br />
|pip install tensorflow-gpu}}<br />
<br />
==Submitting a TensorFlow job with a GPU== <!--T:5--><br />
Once you have the above setup completed you can submit a TensorFlow job as<br />
{{Command|sbatch tensorflow-test.sh}}<br />
The job submission script has the content<br />
</translate><br />
{{File<br />
|name=tensorflow-test.sh<br />
|lang="bash"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --gres=gpu:1 # request GPU "generic resource"<br />
#SBATCH --cpus-per-task=6 # maximum CPU cores per GPU request: 6 on Cedar, 16 on Graham.<br />
#SBATCH --mem=32000M # memory per node<br />
#SBATCH --time=0-03:00 # time (DD-HH:MM)<br />
#SBATCH --output=%N-%j.out # %N for node name, %j for jobID<br />
<br />
module load cuda cudnn python/3.5.2<br />
source tensorflow/bin/activate<br />
python ./tensorflow-test.py<br />
}}<br />
<translate><br />
<!--T:6--><br />
while the Python script has the form,<br />
</translate><br />
{{File<br />
|name=tensorflow-test.py<br />
|lang="python"<br />
|contents=<br />
import tensorflow as tf<br />
node1 = tf.constant(3.0, dtype=tf.float32)<br />
node2 = tf.constant(4.0) # also tf.float32 implicitly<br />
print(node1, node2)<br />
sess = tf.Session()<br />
print(sess.run([node1, node2]))<br />
}}<br />
<translate><br />
<!--T:7--><br />
Once the above job has completed (should take less than a minute) you should see an output file called something like <tt>cdr116-122907.out</tt> with contents similar to the following example,<br />
</translate><br />
{{File<br />
|name=cdr116-122907.out<br />
|lang="text"<br />
|contents=<br />
2017-07-10 12:35:19.489458: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 0 with properties:<br />
name: Tesla P100-PCIE-12GB<br />
major: 6 minor: 0 memoryClockRate (GHz) 1.3285<br />
pciBusID 0000:82:00.0<br />
Total memory: 11.91GiB<br />
Free memory: 11.63GiB<br />
2017-07-10 12:35:19.491097: I tensorflow/core/common_runtime/gpu/gpu_device.cc:961] DMA: 0<br />
2017-07-10 12:35:19.491156: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0: Y<br />
2017-07-10 12:35:19.520737: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla P100-PCIE-12GB, pci bus id: 0000:82:00.0)<br />
Tensor("Const:0", shape=(), dtype=float32) Tensor("Const_1:0", shape=(), dtype=float32)<br />
[3.0, 4.0]<br />
}}<br />
<translate><br />
<br />
<!--T:16--><br />
TensorFlow can run on all GPU node types. Cedar's ''GPU large'' node type, which is equipped with 4 x P100-PCIE-16GB with GPUDirect P2P enabled between each pair, is highly recommended for large scale Deep Learning or Machine Learning research. See [[Using GPUs with SLURM]] for more information.<br />
<br />
</translate><br />
<br />
==TensorFlow with Multi-GPUs==<br />
TensorFlow provides different methods of managing variables when training models on multiple GPUs. "Parameter Server" and "Replicated" are the most two common methods. <br />
*In this section, [https://github.com/tensorflow/benchmarks TensorFlow Benchmarks] code will be used as an example to explain the different methods. Users can reference the TensorFlow Benchmarks code to implement their own.<br />
===Parameter Server===<br />
Variables are stored on a parameter server that holds the master copy of the variable. In distributed training, the parameter servers are separate processes in the different devices. For each step, each tower gets a copy of the variables from the parameter server, and sends its gradients to the param server.<br />
<br />
Parameters can be stored in CPU:<br />
<pre><br />
--variable_update=parameter_server --local_parameter_device=cpu<br />
</pre><br />
or GPU:<br />
<pre><br />
--variable_update=parameter_server --local_parameter_device=gpu<br />
</pre><br />
<br />
===Replicated===<br />
With this method, each GPU has its own copy of the variables. To apply gradients, an all_reduce algorithm or or regular cross-device aggregation is used to replicate the combined gradients to all towers (depending on all_reduce_spec parameter setting).<br />
<br />
All reduce method can be default:<br />
<pre><br />
--variable_update=replicated<br />
</pre><br />
Xring --- use one global ring reduction for all tensors:<br />
<pre><br />
--variable_update=replicated --all_reduce_spec=xring<br />
</pre><br />
Pscpu --- use CPU at worker 0 to reduce all tensors:<br />
<pre><br />
--variable_update=replicated --all_reduce_spec=pscpu<br />
</pre><br />
NCCL --- use NCCL to locally reduce all tensors:<br />
<pre><br />
--variable_update=replicated --all_reduce_spec=nccl<br />
</pre><br />
Different variable managing methods perform differently with different models. Users are highly recommended to test their own models with all methods on different types of GPU node.<br />
<br />
===Benchmarks===<br />
This section will give ResNet-50 and VGG-16 benchmarking results on both Graham and Cedar with single and multiple GPUs using different methods for managing variables. TensorFlow v1.5 (built with CUDA 9 and cuDNN 7) is used. The benchmark can be found on github: [https://github.com/tensorflow/benchmarks TensorFlow Benchmarks]. <br />
*ResNet-50<br />
Batch size is 32 per GPU. Data parallelism is used. (Results in "images per second", higher the better)<br />
{| class="wikitable"<br />
|-<br />
! Node type !! Single GPU baseline !! Number of GPUs !! ps,cpu !! ps, gpu !! replicated !! replicated, xring !! replicated, pscpu !! replicated, nccl<br />
|-<br />
| Graham GPU node || 171.23||2 || 93.31 || '''324.04''' || 318.33 || 316.01 || 109.82 || 315.99<br />
|-<br />
| Cedar GPU Base || 172.99|| 4 || '''662.65''' ||595.43 || 616.02 || 490.03|| 645.04 || 608.95<br />
|-<br />
| Cedar GPU Large || 205.71 ||4 || 673.47 || 721.98 || '''754.35''' || 574.91 || 664.72 || 692.25<br />
|}<br />
<br />
*VGG-16<br />
Batch size is 32 per GPU. Data parallelism is used. (Results in "images per second", higher the better)<br />
<br />
{| class="wikitable"<br />
|-<br />
! Node type !! Single GPU baseline !! Number of GPUs !! ps,cpu !! ps, gpu !! replicated !! replicated, xring !! replicated, pscpu !! replicated, nccl<br />
|-<br />
| Graham GPU node || 115.89||2 || 91.29 || 194.46 || 194.43 || 203.83 || 132.19 || '''219.72'''<br />
|-<br />
| Cedar GPU Base || 114.77 ||4 || 232.85 || 280.69 || 274.41 || 341.29 || 330.04 || '''388.53'''<br />
|-<br />
| Cedar GPU Large || 137.16 ||4 || 175.20 || 379.80 ||336.72 || 417.46 || 225.37 || '''490.52'''<br />
|}</div>Feimaohttps://docs.alliancecan.ca/mediawiki/index.php?title=TensorFlow&diff=46440TensorFlow2018-02-12T20:31:20Z<p>Feimao: /* Replicated */</p>
<hr />
<div><languages /><br />
[[Category:Software]]<br />
<translate><br />
<!--T:18--><br />
[https://www.tensorflow.org/ TensorFlow] is "an open-source software library for Machine Intelligence".<br />
<br />
==Installing TensorFlow== <!--T:1--><br />
<br />
<!--T:2--><br />
These instructions install Tensorflow into your home directory using Compute Canada's pre-built [http://pythonwheels.com/ Python wheels]. Custom Python wheels are stored in <code>/cvmfs/soft.computecanada.ca/custom/python/wheelhouse/</code>. To install a TensorFlow wheel we will use the <code>pip</code> command and install it into a [[Python#Creating_and_using_a_virtual_environment | Python virtual environment]]. The instructions below are for Python 3.5.2 but you can also install other Python versions by loading a different Python module.<br />
<br />
<!--T:3--><br />
Load modules required by TensorFlow:<br />
{{Command|module load python/3.5.2}}<br />
<br />
<!--T:17--><br />
Create a new Python virtual environment:<br />
{{Command|virtualenv tensorflow}}<br />
<br />
<!--T:4--><br />
Activate your newly created Python virtual environment:<br />
{{Command|source tensorflow/bin/activate}}<br />
<br />
Install TensorFlow into your newly created virtual environment using the command from either one of the two following subsections. <br />
=== CPU-only === <!--T:8--><br />
{{Command|prompt=(tensorflow) [name@server $]<br />
|pip install tensorflow-cpu}}<br />
<br />
=== GPU === <!--T:9--><br />
<br />
<!--T:10--><br />
{{Command|prompt=(tensorflow) [name@server $]<br />
|pip install tensorflow-gpu}}<br />
<br />
==Submitting a TensorFlow job with a GPU== <!--T:5--><br />
Once you have the above setup completed you can submit a TensorFlow job as<br />
{{Command|sbatch tensorflow-test.sh}}<br />
The job submission script has the content<br />
</translate><br />
{{File<br />
|name=tensorflow-test.sh<br />
|lang="bash"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --gres=gpu:1 # request GPU "generic resource"<br />
#SBATCH --cpus-per-task=6 # maximum CPU cores per GPU request: 6 on Cedar, 16 on Graham.<br />
#SBATCH --mem=32000M # memory per node<br />
#SBATCH --time=0-03:00 # time (DD-HH:MM)<br />
#SBATCH --output=%N-%j.out # %N for node name, %j for jobID<br />
<br />
module load cuda cudnn python/3.5.2<br />
source tensorflow/bin/activate<br />
python ./tensorflow-test.py<br />
}}<br />
<translate><br />
<!--T:6--><br />
while the Python script has the form,<br />
</translate><br />
{{File<br />
|name=tensorflow-test.py<br />
|lang="python"<br />
|contents=<br />
import tensorflow as tf<br />
node1 = tf.constant(3.0, dtype=tf.float32)<br />
node2 = tf.constant(4.0) # also tf.float32 implicitly<br />
print(node1, node2)<br />
sess = tf.Session()<br />
print(sess.run([node1, node2]))<br />
}}<br />
<translate><br />
<!--T:7--><br />
Once the above job has completed (should take less than a minute) you should see an output file called something like <tt>cdr116-122907.out</tt> with contents similar to the following example,<br />
</translate><br />
{{File<br />
|name=cdr116-122907.out<br />
|lang="text"<br />
|contents=<br />
2017-07-10 12:35:19.489458: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 0 with properties:<br />
name: Tesla P100-PCIE-12GB<br />
major: 6 minor: 0 memoryClockRate (GHz) 1.3285<br />
pciBusID 0000:82:00.0<br />
Total memory: 11.91GiB<br />
Free memory: 11.63GiB<br />
2017-07-10 12:35:19.491097: I tensorflow/core/common_runtime/gpu/gpu_device.cc:961] DMA: 0<br />
2017-07-10 12:35:19.491156: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0: Y<br />
2017-07-10 12:35:19.520737: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla P100-PCIE-12GB, pci bus id: 0000:82:00.0)<br />
Tensor("Const:0", shape=(), dtype=float32) Tensor("Const_1:0", shape=(), dtype=float32)<br />
[3.0, 4.0]<br />
}}<br />
<translate><br />
<br />
<!--T:16--><br />
TensorFlow can run on all GPU node types. Cedar's ''GPU large'' node type, which is equipped with 4 x P100-PCIE-16GB with GPUDirect P2P enabled between each pair, is highly recommended for large scale Deep Learning or Machine Learning research. See [[Using GPUs with SLURM]] for more information.<br />
<br />
</translate><br />
<br />
==TensorFlow with Multi-GPUs==<br />
TensorFlow provides different methods of managing variables when training models on multiple GPUs. "Parameter Server" and "Replicated" are the most two common methods.<br />
===Parameter Server===<br />
Variables are stored on a parameter server that holds the master copy of the variable. In distributed training, the parameter servers are separate processes in the different devices. For each step, each tower gets a copy of the variables from the parameter server, and sends its gradients to the param server.<br />
<br />
Parameters can be stored in CPU:<br />
<pre><br />
--variable_update=parameter_server --local_parameter_device=cpu<br />
</pre><br />
or GPU:<br />
<pre><br />
--variable_update=parameter_server --local_parameter_device=gpu<br />
</pre><br />
<br />
===Replicated===<br />
With this method, each GPU has its own copy of the variables. To apply gradients, an all_reduce algorithm or or regular cross-device aggregation is used to replicate the combined gradients to all towers (depending on all_reduce_spec parameter setting).<br />
<br />
All reduce method can be default:<br />
<pre><br />
--variable_update=replicated<br />
</pre><br />
Xring --- use one global ring reduction for all tensors:<br />
<pre><br />
--variable_update=replicated --all_reduce_spec=xring<br />
</pre><br />
Pscpu --- use CPU at worker 0 to reduce all tensors:<br />
<pre><br />
--variable_update=replicated --all_reduce_spec=pscpu<br />
</pre><br />
NCCL --- use NCCL to locally reduce all tensors:<br />
<pre><br />
--variable_update=replicated --all_reduce_spec=nccl<br />
</pre><br />
Different variable managing methods perform differently with different models. Users are highly recommended to test their own models with all methods on different types of GPU node.<br />
<br />
===Benchmarks===<br />
This section will give ResNet-50 and VGG-16 benchmarking results on both Graham and Cedar with single and multiple GPUs using different methods for managing variables. TensorFlow v1.5 (built with CUDA 9 and cuDNN 7) is used. The benchmark can be found on github: [https://github.com/tensorflow/benchmarks TensorFlow Benchmarks]. <br />
*ResNet-50<br />
Batch size is 32 per GPU. Data parallelism is used. (Results in "images per second", higher the better)<br />
{| class="wikitable"<br />
|-<br />
! Node type !! Single GPU baseline !! Number of GPUs !! ps,cpu !! ps, gpu !! replicated !! replicated, xring !! replicated, pscpu !! replicated, nccl<br />
|-<br />
| Graham GPU node || 171.23||2 || 93.31 || '''324.04''' || 318.33 || 316.01 || 109.82 || 315.99<br />
|-<br />
| Cedar GPU Base || 172.99|| 4 || '''662.65''' ||595.43 || 616.02 || 490.03|| 645.04 || 608.95<br />
|-<br />
| Cedar GPU Large || 205.71 ||4 || 673.47 || 721.98 || '''754.35''' || 574.91 || 664.72 || 692.25<br />
|}<br />
<br />
*VGG-16<br />
Batch size is 32 per GPU. Data parallelism is used. (Results in "images per second", higher the better)<br />
<br />
{| class="wikitable"<br />
|-<br />
! Node type !! Single GPU baseline !! Number of GPUs !! ps,cpu !! ps, gpu !! replicated !! replicated, xring !! replicated, pscpu !! replicated, nccl<br />
|-<br />
| Graham GPU node || 115.89||2 || 91.29 || 194.46 || 194.43 || 203.83 || 132.19 || '''219.72'''<br />
|-<br />
| Cedar GPU Base || 114.77 ||4 || 232.85 || 280.69 || 274.41 || 341.29 || 330.04 || '''388.53'''<br />
|-<br />
| Cedar GPU Large || 137.16 ||4 || 175.20 || 379.80 ||336.72 || 417.46 || 225.37 || '''490.52'''<br />
|}</div>Feimaohttps://docs.alliancecan.ca/mediawiki/index.php?title=TensorFlow&diff=46439TensorFlow2018-02-12T20:30:32Z<p>Feimao: /* Parameter Server */</p>
<hr />
<div><languages /><br />
[[Category:Software]]<br />
<translate><br />
<!--T:18--><br />
[https://www.tensorflow.org/ TensorFlow] is "an open-source software library for Machine Intelligence".<br />
<br />
==Installing TensorFlow== <!--T:1--><br />
<br />
<!--T:2--><br />
These instructions install Tensorflow into your home directory using Compute Canada's pre-built [http://pythonwheels.com/ Python wheels]. Custom Python wheels are stored in <code>/cvmfs/soft.computecanada.ca/custom/python/wheelhouse/</code>. To install a TensorFlow wheel we will use the <code>pip</code> command and install it into a [[Python#Creating_and_using_a_virtual_environment | Python virtual environment]]. The instructions below are for Python 3.5.2 but you can also install other Python versions by loading a different Python module.<br />
<br />
<!--T:3--><br />
Load modules required by TensorFlow:<br />
{{Command|module load python/3.5.2}}<br />
<br />
<!--T:17--><br />
Create a new Python virtual environment:<br />
{{Command|virtualenv tensorflow}}<br />
<br />
<!--T:4--><br />
Activate your newly created Python virtual environment:<br />
{{Command|source tensorflow/bin/activate}}<br />
<br />
Install TensorFlow into your newly created virtual environment using the command from either one of the two following subsections. <br />
=== CPU-only === <!--T:8--><br />
{{Command|prompt=(tensorflow) [name@server $]<br />
|pip install tensorflow-cpu}}<br />
<br />
=== GPU === <!--T:9--><br />
<br />
<!--T:10--><br />
{{Command|prompt=(tensorflow) [name@server $]<br />
|pip install tensorflow-gpu}}<br />
<br />
==Submitting a TensorFlow job with a GPU== <!--T:5--><br />
Once you have the above setup completed you can submit a TensorFlow job as<br />
{{Command|sbatch tensorflow-test.sh}}<br />
The job submission script has the content<br />
</translate><br />
{{File<br />
|name=tensorflow-test.sh<br />
|lang="bash"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --gres=gpu:1 # request GPU "generic resource"<br />
#SBATCH --cpus-per-task=6 # maximum CPU cores per GPU request: 6 on Cedar, 16 on Graham.<br />
#SBATCH --mem=32000M # memory per node<br />
#SBATCH --time=0-03:00 # time (DD-HH:MM)<br />
#SBATCH --output=%N-%j.out # %N for node name, %j for jobID<br />
<br />
module load cuda cudnn python/3.5.2<br />
source tensorflow/bin/activate<br />
python ./tensorflow-test.py<br />
}}<br />
<translate><br />
<!--T:6--><br />
while the Python script has the form,<br />
</translate><br />
{{File<br />
|name=tensorflow-test.py<br />
|lang="python"<br />
|contents=<br />
import tensorflow as tf<br />
node1 = tf.constant(3.0, dtype=tf.float32)<br />
node2 = tf.constant(4.0) # also tf.float32 implicitly<br />
print(node1, node2)<br />
sess = tf.Session()<br />
print(sess.run([node1, node2]))<br />
}}<br />
<translate><br />
<!--T:7--><br />
Once the above job has completed (should take less than a minute) you should see an output file called something like <tt>cdr116-122907.out</tt> with contents similar to the following example,<br />
</translate><br />
{{File<br />
|name=cdr116-122907.out<br />
|lang="text"<br />
|contents=<br />
2017-07-10 12:35:19.489458: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 0 with properties:<br />
name: Tesla P100-PCIE-12GB<br />
major: 6 minor: 0 memoryClockRate (GHz) 1.3285<br />
pciBusID 0000:82:00.0<br />
Total memory: 11.91GiB<br />
Free memory: 11.63GiB<br />
2017-07-10 12:35:19.491097: I tensorflow/core/common_runtime/gpu/gpu_device.cc:961] DMA: 0<br />
2017-07-10 12:35:19.491156: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0: Y<br />
2017-07-10 12:35:19.520737: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla P100-PCIE-12GB, pci bus id: 0000:82:00.0)<br />
Tensor("Const:0", shape=(), dtype=float32) Tensor("Const_1:0", shape=(), dtype=float32)<br />
[3.0, 4.0]<br />
}}<br />
<translate><br />
<br />
<!--T:16--><br />
TensorFlow can run on all GPU node types. Cedar's ''GPU large'' node type, which is equipped with 4 x P100-PCIE-16GB with GPUDirect P2P enabled between each pair, is highly recommended for large scale Deep Learning or Machine Learning research. See [[Using GPUs with SLURM]] for more information.<br />
<br />
</translate><br />
<br />
==TensorFlow with Multi-GPUs==<br />
TensorFlow provides different methods of managing variables when training models on multiple GPUs. "Parameter Server" and "Replicated" are the most two common methods.<br />
===Parameter Server===<br />
Variables are stored on a parameter server that holds the master copy of the variable. In distributed training, the parameter servers are separate processes in the different devices. For each step, each tower gets a copy of the variables from the parameter server, and sends its gradients to the param server.<br />
<br />
Parameters can be stored in CPU:<br />
<pre><br />
--variable_update=parameter_server --local_parameter_device=cpu<br />
</pre><br />
or GPU:<br />
<pre><br />
--variable_update=parameter_server --local_parameter_device=gpu<br />
</pre><br />
<br />
===Replicated===<br />
With this method, each GPU has its own copy of the variables. To apply gradients, an all_reduce algorithm or or regular cross-device aggregation is used to replicate the combined gradients to all towers (depending on all_reduce_spec parameter setting).<br />
<br />
All reduce methods can be default:<br />
<pre><br />
--variable_update=replicated<br />
</pre><br />
Xring --- use one global ring reduction for all tensors:<br />
<pre><br />
--variable_update=replicated --all_reduce_spec=xring<br />
</pre><br />
Pscpu --- use CPU at worker 0 to reduce all tensors:<br />
<pre><br />
--variable_update=replicated --all_reduce_spec=pscpu<br />
</pre><br />
NCCL --- use NCCL to locally reduce all tensors:<br />
<pre><br />
--variable_update=replicated --all_reduce_spec=nccl<br />
</pre><br />
Different variable managing methods perform differently with different models. Users are highly recommended to test their own models with all methods on different types of GPU node.<br />
<br />
===Benchmarks===<br />
This section will give ResNet-50 and VGG-16 benchmarking results on both Graham and Cedar with single and multiple GPUs using different methods for managing variables. TensorFlow v1.5 (built with CUDA 9 and cuDNN 7) is used. The benchmark can be found on github: [https://github.com/tensorflow/benchmarks TensorFlow Benchmarks]. <br />
*ResNet-50<br />
Batch size is 32 per GPU. Data parallelism is used. (Results in "images per second", higher the better)<br />
{| class="wikitable"<br />
|-<br />
! Node type !! Single GPU baseline !! Number of GPUs !! ps,cpu !! ps, gpu !! replicated !! replicated, xring !! replicated, pscpu !! replicated, nccl<br />
|-<br />
| Graham GPU node || 171.23||2 || 93.31 || '''324.04''' || 318.33 || 316.01 || 109.82 || 315.99<br />
|-<br />
| Cedar GPU Base || 172.99|| 4 || '''662.65''' ||595.43 || 616.02 || 490.03|| 645.04 || 608.95<br />
|-<br />
| Cedar GPU Large || 205.71 ||4 || 673.47 || 721.98 || '''754.35''' || 574.91 || 664.72 || 692.25<br />
|}<br />
<br />
*VGG-16<br />
Batch size is 32 per GPU. Data parallelism is used. (Results in "images per second", higher the better)<br />
<br />
{| class="wikitable"<br />
|-<br />
! Node type !! Single GPU baseline !! Number of GPUs !! ps,cpu !! ps, gpu !! replicated !! replicated, xring !! replicated, pscpu !! replicated, nccl<br />
|-<br />
| Graham GPU node || 115.89||2 || 91.29 || 194.46 || 194.43 || 203.83 || 132.19 || '''219.72'''<br />
|-<br />
| Cedar GPU Base || 114.77 ||4 || 232.85 || 280.69 || 274.41 || 341.29 || 330.04 || '''388.53'''<br />
|-<br />
| Cedar GPU Large || 137.16 ||4 || 175.20 || 379.80 ||336.72 || 417.46 || 225.37 || '''490.52'''<br />
|}</div>Feimaohttps://docs.alliancecan.ca/mediawiki/index.php?title=TensorFlow&diff=46438TensorFlow2018-02-12T20:30:16Z<p>Feimao: /* Parameter Server */</p>
<hr />
<div><languages /><br />
[[Category:Software]]<br />
<translate><br />
<!--T:18--><br />
[https://www.tensorflow.org/ TensorFlow] is "an open-source software library for Machine Intelligence".<br />
<br />
==Installing TensorFlow== <!--T:1--><br />
<br />
<!--T:2--><br />
These instructions install Tensorflow into your home directory using Compute Canada's pre-built [http://pythonwheels.com/ Python wheels]. Custom Python wheels are stored in <code>/cvmfs/soft.computecanada.ca/custom/python/wheelhouse/</code>. To install a TensorFlow wheel we will use the <code>pip</code> command and install it into a [[Python#Creating_and_using_a_virtual_environment | Python virtual environment]]. The instructions below are for Python 3.5.2 but you can also install other Python versions by loading a different Python module.<br />
<br />
<!--T:3--><br />
Load modules required by TensorFlow:<br />
{{Command|module load python/3.5.2}}<br />
<br />
<!--T:17--><br />
Create a new Python virtual environment:<br />
{{Command|virtualenv tensorflow}}<br />
<br />
<!--T:4--><br />
Activate your newly created Python virtual environment:<br />
{{Command|source tensorflow/bin/activate}}<br />
<br />
Install TensorFlow into your newly created virtual environment using the command from either one of the two following subsections. <br />
=== CPU-only === <!--T:8--><br />
{{Command|prompt=(tensorflow) [name@server $]<br />
|pip install tensorflow-cpu}}<br />
<br />
=== GPU === <!--T:9--><br />
<br />
<!--T:10--><br />
{{Command|prompt=(tensorflow) [name@server $]<br />
|pip install tensorflow-gpu}}<br />
<br />
==Submitting a TensorFlow job with a GPU== <!--T:5--><br />
Once you have the above setup completed you can submit a TensorFlow job as<br />
{{Command|sbatch tensorflow-test.sh}}<br />
The job submission script has the content<br />
</translate><br />
{{File<br />
|name=tensorflow-test.sh<br />
|lang="bash"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --gres=gpu:1 # request GPU "generic resource"<br />
#SBATCH --cpus-per-task=6 # maximum CPU cores per GPU request: 6 on Cedar, 16 on Graham.<br />
#SBATCH --mem=32000M # memory per node<br />
#SBATCH --time=0-03:00 # time (DD-HH:MM)<br />
#SBATCH --output=%N-%j.out # %N for node name, %j for jobID<br />
<br />
module load cuda cudnn python/3.5.2<br />
source tensorflow/bin/activate<br />
python ./tensorflow-test.py<br />
}}<br />
<translate><br />
<!--T:6--><br />
while the Python script has the form,<br />
</translate><br />
{{File<br />
|name=tensorflow-test.py<br />
|lang="python"<br />
|contents=<br />
import tensorflow as tf<br />
node1 = tf.constant(3.0, dtype=tf.float32)<br />
node2 = tf.constant(4.0) # also tf.float32 implicitly<br />
print(node1, node2)<br />
sess = tf.Session()<br />
print(sess.run([node1, node2]))<br />
}}<br />
<translate><br />
<!--T:7--><br />
Once the above job has completed (should take less than a minute) you should see an output file called something like <tt>cdr116-122907.out</tt> with contents similar to the following example,<br />
</translate><br />
{{File<br />
|name=cdr116-122907.out<br />
|lang="text"<br />
|contents=<br />
2017-07-10 12:35:19.489458: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 0 with properties:<br />
name: Tesla P100-PCIE-12GB<br />
major: 6 minor: 0 memoryClockRate (GHz) 1.3285<br />
pciBusID 0000:82:00.0<br />
Total memory: 11.91GiB<br />
Free memory: 11.63GiB<br />
2017-07-10 12:35:19.491097: I tensorflow/core/common_runtime/gpu/gpu_device.cc:961] DMA: 0<br />
2017-07-10 12:35:19.491156: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0: Y<br />
2017-07-10 12:35:19.520737: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla P100-PCIE-12GB, pci bus id: 0000:82:00.0)<br />
Tensor("Const:0", shape=(), dtype=float32) Tensor("Const_1:0", shape=(), dtype=float32)<br />
[3.0, 4.0]<br />
}}<br />
<translate><br />
<br />
<!--T:16--><br />
TensorFlow can run on all GPU node types. Cedar's ''GPU large'' node type, which is equipped with 4 x P100-PCIE-16GB with GPUDirect P2P enabled between each pair, is highly recommended for large scale Deep Learning or Machine Learning research. See [[Using GPUs with SLURM]] for more information.<br />
<br />
</translate><br />
<br />
==TensorFlow with Multi-GPUs==<br />
TensorFlow provides different methods of managing variables when training models on multiple GPUs. "Parameter Server" and "Replicated" are the most two common methods.<br />
===Parameter Server===<br />
Variables are stored on a parameter server that holds the master copy of the variable. In distributed training, the parameter servers are separate processes in the different devices. For each step, each tower gets a copy of the variables from the parameter server, and sends its gradients to the param server.<br />
<br />
Parameters can be stored in CPU:<br />
<pre><br />
ps, cpu: --variable_update=parameter_server --local_parameter_device=cpu<br />
</pre><br />
or GPU:<br />
<pre><br />
ps, gpu: --variable_update=parameter_server --local_parameter_device=gpu<br />
</pre><br />
<br />
===Replicated===<br />
With this method, each GPU has its own copy of the variables. To apply gradients, an all_reduce algorithm or or regular cross-device aggregation is used to replicate the combined gradients to all towers (depending on all_reduce_spec parameter setting).<br />
<br />
All reduce methods can be default:<br />
<pre><br />
--variable_update=replicated<br />
</pre><br />
Xring --- use one global ring reduction for all tensors:<br />
<pre><br />
--variable_update=replicated --all_reduce_spec=xring<br />
</pre><br />
Pscpu --- use CPU at worker 0 to reduce all tensors:<br />
<pre><br />
--variable_update=replicated --all_reduce_spec=pscpu<br />
</pre><br />
NCCL --- use NCCL to locally reduce all tensors:<br />
<pre><br />
--variable_update=replicated --all_reduce_spec=nccl<br />
</pre><br />
Different variable managing methods perform differently with different models. Users are highly recommended to test their own models with all methods on different types of GPU node.<br />
<br />
===Benchmarks===<br />
This section will give ResNet-50 and VGG-16 benchmarking results on both Graham and Cedar with single and multiple GPUs using different methods for managing variables. TensorFlow v1.5 (built with CUDA 9 and cuDNN 7) is used. The benchmark can be found on github: [https://github.com/tensorflow/benchmarks TensorFlow Benchmarks]. <br />
*ResNet-50<br />
Batch size is 32 per GPU. Data parallelism is used. (Results in "images per second", higher the better)<br />
{| class="wikitable"<br />
|-<br />
! Node type !! Single GPU baseline !! Number of GPUs !! ps,cpu !! ps, gpu !! replicated !! replicated, xring !! replicated, pscpu !! replicated, nccl<br />
|-<br />
| Graham GPU node || 171.23||2 || 93.31 || '''324.04''' || 318.33 || 316.01 || 109.82 || 315.99<br />
|-<br />
| Cedar GPU Base || 172.99|| 4 || '''662.65''' ||595.43 || 616.02 || 490.03|| 645.04 || 608.95<br />
|-<br />
| Cedar GPU Large || 205.71 ||4 || 673.47 || 721.98 || '''754.35''' || 574.91 || 664.72 || 692.25<br />
|}<br />
<br />
*VGG-16<br />
Batch size is 32 per GPU. Data parallelism is used. (Results in "images per second", higher the better)<br />
<br />
{| class="wikitable"<br />
|-<br />
! Node type !! Single GPU baseline !! Number of GPUs !! ps,cpu !! ps, gpu !! replicated !! replicated, xring !! replicated, pscpu !! replicated, nccl<br />
|-<br />
| Graham GPU node || 115.89||2 || 91.29 || 194.46 || 194.43 || 203.83 || 132.19 || '''219.72'''<br />
|-<br />
| Cedar GPU Base || 114.77 ||4 || 232.85 || 280.69 || 274.41 || 341.29 || 330.04 || '''388.53'''<br />
|-<br />
| Cedar GPU Large || 137.16 ||4 || 175.20 || 379.80 ||336.72 || 417.46 || 225.37 || '''490.52'''<br />
|}</div>Feimaohttps://docs.alliancecan.ca/mediawiki/index.php?title=TensorFlow&diff=46437TensorFlow2018-02-12T20:29:49Z<p>Feimao: /* TensorFlow with Multi-GPUs */</p>
<hr />
<div><languages /><br />
[[Category:Software]]<br />
<translate><br />
<!--T:18--><br />
[https://www.tensorflow.org/ TensorFlow] is "an open-source software library for Machine Intelligence".<br />
<br />
==Installing TensorFlow== <!--T:1--><br />
<br />
<!--T:2--><br />
These instructions install Tensorflow into your home directory using Compute Canada's pre-built [http://pythonwheels.com/ Python wheels]. Custom Python wheels are stored in <code>/cvmfs/soft.computecanada.ca/custom/python/wheelhouse/</code>. To install a TensorFlow wheel we will use the <code>pip</code> command and install it into a [[Python#Creating_and_using_a_virtual_environment | Python virtual environment]]. The instructions below are for Python 3.5.2 but you can also install other Python versions by loading a different Python module.<br />
<br />
<!--T:3--><br />
Load modules required by TensorFlow:<br />
{{Command|module load python/3.5.2}}<br />
<br />
<!--T:17--><br />
Create a new Python virtual environment:<br />
{{Command|virtualenv tensorflow}}<br />
<br />
<!--T:4--><br />
Activate your newly created Python virtual environment:<br />
{{Command|source tensorflow/bin/activate}}<br />
<br />
Install TensorFlow into your newly created virtual environment using the command from either one of the two following subsections. <br />
=== CPU-only === <!--T:8--><br />
{{Command|prompt=(tensorflow) [name@server $]<br />
|pip install tensorflow-cpu}}<br />
<br />
=== GPU === <!--T:9--><br />
<br />
<!--T:10--><br />
{{Command|prompt=(tensorflow) [name@server $]<br />
|pip install tensorflow-gpu}}<br />
<br />
==Submitting a TensorFlow job with a GPU== <!--T:5--><br />
Once you have the above setup completed you can submit a TensorFlow job as<br />
{{Command|sbatch tensorflow-test.sh}}<br />
The job submission script has the content<br />
</translate><br />
{{File<br />
|name=tensorflow-test.sh<br />
|lang="bash"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --gres=gpu:1 # request GPU "generic resource"<br />
#SBATCH --cpus-per-task=6 # maximum CPU cores per GPU request: 6 on Cedar, 16 on Graham.<br />
#SBATCH --mem=32000M # memory per node<br />
#SBATCH --time=0-03:00 # time (DD-HH:MM)<br />
#SBATCH --output=%N-%j.out # %N for node name, %j for jobID<br />
<br />
module load cuda cudnn python/3.5.2<br />
source tensorflow/bin/activate<br />
python ./tensorflow-test.py<br />
}}<br />
<translate><br />
<!--T:6--><br />
while the Python script has the form,<br />
</translate><br />
{{File<br />
|name=tensorflow-test.py<br />
|lang="python"<br />
|contents=<br />
import tensorflow as tf<br />
node1 = tf.constant(3.0, dtype=tf.float32)<br />
node2 = tf.constant(4.0) # also tf.float32 implicitly<br />
print(node1, node2)<br />
sess = tf.Session()<br />
print(sess.run([node1, node2]))<br />
}}<br />
<translate><br />
<!--T:7--><br />
Once the above job has completed (should take less than a minute) you should see an output file called something like <tt>cdr116-122907.out</tt> with contents similar to the following example,<br />
</translate><br />
{{File<br />
|name=cdr116-122907.out<br />
|lang="text"<br />
|contents=<br />
2017-07-10 12:35:19.489458: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 0 with properties:<br />
name: Tesla P100-PCIE-12GB<br />
major: 6 minor: 0 memoryClockRate (GHz) 1.3285<br />
pciBusID 0000:82:00.0<br />
Total memory: 11.91GiB<br />
Free memory: 11.63GiB<br />
2017-07-10 12:35:19.491097: I tensorflow/core/common_runtime/gpu/gpu_device.cc:961] DMA: 0<br />
2017-07-10 12:35:19.491156: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0: Y<br />
2017-07-10 12:35:19.520737: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla P100-PCIE-12GB, pci bus id: 0000:82:00.0)<br />
Tensor("Const:0", shape=(), dtype=float32) Tensor("Const_1:0", shape=(), dtype=float32)<br />
[3.0, 4.0]<br />
}}<br />
<translate><br />
<br />
<!--T:16--><br />
TensorFlow can run on all GPU node types. Cedar's ''GPU large'' node type, which is equipped with 4 x P100-PCIE-16GB with GPUDirect P2P enabled between each pair, is highly recommended for large scale Deep Learning or Machine Learning research. See [[Using GPUs with SLURM]] for more information.<br />
<br />
</translate><br />
<br />
==TensorFlow with Multi-GPUs==<br />
TensorFlow provides different methods of managing variables when training models on multiple GPUs. "Parameter Server" and "Replicated" are the most two common methods.<br />
===Parameter Server===<br />
Variables are stored on a parameter server that holds the master copy of the variable. In distributed training, the parameter servers are separate processes in the different devices (GPUs). For each step, each tower gets a copy of the variables from the parameter server, and sends its gradients to the param server.<br />
<br />
Parameters can be stored in CPU:<br />
<pre><br />
ps, cpu: --variable_update=parameter_server --local_parameter_device=cpu<br />
</pre><br />
or GPU:<br />
<pre><br />
ps, gpu: --variable_update=parameter_server --local_parameter_device=gpu<br />
</pre><br />
<br />
===Replicated===<br />
With this method, each GPU has its own copy of the variables. To apply gradients, an all_reduce algorithm or or regular cross-device aggregation is used to replicate the combined gradients to all towers (depending on all_reduce_spec parameter setting).<br />
<br />
All reduce methods can be default:<br />
<pre><br />
--variable_update=replicated<br />
</pre><br />
Xring --- use one global ring reduction for all tensors:<br />
<pre><br />
--variable_update=replicated --all_reduce_spec=xring<br />
</pre><br />
Pscpu --- use CPU at worker 0 to reduce all tensors:<br />
<pre><br />
--variable_update=replicated --all_reduce_spec=pscpu<br />
</pre><br />
NCCL --- use NCCL to locally reduce all tensors:<br />
<pre><br />
--variable_update=replicated --all_reduce_spec=nccl<br />
</pre><br />
Different variable managing methods perform differently with different models. Users are highly recommended to test their own models with all methods on different types of GPU node.<br />
<br />
===Benchmarks===<br />
This section will give ResNet-50 and VGG-16 benchmarking results on both Graham and Cedar with single and multiple GPUs using different methods for managing variables. TensorFlow v1.5 (built with CUDA 9 and cuDNN 7) is used. The benchmark can be found on github: [https://github.com/tensorflow/benchmarks TensorFlow Benchmarks]. <br />
*ResNet-50<br />
Batch size is 32 per GPU. Data parallelism is used. (Results in "images per second", higher the better)<br />
{| class="wikitable"<br />
|-<br />
! Node type !! Single GPU baseline !! Number of GPUs !! ps,cpu !! ps, gpu !! replicated !! replicated, xring !! replicated, pscpu !! replicated, nccl<br />
|-<br />
| Graham GPU node || 171.23||2 || 93.31 || '''324.04''' || 318.33 || 316.01 || 109.82 || 315.99<br />
|-<br />
| Cedar GPU Base || 172.99|| 4 || '''662.65''' ||595.43 || 616.02 || 490.03|| 645.04 || 608.95<br />
|-<br />
| Cedar GPU Large || 205.71 ||4 || 673.47 || 721.98 || '''754.35''' || 574.91 || 664.72 || 692.25<br />
|}<br />
<br />
*VGG-16<br />
Batch size is 32 per GPU. Data parallelism is used. (Results in "images per second", higher the better)<br />
<br />
{| class="wikitable"<br />
|-<br />
! Node type !! Single GPU baseline !! Number of GPUs !! ps,cpu !! ps, gpu !! replicated !! replicated, xring !! replicated, pscpu !! replicated, nccl<br />
|-<br />
| Graham GPU node || 115.89||2 || 91.29 || 194.46 || 194.43 || 203.83 || 132.19 || '''219.72'''<br />
|-<br />
| Cedar GPU Base || 114.77 ||4 || 232.85 || 280.69 || 274.41 || 341.29 || 330.04 || '''388.53'''<br />
|-<br />
| Cedar GPU Large || 137.16 ||4 || 175.20 || 379.80 ||336.72 || 417.46 || 225.37 || '''490.52'''<br />
|}</div>Feimaohttps://docs.alliancecan.ca/mediawiki/index.php?title=TensorFlow&diff=46436TensorFlow2018-02-12T20:27:15Z<p>Feimao: </p>
<hr />
<div><languages /><br />
[[Category:Software]]<br />
<translate><br />
<!--T:18--><br />
[https://www.tensorflow.org/ TensorFlow] is "an open-source software library for Machine Intelligence".<br />
<br />
==Installing TensorFlow== <!--T:1--><br />
<br />
<!--T:2--><br />
These instructions install Tensorflow into your home directory using Compute Canada's pre-built [http://pythonwheels.com/ Python wheels]. Custom Python wheels are stored in <code>/cvmfs/soft.computecanada.ca/custom/python/wheelhouse/</code>. To install a TensorFlow wheel we will use the <code>pip</code> command and install it into a [[Python#Creating_and_using_a_virtual_environment | Python virtual environment]]. The instructions below are for Python 3.5.2 but you can also install other Python versions by loading a different Python module.<br />
<br />
<!--T:3--><br />
Load modules required by TensorFlow:<br />
{{Command|module load python/3.5.2}}<br />
<br />
<!--T:17--><br />
Create a new Python virtual environment:<br />
{{Command|virtualenv tensorflow}}<br />
<br />
<!--T:4--><br />
Activate your newly created Python virtual environment:<br />
{{Command|source tensorflow/bin/activate}}<br />
<br />
Install TensorFlow into your newly created virtual environment using the command from either one of the two following subsections. <br />
=== CPU-only === <!--T:8--><br />
{{Command|prompt=(tensorflow) [name@server $]<br />
|pip install tensorflow-cpu}}<br />
<br />
=== GPU === <!--T:9--><br />
<br />
<!--T:10--><br />
{{Command|prompt=(tensorflow) [name@server $]<br />
|pip install tensorflow-gpu}}<br />
<br />
==Submitting a TensorFlow job with a GPU== <!--T:5--><br />
Once you have the above setup completed you can submit a TensorFlow job as<br />
{{Command|sbatch tensorflow-test.sh}}<br />
The job submission script has the content<br />
</translate><br />
{{File<br />
|name=tensorflow-test.sh<br />
|lang="bash"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --gres=gpu:1 # request GPU "generic resource"<br />
#SBATCH --cpus-per-task=6 # maximum CPU cores per GPU request: 6 on Cedar, 16 on Graham.<br />
#SBATCH --mem=32000M # memory per node<br />
#SBATCH --time=0-03:00 # time (DD-HH:MM)<br />
#SBATCH --output=%N-%j.out # %N for node name, %j for jobID<br />
<br />
module load cuda cudnn python/3.5.2<br />
source tensorflow/bin/activate<br />
python ./tensorflow-test.py<br />
}}<br />
<translate><br />
<!--T:6--><br />
while the Python script has the form,<br />
</translate><br />
{{File<br />
|name=tensorflow-test.py<br />
|lang="python"<br />
|contents=<br />
import tensorflow as tf<br />
node1 = tf.constant(3.0, dtype=tf.float32)<br />
node2 = tf.constant(4.0) # also tf.float32 implicitly<br />
print(node1, node2)<br />
sess = tf.Session()<br />
print(sess.run([node1, node2]))<br />
}}<br />
<translate><br />
<!--T:7--><br />
Once the above job has completed (should take less than a minute) you should see an output file called something like <tt>cdr116-122907.out</tt> with contents similar to the following example,<br />
</translate><br />
{{File<br />
|name=cdr116-122907.out<br />
|lang="text"<br />
|contents=<br />
2017-07-10 12:35:19.489458: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 0 with properties:<br />
name: Tesla P100-PCIE-12GB<br />
major: 6 minor: 0 memoryClockRate (GHz) 1.3285<br />
pciBusID 0000:82:00.0<br />
Total memory: 11.91GiB<br />
Free memory: 11.63GiB<br />
2017-07-10 12:35:19.491097: I tensorflow/core/common_runtime/gpu/gpu_device.cc:961] DMA: 0<br />
2017-07-10 12:35:19.491156: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0: Y<br />
2017-07-10 12:35:19.520737: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla P100-PCIE-12GB, pci bus id: 0000:82:00.0)<br />
Tensor("Const:0", shape=(), dtype=float32) Tensor("Const_1:0", shape=(), dtype=float32)<br />
[3.0, 4.0]<br />
}}<br />
<translate><br />
<br />
<!--T:16--><br />
TensorFlow can run on all GPU node types. Cedar's ''GPU large'' node type, which is equipped with 4 x P100-PCIE-16GB with GPUDirect P2P enabled between each pair, is highly recommended for large scale Deep Learning or Machine Learning research. See [[Using GPUs with SLURM]] for more information.<br />
<br />
</translate><br />
<br />
==TensorFlow with Multi-GPUs==<br />
TensorFlow provides different methods of managing variables when training models on multiple GPUs. "Parameter Server" and "Replicated" are the most two common methods.<br />
===Parameter Server===<br />
variables are stored on a parameter server that holds the master copy of the variable. In distributed training, the parameter servers are separate processes in the different devices (GPUs). For each step, each tower gets a copy of the variables from the parameter server, and sends its gradients to the param server.<br />
<br />
Parameters can be stored in CPU:<br />
<pre><br />
ps, cpu: --variable_update=parameter_server --local_parameter_device=cpu<br />
</pre><br />
or GPU:<br />
<pre><br />
ps, gpu: --variable_update=parameter_server --local_parameter_device=gpu<br />
</pre><br />
<br />
===Replicated===<br />
With this method, each GPU has its own copy of the variables. To apply gradients, an all_reduce algorithm or or regular cross-device aggregation is used to replicate the combined gradients to all towers (depending on all_reduce_spec parameter setting).<br />
<br />
All reduce methods can be default:<br />
<pre><br />
--variable_update=replicated<br />
</pre><br />
Xring --- use one global ring reduction for all tensors:<br />
<pre><br />
--variable_update=replicated --all_reduce_spec=xring<br />
</pre><br />
Pscpu --- use CPU at worker 0 to reduce all tensors:<br />
<pre><br />
--variable_update=replicated --all_reduce_spec=pscpu<br />
</pre><br />
NCCL --- use NCCL to locally reduce all tensors:<br />
<pre><br />
--variable_update=replicated --all_reduce_spec=nccl<br />
</pre><br />
Different variable managing methods perform differently with different models. Users are highly recommended to test their own models with all methods on different types of GPU node.<br />
<br />
===Benchmarks===<br />
This section will give ResNet-50 and VGG-16 benchmarking results on both Graham and Cedar with single and multiple GPUs using different methods for managing variables. TensorFlow v1.5 (built with CUDA 9 and cuDNN 7) is used. The benchmark can be found on github: [https://github.com/tensorflow/benchmarks TensorFlow Benchmarks]. <br />
*ResNet-50<br />
Batch size is 32 per GPU. Data parallelism is used. (Results in "images per second", higher the better)<br />
{| class="wikitable"<br />
|-<br />
! Node type !! Single GPU baseline !! Number of GPUs !! ps,cpu !! ps, gpu !! replicated !! replicated, xring !! replicated, pscpu !! replicated, nccl<br />
|-<br />
| Graham GPU node || 171.23||2 || 93.31 || '''324.04''' || 318.33 || 316.01 || 109.82 || 315.99<br />
|-<br />
| Cedar GPU Base || 172.99|| 4 || '''662.65''' ||595.43 || 616.02 || 490.03|| 645.04 || 608.95<br />
|-<br />
| Cedar GPU Large || 205.71 ||4 || 673.47 || 721.98 || '''754.35''' || 574.91 || 664.72 || 692.25<br />
|}<br />
<br />
*VGG-16<br />
Batch size is 32 per GPU. Data parallelism is used. (Results in "images per second", higher the better)<br />
<br />
{| class="wikitable"<br />
|-<br />
! Node type !! Single GPU baseline !! Number of GPUs !! ps,cpu !! ps, gpu !! replicated !! replicated, xring !! replicated, pscpu !! replicated, nccl<br />
|-<br />
| Graham GPU node || 115.89||2 || 91.29 || 194.46 || 194.43 || 203.83 || 132.19 || '''219.72'''<br />
|-<br />
| Cedar GPU Base || 114.77 ||4 || 232.85 || 280.69 || 274.41 || 341.29 || 330.04 || '''388.53'''<br />
|-<br />
| Cedar GPU Large || 137.16 ||4 || 175.20 || 379.80 ||336.72 || 417.46 || 225.37 || '''490.52'''<br />
|}</div>Feimaohttps://docs.alliancecan.ca/mediawiki/index.php?title=TensorFlow&diff=46074TensorFlow2018-02-05T20:47:15Z<p>Feimao: /* ResNet-50 */</p>
<hr />
<div><languages /><br />
[[Category:Software]]<br />
<translate><br />
<!--T:18--><br />
[https://www.tensorflow.org/ TensorFlow] is "an open-source software library for Machine Intelligence".<br />
<br />
==Installing TensorFlow== <!--T:1--><br />
<br />
<!--T:2--><br />
These instructions install Tensorflow into your home directory using Compute Canada's pre-built [http://pythonwheels.com/ Python wheels]. Custom Python wheels are stored in <code>/cvmfs/soft.computecanada.ca/custom/python/wheelhouse/</code>. To install TensorFlow's wheel we will use the <code>pip</code> command and install it into a [[Python#Creating_and_using_a_virtual_environment | Python virtual environment]]. The below instructions install for Python 3.5.2 but you can also install for Python 3.5.Y or 2.7.X by loading a different Python module.<br />
<br />
<!--T:3--><br />
Load modules required by TensorFlow:<br />
{{Command|module load python/3.5.2}}<br />
<br />
<!--T:17--><br />
Create a new Python virtual environment:<br />
{{Command|virtualenv tensorflow}}<br />
<br />
<!--T:4--><br />
Activate your newly created Python virtual environment:<br />
{{Command|source tensorflow/bin/activate}}<br />
<br />
Install TensorFlow into your newly created virtual environment using the command from either one of the two following subsections. <br />
=== CPU-only === <!--T:8--><br />
{{Command|prompt=(tensorflow) [name@server $]<br />
|pip install tensorflow-cpu}}<br />
<br />
=== GPU === <!--T:9--><br />
<br />
<!--T:10--><br />
{{Command|prompt=(tensorflow) [name@server $]<br />
|pip install tensorflow-gpu}}<br />
<br />
==Submitting a TensorFlow job with a GPU== <!--T:5--><br />
Once you have the above setup completed you can submit a TensorFlow job as<br />
{{Command|sbatch tensorflow-test.sh}}<br />
The job submission script has the content<br />
</translate><br />
{{File<br />
|name=tensorflow-test.sh<br />
|lang="bash"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --gres=gpu:1 # request GPU "generic resource"<br />
#SBATCH --cpus-per-task=6 # maximum CPU cores per GPU request: 6 on Cedar, 16 on Graham.<br />
#SBATCH --mem=32000M # memory per node<br />
#SBATCH --time=0-03:00 # time (DD-HH:MM)<br />
#SBATCH --output=%N-%j.out # %N for node name, %j for jobID<br />
<br />
module load cuda cudnn python/3.5.2<br />
source tensorflow/bin/activate<br />
python ./tensorflow-test.py<br />
}}<br />
<translate><br />
<!--T:6--><br />
while the Python script has the form,<br />
</translate><br />
{{File<br />
|name=tensorflow-test.py<br />
|lang="python"<br />
|contents=<br />
import tensorflow as tf<br />
node1 = tf.constant(3.0, dtype=tf.float32)<br />
node2 = tf.constant(4.0) # also tf.float32 implicitly<br />
print(node1, node2)<br />
sess = tf.Session()<br />
print(sess.run([node1, node2]))<br />
}}<br />
<translate><br />
<!--T:7--><br />
Once the above job has completed (should take less than a minute) you should see an output file called something like <tt>cdr116-122907.out</tt> with contents similar to the following example,<br />
</translate><br />
{{File<br />
|name=cdr116-122907.out<br />
|lang="text"<br />
|contents=<br />
2017-07-10 12:35:19.489458: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 0 with properties:<br />
name: Tesla P100-PCIE-12GB<br />
major: 6 minor: 0 memoryClockRate (GHz) 1.3285<br />
pciBusID 0000:82:00.0<br />
Total memory: 11.91GiB<br />
Free memory: 11.63GiB<br />
2017-07-10 12:35:19.491097: I tensorflow/core/common_runtime/gpu/gpu_device.cc:961] DMA: 0<br />
2017-07-10 12:35:19.491156: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0: Y<br />
2017-07-10 12:35:19.520737: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla P100-PCIE-12GB, pci bus id: 0000:82:00.0)<br />
Tensor("Const:0", shape=(), dtype=float32) Tensor("Const_1:0", shape=(), dtype=float32)<br />
[3.0, 4.0]<br />
}}<br />
<translate><br />
<br />
<!--T:16--><br />
TensorFlow can run on all GPU node types. Cedar's ''GPU large'' node type, which is equipped with 4 x P100-PCIE-16GB with GPUDirect P2P enabled between each pair, is highly recommended for large scale Deep Learning or Machine Learning research. See [[Using GPUs with SLURM]] for more information.<br />
<br />
</translate><br />
<br />
==Benchmarks==<br />
This section will give ResNet-50 and VGG-16 benchmarking results on both Graham and Cedar with single and multiple GPUs using different methods for managing variables. TensorFlow v1.5 (built with CUDA 9 and cuDNN 7) is used. The benchmark can be found on github: [https://github.com/tensorflow/benchmarks TensorFlow Benchmarks]. <br />
<br />
Methods of managing variables:<br />
<pre><br />
ps, cpu: --variable_update=parameter_server --local_parameter_device=cpu<br />
<br />
ps, gpu: --variable_update=parameter_server --local_parameter_device=gpu<br />
<br />
replicated: --variable_update=replicated<br />
<br />
replicated, xring: --variable_update=replicated --all_reduce_spec=xring<br />
<br />
replicated, pscpu: --variable_update=replicated --all_reduce_spec=pscpu<br />
<br />
replicated, nccl: --variable_update=replicated --all_reduce_spec=nccl<br />
</pre><br />
*Different variable managing methods perform differently with different models. Users are highly recommended to test their own models with all methods on different types of GPU node.<br />
===Data Format===<br />
For GPU which uses cuDNN, NCHW should be used. This is the default one for this benchmarking code.<br />
<pre><br />
--data_format=NCHW<br />
</pre><br />
For other devices, NHWC (TF native) is suggested.<br />
<pre><br />
--data_format=NHWC<br />
</pre><br />
<br />
===ResNet-50===<br />
Batch size is 32 per GPU. Data parallelism is used. (Results in "images per second", higher the better)<br />
{| class="wikitable"<br />
|-<br />
! Node type !! Single GPU baseline !! Number of GPUs !! ps,cpu !! ps, gpu !! replicated !! replicated, xring !! replicated, pscpu !! replicated, nccl<br />
|-<br />
| Graham GPU node || 171.23||2 || 93.31 || '''324.04''' || 318.33 || 316.01 || 109.82 || 315.99<br />
|-<br />
| Cedar GPU Base || 172.99|| 4 || '''662.65''' ||595.43 || 616.02 || 490.03|| 645.04 || 608.95<br />
|-<br />
| Cedar GPU Large || 205.71 ||4 || 673.47 || 721.98 || '''754.35''' || 574.91 || 664.72 || 692.25<br />
|}<br />
<br />
===VGG-16===<br />
Batch size is 32 per GPU. Data parallelism is used. (Results in "images per second", higher the better)<br />
<br />
{| class="wikitable"<br />
|-<br />
! Node type !! Single GPU baseline !! Number of GPUs !! ps,cpu !! ps, gpu !! replicated !! replicated, xring !! replicated, pscpu !! replicated, nccl<br />
|-<br />
| Graham GPU node || 115.89||2 || 91.29 || 194.46 || 194.43 || 203.83 || 132.19 || '''219.72'''<br />
|-<br />
| Cedar GPU Base || 114.77 ||4 || 232.85 || 280.69 || 274.41 || 341.29 || 330.04 || '''388.53'''<br />
|-<br />
| Cedar GPU Large || 137.16 ||4 || 175.20 || 379.80 ||336.72 || 417.46 || 225.37 || '''490.52'''<br />
|}</div>Feimaohttps://docs.alliancecan.ca/mediawiki/index.php?title=TensorFlow&diff=46069TensorFlow2018-02-05T20:20:52Z<p>Feimao: /* Data Format */</p>
<hr />
<div><languages /><br />
[[Category:Software]]<br />
<translate><br />
<!--T:18--><br />
[https://www.tensorflow.org/ TensorFlow] is "an open-source software library for Machine Intelligence".<br />
<br />
==Installing TensorFlow== <!--T:1--><br />
<br />
<!--T:2--><br />
These instructions install Tensorflow into your home directory using Compute Canada's pre-built [http://pythonwheels.com/ Python wheels]. Custom Python wheels are stored in <code>/cvmfs/soft.computecanada.ca/custom/python/wheelhouse/</code>. To install TensorFlow's wheel we will use the <code>pip</code> command and install it into a [[Python#Creating_and_using_a_virtual_environment | Python virtual environment]]. The below instructions install for Python 3.5.2 but you can also install for Python 3.5.Y or 2.7.X by loading a different Python module.<br />
<br />
<!--T:3--><br />
Load modules required by TensorFlow:<br />
{{Command|module load python/3.5.2}}<br />
<br />
<!--T:17--><br />
Create a new Python virtual environment:<br />
{{Command|virtualenv tensorflow}}<br />
<br />
<!--T:4--><br />
Activate your newly created Python virtual environment:<br />
{{Command|source tensorflow/bin/activate}}<br />
<br />
Install TensorFlow into your newly created virtual environment using the command from either one of the two following subsections. <br />
=== CPU-only === <!--T:8--><br />
{{Command|prompt=(tensorflow) [name@server $]<br />
|pip install tensorflow-cpu}}<br />
<br />
=== GPU === <!--T:9--><br />
<br />
<!--T:10--><br />
{{Command|prompt=(tensorflow) [name@server $]<br />
|pip install tensorflow-gpu}}<br />
<br />
==Submitting a TensorFlow job with a GPU== <!--T:5--><br />
Once you have the above setup completed you can submit a TensorFlow job as<br />
{{Command|sbatch tensorflow-test.sh}}<br />
The job submission script has the content<br />
</translate><br />
{{File<br />
|name=tensorflow-test.sh<br />
|lang="bash"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --gres=gpu:1 # request GPU "generic resource"<br />
#SBATCH --cpus-per-task=6 # maximum CPU cores per GPU request: 6 on Cedar, 16 on Graham.<br />
#SBATCH --mem=32000M # memory per node<br />
#SBATCH --time=0-03:00 # time (DD-HH:MM)<br />
#SBATCH --output=%N-%j.out # %N for node name, %j for jobID<br />
<br />
module load cuda cudnn python/3.5.2<br />
source tensorflow/bin/activate<br />
python ./tensorflow-test.py<br />
}}<br />
<translate><br />
<!--T:6--><br />
while the Python script has the form,<br />
</translate><br />
{{File<br />
|name=tensorflow-test.py<br />
|lang="python"<br />
|contents=<br />
import tensorflow as tf<br />
node1 = tf.constant(3.0, dtype=tf.float32)<br />
node2 = tf.constant(4.0) # also tf.float32 implicitly<br />
print(node1, node2)<br />
sess = tf.Session()<br />
print(sess.run([node1, node2]))<br />
}}<br />
<translate><br />
<!--T:7--><br />
Once the above job has completed (should take less than a minute) you should see an output file called something like <tt>cdr116-122907.out</tt> with contents similar to the following example,<br />
</translate><br />
{{File<br />
|name=cdr116-122907.out<br />
|lang="text"<br />
|contents=<br />
2017-07-10 12:35:19.489458: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 0 with properties:<br />
name: Tesla P100-PCIE-12GB<br />
major: 6 minor: 0 memoryClockRate (GHz) 1.3285<br />
pciBusID 0000:82:00.0<br />
Total memory: 11.91GiB<br />
Free memory: 11.63GiB<br />
2017-07-10 12:35:19.491097: I tensorflow/core/common_runtime/gpu/gpu_device.cc:961] DMA: 0<br />
2017-07-10 12:35:19.491156: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0: Y<br />
2017-07-10 12:35:19.520737: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla P100-PCIE-12GB, pci bus id: 0000:82:00.0)<br />
Tensor("Const:0", shape=(), dtype=float32) Tensor("Const_1:0", shape=(), dtype=float32)<br />
[3.0, 4.0]<br />
}}<br />
<translate><br />
<br />
<!--T:16--><br />
TensorFlow can run on all GPU node types. Cedar's ''GPU large'' node type, which is equipped with 4 x P100-PCIE-16GB with GPUDirect P2P enabled between each pair, is highly recommended for large scale Deep Learning or Machine Learning research. See [[Using GPUs with SLURM]] for more information.<br />
<br />
</translate><br />
<br />
==Benchmarks==<br />
This section will give ResNet-50 and VGG-16 benchmarking results on both Graham and Cedar with single and multiple GPUs using different methods for managing variables. TensorFlow v1.5 (built with CUDA 9 and cuDNN 7) is used. The benchmark can be found on github: [https://github.com/tensorflow/benchmarks TensorFlow Benchmarks]. <br />
<br />
Methods of managing variables:<br />
<pre><br />
ps, cpu: --variable_update=parameter_server --local_parameter_device=cpu<br />
<br />
ps, gpu: --variable_update=parameter_server --local_parameter_device=gpu<br />
<br />
replicated: --variable_update=replicated<br />
<br />
replicated, xring: --variable_update=replicated --all_reduce_spec=xring<br />
<br />
replicated, pscpu: --variable_update=replicated --all_reduce_spec=pscpu<br />
<br />
replicated, nccl: --variable_update=replicated --all_reduce_spec=nccl<br />
</pre><br />
*Different variable managing methods perform differently with different models. Users are highly recommended to test their own models with all methods on different types of GPU node.<br />
===Data Format===<br />
For GPU which uses cuDNN, NCHW should be used. This is the default one for this benchmarking code.<br />
<pre><br />
--data_format=NCHW<br />
</pre><br />
For other devices, NHWC (TF native) is suggested.<br />
<pre><br />
--data_format=NHWC<br />
</pre><br />
<br />
===ResNet-50===<br />
Batch size is 32 per GPU. Data parallelism is used. (Results in "images per second", higher the better)<br />
{| class="wikitable"<br />
|-<br />
! Node type !! Single GPU baseline !! Number of GPUs !! ps,cpu !! ps, gpu !! replicated !! replicated, xring !! replicated, pscpu !! replicated, nccl<br />
|-<br />
| Graham GPU node || 171.23||2 || 93.31 || '''324.04''' || 318.33 || 316.01 || 109.82 || 315.99<br />
|-<br />
| Cedar GPU Base || || 4 || || || || || || <br />
|-<br />
| Cedar GPU Large || 205.71 ||4 || 673.47 || 721.98 || '''754.35''' || 574.91 || 664.72 || 692.25<br />
|}<br />
<br />
===VGG-16===<br />
Batch size is 32 per GPU. Data parallelism is used. (Results in "images per second", higher the better)<br />
<br />
{| class="wikitable"<br />
|-<br />
! Node type !! Single GPU baseline !! Number of GPUs !! ps,cpu !! ps, gpu !! replicated !! replicated, xring !! replicated, pscpu !! replicated, nccl<br />
|-<br />
| Graham GPU node || 115.89||2 || 91.29 || 194.46 || 194.43 || 203.83 || 132.19 || '''219.72'''<br />
|-<br />
| Cedar GPU Base || 114.77 ||4 || 232.85 || 280.69 || 274.41 || 341.29 || 330.04 || '''388.53'''<br />
|-<br />
| Cedar GPU Large || 137.16 ||4 || 175.20 || 379.80 ||336.72 || 417.46 || 225.37 || '''490.52'''<br />
|}</div>Feimaohttps://docs.alliancecan.ca/mediawiki/index.php?title=TensorFlow&diff=46068TensorFlow2018-02-05T19:32:58Z<p>Feimao: /* VGG-16 */</p>
<hr />
<div><languages /><br />
[[Category:Software]]<br />
<translate><br />
<!--T:18--><br />
[https://www.tensorflow.org/ TensorFlow] is "an open-source software library for Machine Intelligence".<br />
<br />
==Installing TensorFlow== <!--T:1--><br />
<br />
<!--T:2--><br />
These instructions install Tensorflow into your home directory using Compute Canada's pre-built [http://pythonwheels.com/ Python wheels]. Custom Python wheels are stored in <code>/cvmfs/soft.computecanada.ca/custom/python/wheelhouse/</code>. To install TensorFlow's wheel we will use the <code>pip</code> command and install it into a [[Python#Creating_and_using_a_virtual_environment | Python virtual environment]]. The below instructions install for Python 3.5.2 but you can also install for Python 3.5.Y or 2.7.X by loading a different Python module.<br />
<br />
<!--T:3--><br />
Load modules required by TensorFlow:<br />
{{Command|module load python/3.5.2}}<br />
<br />
<!--T:17--><br />
Create a new Python virtual environment:<br />
{{Command|virtualenv tensorflow}}<br />
<br />
<!--T:4--><br />
Activate your newly created Python virtual environment:<br />
{{Command|source tensorflow/bin/activate}}<br />
<br />
Install TensorFlow into your newly created virtual environment using the command from either one of the two following subsections. <br />
=== CPU-only === <!--T:8--><br />
{{Command|prompt=(tensorflow) [name@server $]<br />
|pip install tensorflow-cpu}}<br />
<br />
=== GPU === <!--T:9--><br />
<br />
<!--T:10--><br />
{{Command|prompt=(tensorflow) [name@server $]<br />
|pip install tensorflow-gpu}}<br />
<br />
==Submitting a TensorFlow job with a GPU== <!--T:5--><br />
Once you have the above setup completed you can submit a TensorFlow job as<br />
{{Command|sbatch tensorflow-test.sh}}<br />
The job submission script has the content<br />
</translate><br />
{{File<br />
|name=tensorflow-test.sh<br />
|lang="bash"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --gres=gpu:1 # request GPU "generic resource"<br />
#SBATCH --cpus-per-task=6 # maximum CPU cores per GPU request: 6 on Cedar, 16 on Graham.<br />
#SBATCH --mem=32000M # memory per node<br />
#SBATCH --time=0-03:00 # time (DD-HH:MM)<br />
#SBATCH --output=%N-%j.out # %N for node name, %j for jobID<br />
<br />
module load cuda cudnn python/3.5.2<br />
source tensorflow/bin/activate<br />
python ./tensorflow-test.py<br />
}}<br />
<translate><br />
<!--T:6--><br />
while the Python script has the form,<br />
</translate><br />
{{File<br />
|name=tensorflow-test.py<br />
|lang="python"<br />
|contents=<br />
import tensorflow as tf<br />
node1 = tf.constant(3.0, dtype=tf.float32)<br />
node2 = tf.constant(4.0) # also tf.float32 implicitly<br />
print(node1, node2)<br />
sess = tf.Session()<br />
print(sess.run([node1, node2]))<br />
}}<br />
<translate><br />
<!--T:7--><br />
Once the above job has completed (should take less than a minute) you should see an output file called something like <tt>cdr116-122907.out</tt> with contents similar to the following example,<br />
</translate><br />
{{File<br />
|name=cdr116-122907.out<br />
|lang="text"<br />
|contents=<br />
2017-07-10 12:35:19.489458: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 0 with properties:<br />
name: Tesla P100-PCIE-12GB<br />
major: 6 minor: 0 memoryClockRate (GHz) 1.3285<br />
pciBusID 0000:82:00.0<br />
Total memory: 11.91GiB<br />
Free memory: 11.63GiB<br />
2017-07-10 12:35:19.491097: I tensorflow/core/common_runtime/gpu/gpu_device.cc:961] DMA: 0<br />
2017-07-10 12:35:19.491156: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0: Y<br />
2017-07-10 12:35:19.520737: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla P100-PCIE-12GB, pci bus id: 0000:82:00.0)<br />
Tensor("Const:0", shape=(), dtype=float32) Tensor("Const_1:0", shape=(), dtype=float32)<br />
[3.0, 4.0]<br />
}}<br />
<translate><br />
<br />
<!--T:16--><br />
TensorFlow can run on all GPU node types. Cedar's ''GPU large'' node type, which is equipped with 4 x P100-PCIE-16GB with GPUDirect P2P enabled between each pair, is highly recommended for large scale Deep Learning or Machine Learning research. See [[Using GPUs with SLURM]] for more information.<br />
<br />
</translate><br />
<br />
==Benchmarks==<br />
This section will give ResNet-50 and VGG-16 benchmarking results on both Graham and Cedar with single and multiple GPUs using different methods for managing variables. TensorFlow v1.5 (built with CUDA 9 and cuDNN 7) is used. The benchmark can be found on github: [https://github.com/tensorflow/benchmarks TensorFlow Benchmarks]. <br />
<br />
Methods of managing variables:<br />
<pre><br />
ps, cpu: --variable_update=parameter_server --local_parameter_device=cpu<br />
<br />
ps, gpu: --variable_update=parameter_server --local_parameter_device=gpu<br />
<br />
replicated: --variable_update=replicated<br />
<br />
replicated, xring: --variable_update=replicated --all_reduce_spec=xring<br />
<br />
replicated, pscpu: --variable_update=replicated --all_reduce_spec=pscpu<br />
<br />
replicated, nccl: --variable_update=replicated --all_reduce_spec=nccl<br />
</pre><br />
*Different variable managing methods perform differently with different models. Users are highly recommended to test their own models with all methods on different types of GPU node.<br />
===Data Format===<br />
For GPU which uses cuDNN, NCHW should be used. This is the default one for this benchmarking code.<br />
<pre><br />
--data_format=NCHW<br />
</pre><br />
For others, NHWC (TF native) is suggested.<br />
<pre><br />
--data_format=NHWC<br />
</pre><br />
<br />
===ResNet-50===<br />
Batch size is 32 per GPU. Data parallelism is used. (Results in "images per second", higher the better)<br />
{| class="wikitable"<br />
|-<br />
! Node type !! Single GPU baseline !! Number of GPUs !! ps,cpu !! ps, gpu !! replicated !! replicated, xring !! replicated, pscpu !! replicated, nccl<br />
|-<br />
| Graham GPU node || 171.23||2 || 93.31 || '''324.04''' || 318.33 || 316.01 || 109.82 || 315.99<br />
|-<br />
| Cedar GPU Base || || 4 || || || || || || <br />
|-<br />
| Cedar GPU Large || 205.71 ||4 || 673.47 || 721.98 || '''754.35''' || 574.91 || 664.72 || 692.25<br />
|}<br />
<br />
===VGG-16===<br />
Batch size is 32 per GPU. Data parallelism is used. (Results in "images per second", higher the better)<br />
<br />
{| class="wikitable"<br />
|-<br />
! Node type !! Single GPU baseline !! Number of GPUs !! ps,cpu !! ps, gpu !! replicated !! replicated, xring !! replicated, pscpu !! replicated, nccl<br />
|-<br />
| Graham GPU node || 115.89||2 || 91.29 || 194.46 || 194.43 || 203.83 || 132.19 || '''219.72'''<br />
|-<br />
| Cedar GPU Base || 114.77 ||4 || 232.85 || 280.69 || 274.41 || 341.29 || 330.04 || '''388.53'''<br />
|-<br />
| Cedar GPU Large || 137.16 ||4 || 175.20 || 379.80 ||336.72 || 417.46 || 225.37 || '''490.52'''<br />
|}</div>Feimaohttps://docs.alliancecan.ca/mediawiki/index.php?title=TensorFlow&diff=46067TensorFlow2018-02-05T19:32:46Z<p>Feimao: /* ResNet-50 */</p>
<hr />
<div><languages /><br />
[[Category:Software]]<br />
<translate><br />
<!--T:18--><br />
[https://www.tensorflow.org/ TensorFlow] is "an open-source software library for Machine Intelligence".<br />
<br />
==Installing TensorFlow== <!--T:1--><br />
<br />
<!--T:2--><br />
These instructions install Tensorflow into your home directory using Compute Canada's pre-built [http://pythonwheels.com/ Python wheels]. Custom Python wheels are stored in <code>/cvmfs/soft.computecanada.ca/custom/python/wheelhouse/</code>. To install TensorFlow's wheel we will use the <code>pip</code> command and install it into a [[Python#Creating_and_using_a_virtual_environment | Python virtual environment]]. The below instructions install for Python 3.5.2 but you can also install for Python 3.5.Y or 2.7.X by loading a different Python module.<br />
<br />
<!--T:3--><br />
Load modules required by TensorFlow:<br />
{{Command|module load python/3.5.2}}<br />
<br />
<!--T:17--><br />
Create a new Python virtual environment:<br />
{{Command|virtualenv tensorflow}}<br />
<br />
<!--T:4--><br />
Activate your newly created Python virtual environment:<br />
{{Command|source tensorflow/bin/activate}}<br />
<br />
Install TensorFlow into your newly created virtual environment using the command from either one of the two following subsections. <br />
=== CPU-only === <!--T:8--><br />
{{Command|prompt=(tensorflow) [name@server $]<br />
|pip install tensorflow-cpu}}<br />
<br />
=== GPU === <!--T:9--><br />
<br />
<!--T:10--><br />
{{Command|prompt=(tensorflow) [name@server $]<br />
|pip install tensorflow-gpu}}<br />
<br />
==Submitting a TensorFlow job with a GPU== <!--T:5--><br />
Once you have the above setup completed you can submit a TensorFlow job as<br />
{{Command|sbatch tensorflow-test.sh}}<br />
The job submission script has the content<br />
</translate><br />
{{File<br />
|name=tensorflow-test.sh<br />
|lang="bash"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --gres=gpu:1 # request GPU "generic resource"<br />
#SBATCH --cpus-per-task=6 # maximum CPU cores per GPU request: 6 on Cedar, 16 on Graham.<br />
#SBATCH --mem=32000M # memory per node<br />
#SBATCH --time=0-03:00 # time (DD-HH:MM)<br />
#SBATCH --output=%N-%j.out # %N for node name, %j for jobID<br />
<br />
module load cuda cudnn python/3.5.2<br />
source tensorflow/bin/activate<br />
python ./tensorflow-test.py<br />
}}<br />
<translate><br />
<!--T:6--><br />
while the Python script has the form,<br />
</translate><br />
{{File<br />
|name=tensorflow-test.py<br />
|lang="python"<br />
|contents=<br />
import tensorflow as tf<br />
node1 = tf.constant(3.0, dtype=tf.float32)<br />
node2 = tf.constant(4.0) # also tf.float32 implicitly<br />
print(node1, node2)<br />
sess = tf.Session()<br />
print(sess.run([node1, node2]))<br />
}}<br />
<translate><br />
<!--T:7--><br />
Once the above job has completed (should take less than a minute) you should see an output file called something like <tt>cdr116-122907.out</tt> with contents similar to the following example,<br />
</translate><br />
{{File<br />
|name=cdr116-122907.out<br />
|lang="text"<br />
|contents=<br />
2017-07-10 12:35:19.489458: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 0 with properties:<br />
name: Tesla P100-PCIE-12GB<br />
major: 6 minor: 0 memoryClockRate (GHz) 1.3285<br />
pciBusID 0000:82:00.0<br />
Total memory: 11.91GiB<br />
Free memory: 11.63GiB<br />
2017-07-10 12:35:19.491097: I tensorflow/core/common_runtime/gpu/gpu_device.cc:961] DMA: 0<br />
2017-07-10 12:35:19.491156: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0: Y<br />
2017-07-10 12:35:19.520737: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla P100-PCIE-12GB, pci bus id: 0000:82:00.0)<br />
Tensor("Const:0", shape=(), dtype=float32) Tensor("Const_1:0", shape=(), dtype=float32)<br />
[3.0, 4.0]<br />
}}<br />
<translate><br />
<br />
<!--T:16--><br />
TensorFlow can run on all GPU node types. Cedar's ''GPU large'' node type, which is equipped with 4 x P100-PCIE-16GB with GPUDirect P2P enabled between each pair, is highly recommended for large scale Deep Learning or Machine Learning research. See [[Using GPUs with SLURM]] for more information.<br />
<br />
</translate><br />
<br />
==Benchmarks==<br />
This section will give ResNet-50 and VGG-16 benchmarking results on both Graham and Cedar with single and multiple GPUs using different methods for managing variables. TensorFlow v1.5 (built with CUDA 9 and cuDNN 7) is used. The benchmark can be found on github: [https://github.com/tensorflow/benchmarks TensorFlow Benchmarks]. <br />
<br />
Methods of managing variables:<br />
<pre><br />
ps, cpu: --variable_update=parameter_server --local_parameter_device=cpu<br />
<br />
ps, gpu: --variable_update=parameter_server --local_parameter_device=gpu<br />
<br />
replicated: --variable_update=replicated<br />
<br />
replicated, xring: --variable_update=replicated --all_reduce_spec=xring<br />
<br />
replicated, pscpu: --variable_update=replicated --all_reduce_spec=pscpu<br />
<br />
replicated, nccl: --variable_update=replicated --all_reduce_spec=nccl<br />
</pre><br />
*Different variable managing methods perform differently with different models. Users are highly recommended to test their own models with all methods on different types of GPU node.<br />
===Data Format===<br />
For GPU which uses cuDNN, NCHW should be used. This is the default one for this benchmarking code.<br />
<pre><br />
--data_format=NCHW<br />
</pre><br />
For others, NHWC (TF native) is suggested.<br />
<pre><br />
--data_format=NHWC<br />
</pre><br />
<br />
===ResNet-50===<br />
Batch size is 32 per GPU. Data parallelism is used. (Results in "images per second", higher the better)<br />
{| class="wikitable"<br />
|-<br />
! Node type !! Single GPU baseline !! Number of GPUs !! ps,cpu !! ps, gpu !! replicated !! replicated, xring !! replicated, pscpu !! replicated, nccl<br />
|-<br />
| Graham GPU node || 171.23||2 || 93.31 || '''324.04''' || 318.33 || 316.01 || 109.82 || 315.99<br />
|-<br />
| Cedar GPU Base || || 4 || || || || || || <br />
|-<br />
| Cedar GPU Large || 205.71 ||4 || 673.47 || 721.98 || '''754.35''' || 574.91 || 664.72 || 692.25<br />
|}<br />
<br />
===VGG-16===<br />
Batch size is 32 per GPU. Data parallelism is used. (Results in "images per second")<br />
<br />
{| class="wikitable"<br />
|-<br />
! Node type !! Single GPU baseline !! Number of GPUs !! ps,cpu !! ps, gpu !! replicated !! replicated, xring !! replicated, pscpu !! replicated, nccl<br />
|-<br />
| Graham GPU node || 115.89||2 || 91.29 || 194.46 || 194.43 || 203.83 || 132.19 || '''219.72'''<br />
|-<br />
| Cedar GPU Base || 114.77 ||4 || 232.85 || 280.69 || 274.41 || 341.29 || 330.04 || '''388.53'''<br />
|-<br />
| Cedar GPU Large || 137.16 ||4 || 175.20 || 379.80 ||336.72 || 417.46 || 225.37 || '''490.52'''<br />
|}</div>Feimaohttps://docs.alliancecan.ca/mediawiki/index.php?title=TensorFlow&diff=46066TensorFlow2018-02-05T19:32:30Z<p>Feimao: /* ResNet-50 */</p>
<hr />
<div><languages /><br />
[[Category:Software]]<br />
<translate><br />
<!--T:18--><br />
[https://www.tensorflow.org/ TensorFlow] is "an open-source software library for Machine Intelligence".<br />
<br />
==Installing TensorFlow== <!--T:1--><br />
<br />
<!--T:2--><br />
These instructions install Tensorflow into your home directory using Compute Canada's pre-built [http://pythonwheels.com/ Python wheels]. Custom Python wheels are stored in <code>/cvmfs/soft.computecanada.ca/custom/python/wheelhouse/</code>. To install TensorFlow's wheel we will use the <code>pip</code> command and install it into a [[Python#Creating_and_using_a_virtual_environment | Python virtual environment]]. The below instructions install for Python 3.5.2 but you can also install for Python 3.5.Y or 2.7.X by loading a different Python module.<br />
<br />
<!--T:3--><br />
Load modules required by TensorFlow:<br />
{{Command|module load python/3.5.2}}<br />
<br />
<!--T:17--><br />
Create a new Python virtual environment:<br />
{{Command|virtualenv tensorflow}}<br />
<br />
<!--T:4--><br />
Activate your newly created Python virtual environment:<br />
{{Command|source tensorflow/bin/activate}}<br />
<br />
Install TensorFlow into your newly created virtual environment using the command from either one of the two following subsections. <br />
=== CPU-only === <!--T:8--><br />
{{Command|prompt=(tensorflow) [name@server $]<br />
|pip install tensorflow-cpu}}<br />
<br />
=== GPU === <!--T:9--><br />
<br />
<!--T:10--><br />
{{Command|prompt=(tensorflow) [name@server $]<br />
|pip install tensorflow-gpu}}<br />
<br />
==Submitting a TensorFlow job with a GPU== <!--T:5--><br />
Once you have the above setup completed you can submit a TensorFlow job as<br />
{{Command|sbatch tensorflow-test.sh}}<br />
The job submission script has the content<br />
</translate><br />
{{File<br />
|name=tensorflow-test.sh<br />
|lang="bash"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --gres=gpu:1 # request GPU "generic resource"<br />
#SBATCH --cpus-per-task=6 # maximum CPU cores per GPU request: 6 on Cedar, 16 on Graham.<br />
#SBATCH --mem=32000M # memory per node<br />
#SBATCH --time=0-03:00 # time (DD-HH:MM)<br />
#SBATCH --output=%N-%j.out # %N for node name, %j for jobID<br />
<br />
module load cuda cudnn python/3.5.2<br />
source tensorflow/bin/activate<br />
python ./tensorflow-test.py<br />
}}<br />
<translate><br />
<!--T:6--><br />
while the Python script has the form,<br />
</translate><br />
{{File<br />
|name=tensorflow-test.py<br />
|lang="python"<br />
|contents=<br />
import tensorflow as tf<br />
node1 = tf.constant(3.0, dtype=tf.float32)<br />
node2 = tf.constant(4.0) # also tf.float32 implicitly<br />
print(node1, node2)<br />
sess = tf.Session()<br />
print(sess.run([node1, node2]))<br />
}}<br />
<translate><br />
<!--T:7--><br />
Once the above job has completed (should take less than a minute) you should see an output file called something like <tt>cdr116-122907.out</tt> with contents similar to the following example,<br />
</translate><br />
{{File<br />
|name=cdr116-122907.out<br />
|lang="text"<br />
|contents=<br />
2017-07-10 12:35:19.489458: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 0 with properties:<br />
name: Tesla P100-PCIE-12GB<br />
major: 6 minor: 0 memoryClockRate (GHz) 1.3285<br />
pciBusID 0000:82:00.0<br />
Total memory: 11.91GiB<br />
Free memory: 11.63GiB<br />
2017-07-10 12:35:19.491097: I tensorflow/core/common_runtime/gpu/gpu_device.cc:961] DMA: 0<br />
2017-07-10 12:35:19.491156: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0: Y<br />
2017-07-10 12:35:19.520737: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla P100-PCIE-12GB, pci bus id: 0000:82:00.0)<br />
Tensor("Const:0", shape=(), dtype=float32) Tensor("Const_1:0", shape=(), dtype=float32)<br />
[3.0, 4.0]<br />
}}<br />
<translate><br />
<br />
<!--T:16--><br />
TensorFlow can run on all GPU node types. Cedar's ''GPU large'' node type, which is equipped with 4 x P100-PCIE-16GB with GPUDirect P2P enabled between each pair, is highly recommended for large scale Deep Learning or Machine Learning research. See [[Using GPUs with SLURM]] for more information.<br />
<br />
</translate><br />
<br />
==Benchmarks==<br />
This section will give ResNet-50 and VGG-16 benchmarking results on both Graham and Cedar with single and multiple GPUs using different methods for managing variables. TensorFlow v1.5 (built with CUDA 9 and cuDNN 7) is used. The benchmark can be found on github: [https://github.com/tensorflow/benchmarks TensorFlow Benchmarks]. <br />
<br />
Methods of managing variables:<br />
<pre><br />
ps, cpu: --variable_update=parameter_server --local_parameter_device=cpu<br />
<br />
ps, gpu: --variable_update=parameter_server --local_parameter_device=gpu<br />
<br />
replicated: --variable_update=replicated<br />
<br />
replicated, xring: --variable_update=replicated --all_reduce_spec=xring<br />
<br />
replicated, pscpu: --variable_update=replicated --all_reduce_spec=pscpu<br />
<br />
replicated, nccl: --variable_update=replicated --all_reduce_spec=nccl<br />
</pre><br />
*Different variable managing methods perform differently with different models. Users are highly recommended to test their own models with all methods on different types of GPU node.<br />
===Data Format===<br />
For GPU which uses cuDNN, NCHW should be used. This is the default one for this benchmarking code.<br />
<pre><br />
--data_format=NCHW<br />
</pre><br />
For others, NHWC (TF native) is suggested.<br />
<pre><br />
--data_format=NHWC<br />
</pre><br />
<br />
===ResNet-50===<br />
Batch size is 32 per GPU. Data parallelism is used. (Results in "images per second", larger the better)<br />
{| class="wikitable"<br />
|-<br />
! Node type !! Single GPU baseline !! Number of GPUs !! ps,cpu !! ps, gpu !! replicated !! replicated, xring !! replicated, pscpu !! replicated, nccl<br />
|-<br />
| Graham GPU node || 171.23||2 || 93.31 || '''324.04''' || 318.33 || 316.01 || 109.82 || 315.99<br />
|-<br />
| Cedar GPU Base || || 4 || || || || || || <br />
|-<br />
| Cedar GPU Large || 205.71 ||4 || 673.47 || 721.98 || '''754.35''' || 574.91 || 664.72 || 692.25<br />
|}<br />
<br />
===VGG-16===<br />
Batch size is 32 per GPU. Data parallelism is used. (Results in "images per second")<br />
<br />
{| class="wikitable"<br />
|-<br />
! Node type !! Single GPU baseline !! Number of GPUs !! ps,cpu !! ps, gpu !! replicated !! replicated, xring !! replicated, pscpu !! replicated, nccl<br />
|-<br />
| Graham GPU node || 115.89||2 || 91.29 || 194.46 || 194.43 || 203.83 || 132.19 || '''219.72'''<br />
|-<br />
| Cedar GPU Base || 114.77 ||4 || 232.85 || 280.69 || 274.41 || 341.29 || 330.04 || '''388.53'''<br />
|-<br />
| Cedar GPU Large || 137.16 ||4 || 175.20 || 379.80 ||336.72 || 417.46 || 225.37 || '''490.52'''<br />
|}</div>Feimaohttps://docs.alliancecan.ca/mediawiki/index.php?title=TensorFlow&diff=46065TensorFlow2018-02-05T19:27:23Z<p>Feimao: /* Benchmarks */</p>
<hr />
<div><languages /><br />
[[Category:Software]]<br />
<translate><br />
<!--T:18--><br />
[https://www.tensorflow.org/ TensorFlow] is "an open-source software library for Machine Intelligence".<br />
<br />
==Installing TensorFlow== <!--T:1--><br />
<br />
<!--T:2--><br />
These instructions install Tensorflow into your home directory using Compute Canada's pre-built [http://pythonwheels.com/ Python wheels]. Custom Python wheels are stored in <code>/cvmfs/soft.computecanada.ca/custom/python/wheelhouse/</code>. To install TensorFlow's wheel we will use the <code>pip</code> command and install it into a [[Python#Creating_and_using_a_virtual_environment | Python virtual environment]]. The below instructions install for Python 3.5.2 but you can also install for Python 3.5.Y or 2.7.X by loading a different Python module.<br />
<br />
<!--T:3--><br />
Load modules required by TensorFlow:<br />
{{Command|module load python/3.5.2}}<br />
<br />
<!--T:17--><br />
Create a new Python virtual environment:<br />
{{Command|virtualenv tensorflow}}<br />
<br />
<!--T:4--><br />
Activate your newly created Python virtual environment:<br />
{{Command|source tensorflow/bin/activate}}<br />
<br />
Install TensorFlow into your newly created virtual environment using the command from either one of the two following subsections. <br />
=== CPU-only === <!--T:8--><br />
{{Command|prompt=(tensorflow) [name@server $]<br />
|pip install tensorflow-cpu}}<br />
<br />
=== GPU === <!--T:9--><br />
<br />
<!--T:10--><br />
{{Command|prompt=(tensorflow) [name@server $]<br />
|pip install tensorflow-gpu}}<br />
<br />
==Submitting a TensorFlow job with a GPU== <!--T:5--><br />
Once you have the above setup completed you can submit a TensorFlow job as<br />
{{Command|sbatch tensorflow-test.sh}}<br />
The job submission script has the content<br />
</translate><br />
{{File<br />
|name=tensorflow-test.sh<br />
|lang="bash"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --gres=gpu:1 # request GPU "generic resource"<br />
#SBATCH --cpus-per-task=6 # maximum CPU cores per GPU request: 6 on Cedar, 16 on Graham.<br />
#SBATCH --mem=32000M # memory per node<br />
#SBATCH --time=0-03:00 # time (DD-HH:MM)<br />
#SBATCH --output=%N-%j.out # %N for node name, %j for jobID<br />
<br />
module load cuda cudnn python/3.5.2<br />
source tensorflow/bin/activate<br />
python ./tensorflow-test.py<br />
}}<br />
<translate><br />
<!--T:6--><br />
while the Python script has the form,<br />
</translate><br />
{{File<br />
|name=tensorflow-test.py<br />
|lang="python"<br />
|contents=<br />
import tensorflow as tf<br />
node1 = tf.constant(3.0, dtype=tf.float32)<br />
node2 = tf.constant(4.0) # also tf.float32 implicitly<br />
print(node1, node2)<br />
sess = tf.Session()<br />
print(sess.run([node1, node2]))<br />
}}<br />
<translate><br />
<!--T:7--><br />
Once the above job has completed (should take less than a minute) you should see an output file called something like <tt>cdr116-122907.out</tt> with contents similar to the following example,<br />
</translate><br />
{{File<br />
|name=cdr116-122907.out<br />
|lang="text"<br />
|contents=<br />
2017-07-10 12:35:19.489458: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 0 with properties:<br />
name: Tesla P100-PCIE-12GB<br />
major: 6 minor: 0 memoryClockRate (GHz) 1.3285<br />
pciBusID 0000:82:00.0<br />
Total memory: 11.91GiB<br />
Free memory: 11.63GiB<br />
2017-07-10 12:35:19.491097: I tensorflow/core/common_runtime/gpu/gpu_device.cc:961] DMA: 0<br />
2017-07-10 12:35:19.491156: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0: Y<br />
2017-07-10 12:35:19.520737: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla P100-PCIE-12GB, pci bus id: 0000:82:00.0)<br />
Tensor("Const:0", shape=(), dtype=float32) Tensor("Const_1:0", shape=(), dtype=float32)<br />
[3.0, 4.0]<br />
}}<br />
<translate><br />
<br />
<!--T:16--><br />
TensorFlow can run on all GPU node types. Cedar's ''GPU large'' node type, which is equipped with 4 x P100-PCIE-16GB with GPUDirect P2P enabled between each pair, is highly recommended for large scale Deep Learning or Machine Learning research. See [[Using GPUs with SLURM]] for more information.<br />
<br />
</translate><br />
<br />
==Benchmarks==<br />
This section will give ResNet-50 and VGG-16 benchmarking results on both Graham and Cedar with single and multiple GPUs using different methods for managing variables. TensorFlow v1.5 (built with CUDA 9 and cuDNN 7) is used. The benchmark can be found on github: [https://github.com/tensorflow/benchmarks TensorFlow Benchmarks]. <br />
<br />
Methods of managing variables:<br />
<pre><br />
ps, cpu: --variable_update=parameter_server --local_parameter_device=cpu<br />
<br />
ps, gpu: --variable_update=parameter_server --local_parameter_device=gpu<br />
<br />
replicated: --variable_update=replicated<br />
<br />
replicated, xring: --variable_update=replicated --all_reduce_spec=xring<br />
<br />
replicated, pscpu: --variable_update=replicated --all_reduce_spec=pscpu<br />
<br />
replicated, nccl: --variable_update=replicated --all_reduce_spec=nccl<br />
</pre><br />
*Different variable managing methods perform differently with different models. Users are highly recommended to test their own models with all methods on different types of GPU node.<br />
===Data Format===<br />
For GPU which uses cuDNN, NCHW should be used. This is the default one for this benchmarking code.<br />
<pre><br />
--data_format=NCHW<br />
</pre><br />
For others, NHWC (TF native) is suggested.<br />
<pre><br />
--data_format=NHWC<br />
</pre><br />
<br />
===ResNet-50===<br />
Batch size is 32 per GPU. Data parallelism is used. (Results in "images per second")<br />
{| class="wikitable"<br />
|-<br />
! Node type !! Single GPU baseline !! Number of GPUs !! ps,cpu !! ps, gpu !! replicated !! replicated, xring !! replicated, pscpu !! replicated, nccl<br />
|-<br />
| Graham GPU node || 171.23||2 || 93.31 || '''324.04''' || 318.33 || 316.01 || 109.82 || 315.99<br />
|-<br />
| Cedar GPU Base || || 4 || || || || || || <br />
|-<br />
| Cedar GPU Large || 205.71 ||4 || 673.47 || 721.98 || '''754.35''' || 574.91 || 664.72 || 692.25<br />
|}<br />
<br />
===VGG-16===<br />
Batch size is 32 per GPU. Data parallelism is used. (Results in "images per second")<br />
<br />
{| class="wikitable"<br />
|-<br />
! Node type !! Single GPU baseline !! Number of GPUs !! ps,cpu !! ps, gpu !! replicated !! replicated, xring !! replicated, pscpu !! replicated, nccl<br />
|-<br />
| Graham GPU node || 115.89||2 || 91.29 || 194.46 || 194.43 || 203.83 || 132.19 || '''219.72'''<br />
|-<br />
| Cedar GPU Base || 114.77 ||4 || 232.85 || 280.69 || 274.41 || 341.29 || 330.04 || '''388.53'''<br />
|-<br />
| Cedar GPU Large || 137.16 ||4 || 175.20 || 379.80 ||336.72 || 417.46 || 225.37 || '''490.52'''<br />
|}</div>Feimaohttps://docs.alliancecan.ca/mediawiki/index.php?title=TensorFlow&diff=46064TensorFlow2018-02-05T19:27:07Z<p>Feimao: /* Benchmarks */</p>
<hr />
<div><languages /><br />
[[Category:Software]]<br />
<translate><br />
<!--T:18--><br />
[https://www.tensorflow.org/ TensorFlow] is "an open-source software library for Machine Intelligence".<br />
<br />
==Installing TensorFlow== <!--T:1--><br />
<br />
<!--T:2--><br />
These instructions install Tensorflow into your home directory using Compute Canada's pre-built [http://pythonwheels.com/ Python wheels]. Custom Python wheels are stored in <code>/cvmfs/soft.computecanada.ca/custom/python/wheelhouse/</code>. To install TensorFlow's wheel we will use the <code>pip</code> command and install it into a [[Python#Creating_and_using_a_virtual_environment | Python virtual environment]]. The below instructions install for Python 3.5.2 but you can also install for Python 3.5.Y or 2.7.X by loading a different Python module.<br />
<br />
<!--T:3--><br />
Load modules required by TensorFlow:<br />
{{Command|module load python/3.5.2}}<br />
<br />
<!--T:17--><br />
Create a new Python virtual environment:<br />
{{Command|virtualenv tensorflow}}<br />
<br />
<!--T:4--><br />
Activate your newly created Python virtual environment:<br />
{{Command|source tensorflow/bin/activate}}<br />
<br />
Install TensorFlow into your newly created virtual environment using the command from either one of the two following subsections. <br />
=== CPU-only === <!--T:8--><br />
{{Command|prompt=(tensorflow) [name@server $]<br />
|pip install tensorflow-cpu}}<br />
<br />
=== GPU === <!--T:9--><br />
<br />
<!--T:10--><br />
{{Command|prompt=(tensorflow) [name@server $]<br />
|pip install tensorflow-gpu}}<br />
<br />
==Submitting a TensorFlow job with a GPU== <!--T:5--><br />
Once you have the above setup completed you can submit a TensorFlow job as<br />
{{Command|sbatch tensorflow-test.sh}}<br />
The job submission script has the content<br />
</translate><br />
{{File<br />
|name=tensorflow-test.sh<br />
|lang="bash"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --gres=gpu:1 # request GPU "generic resource"<br />
#SBATCH --cpus-per-task=6 # maximum CPU cores per GPU request: 6 on Cedar, 16 on Graham.<br />
#SBATCH --mem=32000M # memory per node<br />
#SBATCH --time=0-03:00 # time (DD-HH:MM)<br />
#SBATCH --output=%N-%j.out # %N for node name, %j for jobID<br />
<br />
module load cuda cudnn python/3.5.2<br />
source tensorflow/bin/activate<br />
python ./tensorflow-test.py<br />
}}<br />
<translate><br />
<!--T:6--><br />
while the Python script has the form,<br />
</translate><br />
{{File<br />
|name=tensorflow-test.py<br />
|lang="python"<br />
|contents=<br />
import tensorflow as tf<br />
node1 = tf.constant(3.0, dtype=tf.float32)<br />
node2 = tf.constant(4.0) # also tf.float32 implicitly<br />
print(node1, node2)<br />
sess = tf.Session()<br />
print(sess.run([node1, node2]))<br />
}}<br />
<translate><br />
<!--T:7--><br />
Once the above job has completed (should take less than a minute) you should see an output file called something like <tt>cdr116-122907.out</tt> with contents similar to the following example,<br />
</translate><br />
{{File<br />
|name=cdr116-122907.out<br />
|lang="text"<br />
|contents=<br />
2017-07-10 12:35:19.489458: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 0 with properties:<br />
name: Tesla P100-PCIE-12GB<br />
major: 6 minor: 0 memoryClockRate (GHz) 1.3285<br />
pciBusID 0000:82:00.0<br />
Total memory: 11.91GiB<br />
Free memory: 11.63GiB<br />
2017-07-10 12:35:19.491097: I tensorflow/core/common_runtime/gpu/gpu_device.cc:961] DMA: 0<br />
2017-07-10 12:35:19.491156: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0: Y<br />
2017-07-10 12:35:19.520737: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla P100-PCIE-12GB, pci bus id: 0000:82:00.0)<br />
Tensor("Const:0", shape=(), dtype=float32) Tensor("Const_1:0", shape=(), dtype=float32)<br />
[3.0, 4.0]<br />
}}<br />
<translate><br />
<br />
<!--T:16--><br />
TensorFlow can run on all GPU node types. Cedar's ''GPU large'' node type, which is equipped with 4 x P100-PCIE-16GB with GPUDirect P2P enabled between each pair, is highly recommended for large scale Deep Learning or Machine Learning research. See [[Using GPUs with SLURM]] for more information.<br />
<br />
</translate><br />
<br />
==Benchmarks==<br />
This section will give ResNet-50 and VGG-16 benchmarking results on both Graham and Cedar with single and multiple GPUs using different methods for managing variables. TensorFlow v1.5 (built with CUDA9 and cuDNN 7) is used. The benchmark can be found on github: [https://github.com/tensorflow/benchmarks TensorFlow Benchmarks]. <br />
<br />
Methods of managing variables:<br />
<pre><br />
ps, cpu: --variable_update=parameter_server --local_parameter_device=cpu<br />
<br />
ps, gpu: --variable_update=parameter_server --local_parameter_device=gpu<br />
<br />
replicated: --variable_update=replicated<br />
<br />
replicated, xring: --variable_update=replicated --all_reduce_spec=xring<br />
<br />
replicated, pscpu: --variable_update=replicated --all_reduce_spec=pscpu<br />
<br />
replicated, nccl: --variable_update=replicated --all_reduce_spec=nccl<br />
</pre><br />
*Different variable managing methods perform differently with different models. Users are highly recommended to test their own models with all methods on different types of GPU node.<br />
===Data Format===<br />
For GPU which uses cuDNN, NCHW should be used. This is the default one for this benchmarking code.<br />
<pre><br />
--data_format=NCHW<br />
</pre><br />
For others, NHWC (TF native) is suggested.<br />
<pre><br />
--data_format=NHWC<br />
</pre><br />
<br />
===ResNet-50===<br />
Batch size is 32 per GPU. Data parallelism is used. (Results in "images per second")<br />
{| class="wikitable"<br />
|-<br />
! Node type !! Single GPU baseline !! Number of GPUs !! ps,cpu !! ps, gpu !! replicated !! replicated, xring !! replicated, pscpu !! replicated, nccl<br />
|-<br />
| Graham GPU node || 171.23||2 || 93.31 || '''324.04''' || 318.33 || 316.01 || 109.82 || 315.99<br />
|-<br />
| Cedar GPU Base || || 4 || || || || || || <br />
|-<br />
| Cedar GPU Large || 205.71 ||4 || 673.47 || 721.98 || '''754.35''' || 574.91 || 664.72 || 692.25<br />
|}<br />
<br />
===VGG-16===<br />
Batch size is 32 per GPU. Data parallelism is used. (Results in "images per second")<br />
<br />
{| class="wikitable"<br />
|-<br />
! Node type !! Single GPU baseline !! Number of GPUs !! ps,cpu !! ps, gpu !! replicated !! replicated, xring !! replicated, pscpu !! replicated, nccl<br />
|-<br />
| Graham GPU node || 115.89||2 || 91.29 || 194.46 || 194.43 || 203.83 || 132.19 || '''219.72'''<br />
|-<br />
| Cedar GPU Base || 114.77 ||4 || 232.85 || 280.69 || 274.41 || 341.29 || 330.04 || '''388.53'''<br />
|-<br />
| Cedar GPU Large || 137.16 ||4 || 175.20 || 379.80 ||336.72 || 417.46 || 225.37 || '''490.52'''<br />
|}</div>Feimaohttps://docs.alliancecan.ca/mediawiki/index.php?title=TensorFlow&diff=46063TensorFlow2018-02-05T19:25:52Z<p>Feimao: /* VGG-16 */</p>
<hr />
<div><languages /><br />
[[Category:Software]]<br />
<translate><br />
<!--T:18--><br />
[https://www.tensorflow.org/ TensorFlow] is "an open-source software library for Machine Intelligence".<br />
<br />
==Installing TensorFlow== <!--T:1--><br />
<br />
<!--T:2--><br />
These instructions install Tensorflow into your home directory using Compute Canada's pre-built [http://pythonwheels.com/ Python wheels]. Custom Python wheels are stored in <code>/cvmfs/soft.computecanada.ca/custom/python/wheelhouse/</code>. To install TensorFlow's wheel we will use the <code>pip</code> command and install it into a [[Python#Creating_and_using_a_virtual_environment | Python virtual environment]]. The below instructions install for Python 3.5.2 but you can also install for Python 3.5.Y or 2.7.X by loading a different Python module.<br />
<br />
<!--T:3--><br />
Load modules required by TensorFlow:<br />
{{Command|module load python/3.5.2}}<br />
<br />
<!--T:17--><br />
Create a new Python virtual environment:<br />
{{Command|virtualenv tensorflow}}<br />
<br />
<!--T:4--><br />
Activate your newly created Python virtual environment:<br />
{{Command|source tensorflow/bin/activate}}<br />
<br />
Install TensorFlow into your newly created virtual environment using the command from either one of the two following subsections. <br />
=== CPU-only === <!--T:8--><br />
{{Command|prompt=(tensorflow) [name@server $]<br />
|pip install tensorflow-cpu}}<br />
<br />
=== GPU === <!--T:9--><br />
<br />
<!--T:10--><br />
{{Command|prompt=(tensorflow) [name@server $]<br />
|pip install tensorflow-gpu}}<br />
<br />
==Submitting a TensorFlow job with a GPU== <!--T:5--><br />
Once you have the above setup completed you can submit a TensorFlow job as<br />
{{Command|sbatch tensorflow-test.sh}}<br />
The job submission script has the content<br />
</translate><br />
{{File<br />
|name=tensorflow-test.sh<br />
|lang="bash"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --gres=gpu:1 # request GPU "generic resource"<br />
#SBATCH --cpus-per-task=6 # maximum CPU cores per GPU request: 6 on Cedar, 16 on Graham.<br />
#SBATCH --mem=32000M # memory per node<br />
#SBATCH --time=0-03:00 # time (DD-HH:MM)<br />
#SBATCH --output=%N-%j.out # %N for node name, %j for jobID<br />
<br />
module load cuda cudnn python/3.5.2<br />
source tensorflow/bin/activate<br />
python ./tensorflow-test.py<br />
}}<br />
<translate><br />
<!--T:6--><br />
while the Python script has the form,<br />
</translate><br />
{{File<br />
|name=tensorflow-test.py<br />
|lang="python"<br />
|contents=<br />
import tensorflow as tf<br />
node1 = tf.constant(3.0, dtype=tf.float32)<br />
node2 = tf.constant(4.0) # also tf.float32 implicitly<br />
print(node1, node2)<br />
sess = tf.Session()<br />
print(sess.run([node1, node2]))<br />
}}<br />
<translate><br />
<!--T:7--><br />
Once the above job has completed (should take less than a minute) you should see an output file called something like <tt>cdr116-122907.out</tt> with contents similar to the following example,<br />
</translate><br />
{{File<br />
|name=cdr116-122907.out<br />
|lang="text"<br />
|contents=<br />
2017-07-10 12:35:19.489458: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 0 with properties:<br />
name: Tesla P100-PCIE-12GB<br />
major: 6 minor: 0 memoryClockRate (GHz) 1.3285<br />
pciBusID 0000:82:00.0<br />
Total memory: 11.91GiB<br />
Free memory: 11.63GiB<br />
2017-07-10 12:35:19.491097: I tensorflow/core/common_runtime/gpu/gpu_device.cc:961] DMA: 0<br />
2017-07-10 12:35:19.491156: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0: Y<br />
2017-07-10 12:35:19.520737: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla P100-PCIE-12GB, pci bus id: 0000:82:00.0)<br />
Tensor("Const:0", shape=(), dtype=float32) Tensor("Const_1:0", shape=(), dtype=float32)<br />
[3.0, 4.0]<br />
}}<br />
<translate><br />
<br />
<!--T:16--><br />
TensorFlow can run on all GPU node types. Cedar's ''GPU large'' node type, which is equipped with 4 x P100-PCIE-16GB with GPUDirect P2P enabled between each pair, is highly recommended for large scale Deep Learning or Machine Learning research. See [[Using GPUs with SLURM]] for more information.<br />
<br />
</translate><br />
<br />
==Benchmarks==<br />
This section will give ResNet-50 and VGG-16 benchmarking results on both Graham and Cedar with single and multiple GPUs using different methods for managing variables. The benchmark can be found on github: [https://github.com/tensorflow/benchmarks TensorFlow Benchmarks].<br />
<br />
Methods of managing variables:<br />
<pre><br />
ps, cpu: --variable_update=parameter_server --local_parameter_device=cpu<br />
<br />
ps, gpu: --variable_update=parameter_server --local_parameter_device=gpu<br />
<br />
replicated: --variable_update=replicated<br />
<br />
replicated, xring: --variable_update=replicated --all_reduce_spec=xring<br />
<br />
replicated, pscpu: --variable_update=replicated --all_reduce_spec=pscpu<br />
<br />
replicated, nccl: --variable_update=replicated --all_reduce_spec=nccl<br />
</pre><br />
*Different variable managing methods perform differently with different models. Users are highly recommended to test their own models with all methods on different types of GPU node.<br />
===Data Format===<br />
For GPU which uses cuDNN, NCHW should be used. This is the default one for this benchmarking code.<br />
<pre><br />
--data_format=NCHW<br />
</pre><br />
For others, NHWC (TF native) is suggested.<br />
<pre><br />
--data_format=NHWC<br />
</pre><br />
<br />
===ResNet-50===<br />
Batch size is 32 per GPU. Data parallelism is used. (Results in "images per second")<br />
{| class="wikitable"<br />
|-<br />
! Node type !! Single GPU baseline !! Number of GPUs !! ps,cpu !! ps, gpu !! replicated !! replicated, xring !! replicated, pscpu !! replicated, nccl<br />
|-<br />
| Graham GPU node || 171.23||2 || 93.31 || '''324.04''' || 318.33 || 316.01 || 109.82 || 315.99<br />
|-<br />
| Cedar GPU Base || || 4 || || || || || || <br />
|-<br />
| Cedar GPU Large || 205.71 ||4 || 673.47 || 721.98 || '''754.35''' || 574.91 || 664.72 || 692.25<br />
|}<br />
<br />
===VGG-16===<br />
Batch size is 32 per GPU. Data parallelism is used. (Results in "images per second")<br />
<br />
{| class="wikitable"<br />
|-<br />
! Node type !! Single GPU baseline !! Number of GPUs !! ps,cpu !! ps, gpu !! replicated !! replicated, xring !! replicated, pscpu !! replicated, nccl<br />
|-<br />
| Graham GPU node || 115.89||2 || 91.29 || 194.46 || 194.43 || 203.83 || 132.19 || '''219.72'''<br />
|-<br />
| Cedar GPU Base || 114.77 ||4 || 232.85 || 280.69 || 274.41 || 341.29 || 330.04 || '''388.53'''<br />
|-<br />
| Cedar GPU Large || 137.16 ||4 || 175.20 || 379.80 ||336.72 || 417.46 || 225.37 || '''490.52'''<br />
|}</div>Feimaohttps://docs.alliancecan.ca/mediawiki/index.php?title=TensorFlow&diff=46062TensorFlow2018-02-05T19:25:01Z<p>Feimao: /* Benchmarks */</p>
<hr />
<div><languages /><br />
[[Category:Software]]<br />
<translate><br />
<!--T:18--><br />
[https://www.tensorflow.org/ TensorFlow] is "an open-source software library for Machine Intelligence".<br />
<br />
==Installing TensorFlow== <!--T:1--><br />
<br />
<!--T:2--><br />
These instructions install Tensorflow into your home directory using Compute Canada's pre-built [http://pythonwheels.com/ Python wheels]. Custom Python wheels are stored in <code>/cvmfs/soft.computecanada.ca/custom/python/wheelhouse/</code>. To install TensorFlow's wheel we will use the <code>pip</code> command and install it into a [[Python#Creating_and_using_a_virtual_environment | Python virtual environment]]. The below instructions install for Python 3.5.2 but you can also install for Python 3.5.Y or 2.7.X by loading a different Python module.<br />
<br />
<!--T:3--><br />
Load modules required by TensorFlow:<br />
{{Command|module load python/3.5.2}}<br />
<br />
<!--T:17--><br />
Create a new Python virtual environment:<br />
{{Command|virtualenv tensorflow}}<br />
<br />
<!--T:4--><br />
Activate your newly created Python virtual environment:<br />
{{Command|source tensorflow/bin/activate}}<br />
<br />
Install TensorFlow into your newly created virtual environment using the command from either one of the two following subsections. <br />
=== CPU-only === <!--T:8--><br />
{{Command|prompt=(tensorflow) [name@server $]<br />
|pip install tensorflow-cpu}}<br />
<br />
=== GPU === <!--T:9--><br />
<br />
<!--T:10--><br />
{{Command|prompt=(tensorflow) [name@server $]<br />
|pip install tensorflow-gpu}}<br />
<br />
==Submitting a TensorFlow job with a GPU== <!--T:5--><br />
Once you have the above setup completed you can submit a TensorFlow job as<br />
{{Command|sbatch tensorflow-test.sh}}<br />
The job submission script has the content<br />
</translate><br />
{{File<br />
|name=tensorflow-test.sh<br />
|lang="bash"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --gres=gpu:1 # request GPU "generic resource"<br />
#SBATCH --cpus-per-task=6 # maximum CPU cores per GPU request: 6 on Cedar, 16 on Graham.<br />
#SBATCH --mem=32000M # memory per node<br />
#SBATCH --time=0-03:00 # time (DD-HH:MM)<br />
#SBATCH --output=%N-%j.out # %N for node name, %j for jobID<br />
<br />
module load cuda cudnn python/3.5.2<br />
source tensorflow/bin/activate<br />
python ./tensorflow-test.py<br />
}}<br />
<translate><br />
<!--T:6--><br />
while the Python script has the form,<br />
</translate><br />
{{File<br />
|name=tensorflow-test.py<br />
|lang="python"<br />
|contents=<br />
import tensorflow as tf<br />
node1 = tf.constant(3.0, dtype=tf.float32)<br />
node2 = tf.constant(4.0) # also tf.float32 implicitly<br />
print(node1, node2)<br />
sess = tf.Session()<br />
print(sess.run([node1, node2]))<br />
}}<br />
<translate><br />
<!--T:7--><br />
Once the above job has completed (should take less than a minute) you should see an output file called something like <tt>cdr116-122907.out</tt> with contents similar to the following example,<br />
</translate><br />
{{File<br />
|name=cdr116-122907.out<br />
|lang="text"<br />
|contents=<br />
2017-07-10 12:35:19.489458: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 0 with properties:<br />
name: Tesla P100-PCIE-12GB<br />
major: 6 minor: 0 memoryClockRate (GHz) 1.3285<br />
pciBusID 0000:82:00.0<br />
Total memory: 11.91GiB<br />
Free memory: 11.63GiB<br />
2017-07-10 12:35:19.491097: I tensorflow/core/common_runtime/gpu/gpu_device.cc:961] DMA: 0<br />
2017-07-10 12:35:19.491156: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0: Y<br />
2017-07-10 12:35:19.520737: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla P100-PCIE-12GB, pci bus id: 0000:82:00.0)<br />
Tensor("Const:0", shape=(), dtype=float32) Tensor("Const_1:0", shape=(), dtype=float32)<br />
[3.0, 4.0]<br />
}}<br />
<translate><br />
<br />
<!--T:16--><br />
TensorFlow can run on all GPU node types. Cedar's ''GPU large'' node type, which is equipped with 4 x P100-PCIE-16GB with GPUDirect P2P enabled between each pair, is highly recommended for large scale Deep Learning or Machine Learning research. See [[Using GPUs with SLURM]] for more information.<br />
<br />
</translate><br />
<br />
==Benchmarks==<br />
This section will give ResNet-50 and VGG-16 benchmarking results on both Graham and Cedar with single and multiple GPUs using different methods for managing variables. The benchmark can be found on github: [https://github.com/tensorflow/benchmarks TensorFlow Benchmarks].<br />
<br />
Methods of managing variables:<br />
<pre><br />
ps, cpu: --variable_update=parameter_server --local_parameter_device=cpu<br />
<br />
ps, gpu: --variable_update=parameter_server --local_parameter_device=gpu<br />
<br />
replicated: --variable_update=replicated<br />
<br />
replicated, xring: --variable_update=replicated --all_reduce_spec=xring<br />
<br />
replicated, pscpu: --variable_update=replicated --all_reduce_spec=pscpu<br />
<br />
replicated, nccl: --variable_update=replicated --all_reduce_spec=nccl<br />
</pre><br />
*Different variable managing methods perform differently with different models. Users are highly recommended to test their own models with all methods on different types of GPU node.<br />
===Data Format===<br />
For GPU which uses cuDNN, NCHW should be used. This is the default one for this benchmarking code.<br />
<pre><br />
--data_format=NCHW<br />
</pre><br />
For others, NHWC (TF native) is suggested.<br />
<pre><br />
--data_format=NHWC<br />
</pre><br />
<br />
===ResNet-50===<br />
Batch size is 32 per GPU. Data parallelism is used. (Results in "images per second")<br />
{| class="wikitable"<br />
|-<br />
! Node type !! Single GPU baseline !! Number of GPUs !! ps,cpu !! ps, gpu !! replicated !! replicated, xring !! replicated, pscpu !! replicated, nccl<br />
|-<br />
| Graham GPU node || 171.23||2 || 93.31 || '''324.04''' || 318.33 || 316.01 || 109.82 || 315.99<br />
|-<br />
| Cedar GPU Base || || 4 || || || || || || <br />
|-<br />
| Cedar GPU Large || 205.71 ||4 || 673.47 || 721.98 || '''754.35''' || 574.91 || 664.72 || 692.25<br />
|}<br />
<br />
===VGG-16===<br />
Batch size is 32 per GPU. Data parallelism is used. NCCL runs the best for all kinds of node types. (Results in "images per second")<br />
<br />
{| class="wikitable"<br />
|-<br />
! Node type !! Single GPU baseline !! Number of GPUs !! ps,cpu !! ps, gpu !! replicated !! replicated, xring !! replicated, pscpu !! replicated, nccl<br />
|-<br />
| Graham GPU node || 115.89||2 || 91.29 || 194.46 || 194.43 || 203.83 || 132.19 || '''219.72'''<br />
|-<br />
| Cedar GPU Base || 114.77 ||4 || 232.85 || 280.69 || 274.41 || 341.29 || 330.04 || '''388.53'''<br />
|-<br />
| Cedar GPU Large || 137.16 ||4 || 175.20 || 379.80 ||336.72 || 417.46 || 225.37 || '''490.52'''<br />
|}</div>Feimaohttps://docs.alliancecan.ca/mediawiki/index.php?title=TensorFlow&diff=46061TensorFlow2018-02-05T19:19:42Z<p>Feimao: /* ResNet-50 */</p>
<hr />
<div><languages /><br />
[[Category:Software]]<br />
<translate><br />
<!--T:18--><br />
[https://www.tensorflow.org/ TensorFlow] is "an open-source software library for Machine Intelligence".<br />
<br />
==Installing TensorFlow== <!--T:1--><br />
<br />
<!--T:2--><br />
These instructions install Tensorflow into your home directory using Compute Canada's pre-built [http://pythonwheels.com/ Python wheels]. Custom Python wheels are stored in <code>/cvmfs/soft.computecanada.ca/custom/python/wheelhouse/</code>. To install TensorFlow's wheel we will use the <code>pip</code> command and install it into a [[Python#Creating_and_using_a_virtual_environment | Python virtual environment]]. The below instructions install for Python 3.5.2 but you can also install for Python 3.5.Y or 2.7.X by loading a different Python module.<br />
<br />
<!--T:3--><br />
Load modules required by TensorFlow:<br />
{{Command|module load python/3.5.2}}<br />
<br />
<!--T:17--><br />
Create a new Python virtual environment:<br />
{{Command|virtualenv tensorflow}}<br />
<br />
<!--T:4--><br />
Activate your newly created Python virtual environment:<br />
{{Command|source tensorflow/bin/activate}}<br />
<br />
Install TensorFlow into your newly created virtual environment using the command from either one of the two following subsections. <br />
=== CPU-only === <!--T:8--><br />
{{Command|prompt=(tensorflow) [name@server $]<br />
|pip install tensorflow-cpu}}<br />
<br />
=== GPU === <!--T:9--><br />
<br />
<!--T:10--><br />
{{Command|prompt=(tensorflow) [name@server $]<br />
|pip install tensorflow-gpu}}<br />
<br />
==Submitting a TensorFlow job with a GPU== <!--T:5--><br />
Once you have the above setup completed you can submit a TensorFlow job as<br />
{{Command|sbatch tensorflow-test.sh}}<br />
The job submission script has the content<br />
</translate><br />
{{File<br />
|name=tensorflow-test.sh<br />
|lang="bash"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --gres=gpu:1 # request GPU "generic resource"<br />
#SBATCH --cpus-per-task=6 # maximum CPU cores per GPU request: 6 on Cedar, 16 on Graham.<br />
#SBATCH --mem=32000M # memory per node<br />
#SBATCH --time=0-03:00 # time (DD-HH:MM)<br />
#SBATCH --output=%N-%j.out # %N for node name, %j for jobID<br />
<br />
module load cuda cudnn python/3.5.2<br />
source tensorflow/bin/activate<br />
python ./tensorflow-test.py<br />
}}<br />
<translate><br />
<!--T:6--><br />
while the Python script has the form,<br />
</translate><br />
{{File<br />
|name=tensorflow-test.py<br />
|lang="python"<br />
|contents=<br />
import tensorflow as tf<br />
node1 = tf.constant(3.0, dtype=tf.float32)<br />
node2 = tf.constant(4.0) # also tf.float32 implicitly<br />
print(node1, node2)<br />
sess = tf.Session()<br />
print(sess.run([node1, node2]))<br />
}}<br />
<translate><br />
<!--T:7--><br />
Once the above job has completed (should take less than a minute) you should see an output file called something like <tt>cdr116-122907.out</tt> with contents similar to the following example,<br />
</translate><br />
{{File<br />
|name=cdr116-122907.out<br />
|lang="text"<br />
|contents=<br />
2017-07-10 12:35:19.489458: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 0 with properties:<br />
name: Tesla P100-PCIE-12GB<br />
major: 6 minor: 0 memoryClockRate (GHz) 1.3285<br />
pciBusID 0000:82:00.0<br />
Total memory: 11.91GiB<br />
Free memory: 11.63GiB<br />
2017-07-10 12:35:19.491097: I tensorflow/core/common_runtime/gpu/gpu_device.cc:961] DMA: 0<br />
2017-07-10 12:35:19.491156: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0: Y<br />
2017-07-10 12:35:19.520737: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla P100-PCIE-12GB, pci bus id: 0000:82:00.0)<br />
Tensor("Const:0", shape=(), dtype=float32) Tensor("Const_1:0", shape=(), dtype=float32)<br />
[3.0, 4.0]<br />
}}<br />
<translate><br />
<br />
<!--T:16--><br />
TensorFlow can run on all GPU node types. Cedar's ''GPU large'' node type, which is equipped with 4 x P100-PCIE-16GB with GPUDirect P2P enabled between each pair, is highly recommended for large scale Deep Learning or Machine Learning research. See [[Using GPUs with SLURM]] for more information.<br />
<br />
</translate><br />
<br />
==Benchmarks==<br />
This section will give ResNet-50 and VGG-16 benchmarking results on both Graham and Cedar with single and multiple GPUs using different methods for managing variables. The benchmark can be found on github: [https://github.com/tensorflow/benchmarks TensorFlow Benchmarks].<br />
<br />
Methods of managing variables:<br />
<pre><br />
ps, cpu: --variable_update=parameter_server --local_parameter_device=cpu<br />
<br />
ps, gpu: --variable_update=parameter_server --local_parameter_device=gpu<br />
<br />
replicated: --variable_update=replicated<br />
<br />
replicated, xring: --variable_update=replicated --all_reduce_spec=xring<br />
<br />
replicated, pscpu: --variable_update=replicated --all_reduce_spec=pscpu<br />
<br />
replicated, nccl: --variable_update=replicated --all_reduce_spec=nccl<br />
</pre><br />
===Data Format===<br />
For GPU which uses cuDNN, NCHW should be used. This is the default one for this benchmarking code.<br />
<pre><br />
--data_format=NCHW<br />
</pre><br />
For others, NHWC (TF native) is suggested.<br />
<pre><br />
--data_format=NHWC<br />
</pre><br />
<br />
===ResNet-50===<br />
Batch size is 32 per GPU. Data parallelism is used. (Results in "images per second")<br />
{| class="wikitable"<br />
|-<br />
! Node type !! Single GPU baseline !! Number of GPUs !! ps,cpu !! ps, gpu !! replicated !! replicated, xring !! replicated, pscpu !! replicated, nccl<br />
|-<br />
| Graham GPU node || 171.23||2 || 93.31 || '''324.04''' || 318.33 || 316.01 || 109.82 || 315.99<br />
|-<br />
| Cedar GPU Base || || 4 || || || || || || <br />
|-<br />
| Cedar GPU Large || 205.71 ||4 || 673.47 || 721.98 || '''754.35''' || 574.91 || 664.72 || 692.25<br />
|}<br />
<br />
===VGG-16===<br />
Batch size is 32 per GPU. Data parallelism is used. NCCL runs the best for all kinds of node types. (Results in "images per second")<br />
<br />
{| class="wikitable"<br />
|-<br />
! Node type !! Single GPU baseline !! Number of GPUs !! ps,cpu !! ps, gpu !! replicated !! replicated, xring !! replicated, pscpu !! replicated, nccl<br />
|-<br />
| Graham GPU node || 115.89||2 || 91.29 || 194.46 || 194.43 || 203.83 || 132.19 || '''219.72'''<br />
|-<br />
| Cedar GPU Base || 114.77 ||4 || 232.85 || 280.69 || 274.41 || 341.29 || 330.04 || '''388.53'''<br />
|-<br />
| Cedar GPU Large || 137.16 ||4 || 175.20 || 379.80 ||336.72 || 417.46 || 225.37 || '''490.52'''<br />
|}</div>Feimaohttps://docs.alliancecan.ca/mediawiki/index.php?title=TensorFlow&diff=46060TensorFlow2018-02-05T19:19:25Z<p>Feimao: /* VGG-16 */</p>
<hr />
<div><languages /><br />
[[Category:Software]]<br />
<translate><br />
<!--T:18--><br />
[https://www.tensorflow.org/ TensorFlow] is "an open-source software library for Machine Intelligence".<br />
<br />
==Installing TensorFlow== <!--T:1--><br />
<br />
<!--T:2--><br />
These instructions install Tensorflow into your home directory using Compute Canada's pre-built [http://pythonwheels.com/ Python wheels]. Custom Python wheels are stored in <code>/cvmfs/soft.computecanada.ca/custom/python/wheelhouse/</code>. To install TensorFlow's wheel we will use the <code>pip</code> command and install it into a [[Python#Creating_and_using_a_virtual_environment | Python virtual environment]]. The below instructions install for Python 3.5.2 but you can also install for Python 3.5.Y or 2.7.X by loading a different Python module.<br />
<br />
<!--T:3--><br />
Load modules required by TensorFlow:<br />
{{Command|module load python/3.5.2}}<br />
<br />
<!--T:17--><br />
Create a new Python virtual environment:<br />
{{Command|virtualenv tensorflow}}<br />
<br />
<!--T:4--><br />
Activate your newly created Python virtual environment:<br />
{{Command|source tensorflow/bin/activate}}<br />
<br />
Install TensorFlow into your newly created virtual environment using the command from either one of the two following subsections. <br />
=== CPU-only === <!--T:8--><br />
{{Command|prompt=(tensorflow) [name@server $]<br />
|pip install tensorflow-cpu}}<br />
<br />
=== GPU === <!--T:9--><br />
<br />
<!--T:10--><br />
{{Command|prompt=(tensorflow) [name@server $]<br />
|pip install tensorflow-gpu}}<br />
<br />
==Submitting a TensorFlow job with a GPU== <!--T:5--><br />
Once you have the above setup completed you can submit a TensorFlow job as<br />
{{Command|sbatch tensorflow-test.sh}}<br />
The job submission script has the content<br />
</translate><br />
{{File<br />
|name=tensorflow-test.sh<br />
|lang="bash"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --gres=gpu:1 # request GPU "generic resource"<br />
#SBATCH --cpus-per-task=6 # maximum CPU cores per GPU request: 6 on Cedar, 16 on Graham.<br />
#SBATCH --mem=32000M # memory per node<br />
#SBATCH --time=0-03:00 # time (DD-HH:MM)<br />
#SBATCH --output=%N-%j.out # %N for node name, %j for jobID<br />
<br />
module load cuda cudnn python/3.5.2<br />
source tensorflow/bin/activate<br />
python ./tensorflow-test.py<br />
}}<br />
<translate><br />
<!--T:6--><br />
while the Python script has the form,<br />
</translate><br />
{{File<br />
|name=tensorflow-test.py<br />
|lang="python"<br />
|contents=<br />
import tensorflow as tf<br />
node1 = tf.constant(3.0, dtype=tf.float32)<br />
node2 = tf.constant(4.0) # also tf.float32 implicitly<br />
print(node1, node2)<br />
sess = tf.Session()<br />
print(sess.run([node1, node2]))<br />
}}<br />
<translate><br />
<!--T:7--><br />
Once the above job has completed (should take less than a minute) you should see an output file called something like <tt>cdr116-122907.out</tt> with contents similar to the following example,<br />
</translate><br />
{{File<br />
|name=cdr116-122907.out<br />
|lang="text"<br />
|contents=<br />
2017-07-10 12:35:19.489458: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 0 with properties:<br />
name: Tesla P100-PCIE-12GB<br />
major: 6 minor: 0 memoryClockRate (GHz) 1.3285<br />
pciBusID 0000:82:00.0<br />
Total memory: 11.91GiB<br />
Free memory: 11.63GiB<br />
2017-07-10 12:35:19.491097: I tensorflow/core/common_runtime/gpu/gpu_device.cc:961] DMA: 0<br />
2017-07-10 12:35:19.491156: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0: Y<br />
2017-07-10 12:35:19.520737: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla P100-PCIE-12GB, pci bus id: 0000:82:00.0)<br />
Tensor("Const:0", shape=(), dtype=float32) Tensor("Const_1:0", shape=(), dtype=float32)<br />
[3.0, 4.0]<br />
}}<br />
<translate><br />
<br />
<!--T:16--><br />
TensorFlow can run on all GPU node types. Cedar's ''GPU large'' node type, which is equipped with 4 x P100-PCIE-16GB with GPUDirect P2P enabled between each pair, is highly recommended for large scale Deep Learning or Machine Learning research. See [[Using GPUs with SLURM]] for more information.<br />
<br />
</translate><br />
<br />
==Benchmarks==<br />
This section will give ResNet-50 and VGG-16 benchmarking results on both Graham and Cedar with single and multiple GPUs using different methods for managing variables. The benchmark can be found on github: [https://github.com/tensorflow/benchmarks TensorFlow Benchmarks].<br />
<br />
Methods of managing variables:<br />
<pre><br />
ps, cpu: --variable_update=parameter_server --local_parameter_device=cpu<br />
<br />
ps, gpu: --variable_update=parameter_server --local_parameter_device=gpu<br />
<br />
replicated: --variable_update=replicated<br />
<br />
replicated, xring: --variable_update=replicated --all_reduce_spec=xring<br />
<br />
replicated, pscpu: --variable_update=replicated --all_reduce_spec=pscpu<br />
<br />
replicated, nccl: --variable_update=replicated --all_reduce_spec=nccl<br />
</pre><br />
===Data Format===<br />
For GPU which uses cuDNN, NCHW should be used. This is the default one for this benchmarking code.<br />
<pre><br />
--data_format=NCHW<br />
</pre><br />
For others, NHWC (TF native) is suggested.<br />
<pre><br />
--data_format=NHWC<br />
</pre><br />
<br />
===ResNet-50===<br />
Batch size is 32 per GPU. Data parallelism is used. Results in "images per second".<br />
{| class="wikitable"<br />
|-<br />
! Node type !! Single GPU baseline !! Number of GPUs !! ps,cpu !! ps, gpu !! replicated !! replicated, xring !! replicated, pscpu !! replicated, nccl<br />
|-<br />
| Graham GPU node || 171.23||2 || 93.31 || '''324.04''' || 318.33 || 316.01 || 109.82 || 315.99<br />
|-<br />
| Cedar GPU Base || || 4 || || || || || || <br />
|-<br />
| Cedar GPU Large || 205.71 ||4 || 673.47 || 721.98 || '''754.35''' || 574.91 || 664.72 || 692.25<br />
|}<br />
<br />
===VGG-16===<br />
Batch size is 32 per GPU. Data parallelism is used. NCCL runs the best for all kinds of node types. (Results in "images per second")<br />
<br />
{| class="wikitable"<br />
|-<br />
! Node type !! Single GPU baseline !! Number of GPUs !! ps,cpu !! ps, gpu !! replicated !! replicated, xring !! replicated, pscpu !! replicated, nccl<br />
|-<br />
| Graham GPU node || 115.89||2 || 91.29 || 194.46 || 194.43 || 203.83 || 132.19 || '''219.72'''<br />
|-<br />
| Cedar GPU Base || 114.77 ||4 || 232.85 || 280.69 || 274.41 || 341.29 || 330.04 || '''388.53'''<br />
|-<br />
| Cedar GPU Large || 137.16 ||4 || 175.20 || 379.80 ||336.72 || 417.46 || 225.37 || '''490.52'''<br />
|}</div>Feimaohttps://docs.alliancecan.ca/mediawiki/index.php?title=TensorFlow&diff=46059TensorFlow2018-02-05T19:19:11Z<p>Feimao: /* ResNet-50 */</p>
<hr />
<div><languages /><br />
[[Category:Software]]<br />
<translate><br />
<!--T:18--><br />
[https://www.tensorflow.org/ TensorFlow] is "an open-source software library for Machine Intelligence".<br />
<br />
==Installing TensorFlow== <!--T:1--><br />
<br />
<!--T:2--><br />
These instructions install Tensorflow into your home directory using Compute Canada's pre-built [http://pythonwheels.com/ Python wheels]. Custom Python wheels are stored in <code>/cvmfs/soft.computecanada.ca/custom/python/wheelhouse/</code>. To install TensorFlow's wheel we will use the <code>pip</code> command and install it into a [[Python#Creating_and_using_a_virtual_environment | Python virtual environment]]. The below instructions install for Python 3.5.2 but you can also install for Python 3.5.Y or 2.7.X by loading a different Python module.<br />
<br />
<!--T:3--><br />
Load modules required by TensorFlow:<br />
{{Command|module load python/3.5.2}}<br />
<br />
<!--T:17--><br />
Create a new Python virtual environment:<br />
{{Command|virtualenv tensorflow}}<br />
<br />
<!--T:4--><br />
Activate your newly created Python virtual environment:<br />
{{Command|source tensorflow/bin/activate}}<br />
<br />
Install TensorFlow into your newly created virtual environment using the command from either one of the two following subsections. <br />
=== CPU-only === <!--T:8--><br />
{{Command|prompt=(tensorflow) [name@server $]<br />
|pip install tensorflow-cpu}}<br />
<br />
=== GPU === <!--T:9--><br />
<br />
<!--T:10--><br />
{{Command|prompt=(tensorflow) [name@server $]<br />
|pip install tensorflow-gpu}}<br />
<br />
==Submitting a TensorFlow job with a GPU== <!--T:5--><br />
Once you have the above setup completed you can submit a TensorFlow job as<br />
{{Command|sbatch tensorflow-test.sh}}<br />
The job submission script has the content<br />
</translate><br />
{{File<br />
|name=tensorflow-test.sh<br />
|lang="bash"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --gres=gpu:1 # request GPU "generic resource"<br />
#SBATCH --cpus-per-task=6 # maximum CPU cores per GPU request: 6 on Cedar, 16 on Graham.<br />
#SBATCH --mem=32000M # memory per node<br />
#SBATCH --time=0-03:00 # time (DD-HH:MM)<br />
#SBATCH --output=%N-%j.out # %N for node name, %j for jobID<br />
<br />
module load cuda cudnn python/3.5.2<br />
source tensorflow/bin/activate<br />
python ./tensorflow-test.py<br />
}}<br />
<translate><br />
<!--T:6--><br />
while the Python script has the form,<br />
</translate><br />
{{File<br />
|name=tensorflow-test.py<br />
|lang="python"<br />
|contents=<br />
import tensorflow as tf<br />
node1 = tf.constant(3.0, dtype=tf.float32)<br />
node2 = tf.constant(4.0) # also tf.float32 implicitly<br />
print(node1, node2)<br />
sess = tf.Session()<br />
print(sess.run([node1, node2]))<br />
}}<br />
<translate><br />
<!--T:7--><br />
Once the above job has completed (should take less than a minute) you should see an output file called something like <tt>cdr116-122907.out</tt> with contents similar to the following example,<br />
</translate><br />
{{File<br />
|name=cdr116-122907.out<br />
|lang="text"<br />
|contents=<br />
2017-07-10 12:35:19.489458: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 0 with properties:<br />
name: Tesla P100-PCIE-12GB<br />
major: 6 minor: 0 memoryClockRate (GHz) 1.3285<br />
pciBusID 0000:82:00.0<br />
Total memory: 11.91GiB<br />
Free memory: 11.63GiB<br />
2017-07-10 12:35:19.491097: I tensorflow/core/common_runtime/gpu/gpu_device.cc:961] DMA: 0<br />
2017-07-10 12:35:19.491156: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0: Y<br />
2017-07-10 12:35:19.520737: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla P100-PCIE-12GB, pci bus id: 0000:82:00.0)<br />
Tensor("Const:0", shape=(), dtype=float32) Tensor("Const_1:0", shape=(), dtype=float32)<br />
[3.0, 4.0]<br />
}}<br />
<translate><br />
<br />
<!--T:16--><br />
TensorFlow can run on all GPU node types. Cedar's ''GPU large'' node type, which is equipped with 4 x P100-PCIE-16GB with GPUDirect P2P enabled between each pair, is highly recommended for large scale Deep Learning or Machine Learning research. See [[Using GPUs with SLURM]] for more information.<br />
<br />
</translate><br />
<br />
==Benchmarks==<br />
This section will give ResNet-50 and VGG-16 benchmarking results on both Graham and Cedar with single and multiple GPUs using different methods for managing variables. The benchmark can be found on github: [https://github.com/tensorflow/benchmarks TensorFlow Benchmarks].<br />
<br />
Methods of managing variables:<br />
<pre><br />
ps, cpu: --variable_update=parameter_server --local_parameter_device=cpu<br />
<br />
ps, gpu: --variable_update=parameter_server --local_parameter_device=gpu<br />
<br />
replicated: --variable_update=replicated<br />
<br />
replicated, xring: --variable_update=replicated --all_reduce_spec=xring<br />
<br />
replicated, pscpu: --variable_update=replicated --all_reduce_spec=pscpu<br />
<br />
replicated, nccl: --variable_update=replicated --all_reduce_spec=nccl<br />
</pre><br />
===Data Format===<br />
For GPU which uses cuDNN, NCHW should be used. This is the default one for this benchmarking code.<br />
<pre><br />
--data_format=NCHW<br />
</pre><br />
For others, NHWC (TF native) is suggested.<br />
<pre><br />
--data_format=NHWC<br />
</pre><br />
<br />
===ResNet-50===<br />
Batch size is 32 per GPU. Data parallelism is used. Results in "images per second".<br />
{| class="wikitable"<br />
|-<br />
! Node type !! Single GPU baseline !! Number of GPUs !! ps,cpu !! ps, gpu !! replicated !! replicated, xring !! replicated, pscpu !! replicated, nccl<br />
|-<br />
| Graham GPU node || 171.23||2 || 93.31 || '''324.04''' || 318.33 || 316.01 || 109.82 || 315.99<br />
|-<br />
| Cedar GPU Base || || 4 || || || || || || <br />
|-<br />
| Cedar GPU Large || 205.71 ||4 || 673.47 || 721.98 || '''754.35''' || 574.91 || 664.72 || 692.25<br />
|}<br />
<br />
===VGG-16===<br />
Batch size is 32 per GPU. Data parallelism is used. NCCL runs the best for all kinds of node types.<br />
<br />
{| class="wikitable"<br />
|-<br />
! Node type !! Single GPU baseline !! Number of GPUs !! ps,cpu !! ps, gpu !! replicated !! replicated, xring !! replicated, pscpu !! replicated, nccl<br />
|-<br />
| Graham GPU node || 115.89||2 || 91.29 || 194.46 || 194.43 || 203.83 || 132.19 || '''219.72'''<br />
|-<br />
| Cedar GPU Base || 114.77 ||4 || 232.85 || 280.69 || 274.41 || 341.29 || 330.04 || '''388.53'''<br />
|-<br />
| Cedar GPU Large || 137.16 ||4 || 175.20 || 379.80 ||336.72 || 417.46 || 225.37 || '''490.52'''<br />
|}</div>Feimaohttps://docs.alliancecan.ca/mediawiki/index.php?title=TensorFlow&diff=46058TensorFlow2018-02-05T19:16:25Z<p>Feimao: /* VGG-16 */</p>
<hr />
<div><languages /><br />
[[Category:Software]]<br />
<translate><br />
<!--T:18--><br />
[https://www.tensorflow.org/ TensorFlow] is "an open-source software library for Machine Intelligence".<br />
<br />
==Installing TensorFlow== <!--T:1--><br />
<br />
<!--T:2--><br />
These instructions install Tensorflow into your home directory using Compute Canada's pre-built [http://pythonwheels.com/ Python wheels]. Custom Python wheels are stored in <code>/cvmfs/soft.computecanada.ca/custom/python/wheelhouse/</code>. To install TensorFlow's wheel we will use the <code>pip</code> command and install it into a [[Python#Creating_and_using_a_virtual_environment | Python virtual environment]]. The below instructions install for Python 3.5.2 but you can also install for Python 3.5.Y or 2.7.X by loading a different Python module.<br />
<br />
<!--T:3--><br />
Load modules required by TensorFlow:<br />
{{Command|module load python/3.5.2}}<br />
<br />
<!--T:17--><br />
Create a new Python virtual environment:<br />
{{Command|virtualenv tensorflow}}<br />
<br />
<!--T:4--><br />
Activate your newly created Python virtual environment:<br />
{{Command|source tensorflow/bin/activate}}<br />
<br />
Install TensorFlow into your newly created virtual environment using the command from either one of the two following subsections. <br />
=== CPU-only === <!--T:8--><br />
{{Command|prompt=(tensorflow) [name@server $]<br />
|pip install tensorflow-cpu}}<br />
<br />
=== GPU === <!--T:9--><br />
<br />
<!--T:10--><br />
{{Command|prompt=(tensorflow) [name@server $]<br />
|pip install tensorflow-gpu}}<br />
<br />
==Submitting a TensorFlow job with a GPU== <!--T:5--><br />
Once you have the above setup completed you can submit a TensorFlow job as<br />
{{Command|sbatch tensorflow-test.sh}}<br />
The job submission script has the content<br />
</translate><br />
{{File<br />
|name=tensorflow-test.sh<br />
|lang="bash"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --gres=gpu:1 # request GPU "generic resource"<br />
#SBATCH --cpus-per-task=6 # maximum CPU cores per GPU request: 6 on Cedar, 16 on Graham.<br />
#SBATCH --mem=32000M # memory per node<br />
#SBATCH --time=0-03:00 # time (DD-HH:MM)<br />
#SBATCH --output=%N-%j.out # %N for node name, %j for jobID<br />
<br />
module load cuda cudnn python/3.5.2<br />
source tensorflow/bin/activate<br />
python ./tensorflow-test.py<br />
}}<br />
<translate><br />
<!--T:6--><br />
while the Python script has the form,<br />
</translate><br />
{{File<br />
|name=tensorflow-test.py<br />
|lang="python"<br />
|contents=<br />
import tensorflow as tf<br />
node1 = tf.constant(3.0, dtype=tf.float32)<br />
node2 = tf.constant(4.0) # also tf.float32 implicitly<br />
print(node1, node2)<br />
sess = tf.Session()<br />
print(sess.run([node1, node2]))<br />
}}<br />
<translate><br />
<!--T:7--><br />
Once the above job has completed (should take less than a minute) you should see an output file called something like <tt>cdr116-122907.out</tt> with contents similar to the following example,<br />
</translate><br />
{{File<br />
|name=cdr116-122907.out<br />
|lang="text"<br />
|contents=<br />
2017-07-10 12:35:19.489458: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 0 with properties:<br />
name: Tesla P100-PCIE-12GB<br />
major: 6 minor: 0 memoryClockRate (GHz) 1.3285<br />
pciBusID 0000:82:00.0<br />
Total memory: 11.91GiB<br />
Free memory: 11.63GiB<br />
2017-07-10 12:35:19.491097: I tensorflow/core/common_runtime/gpu/gpu_device.cc:961] DMA: 0<br />
2017-07-10 12:35:19.491156: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0: Y<br />
2017-07-10 12:35:19.520737: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla P100-PCIE-12GB, pci bus id: 0000:82:00.0)<br />
Tensor("Const:0", shape=(), dtype=float32) Tensor("Const_1:0", shape=(), dtype=float32)<br />
[3.0, 4.0]<br />
}}<br />
<translate><br />
<br />
<!--T:16--><br />
TensorFlow can run on all GPU node types. Cedar's ''GPU large'' node type, which is equipped with 4 x P100-PCIE-16GB with GPUDirect P2P enabled between each pair, is highly recommended for large scale Deep Learning or Machine Learning research. See [[Using GPUs with SLURM]] for more information.<br />
<br />
</translate><br />
<br />
==Benchmarks==<br />
This section will give ResNet-50 and VGG-16 benchmarking results on both Graham and Cedar with single and multiple GPUs using different methods for managing variables. The benchmark can be found on github: [https://github.com/tensorflow/benchmarks TensorFlow Benchmarks].<br />
<br />
Methods of managing variables:<br />
<pre><br />
ps, cpu: --variable_update=parameter_server --local_parameter_device=cpu<br />
<br />
ps, gpu: --variable_update=parameter_server --local_parameter_device=gpu<br />
<br />
replicated: --variable_update=replicated<br />
<br />
replicated, xring: --variable_update=replicated --all_reduce_spec=xring<br />
<br />
replicated, pscpu: --variable_update=replicated --all_reduce_spec=pscpu<br />
<br />
replicated, nccl: --variable_update=replicated --all_reduce_spec=nccl<br />
</pre><br />
===Data Format===<br />
For GPU which uses cuDNN, NCHW should be used. This is the default one for this benchmarking code.<br />
<pre><br />
--data_format=NCHW<br />
</pre><br />
For others, NHWC (TF native) is suggested.<br />
<pre><br />
--data_format=NHWC<br />
</pre><br />
<br />
===ResNet-50===<br />
Batch size is 32 per GPU. Data parallelism is used.<br />
{| class="wikitable"<br />
|-<br />
! Node type !! Single GPU baseline !! Number of GPUs !! ps,cpu !! ps, gpu !! replicated !! replicated, xring !! replicated, pscpu !! replicated, nccl<br />
|-<br />
| Graham GPU node || 171.23||2 || 93.31 || '''324.04''' || 318.33 || 316.01 || 109.82 || 315.99<br />
|-<br />
| Cedar GPU Base || || 4 || || || || || || <br />
|-<br />
| Cedar GPU Large || 205.71 ||4 || 673.47 || 721.98 || '''754.35''' || 574.91 || 664.72 || 692.25<br />
|}<br />
<br />
===VGG-16===<br />
Batch size is 32 per GPU. Data parallelism is used. NCCL runs the best for all kinds of node types.<br />
<br />
{| class="wikitable"<br />
|-<br />
! Node type !! Single GPU baseline !! Number of GPUs !! ps,cpu !! ps, gpu !! replicated !! replicated, xring !! replicated, pscpu !! replicated, nccl<br />
|-<br />
| Graham GPU node || 115.89||2 || 91.29 || 194.46 || 194.43 || 203.83 || 132.19 || '''219.72'''<br />
|-<br />
| Cedar GPU Base || 114.77 ||4 || 232.85 || 280.69 || 274.41 || 341.29 || 330.04 || '''388.53'''<br />
|-<br />
| Cedar GPU Large || 137.16 ||4 || 175.20 || 379.80 ||336.72 || 417.46 || 225.37 || '''490.52'''<br />
|}</div>Feimaohttps://docs.alliancecan.ca/mediawiki/index.php?title=TensorFlow&diff=46057TensorFlow2018-02-05T19:14:25Z<p>Feimao: /* Data Format */</p>
<hr />
<div><languages /><br />
[[Category:Software]]<br />
<translate><br />
<!--T:18--><br />
[https://www.tensorflow.org/ TensorFlow] is "an open-source software library for Machine Intelligence".<br />
<br />
==Installing TensorFlow== <!--T:1--><br />
<br />
<!--T:2--><br />
These instructions install Tensorflow into your home directory using Compute Canada's pre-built [http://pythonwheels.com/ Python wheels]. Custom Python wheels are stored in <code>/cvmfs/soft.computecanada.ca/custom/python/wheelhouse/</code>. To install TensorFlow's wheel we will use the <code>pip</code> command and install it into a [[Python#Creating_and_using_a_virtual_environment | Python virtual environment]]. The below instructions install for Python 3.5.2 but you can also install for Python 3.5.Y or 2.7.X by loading a different Python module.<br />
<br />
<!--T:3--><br />
Load modules required by TensorFlow:<br />
{{Command|module load python/3.5.2}}<br />
<br />
<!--T:17--><br />
Create a new Python virtual environment:<br />
{{Command|virtualenv tensorflow}}<br />
<br />
<!--T:4--><br />
Activate your newly created Python virtual environment:<br />
{{Command|source tensorflow/bin/activate}}<br />
<br />
Install TensorFlow into your newly created virtual environment using the command from either one of the two following subsections. <br />
=== CPU-only === <!--T:8--><br />
{{Command|prompt=(tensorflow) [name@server $]<br />
|pip install tensorflow-cpu}}<br />
<br />
=== GPU === <!--T:9--><br />
<br />
<!--T:10--><br />
{{Command|prompt=(tensorflow) [name@server $]<br />
|pip install tensorflow-gpu}}<br />
<br />
==Submitting a TensorFlow job with a GPU== <!--T:5--><br />
Once you have the above setup completed you can submit a TensorFlow job as<br />
{{Command|sbatch tensorflow-test.sh}}<br />
The job submission script has the content<br />
</translate><br />
{{File<br />
|name=tensorflow-test.sh<br />
|lang="bash"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --gres=gpu:1 # request GPU "generic resource"<br />
#SBATCH --cpus-per-task=6 # maximum CPU cores per GPU request: 6 on Cedar, 16 on Graham.<br />
#SBATCH --mem=32000M # memory per node<br />
#SBATCH --time=0-03:00 # time (DD-HH:MM)<br />
#SBATCH --output=%N-%j.out # %N for node name, %j for jobID<br />
<br />
module load cuda cudnn python/3.5.2<br />
source tensorflow/bin/activate<br />
python ./tensorflow-test.py<br />
}}<br />
<translate><br />
<!--T:6--><br />
while the Python script has the form,<br />
</translate><br />
{{File<br />
|name=tensorflow-test.py<br />
|lang="python"<br />
|contents=<br />
import tensorflow as tf<br />
node1 = tf.constant(3.0, dtype=tf.float32)<br />
node2 = tf.constant(4.0) # also tf.float32 implicitly<br />
print(node1, node2)<br />
sess = tf.Session()<br />
print(sess.run([node1, node2]))<br />
}}<br />
<translate><br />
<!--T:7--><br />
Once the above job has completed (should take less than a minute) you should see an output file called something like <tt>cdr116-122907.out</tt> with contents similar to the following example,<br />
</translate><br />
{{File<br />
|name=cdr116-122907.out<br />
|lang="text"<br />
|contents=<br />
2017-07-10 12:35:19.489458: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 0 with properties:<br />
name: Tesla P100-PCIE-12GB<br />
major: 6 minor: 0 memoryClockRate (GHz) 1.3285<br />
pciBusID 0000:82:00.0<br />
Total memory: 11.91GiB<br />
Free memory: 11.63GiB<br />
2017-07-10 12:35:19.491097: I tensorflow/core/common_runtime/gpu/gpu_device.cc:961] DMA: 0<br />
2017-07-10 12:35:19.491156: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0: Y<br />
2017-07-10 12:35:19.520737: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla P100-PCIE-12GB, pci bus id: 0000:82:00.0)<br />
Tensor("Const:0", shape=(), dtype=float32) Tensor("Const_1:0", shape=(), dtype=float32)<br />
[3.0, 4.0]<br />
}}<br />
<translate><br />
<br />
<!--T:16--><br />
TensorFlow can run on all GPU node types. Cedar's ''GPU large'' node type, which is equipped with 4 x P100-PCIE-16GB with GPUDirect P2P enabled between each pair, is highly recommended for large scale Deep Learning or Machine Learning research. See [[Using GPUs with SLURM]] for more information.<br />
<br />
</translate><br />
<br />
==Benchmarks==<br />
This section will give ResNet-50 and VGG-16 benchmarking results on both Graham and Cedar with single and multiple GPUs using different methods for managing variables. The benchmark can be found on github: [https://github.com/tensorflow/benchmarks TensorFlow Benchmarks].<br />
<br />
Methods of managing variables:<br />
<pre><br />
ps, cpu: --variable_update=parameter_server --local_parameter_device=cpu<br />
<br />
ps, gpu: --variable_update=parameter_server --local_parameter_device=gpu<br />
<br />
replicated: --variable_update=replicated<br />
<br />
replicated, xring: --variable_update=replicated --all_reduce_spec=xring<br />
<br />
replicated, pscpu: --variable_update=replicated --all_reduce_spec=pscpu<br />
<br />
replicated, nccl: --variable_update=replicated --all_reduce_spec=nccl<br />
</pre><br />
===Data Format===<br />
For GPU which uses cuDNN, NCHW should be used. This is the default one for this benchmarking code.<br />
<pre><br />
--data_format=NCHW<br />
</pre><br />
For others, NHWC (TF native) is suggested.<br />
<pre><br />
--data_format=NHWC<br />
</pre><br />
<br />
===ResNet-50===<br />
Batch size is 32 per GPU. Data parallelism is used.<br />
{| class="wikitable"<br />
|-<br />
! Node type !! Single GPU baseline !! Number of GPUs !! ps,cpu !! ps, gpu !! replicated !! replicated, xring !! replicated, pscpu !! replicated, nccl<br />
|-<br />
| Graham GPU node || 171.23||2 || 93.31 || '''324.04''' || 318.33 || 316.01 || 109.82 || 315.99<br />
|-<br />
| Cedar GPU Base || || 4 || || || || || || <br />
|-<br />
| Cedar GPU Large || 205.71 ||4 || 673.47 || 721.98 || '''754.35''' || 574.91 || 664.72 || 692.25<br />
|}<br />
<br />
===VGG-16===<br />
Batch size is 32 per GPU. Data parallelism is used. NCCL runs the best for all kind of node types.<br />
<br />
{| class="wikitable"<br />
|-<br />
! Node type !! Single GPU baseline !! Number of GPUs !! ps,cpu !! ps, gpu !! replicated !! replicated, xring !! replicated, pscpu !! replicated, nccl<br />
|-<br />
| Graham GPU node || 115.89||2 || 91.29 || 194.46 || 194.43 || 203.83 || 132.19 || '''219.72'''<br />
|-<br />
| Cedar GPU Base || 114.77 ||4 || 232.85 || 280.69 || 274.41 || 341.29 || 330.04 || '''388.53'''<br />
|-<br />
| Cedar GPU Large || 137.16 ||4 || 175.20 || 379.80 ||336.72 || 417.46 || 225.37 || '''490.52'''<br />
|}</div>Feimaohttps://docs.alliancecan.ca/mediawiki/index.php?title=TensorFlow&diff=46056TensorFlow2018-02-05T19:12:59Z<p>Feimao: /* Benchmarks */</p>
<hr />
<div><languages /><br />
[[Category:Software]]<br />
<translate><br />
<!--T:18--><br />
[https://www.tensorflow.org/ TensorFlow] is "an open-source software library for Machine Intelligence".<br />
<br />
==Installing TensorFlow== <!--T:1--><br />
<br />
<!--T:2--><br />
These instructions install Tensorflow into your home directory using Compute Canada's pre-built [http://pythonwheels.com/ Python wheels]. Custom Python wheels are stored in <code>/cvmfs/soft.computecanada.ca/custom/python/wheelhouse/</code>. To install TensorFlow's wheel we will use the <code>pip</code> command and install it into a [[Python#Creating_and_using_a_virtual_environment | Python virtual environment]]. The below instructions install for Python 3.5.2 but you can also install for Python 3.5.Y or 2.7.X by loading a different Python module.<br />
<br />
<!--T:3--><br />
Load modules required by TensorFlow:<br />
{{Command|module load python/3.5.2}}<br />
<br />
<!--T:17--><br />
Create a new Python virtual environment:<br />
{{Command|virtualenv tensorflow}}<br />
<br />
<!--T:4--><br />
Activate your newly created Python virtual environment:<br />
{{Command|source tensorflow/bin/activate}}<br />
<br />
Install TensorFlow into your newly created virtual environment using the command from either one of the two following subsections. <br />
=== CPU-only === <!--T:8--><br />
{{Command|prompt=(tensorflow) [name@server $]<br />
|pip install tensorflow-cpu}}<br />
<br />
=== GPU === <!--T:9--><br />
<br />
<!--T:10--><br />
{{Command|prompt=(tensorflow) [name@server $]<br />
|pip install tensorflow-gpu}}<br />
<br />
==Submitting a TensorFlow job with a GPU== <!--T:5--><br />
Once you have the above setup completed you can submit a TensorFlow job as<br />
{{Command|sbatch tensorflow-test.sh}}<br />
The job submission script has the content<br />
</translate><br />
{{File<br />
|name=tensorflow-test.sh<br />
|lang="bash"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --gres=gpu:1 # request GPU "generic resource"<br />
#SBATCH --cpus-per-task=6 # maximum CPU cores per GPU request: 6 on Cedar, 16 on Graham.<br />
#SBATCH --mem=32000M # memory per node<br />
#SBATCH --time=0-03:00 # time (DD-HH:MM)<br />
#SBATCH --output=%N-%j.out # %N for node name, %j for jobID<br />
<br />
module load cuda cudnn python/3.5.2<br />
source tensorflow/bin/activate<br />
python ./tensorflow-test.py<br />
}}<br />
<translate><br />
<!--T:6--><br />
while the Python script has the form,<br />
</translate><br />
{{File<br />
|name=tensorflow-test.py<br />
|lang="python"<br />
|contents=<br />
import tensorflow as tf<br />
node1 = tf.constant(3.0, dtype=tf.float32)<br />
node2 = tf.constant(4.0) # also tf.float32 implicitly<br />
print(node1, node2)<br />
sess = tf.Session()<br />
print(sess.run([node1, node2]))<br />
}}<br />
<translate><br />
<!--T:7--><br />
Once the above job has completed (should take less than a minute) you should see an output file called something like <tt>cdr116-122907.out</tt> with contents similar to the following example,<br />
</translate><br />
{{File<br />
|name=cdr116-122907.out<br />
|lang="text"<br />
|contents=<br />
2017-07-10 12:35:19.489458: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 0 with properties:<br />
name: Tesla P100-PCIE-12GB<br />
major: 6 minor: 0 memoryClockRate (GHz) 1.3285<br />
pciBusID 0000:82:00.0<br />
Total memory: 11.91GiB<br />
Free memory: 11.63GiB<br />
2017-07-10 12:35:19.491097: I tensorflow/core/common_runtime/gpu/gpu_device.cc:961] DMA: 0<br />
2017-07-10 12:35:19.491156: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0: Y<br />
2017-07-10 12:35:19.520737: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla P100-PCIE-12GB, pci bus id: 0000:82:00.0)<br />
Tensor("Const:0", shape=(), dtype=float32) Tensor("Const_1:0", shape=(), dtype=float32)<br />
[3.0, 4.0]<br />
}}<br />
<translate><br />
<br />
<!--T:16--><br />
TensorFlow can run on all GPU node types. Cedar's ''GPU large'' node type, which is equipped with 4 x P100-PCIE-16GB with GPUDirect P2P enabled between each pair, is highly recommended for large scale Deep Learning or Machine Learning research. See [[Using GPUs with SLURM]] for more information.<br />
<br />
</translate><br />
<br />
==Benchmarks==<br />
This section will give ResNet-50 and VGG-16 benchmarking results on both Graham and Cedar with single and multiple GPUs using different methods for managing variables. The benchmark can be found on github: [https://github.com/tensorflow/benchmarks TensorFlow Benchmarks].<br />
<br />
Methods of managing variables:<br />
<pre><br />
ps, cpu: --variable_update=parameter_server --local_parameter_device=cpu<br />
<br />
ps, gpu: --variable_update=parameter_server --local_parameter_device=gpu<br />
<br />
replicated: --variable_update=replicated<br />
<br />
replicated, xring: --variable_update=replicated --all_reduce_spec=xring<br />
<br />
replicated, pscpu: --variable_update=replicated --all_reduce_spec=pscpu<br />
<br />
replicated, nccl: --variable_update=replicated --all_reduce_spec=nccl<br />
</pre><br />
===Data Format===<br />
For GPU which uses cuDNN, NCHW should be used. This is default for this benchmarking code.<br />
<pre><br />
--data_format=NCHW<br />
</pre><br />
For others, NHWC (TF native) is suggested.<br />
<pre><br />
--data_format=NHWC<br />
</pre><br />
===ResNet-50===<br />
Batch size is 32 per GPU. Data parallelism is used.<br />
{| class="wikitable"<br />
|-<br />
! Node type !! Single GPU baseline !! Number of GPUs !! ps,cpu !! ps, gpu !! replicated !! replicated, xring !! replicated, pscpu !! replicated, nccl<br />
|-<br />
| Graham GPU node || 171.23||2 || 93.31 || '''324.04''' || 318.33 || 316.01 || 109.82 || 315.99<br />
|-<br />
| Cedar GPU Base || || 4 || || || || || || <br />
|-<br />
| Cedar GPU Large || 205.71 ||4 || 673.47 || 721.98 || '''754.35''' || 574.91 || 664.72 || 692.25<br />
|}<br />
<br />
===VGG-16===<br />
Batch size is 32 per GPU. Data parallelism is used. NCCL runs the best for all kind of node types.<br />
<br />
{| class="wikitable"<br />
|-<br />
! Node type !! Single GPU baseline !! Number of GPUs !! ps,cpu !! ps, gpu !! replicated !! replicated, xring !! replicated, pscpu !! replicated, nccl<br />
|-<br />
| Graham GPU node || 115.89||2 || 91.29 || 194.46 || 194.43 || 203.83 || 132.19 || '''219.72'''<br />
|-<br />
| Cedar GPU Base || 114.77 ||4 || 232.85 || 280.69 || 274.41 || 341.29 || 330.04 || '''388.53'''<br />
|-<br />
| Cedar GPU Large || 137.16 ||4 || 175.20 || 379.80 ||336.72 || 417.46 || 225.37 || '''490.52'''<br />
|}</div>Feimaohttps://docs.alliancecan.ca/mediawiki/index.php?title=TensorFlow&diff=46055TensorFlow2018-02-05T19:07:57Z<p>Feimao: /* VGG-16 */</p>
<hr />
<div><languages /><br />
[[Category:Software]]<br />
<translate><br />
<!--T:18--><br />
[https://www.tensorflow.org/ TensorFlow] is "an open-source software library for Machine Intelligence".<br />
<br />
==Installing TensorFlow== <!--T:1--><br />
<br />
<!--T:2--><br />
These instructions install Tensorflow into your home directory using Compute Canada's pre-built [http://pythonwheels.com/ Python wheels]. Custom Python wheels are stored in <code>/cvmfs/soft.computecanada.ca/custom/python/wheelhouse/</code>. To install TensorFlow's wheel we will use the <code>pip</code> command and install it into a [[Python#Creating_and_using_a_virtual_environment | Python virtual environment]]. The below instructions install for Python 3.5.2 but you can also install for Python 3.5.Y or 2.7.X by loading a different Python module.<br />
<br />
<!--T:3--><br />
Load modules required by TensorFlow:<br />
{{Command|module load python/3.5.2}}<br />
<br />
<!--T:17--><br />
Create a new Python virtual environment:<br />
{{Command|virtualenv tensorflow}}<br />
<br />
<!--T:4--><br />
Activate your newly created Python virtual environment:<br />
{{Command|source tensorflow/bin/activate}}<br />
<br />
Install TensorFlow into your newly created virtual environment using the command from either one of the two following subsections. <br />
=== CPU-only === <!--T:8--><br />
{{Command|prompt=(tensorflow) [name@server $]<br />
|pip install tensorflow-cpu}}<br />
<br />
=== GPU === <!--T:9--><br />
<br />
<!--T:10--><br />
{{Command|prompt=(tensorflow) [name@server $]<br />
|pip install tensorflow-gpu}}<br />
<br />
==Submitting a TensorFlow job with a GPU== <!--T:5--><br />
Once you have the above setup completed you can submit a TensorFlow job as<br />
{{Command|sbatch tensorflow-test.sh}}<br />
The job submission script has the content<br />
</translate><br />
{{File<br />
|name=tensorflow-test.sh<br />
|lang="bash"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --gres=gpu:1 # request GPU "generic resource"<br />
#SBATCH --cpus-per-task=6 # maximum CPU cores per GPU request: 6 on Cedar, 16 on Graham.<br />
#SBATCH --mem=32000M # memory per node<br />
#SBATCH --time=0-03:00 # time (DD-HH:MM)<br />
#SBATCH --output=%N-%j.out # %N for node name, %j for jobID<br />
<br />
module load cuda cudnn python/3.5.2<br />
source tensorflow/bin/activate<br />
python ./tensorflow-test.py<br />
}}<br />
<translate><br />
<!--T:6--><br />
while the Python script has the form,<br />
</translate><br />
{{File<br />
|name=tensorflow-test.py<br />
|lang="python"<br />
|contents=<br />
import tensorflow as tf<br />
node1 = tf.constant(3.0, dtype=tf.float32)<br />
node2 = tf.constant(4.0) # also tf.float32 implicitly<br />
print(node1, node2)<br />
sess = tf.Session()<br />
print(sess.run([node1, node2]))<br />
}}<br />
<translate><br />
<!--T:7--><br />
Once the above job has completed (should take less than a minute) you should see an output file called something like <tt>cdr116-122907.out</tt> with contents similar to the following example,<br />
</translate><br />
{{File<br />
|name=cdr116-122907.out<br />
|lang="text"<br />
|contents=<br />
2017-07-10 12:35:19.489458: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 0 with properties:<br />
name: Tesla P100-PCIE-12GB<br />
major: 6 minor: 0 memoryClockRate (GHz) 1.3285<br />
pciBusID 0000:82:00.0<br />
Total memory: 11.91GiB<br />
Free memory: 11.63GiB<br />
2017-07-10 12:35:19.491097: I tensorflow/core/common_runtime/gpu/gpu_device.cc:961] DMA: 0<br />
2017-07-10 12:35:19.491156: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0: Y<br />
2017-07-10 12:35:19.520737: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla P100-PCIE-12GB, pci bus id: 0000:82:00.0)<br />
Tensor("Const:0", shape=(), dtype=float32) Tensor("Const_1:0", shape=(), dtype=float32)<br />
[3.0, 4.0]<br />
}}<br />
<translate><br />
<br />
<!--T:16--><br />
TensorFlow can run on all GPU node types. Cedar's ''GPU large'' node type, which is equipped with 4 x P100-PCIE-16GB with GPUDirect P2P enabled between each pair, is highly recommended for large scale Deep Learning or Machine Learning research. See [[Using GPUs with SLURM]] for more information.<br />
<br />
</translate><br />
<br />
==Benchmarks==<br />
This section will give ResNet-50 and VGG-16 benchmarking results on both Graham and Cedar with single and multiple GPUs using different methods for managing variables. The benchmark can be found on github: [https://github.com/tensorflow/benchmarks TensorFlow Benchmarks].<br />
<br />
Methods of managing variables:<br />
<pre><br />
ps, cpu: --variable_update=parameter_server --local_parameter_device=cpu<br />
<br />
ps, gpu: --variable_update=parameter_server --local_parameter_device=gpu<br />
<br />
replicated: --variable_update=replicated<br />
<br />
replicated, xring: --variable_update=replicated --all_reduce_spec=xring<br />
<br />
replicated, pscpu: --variable_update=replicated --all_reduce_spec=pscpu<br />
<br />
replicated, nccl: --variable_update=replicated --all_reduce_spec=nccl<br />
</pre><br />
===ResNet-50===<br />
Batch size is 32 per GPU. Data parallelism is used.<br />
{| class="wikitable"<br />
|-<br />
! Node type !! Single GPU baseline !! Number of GPUs !! ps,cpu !! ps, gpu !! replicated !! replicated, xring !! replicated, pscpu !! replicated, nccl<br />
|-<br />
| Graham GPU node || 171.23||2 || 93.31 || '''324.04''' || 318.33 || 316.01 || 109.82 || 315.99<br />
|-<br />
| Cedar GPU Base || 1 || 4 || 1 || 1 || 1 || 1 || 1 || 1<br />
|-<br />
| Cedar GPU Large || 205.71 ||4 || 673.47 || 721.98 || '''754.35''' || 574.91 || 664.72 || 692.25<br />
|}<br />
<br />
===VGG-16===<br />
Batch size is 32 per GPU. Data parallelism is used. NCCL runs the best for all kind of node types.<br />
<br />
{| class="wikitable"<br />
|-<br />
! Node type !! Single GPU baseline !! Number of GPUs !! ps,cpu !! ps, gpu !! replicated !! replicated, xring !! replicated, pscpu !! replicated, nccl<br />
|-<br />
| Graham GPU node || 115.89||2 || 91.29 || 194.46 || 194.43 || 203.83 || 132.19 || '''219.72'''<br />
|-<br />
| Cedar GPU Base || 114.77 ||4 || 232.85 || 280.69 || 274.41 || 341.29 || 330.04 || '''388.53'''<br />
|-<br />
| Cedar GPU Large || 137.16 ||4 || 175.20 || 379.80 ||336.72 || 417.46 || 225.37 || '''490.52'''<br />
|}</div>Feimaohttps://docs.alliancecan.ca/mediawiki/index.php?title=TensorFlow&diff=46054TensorFlow2018-02-05T19:07:14Z<p>Feimao: </p>
<hr />
<div><languages /><br />
[[Category:Software]]<br />
<translate><br />
<!--T:18--><br />
[https://www.tensorflow.org/ TensorFlow] is "an open-source software library for Machine Intelligence".<br />
<br />
==Installing TensorFlow== <!--T:1--><br />
<br />
<!--T:2--><br />
These instructions install Tensorflow into your home directory using Compute Canada's pre-built [http://pythonwheels.com/ Python wheels]. Custom Python wheels are stored in <code>/cvmfs/soft.computecanada.ca/custom/python/wheelhouse/</code>. To install TensorFlow's wheel we will use the <code>pip</code> command and install it into a [[Python#Creating_and_using_a_virtual_environment | Python virtual environment]]. The below instructions install for Python 3.5.2 but you can also install for Python 3.5.Y or 2.7.X by loading a different Python module.<br />
<br />
<!--T:3--><br />
Load modules required by TensorFlow:<br />
{{Command|module load python/3.5.2}}<br />
<br />
<!--T:17--><br />
Create a new Python virtual environment:<br />
{{Command|virtualenv tensorflow}}<br />
<br />
<!--T:4--><br />
Activate your newly created Python virtual environment:<br />
{{Command|source tensorflow/bin/activate}}<br />
<br />
Install TensorFlow into your newly created virtual environment using the command from either one of the two following subsections. <br />
=== CPU-only === <!--T:8--><br />
{{Command|prompt=(tensorflow) [name@server $]<br />
|pip install tensorflow-cpu}}<br />
<br />
=== GPU === <!--T:9--><br />
<br />
<!--T:10--><br />
{{Command|prompt=(tensorflow) [name@server $]<br />
|pip install tensorflow-gpu}}<br />
<br />
==Submitting a TensorFlow job with a GPU== <!--T:5--><br />
Once you have the above setup completed you can submit a TensorFlow job as<br />
{{Command|sbatch tensorflow-test.sh}}<br />
The job submission script has the content<br />
</translate><br />
{{File<br />
|name=tensorflow-test.sh<br />
|lang="bash"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --gres=gpu:1 # request GPU "generic resource"<br />
#SBATCH --cpus-per-task=6 # maximum CPU cores per GPU request: 6 on Cedar, 16 on Graham.<br />
#SBATCH --mem=32000M # memory per node<br />
#SBATCH --time=0-03:00 # time (DD-HH:MM)<br />
#SBATCH --output=%N-%j.out # %N for node name, %j for jobID<br />
<br />
module load cuda cudnn python/3.5.2<br />
source tensorflow/bin/activate<br />
python ./tensorflow-test.py<br />
}}<br />
<translate><br />
<!--T:6--><br />
while the Python script has the form,<br />
</translate><br />
{{File<br />
|name=tensorflow-test.py<br />
|lang="python"<br />
|contents=<br />
import tensorflow as tf<br />
node1 = tf.constant(3.0, dtype=tf.float32)<br />
node2 = tf.constant(4.0) # also tf.float32 implicitly<br />
print(node1, node2)<br />
sess = tf.Session()<br />
print(sess.run([node1, node2]))<br />
}}<br />
<translate><br />
<!--T:7--><br />
Once the above job has completed (should take less than a minute) you should see an output file called something like <tt>cdr116-122907.out</tt> with contents similar to the following example,<br />
</translate><br />
{{File<br />
|name=cdr116-122907.out<br />
|lang="text"<br />
|contents=<br />
2017-07-10 12:35:19.489458: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 0 with properties:<br />
name: Tesla P100-PCIE-12GB<br />
major: 6 minor: 0 memoryClockRate (GHz) 1.3285<br />
pciBusID 0000:82:00.0<br />
Total memory: 11.91GiB<br />
Free memory: 11.63GiB<br />
2017-07-10 12:35:19.491097: I tensorflow/core/common_runtime/gpu/gpu_device.cc:961] DMA: 0<br />
2017-07-10 12:35:19.491156: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0: Y<br />
2017-07-10 12:35:19.520737: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla P100-PCIE-12GB, pci bus id: 0000:82:00.0)<br />
Tensor("Const:0", shape=(), dtype=float32) Tensor("Const_1:0", shape=(), dtype=float32)<br />
[3.0, 4.0]<br />
}}<br />
<translate><br />
<br />
<!--T:16--><br />
TensorFlow can run on all GPU node types. Cedar's ''GPU large'' node type, which is equipped with 4 x P100-PCIE-16GB with GPUDirect P2P enabled between each pair, is highly recommended for large scale Deep Learning or Machine Learning research. See [[Using GPUs with SLURM]] for more information.<br />
<br />
</translate><br />
<br />
==Benchmarks==<br />
This section will give ResNet-50 and VGG-16 benchmarking results on both Graham and Cedar with single and multiple GPUs using different methods for managing variables. The benchmark can be found on github: [https://github.com/tensorflow/benchmarks TensorFlow Benchmarks].<br />
<br />
Methods of managing variables:<br />
<pre><br />
ps, cpu: --variable_update=parameter_server --local_parameter_device=cpu<br />
<br />
ps, gpu: --variable_update=parameter_server --local_parameter_device=gpu<br />
<br />
replicated: --variable_update=replicated<br />
<br />
replicated, xring: --variable_update=replicated --all_reduce_spec=xring<br />
<br />
replicated, pscpu: --variable_update=replicated --all_reduce_spec=pscpu<br />
<br />
replicated, nccl: --variable_update=replicated --all_reduce_spec=nccl<br />
</pre><br />
===ResNet-50===<br />
Batch size is 32 per GPU. Data parallelism is used.<br />
{| class="wikitable"<br />
|-<br />
! Node type !! Single GPU baseline !! Number of GPUs !! ps,cpu !! ps, gpu !! replicated !! replicated, xring !! replicated, pscpu !! replicated, nccl<br />
|-<br />
| Graham GPU node || 171.23||2 || 93.31 || '''324.04''' || 318.33 || 316.01 || 109.82 || 315.99<br />
|-<br />
| Cedar GPU Base || 1 || 4 || 1 || 1 || 1 || 1 || 1 || 1<br />
|-<br />
| Cedar GPU Large || 205.71 ||4 || 673.47 || 721.98 || '''754.35''' || 574.91 || 664.72 || 692.25<br />
|}<br />
<br />
===VGG-16===<br />
Batch size is 32 per GPU. Data parallelism is used.<br />
{| class="wikitable"<br />
|-<br />
! Node type !! Single GPU baseline !! Number of GPUs !! ps,cpu !! ps, gpu !! replicated !! replicated, xring !! replicated, pscpu !! replicated, nccl<br />
|-<br />
| Graham GPU node || 115.89||2 || 91.29 || 194.46 || 194.43 || 203.83 || 132.19 || '''219.72'''<br />
|-<br />
| Cedar GPU Base || 114.77 ||4 || 232.85 || 280.69 || 274.41 || 341.29 || 330.04 || '''388.53'''<br />
|-<br />
| Cedar GPU Large || 137.16 ||4 || 175.20 || 379.80 ||336.72 || 417.46 || 225.37 || '''490.52'''<br />
|}<br />
For VGG-16 model, NCCL runs the best for all kind of node types.</div>Feimaohttps://docs.alliancecan.ca/mediawiki/index.php?title=Graham&diff=44279Graham2017-12-13T16:51:17Z<p>Feimao: </p>
<hr />
<div><noinclude><br />
<languages /><br />
<br />
<translate><br />
<!--T:27--><br />
</noinclude><br />
{| class="wikitable"<br />
|-<br />
| Expected availability: In production. RAC 2017's implemented June 30, 2017<br />
|-<br />
| Login node: '''graham.computecanada.ca'''<br />
|-<br />
| Globus endpoint: '''computecanada#graham-dtn'''<br />
|}<br />
<br />
<!--T:2--><br />
GRAHAM is a heterogeneous cluster, suitable for a variety of workloads, and located at the University of Waterloo. It is named after [https://en.wikipedia.org/wiki/Wes_Graham Wes Graham], the first director of the Computing Centre at Waterloo. It was previously known as "GP3" and is still identified as such in the [https://www.computecanada.ca/research-portal/accessing-resources/resource-allocation-competitions/ 2017 RAC] documentation.<br />
<br />
<!--T:4--><br />
The parallel filesystem and external persistent storage ([[National Data Cyberinfrastructure|NDC-Waterloo]]) are similar to [[Cedar|Cedar's]]. The interconnect is different and there is a slightly different mix of compute nodes.<br />
<br />
<!--T:28--><br />
The Graham system is sold and supported by Huawei Canada, Inc. It is entirely liquid cooled, using rear-door heat exchangers.<br />
<br />
<!--T:33--><br />
[https://docs.computecanada.ca/wiki/Getting_Started_with_the_new_National_Systems Getting started with Graham]<br />
<br />
<!--T:34--><br />
[https://docs.computecanada.ca/wiki/Running_jobs How to run jobs]<br />
<br />
=Attached storage systems= <!--T:23--><br />
<br />
<!--T:24--><br />
{| class="wikitable sortable"<br />
|-<br />
| '''Home space''' ||<br />
* Location of home directories.<br />
* Each home directory has a small, fixed [[Storage and file management#Filesystem_Quotas_and_Policies|quota]]. <br />
* Not allocated via [https://www.computecanada.ca/research-portal/accessing-resources/rapid-access-service/ RAS] or [https://www.computecanada.ca/research-portal/accessing-resources/resource-allocation-competitions/ RAC]. Larger requests go to Project space.<br />
* Has daily backup.<br />
|-<br />
| '''Scratch space'''<br />3.6PB total volume<br />Parallel high-performance filesystem ||<br />
* For active or temporary (<code>/scratch</code>) storage.<br />
* Not allocated.<br />
* Large fixed [[Storage and file management#Filesystem_Quotas_and_Policies|quota]] per user.<br />
* Inactive data will be purged.<br />
|-<br />
|'''Project space'''<br />External persistent storage<br />
||<br />
* Part of the [[National Data Cyberinfrastructure]].<br />
* Allocated via [https://www.computecanada.ca/research-portal/accessing-resources/rapid-access-service/ RAS] or [https://www.computecanada.ca/research-portal/accessing-resources/resource-allocation-competitions/ RAC].<br />
* Not designed for parallel I/O workloads. Use Scratch space instead.<br />
* Large adjustable [[Storage and file management#Filesystem_Quotas_and_Policies|quota]] per project.<br />
* Has daily backup.<br />
|}<br />
<br />
=High-performance interconnect= <!--T:19--><br />
<br />
<!--T:21--><br />
Mellanox FDR (56Gb/s) and EDR (100Gb/s) InfiniBand interconnect. FDR is used for GPU and cloud nodes, EDR for other node types. A central 324-port director switch aggregates connections from islands of 1024 cores each for CPU and GPU nodes. The 56 cloud nodes are a variation on CPU nodes, and are on a single larger island sharing 8 FDR uplinks to the director switch.<br />
<br />
<!--T:29--><br />
A low-latency high-bandwidth Infiniband fabric connects all nodes and scratch storage.<br />
<br />
<!--T:30--><br />
Nodes configurable for cloud provisioning also have a 10Gb/s Ethernet network, with 40Gb/s uplinks to scratch storage.<br />
<br />
<!--T:22--><br />
The design of Graham is to support multiple simultaneous parallel jobs of up to 1024 cores in a fully non-blocking manner. <br />
<br />
<!--T:31--><br />
For larger jobs the interconnect has a 8:1 blocking factor, i.e., even for jobs running on multiple islands the Graham system provides a high-performance interconnect.<br />
<br />
<!--T:32--><br />
[https://docs.computecanada.ca/mediawiki/images/b/b3/Gp3-network-topo.png Graham high performance interconnect diagram]<br />
<br />
=Node types and characteristics= <!--T:5--><br />
A total of 35,520 cores and 320 GPU devices, spread across 1,107 nodes of different types.<br />
<br />
<!--T:25--><br />
''Processor type:'' All nodes except bigmem3000 have Intel E5-2683 V4 CPUs, running at 2.1 GHz<br />
<br />
<!--T:26--><br />
''GPU type:'' P100 12g<br />
<br />
<!--T:6--><br />
{| class="wikitable sortable"<br />
|-<br />
| base nodes || 864 nodes || 128 GB of memory, 16 cores/socket, 2 sockets/node. Intel "Broadwell" CPUs at 2.1Ghz, model E5-2683 v4. 960GB SATA SSD.<br />
|-<br />
| large nodes (cloud configuration) || 56 nodes || 256 GB of memory, 16 cores/socket, 2 sockets/node. Intel "Broadwell" CPUs at 2.1Ghz, model E5-2683 v4. 960GB SATA SSD.<br />
|-<br />
| GPU nodes || 160 nodes || 128 GB of memory, 16 cores/socket, 2 sockets/node, 2 NVIDIA P100 Pascal GPUs/node (12GB HBM2 memory). Intel "Broadwell" CPUs at 2.1Ghz, model E5-2683 v4. 1.6TB NVMe SSD.<br />
|-<br />
| bigmem500 nodes|| 24 nodes || 0.5 TB (512 GB) of memory, 16 cores/socket, 2 sockets/node. Intel "Broadwell" CPUs at 2.1Ghz, model E5-2683 v4. 960GB SATA SSD.<br />
|-<br />
| bigmem3000 nodes || 3 nodes || 3 TB of memory, 16 cores/socket, 4 sockets/node. Intel "Broadwell" CPUs at 2.1Ghz, model E7-4850 v4. 960GB SATA SSD.<br />
<br />
<!--T:35--><br />
|}<br />
<br />
<!--T:7--><br />
Local (on-node) storage in the above nodes is in /tmp. Best practice is to use the temporary directory generated by [[Running jobs|Slurm]], $SLURM_TMPDIR.<br />
<br />
<!--T:14--><br />
<noinclude><br />
</translate><br />
</noinclude></div>Feimaohttps://docs.alliancecan.ca/mediawiki/index.php?title=Graham&diff=41667Graham2017-11-13T21:46:56Z<p>Feimao: </p>
<hr />
<div><noinclude><br />
<languages /><br />
<br />
<translate><br />
<!--T:27--><br />
</noinclude><br />
{| class="wikitable"<br />
|-<br />
| Expected availability: In production. RAC 2017's implemented June 30, 2017<br />
|-<br />
| Login node: '''graham.computecanada.ca'''<br />
|-<br />
| Globus endpoint: '''computecanada#graham-dtn'''<br />
|}<br />
<br />
<!--T:2--><br />
GRAHAM is a heterogeneous cluster, suitable for a variety of workloads, and located at the University of Waterloo. It is named after [https://en.wikipedia.org/wiki/Wes_Graham Wes Graham], the first director of the Computing Centre at Waterloo. It was previously known as "GP3" and is still identified as such in the [https://www.computecanada.ca/research-portal/accessing-resources/resource-allocation-competitions/ 2017 RAC] documentation.<br />
<br />
<!--T:4--><br />
The parallel filesystem and external persistent storage ([[National Data Cyberinfrastructure|NDC-Waterloo]]) are similar to [[Cedar|Cedar's]]. The interconnect is different and there is a slightly different mix of compute nodes.<br />
<br />
<!--T:28--><br />
The Graham system is sold and supported by Huawei Canada, Inc. It is entirely liquid cooled, using rear-door heat exchangers.<br />
<br />
<!--T:33--><br />
[https://docs.computecanada.ca/wiki/Getting_Started_with_the_new_National_Systems Getting started with Graham]<br />
<br />
<!--T:34--><br />
[https://docs.computecanada.ca/wiki/Running_jobs How to run jobs]<br />
<br />
=Attached storage systems= <!--T:23--><br />
<br />
<!--T:24--><br />
{| class="wikitable sortable"<br />
|-<br />
| '''Home space''' ||<br />
* Standard home directory.<br />
* Small, standard quota. <br />
* Not allocated via [https://www.computecanada.ca/research-portal/accessing-resources/rapid-access-service/ RAS] or [https://www.computecanada.ca/research-portal/accessing-resources/resource-allocation-competitions/ RAC]. Larger requests go to Project space.<br />
|-<br />
| '''Scratch space'''<br />Parallel high-performance filesystem ||<br />
* For active or temporary (<code>/scratch</code>) storage.<br />
* Available to all nodes.<br />
* Not allocated.<br />
* Inactive data will be purged.<br />
* [http://e.huawei.com/en/products/cloud-computing-dc/storage Huawei OceanStor] storage system with approximately 3.6PB usable capacity and aggregate performance of approximately 30GB/s.<br />
|-<br />
|'''Project space'''<br />External persistent storage<br />
||<br />
* Part of the [[National Data Cyberinfrastructure]].<br />
* Allocated via [https://www.computecanada.ca/research-portal/accessing-resources/rapid-access-service/ RAS] or [https://www.computecanada.ca/research-portal/accessing-resources/resource-allocation-competitions/ RAC].<br />
* Available to all nodes.<br />
* Not designed for parallel I/O workloads. Use Scratch space instead.<br />
|}<br />
<br />
=High-performance interconnect= <!--T:19--><br />
<br />
<!--T:21--><br />
Mellanox FDR (56Gb/s) and EDR (100Gb/s) InfiniBand interconnect. FDR is used for GPU and cloud nodes, EDR for other node types. A central 324-port director switch aggregates connections from islands of 1024 cores each for CPU and GPU nodes. The 56 cloud nodes are a variation on CPU nodes, and are on a single larger island sharing 8 FDR uplinks to the director switch.<br />
<br />
<!--T:29--><br />
A low-latency high-bandwidth Infiniband fabric connects all nodes and scratch storage.<br />
<br />
<!--T:30--><br />
Nodes configurable for cloud provisioning also have a 10Gb/s Ethernet network, with 40Gb/s uplinks to scratch storage.<br />
<br />
<!--T:22--><br />
The design of Graham is to support multiple simultaneous parallel jobs of up to 1024 cores in a fully non-blocking manner. <br />
<br />
<!--T:31--><br />
For larger jobs the interconnect has a 8:1 blocking factor, i.e., even for jobs running on multiple islands the Graham system provides a high-performance interconnect.<br />
<br />
<!--T:32--><br />
[https://docs.computecanada.ca/mediawiki/images/b/b3/Gp3-network-topo.png Graham high performance interconnect diagram]<br />
<br />
=Node types and characteristics= <!--T:5--><br />
A total of 37,568 cores and 320 GPU devices, spread across 1,107 nodes of different types.<br />
<br />
<!--T:25--><br />
''Processor type:'' All nodes except bigmem3000 have Intel E5-2683 V4 CPUs, running at 2.1 GHz<br />
<br />
<!--T:26--><br />
''GPU type:'' P100 12g<br />
<br />
<!--T:6--><br />
{| class="wikitable sortable"<br />
|-<br />
| "Base" compute nodes || 864 nodes || 128 GB of memory, 16 cores/socket, 2 sockets/node. Intel "Broadwell" CPUs at 2.1Ghz, model E5-2683 v4. 960GB SATA SSD.<br />
|-<br />
| "Large" nodes (cloud configuration) || 56 nodes || 256 GB of memory, 16 cores/socket, 2 sockets/node. Intel "Broadwell" CPUs at 2.1Ghz, model E5-2683 v4. 960GB SATA SSD.<br />
|-<br />
| "Bigmem500" nodes|| 24 nodes || 0.5 TB (512 GB) of memory, 16 cores/socket, 2 sockets/node. Intel "Broadwell" CPUs at 2.1Ghz, model E5-2683 v4. 960GB SATA SSD.<br />
|-<br />
| "Bigmem3000" nodes || 3 nodes || 3 TB of memory, 16 cores/socket, 4 sockets/node. Intel "Broadwell" CPUs at 2.1Ghz, model E7-4850 v4. 960GB SATA SSD.<br />
|-<br />
| "GPU" nodes || 160 nodes || 128 GB of memory, 16 cores/socket, 2 sockets/node, 2 NVIDIA P100 Pascal GPUs/node (12GB HBM2 memory). Intel "Broadwell" CPUs at 2.1Ghz, model E5-2683 v4. 1.6TB NVMe SSD.<br />
|}<br />
<br />
<!--T:7--><br />
Local (on-node) storage in the above nodes is in /tmp. Best practice is to use the temporary directory generated by [[Running jobs|Slurm]], $SLURM_TMPDIR.<br />
<br />
<!--T:14--><br />
<noinclude><br />
</translate><br />
</noinclude></div>Feimaohttps://docs.alliancecan.ca/mediawiki/index.php?title=Using_GPUs_with_Slurm&diff=40230Using GPUs with Slurm2017-10-25T01:03:00Z<p>Feimao: </p>
<hr />
<div><languages /><br />
<translate><br />
<br />
== Available hardware == <!--T:1--><br />
These are the node types containing GPUs currently available on [[Cedar]] and [[Graham]]:<br />
<br />
<!--T:2--><br />
{| class="wikitable"<br />
|-<br />
! # of Nodes !!Node type !! CPU cores !! CPU memory !! # of GPUs !! GPU type !! PCIe bus topology<br />
|-<br />
| 114 || Cedar Base GPU || 24 || 128GB || 4 || NVIDIA P100-PCIE-12GB || Two GPUs per CPU socket<br />
|-<br />
| 32 || Cedar Large GPU || 24|| 256GB || 4 || NVIDIA P100-PCIE-16GB || All GPUs under same CPU socket<br />
|-<br />
| 160 || Graham Base GPU || 32|| 128GB || 2 || NVIDIA P100-PCIE-12GB || One GPU per CPU socket<br />
|}<br />
<br />
== Single-core job == <!--T:3--><br />
If you need only a single CPU core and one GPU:<br />
{{File<br />
|name=gpu_serial_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --account=def-someuser<br />
#SBATCH --gres=gpu:1 # Number of GPUs (per node)<br />
#SBATCH --mem=4000M # memory (per node)<br />
#SBATCH --time=0-03:00 # time (DD-HH:MM)<br />
./program<br />
}}<br />
<br />
== Multi-threaded job == <!--T:4--><br />
For GPU jobs asking for multiple CPUs in a single node:<br />
{{File<br />
|name=gpu_threaded_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --account=def-someuser<br />
#SBATCH --gres=gpu:1 # Number of GPU(s) per node<br />
#SBATCH --cpus-per-task=6 # CPU cores/threads<br />
#SBATCH --mem=4000M # memory per node<br />
#SBATCH --time=0-03:00 # time (DD-HH:MM)<br />
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK<br />
./program<br />
}}<br />
On Cedar, we recommend that multi-threaded jobs use no more than 6 CPU cores for each GPU requested. On Graham, we recommend no more than 16 CPU cores for each GPU.<br />
<br />
== MPI job == <!--T:5--><br />
{{File<br />
|name=gpu_mpi_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --account=def-someuser<br />
#SBATCH --gres=gpu:4 # Number of GPUs per node<br />
#SBATCH --nodes=2 # Number of nodes<br />
#SBATCH --ntask=48 # Number of MPI process<br />
#SBATCH --cpus-per-task=1 # CPU cores per MPI process<br />
#SBATCH --mem=120G # memory per node<br />
#SBATCH --time=0-03:00 # time (DD-HH:MM)<br />
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK<br />
srun ./program<br />
}}<br />
<br />
== Whole nodes == <!--T:6--><br />
If your application can efficiently use an entire node and its associated GPUs, you will probably experience shorter wait times if you ask Slurm for a whole node. Use one of the following job scripts as a template. <br />
<br />
=== Scheduling a GPU node at Graham === <!--T:7--><br />
{{File<br />
|name=graham_gpu_node_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --nodes=1<br />
#SBATCH --gres=gpu:2<br />
#SBATCH --ntasks-per-node=32<br />
#SBATCH --mem=128000M<br />
#SBATCH --time=3:00<br />
#SBATCH --account=def-someuser<br />
nvidia-smi<br />
}}<br />
<br />
=== Scheduling a Base GPU node at Cedar === <!--T:8--><br />
{{File<br />
|name=cedar_gpu_node_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --nodes=1<br />
#SBATCH --gres=gpu:4<br />
#SBATCH --exclusive<br />
#SBATCH --mem=125G<br />
#SBATCH --time=3:00<br />
#SBATCH --account=def-someuser<br />
nvidia-smi<br />
}}<br />
<br />
=== Scheduling a Large GPU node at Cedar === <!--T:9--><br />
<br />
<!--T:10--><br />
There is a special group of large-memory GPU nodes at [[Cedar]] which have four Tesla P100 16GB cards each. (Other GPUs in the cluster have 12GB.) These GPUs all use the same PCI switch so the inter-GPU communication latency is lower, but bandwidth between CPU and GPU is lower than on the regular GPU nodes. The nodes also have 256 GB RAM instead of 128GB. In order to use these nodes you must specify <code>lgpu</code>. By-gpu requests can '''only run up to 24 hours'''.<br />
<br />
<!--T:11--><br />
{{File<br />
|name=large_gpu_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --nodes=1 <br />
#SBATCH --gres=gpu:lgpu:4 <br />
#SBATCH --ntasks=1<br />
#SBATCH --cpus-per-task=24 # There are 24 CPU cores on Cedar GPU nodes<br />
#SBATCH --time=3:00<br />
#SBATCH --account=def-someuser<br />
hostname<br />
nvidia-smi<br />
}}<br />
<br />
</translate></div>Feimaohttps://docs.alliancecan.ca/mediawiki/index.php?title=Running_jobs&diff=39081Running jobs2017-10-04T15:25:26Z<p>Feimao: Marked this version for translation</p>
<hr />
<div><languages /><br />
<translate><br />
<br />
<!--T:54--><br />
This page is intended for the user who is already familiar with the concepts of job scheduling and job scripts, and who wants guidance on submitting jobs to Compute Canada clusters.<br />
If you have not worked on a large shared computer cluster before, you should probably read [[What is a scheduler?]] first.<br />
<br />
<!--T:55--><br />
On Compute Canada clusters, the job scheduler is the <br />
[https://en.wikipedia.org/wiki/Slurm_Workload_Manager Slurm Workload Manager].<br />
Comprehensive [https://slurm.schedmd.com/documentation.html documentation for Slurm] is maintained by SchedMD. If you are coming to Slurm from PBS/Torque, SGE, LSF, or LoadLeveler, you might find this table of [https://slurm.schedmd.com/rosetta.pdf corresponding commands] useful.<br />
<br />
==Use <code>sbatch</code> to submit jobs== <!--T:56--><br />
The command to submit a job is [https://slurm.schedmd.com/sbatch.html <code>sbatch</code>]:<br />
<source lang="bash"><br />
[someuser@host ~]$ sbatch simple_job.sh<br />
Submitted batch job 123456<br />
</source><br />
<br />
<!--T:57--><br />
A minimal Slurm job script looks like this:<br />
{{File<br />
|name=simple_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --time=00:01:00<br />
#SBATCH --account=def-someuser<br />
echo 'Hello, world!'<br />
sleep 30 <br />
}}<br />
<br />
<!--T:58--><br />
Directives (or "options") in the job script are prefixed with <code>#SBATCH</code> and must precede all executable commands. All available directives are described on the [https://slurm.schedmd.com/sbatch.html sbatch page]. Compute Canada policies require that you supply at least a time limit (<code>--time</code>) and an account name (<code>--account</code>) for each job. (See [[#Accounts and projects]] below.)<br />
<br />
<!--T:106--><br />
A default memory amount of 256 MB per core will be allocated unless you make some other memory request with <code>--mem-per-cpu</code> (memory per core) or <code>--mem</code> (memory per node).<br />
<br />
<!--T:59--><br />
You can also specify directives as command-line arguments to <code>sbatch</code>. So for example,<br />
[someuser@host ~]$ sbatch --time=00:30:00 simple_job.sh <br />
will submit the above job script with a time limit of 30 minutes.<br />
<br />
==Use <code>squeue</code> to list jobs== <!--T:60--><br />
<br />
<!--T:61--><br />
The [https://slurm.schedmd.com/squeue.html <code>squeue</code>] command lists pending and running jobs. Supply your username as an argument with <code>-u</code> to list only your own jobs:<br />
<br />
<!--T:62--><br />
<source lang="bash"><br />
[someuser@host ~]$ squeue -u $USER<br />
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)<br />
123456 cpubase_b simple_j someuser R 0:03 1 cdr234<br />
123457 cpubase_b simple_j someuser PD 1 (Priority)<br />
</source><br />
<br />
<!--T:12--><br />
The ST column of the output shows the status of each job. The two most common states are "PD" for "pending" or "R" for "running". See the [https://slurm.schedmd.com/squeue.html squeue page]<br />
for more on selecting, formatting, and interpreting the <code>squeue</code> output.<br />
<br />
==Where does the output go?== <!--T:63--><br />
<br />
<!--T:64--><br />
By default the output is placed in a file named "slurm-", suffixed with the job ID number and ".out", ''e.g.'' <code>slurm-123456.out</code>, in the directory from which the job was submitted.<br />
You can use <code>--output</code> to specify a different name or location. <br />
Certain replacement symbols can be used in the filename, ''e.g.'' <code>%j</code> will be replaced <br />
by the job ID number. See [https://slurm.schedmd.com/sbatch.html sbatch] for a complete list.<br />
<br />
<!--T:65--><br />
The following sample script sets a ''job name'' (which appears in <code>squeue</code> output) and sends the output to a file with a name constructed from the job name (%x) and the job ID number (%j). <br />
<br />
<!--T:15--><br />
{{File<br />
|name=name_output.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --account=def-someuser<br />
#SBATCH --time=00:01:00<br />
#SBATCH --job-name=test<br />
#SBATCH --output=%x-%j.out<br />
echo 'Hello, world!'<br />
}}<br />
<br />
<!--T:16--><br />
Error output will normally appear in the same file as standard output, just as it would if you were typing commands interactively. If you want to send the standard error channel (stderr) to a separate file, use <code>--error</code>.<br />
<br />
==Accounts and projects== <!--T:66--><br />
<br />
<!--T:67--><br />
Every job must have an associated ''account name'' corresponding to a Compute Canada [https://ccdb.computecanada.ca/me/faq#what_is_rap Resource Allocation Project] (RAP).<br />
<br />
<!--T:107--><br />
If you try to submit a job with <code>sbatch</code> and receive one of these messages:<br />
<pre><br />
You are associated with multiple _cpu allocations...<br />
Please specify one of the following accounts to submit this job:<br />
<br />
<!--T:108--><br />
You are associated with multiple _gpu allocations...<br />
Please specify one of the following accounts to submit this job:<br />
</pre> <br />
then you have more than one valid account, and you will have to specify one<br />
using the <code>--account</code> directive:<br />
#SBATCH --account=def-user-ab<br />
<br />
<!--T:68--><br />
To find out which account name corresponds<br />
to a given Resource Allocation Project, log in to [https://ccdb.computecanada.ca CCDB] <br />
and click on "My Account -> Account Details". You will see a list of all the projects <br />
you are a member of. The string you should use with the <code>--account</code> for <br />
a given project is under the column '''Group Name'''. Note that a Resource <br />
Allocation Project may only apply to a specific cluster (or set of clusters) and therefore<br />
may not be transferable from one cluster to another. <br />
<br />
<!--T:69--><br />
In the illustration below, jobs submitted with <code>--account=def-rdickson</code> will be accounted against RAP wnp-003-aa.<br />
<br />
<!--T:70--><br />
[[File:Find-group-name-annotated.png|750px|frame|left| Finding the group name for a Resource Allocation Project (RAP)]]<br />
<br clear=all> <!-- This is to prevent the next section from filling to the right of the image. --><br />
<br />
<!--T:71--><br />
If you plan to use one account consistently for all jobs, once you have determined the right account name you may find it convenient to set the following three environment variables in your <code>~/.bashrc</code> file:<br />
export SLURM_ACCOUNT=def-someuser<br />
export SBATCH_ACCOUNT=$SLURM_ACCOUNT<br />
export SALLOC_ACCOUNT=$SLURM_ACCOUNT<br />
Slurm will use the value of <code>SBATCH_ACCOUNT</code> in place of the <code>--account</code> directive in the job script. Note that even if you supply an account name inside the job script, ''the environment variable takes priority.'' In order to override the environment variable you must supply an account name as a command-line argument to <code>sbatch</code>.<br />
<br />
<!--T:72--><br />
<code>SLURM_ACCOUNT</code> plays the same role as <code>SBATCH_ACCOUNT</code>, but for the <code>srun</code> command instead of <code>sbatch</code>. The same idea holds for <code>SALLOC_ACCOUNT</code>.<br />
<br />
== Examples of job scripts == <!--T:17--><br />
<br />
=== MPI job === <!--T:18--><br />
<br />
<!--T:51--><br />
This example script launches four MPI processes, each with 1024 MB of memory. The run time is limited to 5 minutes. <br />
<br />
<!--T:19--><br />
{{File<br />
|name=mpi_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --account=def-someuser<br />
#SBATCH --ntasks=4 # number of MPI processes<br />
#SBATCH --mem-per-cpu=1024M # memory; default unit is megabytes<br />
#SBATCH --time=0-00:05 # time (DD-HH:MM)<br />
srun ./mpi_program # mpirun or mpiexec also work<br />
}}<br />
<br />
<!--T:20--><br />
Large MPI jobs, specifically those which can efficiently use a multiple of 32 cores, should use <code>--nodes</code> and <code>--ntasks-per-node</code> instead of <code>--ntasks</code>. Hybrid MPI/threaded jobs are also possible. For more on these and other options relating to distributed parallel jobs, see [[Advanced MPI scheduling]].<br />
<br />
=== Threaded or OpenMP job === <!--T:21--><br />
This example script launches a single process with eight CPU cores. Bear in mind that for an application to use OpenMP it must be compiled with the appropriate flag, e.g. <code>gcc -fopenmp ...</code> or <code>icc -openmp ...</code><br />
<br />
<!--T:22--><br />
{{File<br />
|name=openmp_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --account=def-someuser<br />
#SBATCH --time=0-0:5<br />
#SBATCH --cpus-per-task=8<br />
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK<br />
./ompHello<br />
}}<br />
<br />
<!--T:23--><br />
For more on writing and running parallel programs with OpenMP, see [[OpenMP]].<br />
<br />
=== GPU job === <!--T:24--><br />
This example is a serial job with one [https://en.wikipedia.org/wiki/General-purpose_computing_on_graphics_processing_units GPU] allocated, a memory limit of 4000 MB per node, and a run-time limit of 5 hours. The output filename will include the name of the first node used and the job ID number.<br />
<br />
<!--T:25--><br />
{{File<br />
|name=simple_gpu_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --account=def-someuser<br />
#SBATCH --gres=gpu:1 # request GPU "generic resource"<br />
#SBATCH --mem=4000M # memory per node<br />
#SBATCH --time=0-05:00 # time (DD-HH:MM)<br />
nvidia-smi<br />
}}<br />
<br />
<!--T:49--><br />
Because no node count is specified in the above example, one node will be allocated. If you were to add <code>--nodes=3</code>, the total memory allocated would be 12000M. The same goes for <code>--gres</code>: If you request three nodes, you will get one GPU per node, for a total of three.<br />
<br />
<!--T:86--><br />
All GPU resources on [[Cedar]] have four GPUs per node, [[Graham]] GPU nodes have two GPUs per node. The following example requests all the GPUs on one node.<br />
{{File<br />
|name=simple_gpu_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --account=def-someuser<br />
#SBATCH --gres=gpu:4 # request GPU "generic resource", 4 on Cedar, 2 on Graham<br />
#SBATCH --mem=4000M # memory per node<br />
#SBATCH --time=0-05:00 # time (DD-HH:MM)<br />
nvidia-smi<br />
}}<br />
<br />
==== Cedar large-memory GPUs ==== <!--T:104--><br />
<br />
<!--T:87--><br />
There is a special group of large-memory GPU nodes at [[Cedar]] which have four Tesla P100 16GB cards each. (Other GPUs in the cluster have 12GB.) These GPUs all use the same PCI switch so the inter-GPU communication latency is lower, but bandwidth between CPU and GPU is lower than on the regular GPU nodes. The nodes also have 256 GB RAM instead of 128GB. In order to use these nodes you must specify <code>lgpu</code>. By-gpu requests can '''only run up to 24 hours'''.<br />
<br />
<!--T:110--><br />
The job submission script for a by-gpu job should have the contents<br />
<br />
<!--T:105--><br />
{{File<br />
|name=large_gpu_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --nodes=1 <br />
#SBATCH --ntasks=1<br />
#SBATCH --cpus-per-task=6 # There are 24 CPU cores on Cedar GPU nodes, up to 6 per GPU<br />
#SBATCH --gres=gpu:lgpu:1 # Ask for 1 GPU per node of the large-gpu node variety<br />
#SBATCH --time=0-00:10 <br />
hostname<br />
nvidia-smi<br />
}}<br />
<br />
<!--T:111--><br />
The job submission script for a whole-node (4 GPUs) job should have the contents<br />
{{File<br />
|name=large_gpu_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --nodes=1 <br />
#SBATCH --ntasks=1<br />
#SBATCH --cpus-per-task=32 # There are 24 CPU cores on Cedar GPU nodes<br />
#SBATCH --gres=gpu:lgpu:4 # Ask for 4 GPUs per node of the large-gpu node variety<br />
#SBATCH --time=0-00:10 <br />
hostname<br />
nvidia-smi<br />
}}<br />
<br />
<!--T:26--><br />
For more on running GPU jobs, see [[Using GPUs with SLURM]].<br />
<br />
=== Array job === <!--T:27--><br />
Also known as a ''task array'', an array job is a way to submit a whole set of jobs with one command. The individual jobs in the array are distinguished by an environment variable, <code>$SLURM_ARRAY_TASK_ID</code>, which is set to a different value for each instance of the job. <br />
sbatch --array=0-7 ... # $SLURM_ARRAY_TASK_ID will take values from 0 to 7 inclusive<br />
sbatch --array=1,3,5,7 ... # $SLURM_ARRAY_TASK_ID will take the listed values<br />
sbatch --array=1-7:2 ... # Step-size of 2, does the same as the previous example<br />
sbatch --array=1-100%10 ... # Allow no more than 10 of the jobs to run simultaneously<br />
<br />
== Interactive jobs == <!--T:28--><br />
Though batch submission is the most common and most efficient way to take advantage of our clusters, interactive jobs are also supported. These can be useful for things like:<br />
* Data exploration at the command line<br />
* Interactive "console tools" like R and iPython<br />
* Significant software development, debugging, or compiling<br />
<br />
<!--T:29--><br />
You can start an interactive session on a compute node with [https://slurm.schedmd.com/salloc.html salloc]. In the following example we request two tasks, which corresponds to two CPU cores, for an hour:<br />
[name@login ~]$ salloc --time=1:0:0 --ntasks=2 --account=def-someuser<br />
salloc: Granted job allocation 1234567<br />
[name@node01 ~]$ ... # do some work<br />
[name@node01 ~]$ exit # terminate the allocation<br />
salloc: Relinquishing job allocation 1234567<br />
<br />
<!--T:30--><br />
For more details see [[Interactive jobs]].<br />
<br />
== Monitoring jobs == <!--T:31--><br />
<br />
<!--T:32--><br />
By default [https://slurm.schedmd.com/squeue.html squeue] will show all the jobs the scheduler is managing at the moment. It may run much faster if you ask only about your own jobs with<br />
squeue -u <username><br />
<br />
<!--T:33--><br />
You can show only running jobs, or only pending jobs:<br />
squeue -u <username> -t RUNNING<br />
squeue -u <username> -t PENDING<br />
<br />
<!--T:34--><br />
You can show detailed information for a specific job with [https://slurm.schedmd.com/scontrol.html scontrol]:<br />
scontrol show job -dd <jobid><br />
<br />
<!--T:35--><br />
Find information about a completed job with [https://slurm.schedmd.com/sacct.html sacct], and optionally, control what it prints using <code>--format</code>:<br />
sacct -j <jobid><br />
sacct -j <jobid> --format=JobID,JobName,MaxRSS,Elapsed<br />
<br />
<!--T:73--><br />
If a node fails while running a job, the job may be restarted. <code>sacct</code> will normally show you only the record for the last (presumably successful) run. If you wish to see all records related to a given job, add the <code>--duplicates</code> option.<br />
<br />
<!--T:52--><br />
Use the MaxRSS accounting field to determine how much memory a job needed. The value returned will be the largest [https://en.wikipedia.org/wiki/Resident_set_size resident set size] for any of the tasks. If you want to know which task and node this occurred on, print the MaxRSSTask and MaxRSSNode fields also.<br />
<br />
<!--T:53--><br />
The [https://slurm.schedmd.com/sstat.html sstat] command works on a running job much the same way that [https://slurm.schedmd.com/sacct.html sacct] works on a completed job.<br />
<br />
<!--T:36--><br />
You can ask to be notified by email of certain job conditions by supplying options to <br />
[https://slurm.schedmd.com/sbatch.html sbatch]:<br />
#SBATCH --mail-user=<email_address><br />
#SBATCH --mail-type=BEGIN<br />
#SBATCH --mail-type=END<br />
#SBATCH --mail-type=FAIL<br />
#SBATCH --mail-type=REQUEUE<br />
#SBATCH --mail-type=ALL<br />
<br />
==Cancelling jobs== <!--T:37--><br />
<br />
<!--T:38--><br />
Use [https://slurm.schedmd.com/scancel.html scancel] with the job ID to cancel a job:<br />
<br />
<!--T:39--><br />
scancel <jobid><br />
<br />
<!--T:40--><br />
You can also use it to cancel all your jobs, or all your pending jobs:<br />
<br />
<!--T:41--><br />
scancel -u <username><br />
scancel -t PENDING -u <username><br />
<br />
== Resubmitting jobs for long running computations == <!--T:74--><br />
<br />
<!--T:75--><br />
When a computation is going to require a long time to complete, so long that it cannot be done within the time limits on the system, <br />
the application you are running must support [[Points de contrôle/en|checkpointing]]. The application should be able to save its state to a file, called a ''checkpoint file'', and<br />
then it should be able to restart and continue the computation from that saved state. <br />
<br />
<!--T:76--><br />
For many users restarting a calculation will be rare and may be done manually, <br />
but some workflows require frequent restarts. <br />
In this case some kind of automation technique may be employed. <br />
<br />
<!--T:77--><br />
Here are two recommended methods of automatic restarting:<br />
* Using SLURM '''job arrays'''.<br />
* Resubmitting from the end of the job script.<br />
<br />
=== Restarting using job arrays === <!--T:90--><br />
<br />
<!--T:91--><br />
Using the <code>--array=1-100%10</code> syntax mentioned earlier one can submit a collection of identical jobs with the condition that only one job of them will run at any given time.<br />
The script should be written to ensure that the last checkpoint is always used for the next job. The number of restarts is fixed by the <code>--array</code> argument.<br />
<br />
<!--T:78--><br />
Consider, for example, a molecular dynamics simulations that has to be run for 1 000 000 steps, and such simulation does not fit into the time limit on the cluster. <br />
We can split the simulation into 10 smaller jobs of 100 000 steps, one after another. <br />
<br />
<!--T:79--><br />
An example of using a job array to restart a simulation:<br />
{{File<br />
|name=job_array_restart.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
# ---------------------------------------------------------------------<br />
# SLURM script for a multi-step job on a Compute Canada cluster. <br />
# ---------------------------------------------------------------------<br />
#SBATCH --account=def-someuser<br />
#SBATCH --cpus-per-task=1<br />
#SBATCH --time=0-10:00<br />
#SBATCH --mem=100M<br />
#SBATCH --array=1-10%1 # Run a 10-job array, one job at a time.<br />
# ---------------------------------------------------------------------<br />
echo "Current working directory: `pwd`"<br />
echo "Starting run at: `date`"<br />
# ---------------------------------------------------------------------<br />
echo ""<br />
echo "Job Array ID / Job ID: $SLURM_ARRAY_JOB_ID / $SLURM_JOB_ID"<br />
echo "This is job $SLURM_ARRAY_TASK_ID out of $SLURM_ARRAY_TASK_COUNT jobs."<br />
echo ""<br />
# ---------------------------------------------------------------------<br />
# Run your simulation step here...<br />
<br />
<!--T:92--><br />
if test -e state.cpt; then <br />
# There is a checkpoint file, restart;<br />
mdrun --restart state.cpt<br />
else<br />
# There is no checkpoint file, start a new simulation.<br />
mdrun<br />
fi<br />
<br />
<!--T:93--><br />
# ---------------------------------------------------------------------<br />
echo "Job finished with exit code $? at: `date`"<br />
# ---------------------------------------------------------------------<br />
}}<br />
<br />
=== Resubmission from the job script === <!--T:94--><br />
<br />
<!--T:95--><br />
In this case one submits a job that runs the first chunk of the calculation and saves a checkpoint. <br />
Once the chunk is done but before the allocated run-time of the job has elapsed,<br />
the script checks if the end of the calculation has been reached.<br />
If the calculation is not yet finished, the script submits a copy of itself to continue working.<br />
<br />
<!--T:96--><br />
An example of a job script with resubmission:<br />
{{File<br />
|name=job_resubmission.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
# ---------------------------------------------------------------------<br />
# SLURM script for job resubmission on a Compute Canada cluster. <br />
# ---------------------------------------------------------------------<br />
#SBATCH --job-name=job_chain<br />
#SBATCH --account=def-someuser<br />
#SBATCH --cpus-per-task=1<br />
#SBATCH --time=0-10:00<br />
#SBATCH --mem=100M<br />
# ---------------------------------------------------------------------<br />
echo "Current working directory: `pwd`"<br />
echo "Starting run at: `date`"<br />
# ---------------------------------------------------------------------<br />
# Run your simulation step here...<br />
<br />
<!--T:100--><br />
if test -e state.cpt; then <br />
# There is a checkpoint file, restart;<br />
mdrun --restart state.cpt<br />
else<br />
# There is no checkpoint file, start a new simulation.<br />
mdrun<br />
fi<br />
<br />
<!--T:101--><br />
# Resubmit if not all work has been done yet.<br />
# You must define the function end_is_not_reached().<br />
if end_is_not_reached; then<br />
sbatch ${BASH_SOURCE[0]}<br />
fi<br />
<br />
<!--T:102--><br />
# ---------------------------------------------------------------------<br />
echo "Job finished with exit code $? at: `date`"<br />
# ---------------------------------------------------------------------<br />
}}<br />
<br />
== Troubleshooting == <!--T:42--><br />
<br />
==== Avoid hidden characters in job scripts ==== <!--T:43--><br />
Preparing a job script with a ''word processor'' instead of a ''text editor'' is a common cause of trouble. Best practice is to prepare your job script on the cluster using an [[Editors|editor]] such as nano, vim, or emacs. If you prefer to prepare or alter the script off-line, then:<br />
* '''Windows users:''' <br />
** Use a text editor such as Notepad or [https://notepad-plus-plus.org/ Notepad++].<br />
** After uploading the script, use <code>dos2unix</code> to change Windows end-of-line characters to Linux end-of-line characters. <br />
* '''Mac users:'''<br />
** Open a terminal window and use an [[Editors|editor]] such as nano, vim, or emacs.<br />
<br />
==== Cancellation of jobs with dependency conditions which cannot be met ==== <!--T:109--><br />
A job submitted with <code>dependency=afterok:<jobid></code> is a "dependent job". A dependent job will wait for the parent job to be completed. If the parent job fails (that is, ends with a non-zero exit code) the dependent job can never be scheduled and so will be automatically cancelled. See [https://slurm.schedmd.com/sbatch.html sbatch] for more on dependency.<br />
<br />
== Job status and priority == <!--T:103--><br />
* For a discussion of how job priority is determined and how things like time limits may affect the scheduling of your jobs at Cedar and Graham, see [[Job scheduling policies]].<br />
<br />
== Further reading == <!--T:44--><br />
* Comprehensive [https://slurm.schedmd.com/documentation.html documentation] is maintained by SchedMD, as well as some [https://slurm.schedmd.com/tutorials.html tutorials].<br />
** [https://slurm.schedmd.com/sbatch.html sbatch] command options<br />
* There is also a [https://slurm.schedmd.com/rosetta.pdf "Rosetta stone"] mapping commands and directives from PBS/Torque, SGE, LSF, and LoadLeveler, to SLURM. NERSC also offers some [http://www.nersc.gov/users/computational-systems/cori/running-jobs/for-edison-users/torque-moab-vs-slurm-comparisons/ tables comparing Torque and SLURM].<br />
* Here is a text tutorial from [http://www.ceci-hpc.be/slurm_tutorial.html CÉCI], Belgium<br />
* Here is a rather minimal text tutorial from [http://www.brightcomputing.com/blog/bid/174099/slurm-101-basic-slurm-usage-for-linux-clusters Bright Computing]<br />
<br />
<!--T:48--><br />
[[Category:SLURM]]<br />
</translate></div>Feimaohttps://docs.alliancecan.ca/mediawiki/index.php?title=Running_jobs&diff=39080Running jobs2017-10-04T15:23:58Z<p>Feimao: </p>
<hr />
<div><languages /><br />
<translate><br />
<br />
<!--T:54--><br />
This page is intended for the user who is already familiar with the concepts of job scheduling and job scripts, and who wants guidance on submitting jobs to Compute Canada clusters.<br />
If you have not worked on a large shared computer cluster before, you should probably read [[What is a scheduler?]] first.<br />
<br />
<!--T:55--><br />
On Compute Canada clusters, the job scheduler is the <br />
[https://en.wikipedia.org/wiki/Slurm_Workload_Manager Slurm Workload Manager].<br />
Comprehensive [https://slurm.schedmd.com/documentation.html documentation for Slurm] is maintained by SchedMD. If you are coming to Slurm from PBS/Torque, SGE, LSF, or LoadLeveler, you might find this table of [https://slurm.schedmd.com/rosetta.pdf corresponding commands] useful.<br />
<br />
==Use <code>sbatch</code> to submit jobs== <!--T:56--><br />
The command to submit a job is [https://slurm.schedmd.com/sbatch.html <code>sbatch</code>]:<br />
<source lang="bash"><br />
[someuser@host ~]$ sbatch simple_job.sh<br />
Submitted batch job 123456<br />
</source><br />
<br />
<!--T:57--><br />
A minimal Slurm job script looks like this:<br />
{{File<br />
|name=simple_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --time=00:01:00<br />
#SBATCH --account=def-someuser<br />
echo 'Hello, world!'<br />
sleep 30 <br />
}}<br />
<br />
<!--T:58--><br />
Directives (or "options") in the job script are prefixed with <code>#SBATCH</code> and must precede all executable commands. All available directives are described on the [https://slurm.schedmd.com/sbatch.html sbatch page]. Compute Canada policies require that you supply at least a time limit (<code>--time</code>) and an account name (<code>--account</code>) for each job. (See [[#Accounts and projects]] below.)<br />
<br />
<!--T:106--><br />
A default memory amount of 256 MB per core will be allocated unless you make some other memory request with <code>--mem-per-cpu</code> (memory per core) or <code>--mem</code> (memory per node).<br />
<br />
<!--T:59--><br />
You can also specify directives as command-line arguments to <code>sbatch</code>. So for example,<br />
[someuser@host ~]$ sbatch --time=00:30:00 simple_job.sh <br />
will submit the above job script with a time limit of 30 minutes.<br />
<br />
==Use <code>squeue</code> to list jobs== <!--T:60--><br />
<br />
<!--T:61--><br />
The [https://slurm.schedmd.com/squeue.html <code>squeue</code>] command lists pending and running jobs. Supply your username as an argument with <code>-u</code> to list only your own jobs:<br />
<br />
<!--T:62--><br />
<source lang="bash"><br />
[someuser@host ~]$ squeue -u $USER<br />
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)<br />
123456 cpubase_b simple_j someuser R 0:03 1 cdr234<br />
123457 cpubase_b simple_j someuser PD 1 (Priority)<br />
</source><br />
<br />
<!--T:12--><br />
The ST column of the output shows the status of each job. The two most common states are "PD" for "pending" or "R" for "running". See the [https://slurm.schedmd.com/squeue.html squeue page]<br />
for more on selecting, formatting, and interpreting the <code>squeue</code> output.<br />
<br />
==Where does the output go?== <!--T:63--><br />
<br />
<!--T:64--><br />
By default the output is placed in a file named "slurm-", suffixed with the job ID number and ".out", ''e.g.'' <code>slurm-123456.out</code>, in the directory from which the job was submitted.<br />
You can use <code>--output</code> to specify a different name or location. <br />
Certain replacement symbols can be used in the filename, ''e.g.'' <code>%j</code> will be replaced <br />
by the job ID number. See [https://slurm.schedmd.com/sbatch.html sbatch] for a complete list.<br />
<br />
<!--T:65--><br />
The following sample script sets a ''job name'' (which appears in <code>squeue</code> output) and sends the output to a file with a name constructed from the job name (%x) and the job ID number (%j). <br />
<br />
<!--T:15--><br />
{{File<br />
|name=name_output.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --account=def-someuser<br />
#SBATCH --time=00:01:00<br />
#SBATCH --job-name=test<br />
#SBATCH --output=%x-%j.out<br />
echo 'Hello, world!'<br />
}}<br />
<br />
<!--T:16--><br />
Error output will normally appear in the same file as standard output, just as it would if you were typing commands interactively. If you want to send the standard error channel (stderr) to a separate file, use <code>--error</code>.<br />
<br />
==Accounts and projects== <!--T:66--><br />
<br />
<!--T:67--><br />
Every job must have an associated ''account name'' corresponding to a Compute Canada [https://ccdb.computecanada.ca/me/faq#what_is_rap Resource Allocation Project] (RAP).<br />
<br />
<!--T:107--><br />
If you try to submit a job with <code>sbatch</code> and receive one of these messages:<br />
<pre><br />
You are associated with multiple _cpu allocations...<br />
Please specify one of the following accounts to submit this job:<br />
<br />
<!--T:108--><br />
You are associated with multiple _gpu allocations...<br />
Please specify one of the following accounts to submit this job:<br />
</pre> <br />
then you have more than one valid account, and you will have to specify one<br />
using the <code>--account</code> directive:<br />
#SBATCH --account=def-user-ab<br />
<br />
<!--T:68--><br />
To find out which account name corresponds<br />
to a given Resource Allocation Project, log in to [https://ccdb.computecanada.ca CCDB] <br />
and click on "My Account -> Account Details". You will see a list of all the projects <br />
you are a member of. The string you should use with the <code>--account</code> for <br />
a given project is under the column '''Group Name'''. Note that a Resource <br />
Allocation Project may only apply to a specific cluster (or set of clusters) and therefore<br />
may not be transferable from one cluster to another. <br />
<br />
<!--T:69--><br />
In the illustration below, jobs submitted with <code>--account=def-rdickson</code> will be accounted against RAP wnp-003-aa.<br />
<br />
<!--T:70--><br />
[[File:Find-group-name-annotated.png|750px|frame|left| Finding the group name for a Resource Allocation Project (RAP)]]<br />
<br clear=all> <!-- This is to prevent the next section from filling to the right of the image. --><br />
<br />
<!--T:71--><br />
If you plan to use one account consistently for all jobs, once you have determined the right account name you may find it convenient to set the following three environment variables in your <code>~/.bashrc</code> file:<br />
export SLURM_ACCOUNT=def-someuser<br />
export SBATCH_ACCOUNT=$SLURM_ACCOUNT<br />
export SALLOC_ACCOUNT=$SLURM_ACCOUNT<br />
Slurm will use the value of <code>SBATCH_ACCOUNT</code> in place of the <code>--account</code> directive in the job script. Note that even if you supply an account name inside the job script, ''the environment variable takes priority.'' In order to override the environment variable you must supply an account name as a command-line argument to <code>sbatch</code>.<br />
<br />
<!--T:72--><br />
<code>SLURM_ACCOUNT</code> plays the same role as <code>SBATCH_ACCOUNT</code>, but for the <code>srun</code> command instead of <code>sbatch</code>. The same idea holds for <code>SALLOC_ACCOUNT</code>.<br />
<br />
== Examples of job scripts == <!--T:17--><br />
<br />
=== MPI job === <!--T:18--><br />
<br />
<!--T:51--><br />
This example script launches four MPI processes, each with 1024 MB of memory. The run time is limited to 5 minutes. <br />
<br />
<!--T:19--><br />
{{File<br />
|name=mpi_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --account=def-someuser<br />
#SBATCH --ntasks=4 # number of MPI processes<br />
#SBATCH --mem-per-cpu=1024M # memory; default unit is megabytes<br />
#SBATCH --time=0-00:05 # time (DD-HH:MM)<br />
srun ./mpi_program # mpirun or mpiexec also work<br />
}}<br />
<br />
<!--T:20--><br />
Large MPI jobs, specifically those which can efficiently use a multiple of 32 cores, should use <code>--nodes</code> and <code>--ntasks-per-node</code> instead of <code>--ntasks</code>. Hybrid MPI/threaded jobs are also possible. For more on these and other options relating to distributed parallel jobs, see [[Advanced MPI scheduling]].<br />
<br />
=== Threaded or OpenMP job === <!--T:21--><br />
This example script launches a single process with eight CPU cores. Bear in mind that for an application to use OpenMP it must be compiled with the appropriate flag, e.g. <code>gcc -fopenmp ...</code> or <code>icc -openmp ...</code><br />
<br />
<!--T:22--><br />
{{File<br />
|name=openmp_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --account=def-someuser<br />
#SBATCH --time=0-0:5<br />
#SBATCH --cpus-per-task=8<br />
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK<br />
./ompHello<br />
}}<br />
<br />
<!--T:23--><br />
For more on writing and running parallel programs with OpenMP, see [[OpenMP]].<br />
<br />
=== GPU job === <!--T:24--><br />
This example is a serial job with one [https://en.wikipedia.org/wiki/General-purpose_computing_on_graphics_processing_units GPU] allocated, a memory limit of 4000 MB per node, and a run-time limit of 5 hours. The output filename will include the name of the first node used and the job ID number.<br />
<br />
<!--T:25--><br />
{{File<br />
|name=simple_gpu_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --account=def-someuser<br />
#SBATCH --gres=gpu:1 # request GPU "generic resource"<br />
#SBATCH --mem=4000M # memory per node<br />
#SBATCH --time=0-05:00 # time (DD-HH:MM)<br />
nvidia-smi<br />
}}<br />
<br />
<!--T:49--><br />
Because no node count is specified in the above example, one node will be allocated. If you were to add <code>--nodes=3</code>, the total memory allocated would be 12000M. The same goes for <code>--gres</code>: If you request three nodes, you will get one GPU per node, for a total of three.<br />
<br />
<!--T:86--><br />
All GPU resources on [[Cedar]] have four GPUs per node, [[Graham]] GPU nodes have two GPUs per node. The following example requests all the GPUs on one node.<br />
{{File<br />
|name=simple_gpu_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --account=def-someuser<br />
#SBATCH --gres=gpu:4 # request GPU "generic resource", 4 on Cedar, 2 on Graham<br />
#SBATCH --mem=4000M # memory per node<br />
#SBATCH --time=0-05:00 # time (DD-HH:MM)<br />
nvidia-smi<br />
}}<br />
<br />
==== Cedar large-memory GPUs ==== <!--T:104--><br />
<br />
<!--T:87--><br />
There is a special group of large-memory GPU nodes at [[Cedar]] which have four Tesla P100 16GB cards each. (Other GPUs in the cluster have 12GB.) These GPUs all use the same PCI switch so the inter-GPU communication latency is lower, but bandwidth between CPU and GPU is lower than on the regular GPU nodes. The nodes also have 256 GB RAM instead of 128GB. In order to use these nodes you must specify <code>lgpu</code>. By-gpu requests can '''only run up to 24 hours'''.<br />
<br />
The job submission script for a by-gpu job should have the contents<br />
<br />
<!--T:105--><br />
{{File<br />
|name=large_gpu_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --nodes=1 <br />
#SBATCH --ntasks=1<br />
#SBATCH --cpus-per-task=6 # There are 24 CPU cores on Cedar GPU nodes, up to 6 per GPU<br />
#SBATCH --gres=gpu:lgpu:1 # Ask for 1 GPU per node of the large-gpu node variety<br />
#SBATCH --time=0-00:10 <br />
hostname<br />
nvidia-smi<br />
}}<br />
<br />
The job submission script for a whole-node (4 GPUs) job should have the contents<br />
{{File<br />
|name=large_gpu_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --nodes=1 <br />
#SBATCH --ntasks=1<br />
#SBATCH --cpus-per-task=32 # There are 24 CPU cores on Cedar GPU nodes<br />
#SBATCH --gres=gpu:lgpu:4 # Ask for 4 GPUs per node of the large-gpu node variety<br />
#SBATCH --time=0-00:10 <br />
hostname<br />
nvidia-smi<br />
}}<br />
<br />
<!--T:26--><br />
For more on running GPU jobs, see [[Using GPUs with SLURM]].<br />
<br />
=== Array job === <!--T:27--><br />
Also known as a ''task array'', an array job is a way to submit a whole set of jobs with one command. The individual jobs in the array are distinguished by an environment variable, <code>$SLURM_ARRAY_TASK_ID</code>, which is set to a different value for each instance of the job. <br />
sbatch --array=0-7 ... # $SLURM_ARRAY_TASK_ID will take values from 0 to 7 inclusive<br />
sbatch --array=1,3,5,7 ... # $SLURM_ARRAY_TASK_ID will take the listed values<br />
sbatch --array=1-7:2 ... # Step-size of 2, does the same as the previous example<br />
sbatch --array=1-100%10 ... # Allow no more than 10 of the jobs to run simultaneously<br />
<br />
== Interactive jobs == <!--T:28--><br />
Though batch submission is the most common and most efficient way to take advantage of our clusters, interactive jobs are also supported. These can be useful for things like:<br />
* Data exploration at the command line<br />
* Interactive "console tools" like R and iPython<br />
* Significant software development, debugging, or compiling<br />
<br />
<!--T:29--><br />
You can start an interactive session on a compute node with [https://slurm.schedmd.com/salloc.html salloc]. In the following example we request two tasks, which corresponds to two CPU cores, for an hour:<br />
[name@login ~]$ salloc --time=1:0:0 --ntasks=2 --account=def-someuser<br />
salloc: Granted job allocation 1234567<br />
[name@node01 ~]$ ... # do some work<br />
[name@node01 ~]$ exit # terminate the allocation<br />
salloc: Relinquishing job allocation 1234567<br />
<br />
<!--T:30--><br />
For more details see [[Interactive jobs]].<br />
<br />
== Monitoring jobs == <!--T:31--><br />
<br />
<!--T:32--><br />
By default [https://slurm.schedmd.com/squeue.html squeue] will show all the jobs the scheduler is managing at the moment. It may run much faster if you ask only about your own jobs with<br />
squeue -u <username><br />
<br />
<!--T:33--><br />
You can show only running jobs, or only pending jobs:<br />
squeue -u <username> -t RUNNING<br />
squeue -u <username> -t PENDING<br />
<br />
<!--T:34--><br />
You can show detailed information for a specific job with [https://slurm.schedmd.com/scontrol.html scontrol]:<br />
scontrol show job -dd <jobid><br />
<br />
<!--T:35--><br />
Find information about a completed job with [https://slurm.schedmd.com/sacct.html sacct], and optionally, control what it prints using <code>--format</code>:<br />
sacct -j <jobid><br />
sacct -j <jobid> --format=JobID,JobName,MaxRSS,Elapsed<br />
<br />
<!--T:73--><br />
If a node fails while running a job, the job may be restarted. <code>sacct</code> will normally show you only the record for the last (presumably successful) run. If you wish to see all records related to a given job, add the <code>--duplicates</code> option.<br />
<br />
<!--T:52--><br />
Use the MaxRSS accounting field to determine how much memory a job needed. The value returned will be the largest [https://en.wikipedia.org/wiki/Resident_set_size resident set size] for any of the tasks. If you want to know which task and node this occurred on, print the MaxRSSTask and MaxRSSNode fields also.<br />
<br />
<!--T:53--><br />
The [https://slurm.schedmd.com/sstat.html sstat] command works on a running job much the same way that [https://slurm.schedmd.com/sacct.html sacct] works on a completed job.<br />
<br />
<!--T:36--><br />
You can ask to be notified by email of certain job conditions by supplying options to <br />
[https://slurm.schedmd.com/sbatch.html sbatch]:<br />
#SBATCH --mail-user=<email_address><br />
#SBATCH --mail-type=BEGIN<br />
#SBATCH --mail-type=END<br />
#SBATCH --mail-type=FAIL<br />
#SBATCH --mail-type=REQUEUE<br />
#SBATCH --mail-type=ALL<br />
<br />
==Cancelling jobs== <!--T:37--><br />
<br />
<!--T:38--><br />
Use [https://slurm.schedmd.com/scancel.html scancel] with the job ID to cancel a job:<br />
<br />
<!--T:39--><br />
scancel <jobid><br />
<br />
<!--T:40--><br />
You can also use it to cancel all your jobs, or all your pending jobs:<br />
<br />
<!--T:41--><br />
scancel -u <username><br />
scancel -t PENDING -u <username><br />
<br />
== Resubmitting jobs for long running computations == <!--T:74--><br />
<br />
<!--T:75--><br />
When a computation is going to require a long time to complete, so long that it cannot be done within the time limits on the system, <br />
the application you are running must support [[Points de contrôle/en|checkpointing]]. The application should be able to save its state to a file, called a ''checkpoint file'', and<br />
then it should be able to restart and continue the computation from that saved state. <br />
<br />
<!--T:76--><br />
For many users restarting a calculation will be rare and may be done manually, <br />
but some workflows require frequent restarts. <br />
In this case some kind of automation technique may be employed. <br />
<br />
<!--T:77--><br />
Here are two recommended methods of automatic restarting:<br />
* Using SLURM '''job arrays'''.<br />
* Resubmitting from the end of the job script.<br />
<br />
=== Restarting using job arrays === <!--T:90--><br />
<br />
<!--T:91--><br />
Using the <code>--array=1-100%10</code> syntax mentioned earlier one can submit a collection of identical jobs with the condition that only one job of them will run at any given time.<br />
The script should be written to ensure that the last checkpoint is always used for the next job. The number of restarts is fixed by the <code>--array</code> argument.<br />
<br />
<!--T:78--><br />
Consider, for example, a molecular dynamics simulations that has to be run for 1 000 000 steps, and such simulation does not fit into the time limit on the cluster. <br />
We can split the simulation into 10 smaller jobs of 100 000 steps, one after another. <br />
<br />
<!--T:79--><br />
An example of using a job array to restart a simulation:<br />
{{File<br />
|name=job_array_restart.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
# ---------------------------------------------------------------------<br />
# SLURM script for a multi-step job on a Compute Canada cluster. <br />
# ---------------------------------------------------------------------<br />
#SBATCH --account=def-someuser<br />
#SBATCH --cpus-per-task=1<br />
#SBATCH --time=0-10:00<br />
#SBATCH --mem=100M<br />
#SBATCH --array=1-10%1 # Run a 10-job array, one job at a time.<br />
# ---------------------------------------------------------------------<br />
echo "Current working directory: `pwd`"<br />
echo "Starting run at: `date`"<br />
# ---------------------------------------------------------------------<br />
echo ""<br />
echo "Job Array ID / Job ID: $SLURM_ARRAY_JOB_ID / $SLURM_JOB_ID"<br />
echo "This is job $SLURM_ARRAY_TASK_ID out of $SLURM_ARRAY_TASK_COUNT jobs."<br />
echo ""<br />
# ---------------------------------------------------------------------<br />
# Run your simulation step here...<br />
<br />
<!--T:92--><br />
if test -e state.cpt; then <br />
# There is a checkpoint file, restart;<br />
mdrun --restart state.cpt<br />
else<br />
# There is no checkpoint file, start a new simulation.<br />
mdrun<br />
fi<br />
<br />
<!--T:93--><br />
# ---------------------------------------------------------------------<br />
echo "Job finished with exit code $? at: `date`"<br />
# ---------------------------------------------------------------------<br />
}}<br />
<br />
=== Resubmission from the job script === <!--T:94--><br />
<br />
<!--T:95--><br />
In this case one submits a job that runs the first chunk of the calculation and saves a checkpoint. <br />
Once the chunk is done but before the allocated run-time of the job has elapsed,<br />
the script checks if the end of the calculation has been reached.<br />
If the calculation is not yet finished, the script submits a copy of itself to continue working.<br />
<br />
<!--T:96--><br />
An example of a job script with resubmission:<br />
{{File<br />
|name=job_resubmission.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
# ---------------------------------------------------------------------<br />
# SLURM script for job resubmission on a Compute Canada cluster. <br />
# ---------------------------------------------------------------------<br />
#SBATCH --job-name=job_chain<br />
#SBATCH --account=def-someuser<br />
#SBATCH --cpus-per-task=1<br />
#SBATCH --time=0-10:00<br />
#SBATCH --mem=100M<br />
# ---------------------------------------------------------------------<br />
echo "Current working directory: `pwd`"<br />
echo "Starting run at: `date`"<br />
# ---------------------------------------------------------------------<br />
# Run your simulation step here...<br />
<br />
<!--T:100--><br />
if test -e state.cpt; then <br />
# There is a checkpoint file, restart;<br />
mdrun --restart state.cpt<br />
else<br />
# There is no checkpoint file, start a new simulation.<br />
mdrun<br />
fi<br />
<br />
<!--T:101--><br />
# Resubmit if not all work has been done yet.<br />
# You must define the function end_is_not_reached().<br />
if end_is_not_reached; then<br />
sbatch ${BASH_SOURCE[0]}<br />
fi<br />
<br />
<!--T:102--><br />
# ---------------------------------------------------------------------<br />
echo "Job finished with exit code $? at: `date`"<br />
# ---------------------------------------------------------------------<br />
}}<br />
<br />
== Troubleshooting == <!--T:42--><br />
<br />
==== Avoid hidden characters in job scripts ==== <!--T:43--><br />
Preparing a job script with a ''word processor'' instead of a ''text editor'' is a common cause of trouble. Best practice is to prepare your job script on the cluster using an [[Editors|editor]] such as nano, vim, or emacs. If you prefer to prepare or alter the script off-line, then:<br />
* '''Windows users:''' <br />
** Use a text editor such as Notepad or [https://notepad-plus-plus.org/ Notepad++].<br />
** After uploading the script, use <code>dos2unix</code> to change Windows end-of-line characters to Linux end-of-line characters. <br />
* '''Mac users:'''<br />
** Open a terminal window and use an [[Editors|editor]] such as nano, vim, or emacs.<br />
<br />
==== Cancellation of jobs with dependency conditions which cannot be met ==== <!--T:109--><br />
A job submitted with <code>dependency=afterok:<jobid></code> is a "dependent job". A dependent job will wait for the parent job to be completed. If the parent job fails (that is, ends with a non-zero exit code) the dependent job can never be scheduled and so will be automatically cancelled. See [https://slurm.schedmd.com/sbatch.html sbatch] for more on dependency.<br />
<br />
== Job status and priority == <!--T:103--><br />
* For a discussion of how job priority is determined and how things like time limits may affect the scheduling of your jobs at Cedar and Graham, see [[Job scheduling policies]].<br />
<br />
== Further reading == <!--T:44--><br />
* Comprehensive [https://slurm.schedmd.com/documentation.html documentation] is maintained by SchedMD, as well as some [https://slurm.schedmd.com/tutorials.html tutorials].<br />
** [https://slurm.schedmd.com/sbatch.html sbatch] command options<br />
* There is also a [https://slurm.schedmd.com/rosetta.pdf "Rosetta stone"] mapping commands and directives from PBS/Torque, SGE, LSF, and LoadLeveler, to SLURM. NERSC also offers some [http://www.nersc.gov/users/computational-systems/cori/running-jobs/for-edison-users/torque-moab-vs-slurm-comparisons/ tables comparing Torque and SLURM].<br />
* Here is a text tutorial from [http://www.ceci-hpc.be/slurm_tutorial.html CÉCI], Belgium<br />
* Here is a rather minimal text tutorial from [http://www.brightcomputing.com/blog/bid/174099/slurm-101-basic-slurm-usage-for-linux-clusters Bright Computing]<br />
<br />
<!--T:48--><br />
[[Category:SLURM]]<br />
</translate></div>Feimaohttps://docs.alliancecan.ca/mediawiki/index.php?title=TensorFlow&diff=38600TensorFlow2017-09-21T15:16:41Z<p>Feimao: /* Using Cedar's large GPU nodes */</p>
<hr />
<div><languages /><br />
[[Category:Software]]<br />
<translate><br />
==Installing Tensorflow== <!--T:1--><br />
<br />
<!--T:2--><br />
These instructions install Tensorflow into your home directory using Compute Canada's pre-built [http://pythonwheels.com/ Python wheels]. Custom Python wheels are stored in <code>/cvmfs/soft.computecanada.ca/custom/python/wheelhouse/</code>. To install Tensorflow's wheel we will use the <code>pip</code> command and install it into a [[Python#Creating_and_using_a_virtual_environment | Python virtual environments]]. The below instructions install for Python 3.5.2 but you can also install for Python 3.5.Y or 2.7.X by loading a different Python module.<br />
<br />
<!--T:3--><br />
Load modules required by Tensorflow:<br />
{{Command|module load cuda cudnn python/3.5.2}}<br />
Create a new python virtual environment:<br />
{{Command|virtualenv tensorflow}}<br />
<br />
<!--T:4--><br />
Activate your newly created python virtual environment:<br />
{{Command|source tensorflow/bin/activate}}<br />
Install Tensorflow into your newly created virtual environment:<br />
{{Command|pip install tensorflow}}<br />
<br />
==Submitting a Tensorflow job== <!--T:5--><br />
Once you have the above setup completed you can submit a Tensorflow job as<br />
{{Command|sbatch tensorflow-test.sh}}<br />
The job submission script has the contents<br />
</translate><br />
{{File<br />
|name=tensorflow-test.sh<br />
|lang="bash"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --gres=gpu:1 # request GPU "generic resource"<br />
#SBATCH --cpus-per-task=6 #Maximum of CPU cores per GPU request: 6 on Cedar, 16 on Graham.<br />
#SBATCH --mem=32000M # memory per node<br />
#SBATCH --time=0-03:00 # time (DD-HH:MM)<br />
#SBATCH --output=%N-%j.out # %N for node name, %j for jobID<br />
<br />
module load cuda cudnn python/3.5.2<br />
source tensorflow/bin/activate<br />
python ./tensorflow-test.py<br />
}}<br />
<translate><br />
<!--T:6--><br />
while the Python script has the form,<br />
</translate><br />
{{File<br />
|name=tensorflow-test.py<br />
|lang="python"<br />
|contents=<br />
import tensorflow as tf<br />
node1 = tf.constant(3.0, dtype=tf.float32)<br />
node2 = tf.constant(4.0) # also tf.float32 implicitly<br />
print(node1, node2)<br />
sess = tf.Session()<br />
print(sess.run([node1, node2]))<br />
}}<br />
<translate><br />
<!--T:7--><br />
Once the above job has completed (should take less than a minute) you should see an output file called something like <tt>cdr116-122907.out</tt> with contents similar to the following example,<br />
</translate><br />
{{File<br />
|name=cdr116-122907.out<br />
|lang="text"<br />
|contents=<br />
2017-07-10 12:35:19.489458: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 0 with properties:<br />
name: Tesla P100-PCIE-12GB<br />
major: 6 minor: 0 memoryClockRate (GHz) 1.3285<br />
pciBusID 0000:82:00.0<br />
Total memory: 11.91GiB<br />
Free memory: 11.63GiB<br />
2017-07-10 12:35:19.491097: I tensorflow/core/common_runtime/gpu/gpu_device.cc:961] DMA: 0<br />
2017-07-10 12:35:19.491156: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0: Y<br />
2017-07-10 12:35:19.520737: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla P100-PCIE-12GB, pci bus id: 0000:82:00.0)<br />
Tensor("Const:0", shape=(), dtype=float32) Tensor("Const_1:0", shape=(), dtype=float32)<br />
[3.0, 4.0]<br />
}}<br />
<br />
==Using Cedar's large GPU nodes== <br />
TensorFlow can run on all GPU node types on Cedar and Graham. Cedar's large GPU node type, which equips 4 x P100-PCIE-16GB with GPU Direct P2P enabled between each pair, is highly recommended for large scale Deep Learning/ Machine Learning research.<br />
<br />
Large GPU nodes on Cedar accept whole node(s) jobs and single-GPU jobs. But single-GPU requests can '''only run up to 24 hours'''. The job submission script for a single-GPU job should have the contents<br />
{{File<br />
|name=tensorflow-lgpu-single.sh<br />
|lang="bash"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --nodes=1 # request number of whole nodes<br />
#SBATCH --ntasks-per-node=1 <br />
#SBATCH --cpus-per-task=6 # Total CPU cores is 24, each GPU should use up to 6 cores <br />
#SBATCH --gres=gpu:lgpu:1 # lgpu is required for using large GPU nodes<br />
#SBATCH --mem=60G # Total memory per node is around 250GB, each GPU can ask 60G<br />
#SBATCH --time=0-03:00 # time (DD-HH:MM)<br />
#SBATCH --output=%N-%j.out # %N for node name, %j for jobID<br />
<br />
module load cuda cudnn python/3.5.2<br />
source tensorflow/bin/activate<br />
python ./tensorflow-test.py<br />
}}<br />
<br />
The job submission script for a whole node (4 GPUs) job should have the contents<br />
{{File<br />
|name=tensorflow-lgpu-whole-node.sh<br />
|lang="bash"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --nodes=1 # request number of whole nodes<br />
#SBATCH --ntasks-per-node=1 <br />
#SBATCH --cpus-per-task=24 # Total CPU cores should be 24.<br />
#SBATCH --gres=gpu:lgpu:4 # lgpu is required for using large GPU nodes<br />
#SBATCH --mem=250G # memory per node<br />
#SBATCH --time=0-03:00 # time (DD-HH:MM)<br />
#SBATCH --output=%N-%j.out # %N for node name, %j for jobID<br />
<br />
module load cuda cudnn python/3.5.2<br />
source tensorflow/bin/activate<br />
python ./tensorflow-test.py<br />
}}<br />
===Packing single-GPU jobs within one SLURM job===<br />
Cedar's large GPU nodes are highly recommended to run Deep Learning models which can be accelerated by multiple GPUs. If user needs to run 4 x single GPU codes or 2 x 2-GPU codes in a node for '''longer than 24 hours''', [https://www.gnu.org/software/parallel/ GNU Parallel] is recommended. A simple example is given below:<br />
<pre><br />
cat params.input | parallel -j4 'CUDA_VISIBLE_DEVICES=$(({%} - 1)) python {} &> {#}.out'<br />
</pre><br />
GPU id will be calculated by slot id {%} minus 1. {#} is the job id, starting from 1.<br />
<br />
A params.input file should includes input parameters in each line like:<br />
<pre><br />
code1.py<br />
code2.py<br />
code3.py<br />
code4.py<br />
...<br />
</pre><br />
With this method, user can run multiple codes in one submission. In this case, GNU Parallel will run a maximum of 4 jobs at a time. It will launch the next job when one job is finished. CUDA_VISIBLE_DEVICES is used to force using only 1 GPU for each code.</div>Feimaohttps://docs.alliancecan.ca/mediawiki/index.php?title=Running_jobs&diff=37875Running jobs2017-08-30T15:45:58Z<p>Feimao: </p>
<hr />
<div><languages /><br />
<translate><br />
<br />
<!--T:54--><br />
This page is intended for the user who is already familiar with the concepts of job scheduling and job scripts, and who wants guidance on submitting jobs to Compute Canada clusters.<br />
If you have not worked on a large shared computer cluster before, you should probably read [[What is a scheduler?]] first.<br />
<br />
<!--T:55--><br />
On Compute Canada clusters, the job scheduler is the <br />
[https://en.wikipedia.org/wiki/Slurm_Workload_Manager Slurm Workload Manager].<br />
Comprehensive [https://slurm.schedmd.com/documentation.html documentation for Slurm] is maintained by SchedMD. If you are coming to Slurm from PBS/Torque, SGE, LSF, or LoadLeveler, you might find this table of [https://slurm.schedmd.com/rosetta.pdf corresponding commands] useful.<br />
<br />
==Use <code>sbatch</code> to submit jobs== <!--T:56--><br />
The command to submit a job is [https://slurm.schedmd.com/sbatch.html <code>sbatch</code>]:<br />
<source lang="bash"><br />
[someuser@host ~]$ sbatch simple_job.sh<br />
Submitted batch job 123456<br />
</source><br />
<br />
<!--T:57--><br />
A minimal Slurm job script looks like this:<br />
{{File<br />
|name=simple_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --time=00:01:00<br />
#SBATCH --account=def-someuser<br />
echo 'Hello, world!'<br />
sleep 30 <br />
}}<br />
<br />
<!--T:58--><br />
Directives (or "options") in the job script are prefixed with <code>#SBATCH</code> and must precede all executable commands. All available directives are described on the [https://slurm.schedmd.com/sbatch.html sbatch page]. Compute Canada policies require that you supply at least a time limit (<code>--time</code>) and an account name (<code>--account</code>) for each job. (See [[#Accounts and projects]] below.)<br />
<br />
<!--T:106--><br />
A default memory amount of 256 MB per core will be allocated unless you make some other memory request with <code>--mem-per-cpu</code> (memory per core) or <code>--mem</code> (memory per node).<br />
<br />
<!--T:59--><br />
You can also specify directives as command-line arguments to <code>sbatch</code>. So for example,<br />
[someuser@host ~]$ sbatch --time=00:30:00 simple_job.sh <br />
will submit the above job script with a time limit of 30 minutes.<br />
<br />
==Use <code>squeue</code> to list jobs== <!--T:60--><br />
<br />
<!--T:61--><br />
The [https://slurm.schedmd.com/squeue.html <code>squeue</code>] command lists pending and running jobs. Supply your username as an argument with <code>-u</code> to list only your own jobs:<br />
<br />
<!--T:62--><br />
<source lang="bash"><br />
[someuser@host ~]$ squeue -u $USER<br />
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)<br />
123456 cpubase_b simple_j someuser R 0:03 1 cdr234<br />
123457 cpubase_b simple_j someuser PD 1 (Priority)<br />
</source><br />
<br />
<!--T:12--><br />
The ST column of the output shows the status of each job. The two most common states are "PD" for "pending" or "R" for "running". See the [https://slurm.schedmd.com/squeue.html squeue page]<br />
for more on selecting, formatting, and interpreting the <code>squeue</code> output.<br />
<br />
==Where does the output go?== <!--T:63--><br />
<br />
<!--T:64--><br />
By default the output is placed in a file named "slurm-", suffixed with the job ID number and ".out", ''e.g.'' <code>slurm-123456.out</code>, in the directory from which the job was submitted.<br />
You can use <code>--output</code> to specify a different name or location. <br />
Certain replacement symbols can be used in the filename, ''e.g.'' <code>%j</code> will be replaced <br />
by the job ID number. See [https://slurm.schedmd.com/sbatch.html sbatch] for a complete list.<br />
<br />
<!--T:65--><br />
The following sample script sets a ''job name'' (which appears in <code>squeue</code> output) and sends the output to a file with a name constructed from the job name (%x) and the job ID number (%j). <br />
<br />
<!--T:15--><br />
{{File<br />
|name=name_output.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --account=def-someuser<br />
#SBATCH --time=00:01:00<br />
#SBATCH --job-name=test<br />
#SBATCH --output=%x-%j.out<br />
echo 'Hello, world!'<br />
}}<br />
<br />
<!--T:16--><br />
Error output will normally appear in the same file as standard output, just as it would if you were typing commands interactively. If you want to send the standard error channel (stderr) to a separate file, use <code>--error</code>.<br />
<br />
==Accounts and projects== <!--T:66--><br />
<br />
<!--T:67--><br />
Every job must have an associated ''account name'' corresponding to a Compute Canada [https://ccdb.computecanada.ca/me/faq#what_is_rap Resource Allocation Project] (RAP).<br />
<br />
<!--T:107--><br />
If you try to submit a job with <code>sbatch</code> and receive one of these messages:<br />
<pre><br />
You are associated with multiple _cpu allocations...<br />
Please specify one of the following accounts to submit this job:<br />
<br />
<!--T:108--><br />
You are associated with multiple _gpu allocations...<br />
Please specify one of the following accounts to submit this job:<br />
</pre> <br />
then you have more than one valid account, and you will have to specify one<br />
using the <code>--account</code> directive:<br />
#SBATCH --account=def-user-ab<br />
<br />
<!--T:68--><br />
To find out which account name corresponds<br />
to a given Resource Allocation Project, log in to [https://ccdb.computecanada.ca CCDB] <br />
and click on "My Account -> Account Details". You will see a list of all the projects <br />
you are a member of. The string you should use with the <code>--account</code> for <br />
a given project is under the column '''Group Name'''. Note that a Resource <br />
Allocation Project may only apply to a specific cluster (or set of clusters) and therefore<br />
may not be transferable from one cluster to another. <br />
<br />
<!--T:69--><br />
In the illustration below, jobs submitted with <code>--account=def-rdickson</code> will be accounted against RAP wnp-003-aa.<br />
<br />
<!--T:70--><br />
[[File:Find-group-name-annotated.png|750px|frame|left| Finding the group name for a Resource Allocation Project (RAP)]]<br />
<br clear=all> <!-- This is to prevent the next section from filling to the right of the image. --><br />
<br />
<!--T:71--><br />
If you plan to use one account consistently for all jobs, once you have determined the right account name you may find it convenient to set the following three environment variables in your <code>~/.bashrc</code> file:<br />
export SLURM_ACCOUNT=def-someuser<br />
export SBATCH_ACCOUNT=$SLURM_ACCOUNT<br />
export SALLOC_ACCOUNT=$SLURM_ACCOUNT<br />
Slurm will use the value of <code>SBATCH_ACCOUNT</code> in place of the <code>--account</code> directive in the job script. Note that even if you supply an account name inside the job script, ''the environment variable takes priority.'' In order to override the environment variable you must supply an account name as a command-line argument to <code>sbatch</code>.<br />
<br />
<!--T:72--><br />
<code>SLURM_ACCOUNT</code> plays the same role as <code>SBATCH_ACCOUNT</code>, but for the <code>srun</code> command instead of <code>sbatch</code>. The same idea holds for <code>SALLOC_ACCOUNT</code>.<br />
<br />
== Examples of job scripts == <!--T:17--><br />
<br />
=== MPI job === <!--T:18--><br />
<br />
<!--T:51--><br />
This example script launches four MPI processes, each with 1024 MB of memory. The run time is limited to 5 minutes. <br />
<br />
<!--T:19--><br />
{{File<br />
|name=mpi_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --account=def-someuser<br />
#SBATCH --ntasks=4 # number of MPI processes<br />
#SBATCH --mem-per-cpu=1024M # memory; default unit is megabytes<br />
#SBATCH --time=0-00:05 # time (DD-HH:MM)<br />
srun ./mpi_program # mpirun or mpiexec also work<br />
}}<br />
<br />
<!--T:20--><br />
One can have detailed control over the location of MPI processes by, for example, requesting a specific number of processes per node. Hybrid MPI/threaded jobs are also possible. For more on these and other options relating to distributed parallel jobs, see [[Advanced MPI scheduling]].<br />
<br />
=== Threaded or OpenMP job === <!--T:21--><br />
This example script launches a single process with eight CPU cores. Bear in mind that for an application to use OpenMP it must be compiled with the appropriate flag, e.g. <code>gcc -fopenmp ...</code> or <code>icc -openmp ...</code><br />
<br />
<!--T:22--><br />
{{File<br />
|name=openmp_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --account=def-someuser<br />
#SBATCH --time=0-0:5<br />
#SBATCH --cpus-per-task=8<br />
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK<br />
./ompHello<br />
}}<br />
<br />
<!--T:23--><br />
For more on writing and running parallel programs with OpenMP, see [[OpenMP]].<br />
<br />
=== GPU job === <!--T:24--><br />
This example is a serial job with one [https://en.wikipedia.org/wiki/General-purpose_computing_on_graphics_processing_units GPU] allocated, a memory limit of 4000 MB per node, and a run-time limit of 5 hours. The output filename will include the name of the first node used and the job ID number.<br />
<br />
<!--T:25--><br />
{{File<br />
|name=simple_gpu_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --account=def-someuser<br />
#SBATCH --gres=gpu:1 # request GPU "generic resource"<br />
#SBATCH --mem=4000M # memory per node<br />
#SBATCH --time=0-05:00 # time (DD-HH:MM)<br />
nvidia-smi<br />
}}<br />
<br />
<!--T:49--><br />
Because no node count is specified in the above example, one node will be allocated. If you were to add <code>--nodes=3</code>, the total memory allocated would be 12000M. The same goes for <code>--gres</code>: If you request three nodes, you will get one GPU per node, for a total of three.<br />
<br />
<!--T:86--><br />
All GPU resources on [[Cedar]] have four GPUs per node, [[Graham]] GPU nodes have two GPUs per node. The following example requests all the GPUs on one node.<br />
{{File<br />
|name=simple_gpu_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --account=def-someuser<br />
#SBATCH --gres=gpu:4 # request GPU "generic resource", 4 on Cedar, 2 on Graham<br />
#SBATCH --mem=4000M # memory per node<br />
#SBATCH --time=0-05:00 # time (DD-HH:MM)<br />
nvidia-smi<br />
}}<br />
<br />
==== Cedar large-memory GPUs ==== <!--T:104--><br />
<br />
<!--T:87--><br />
There is a special group of large-memory GPU nodes at [[Cedar]] which have four Tesla P100 16GB cards each. (Other GPUs in the cluster have 12GB.) These GPUs all use the same PCI switch so the inter-GPU communication latency is lower, but bandwidth is also lower than on the regular GPU nodes. The nodes also have 256 GB RAM instead of 128GB. In order to use these nodes you must request all four GPUs, that is, the whole node, and you must specify <code>lgpu</code>, as shown in this example:<br />
<br />
<!--T:105--><br />
{{File<br />
|name=large_gpu_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --nodes=1 <br />
#SBATCH --ntasks=1<br />
#SBATCH --cpus-per-task=24 # There are 24 CPU cores on Cedar GPU nodes<br />
#SBATCH --gres=gpu:lgpu:4 # Ask for 4 GPUs per node of the large-gpu node variety<br />
#SBATCH --time=0-00:10 <br />
hostname<br />
nvidia-smi<br />
}}<br />
<br />
<!--T:26--><br />
For more on running GPU jobs, see [[Using GPUs with SLURM]].<br />
<br />
=== Array job === <!--T:27--><br />
Also known as a ''task array'', an array job is a way to submit a whole set of jobs with one command. The individual jobs in the array are distinguished by an environment variable, <code>$SLURM_ARRAY_TASK_ID</code>, which is set to a different value for each instance of the job. <br />
sbatch --array=0-7 ... # $SLURM_ARRAY_TASK_ID will take values from 0 to 7 inclusive<br />
sbatch --array=1,3,5,7 ... # $SLURM_ARRAY_TASK_ID will take the listed values<br />
sbatch --array=1-7:2 ... # Step-size of 2, does the same as the previous example<br />
sbatch --array=1-100%10 ... # Allow no more than 10 of the jobs to run simultaneously<br />
<br />
== Interactive jobs == <!--T:28--><br />
Though batch submission is the most common and most efficient way to take advantage of our clusters, interactive jobs are also supported. These can be useful for things like:<br />
* Data exploration at the command line<br />
* Interactive "console tools" like R and iPython<br />
* Significant software development, debugging, or compiling<br />
<br />
<!--T:29--><br />
You can start an interactive session on a compute node with [https://slurm.schedmd.com/salloc.html salloc]. In the following example we request two tasks, which corresponds to two CPU cores, for an hour:<br />
[name@login ~]$ salloc --time=1:0:0 --ntasks=2 --account=def-someuser<br />
salloc: Granted job allocation 1234567<br />
[name@node01 ~]$ ... # do some work<br />
[name@node01 ~]$ exit # terminate the allocation<br />
salloc: Relinquishing job allocation 1234567<br />
<br />
<!--T:30--><br />
For more details see [[Interactive jobs]].<br />
<br />
== Monitoring jobs == <!--T:31--><br />
<br />
<!--T:32--><br />
By default [https://slurm.schedmd.com/squeue.html squeue] will show all the jobs the scheduler is managing at the moment. It may run much faster if you ask only about your own jobs with<br />
squeue -u <username><br />
<br />
<!--T:33--><br />
You can show only running jobs, or only pending jobs:<br />
squeue -u <username> -t RUNNING<br />
squeue -u <username> -t PENDING<br />
<br />
<!--T:34--><br />
You can show detailed information for a specific job with [https://slurm.schedmd.com/scontrol.html scontrol]:<br />
scontrol show job -dd <jobid><br />
<br />
<!--T:35--><br />
Find information about a completed job with [https://slurm.schedmd.com/sacct.html sacct], and optionally, control what it prints using <code>--format</code>:<br />
sacct -j <jobid><br />
sacct -j <jobid> --format=JobID,JobName,MaxRSS,Elapsed<br />
<br />
<!--T:73--><br />
If a node fails while running a job, the job may be restarted. <code>sacct</code> will normally show you only the record for the last (presumably successful) run. If you wish to see all records related to a given job, add the <code>--duplicates</code> option.<br />
<br />
<!--T:52--><br />
Use the MaxRSS accounting field to determine how much memory a job needed. The value returned will be the largest [https://en.wikipedia.org/wiki/Resident_set_size resident set size] for any of the tasks. If you want to know which task and node this occurred on, print the MaxRSSTask and MaxRSSNode fields also.<br />
<br />
<!--T:53--><br />
The [https://slurm.schedmd.com/sstat.html sstat] command works on a running job much the same way that [https://slurm.schedmd.com/sacct.html sacct] works on a completed job.<br />
<br />
<!--T:36--><br />
You can ask to be notified by email of certain job conditions by supplying options to <br />
[https://slurm.schedmd.com/sbatch.html sbatch]:<br />
#SBATCH --mail-user=<email_address><br />
#SBATCH --mail-type=BEGIN<br />
#SBATCH --mail-type=END<br />
#SBATCH --mail-type=FAIL<br />
#SBATCH --mail-type=REQUEUE<br />
#SBATCH --mail-type=ALL<br />
<br />
==Cancelling jobs== <!--T:37--><br />
<br />
<!--T:38--><br />
Use [https://slurm.schedmd.com/scancel.html scancel] with the job ID to cancel a job:<br />
<br />
<!--T:39--><br />
scancel <jobid><br />
<br />
<!--T:40--><br />
You can also use it to cancel all your jobs, or all your pending jobs:<br />
<br />
<!--T:41--><br />
scancel -u <username><br />
scancel -t PENDING -u <username><br />
<br />
== Resubmitting jobs for long running computations == <!--T:74--><br />
<br />
<!--T:75--><br />
When a computation is going to require a long time to complete, so long that it cannot be done within the time limits on the system, <br />
the application you are running must support [[Points de contrôle/en|checkpointing]]. The application should be able to save its state to a file, called a ''checkpoint file'', and<br />
then it should be able to restart and continue the computation from that saved state. <br />
<br />
<!--T:76--><br />
For many users restarting a calculation will be rare and may be done manually, <br />
but some workflows require frequent restarts. <br />
In this case some kind of automation technique may be employed. <br />
<br />
<!--T:77--><br />
Here are two recommended methods of automatic restarting:<br />
* Using SLURM '''job arrays'''.<br />
* Resubmitting from the end of the job script.<br />
<br />
=== Restarting using job arrays === <!--T:90--><br />
<br />
<!--T:91--><br />
Using the <code>--array=1-100%10</code> syntax mentioned earlier one can submit a collection of identical jobs with the condition that only one job of them will run at any given time.<br />
The script should be written to ensure that the last checkpoint is always used for the next job. The number of restarts is fixed by the <code>--array</code> argument.<br />
<br />
<!--T:78--><br />
Consider, for example, a molecular dynamics simulations that has to be run for 1 000 000 steps, and such simulation does not fit into the time limit on the cluster. <br />
We can split the simulation into 10 smaller jobs of 100 000 steps, one after another. <br />
<br />
<!--T:79--><br />
An example of using a job array to restart a simulation:<br />
{{File<br />
|name=job_array_restart.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
# ---------------------------------------------------------------------<br />
# SLURM script for a multi-step job on a Compute Canada cluster. <br />
# ---------------------------------------------------------------------<br />
#SBATCH --account=def-someuser<br />
#SBATCH --cpus-per-task=1<br />
#SBATCH --time=0-10:00<br />
#SBATCH --mem=100M<br />
#SBATCH --array=1-10%1 # Run a 10-job array, one job at a time.<br />
# ---------------------------------------------------------------------<br />
echo "Current working directory: `pwd`"<br />
echo "Starting run at: `date`"<br />
# ---------------------------------------------------------------------<br />
echo ""<br />
echo "Job Array ID / Job ID: $SLURM_ARRAY_JOB_ID / $SLURM_JOB_ID"<br />
echo "This is job $SLURM_ARRAY_TASK_ID out of $SLURM_ARRAY_TASK_COUNT jobs."<br />
echo ""<br />
# ---------------------------------------------------------------------<br />
# Run your simulation step here...<br />
<br />
<!--T:92--><br />
if test -e state.cpt; then <br />
# There is a checkpoint file, restart;<br />
mdrun --restart state.cpt<br />
else<br />
# There is no checkpoint file, start a new simulation.<br />
mdrun<br />
fi<br />
<br />
<!--T:93--><br />
# ---------------------------------------------------------------------<br />
echo "Job finished with exit code $? at: `date`"<br />
# ---------------------------------------------------------------------<br />
}}<br />
<br />
=== Resubmission from the job script === <!--T:94--><br />
<br />
<!--T:95--><br />
In this case one submits a job that runs the first chunk of the calculation and saves a checkpoint. <br />
Once the chunk is done but before the allocated run-time of the job has elapsed,<br />
the script checks if the end of the calculation has been reached.<br />
If the calculation is not yet finished, the script submits a copy of itself to continue working.<br />
<br />
<!--T:96--><br />
An example of a job script with resubmission:<br />
{{File<br />
|name=job_resubmission.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
# ---------------------------------------------------------------------<br />
# SLURM script for job resubmission on a Compute Canada cluster. <br />
# ---------------------------------------------------------------------<br />
#SBATCH --job-name=job_chain<br />
#SBATCH --account=def-someuser<br />
#SBATCH --cpus-per-task=1<br />
#SBATCH --time=0-10:00<br />
#SBATCH --mem=100M<br />
# ---------------------------------------------------------------------<br />
echo "Current working directory: `pwd`"<br />
echo "Starting run at: `date`"<br />
# ---------------------------------------------------------------------<br />
# Run your simulation step here...<br />
<br />
<!--T:100--><br />
if test -e state.cpt; then <br />
# There is a checkpoint file, restart;<br />
mdrun --restart state.cpt<br />
else<br />
# There is no checkpoint file, start a new simulation.<br />
mdrun<br />
fi<br />
<br />
<!--T:101--><br />
# Resubmit if not all work has been done yet.<br />
# You must define the function end_is_not_reached().<br />
if end_is_not_reached; then<br />
sbatch ${BASH_SOURCE[0]}<br />
fi<br />
<br />
<!--T:102--><br />
# ---------------------------------------------------------------------<br />
echo "Job finished with exit code $? at: `date`"<br />
# ---------------------------------------------------------------------<br />
}}<br />
<br />
== Troubleshooting == <!--T:42--><br />
<br />
==== Avoid hidden characters in job scripts ==== <!--T:43--><br />
Preparing a job script with a ''word processor'' instead of a ''text editor'' is a common cause of trouble. Best practice is to prepare your job script on the cluster using an [[Editors|editor]] such as nano, vim, or emacs. If you prefer to prepare or alter the script off-line, then:<br />
* '''Windows users:''' <br />
** Use a text editor such as Notepad or [https://notepad-plus-plus.org/ Notepad++].<br />
** After uploading the script, use <code>dos2unix</code> to change Windows end-of-line characters to Linux end-of-line characters. <br />
* '''Mac users:'''<br />
** Open a terminal window and use an [[Editors|editor]] such as nano, vim, or emacs.<br />
<br />
== Job status and priority == <!--T:103--><br />
* For a discussion of how job priority is determined and how things like time limits may affect the scheduling of your jobs at Cedar and Graham, see [[Job scheduling policies]].<br />
<br />
== Further reading == <!--T:44--><br />
* Comprehensive [https://slurm.schedmd.com/documentation.html documentation] is maintained by SchedMD, as well as some [https://slurm.schedmd.com/tutorials.html tutorials].<br />
** [https://slurm.schedmd.com/sbatch.html sbatch] command options<br />
* There is also a [https://slurm.schedmd.com/rosetta.pdf "Rosetta stone"] mapping commands and directives from PBS/Torque, SGE, LSF, and LoadLeveler, to SLURM. NERSC also offers some [http://www.nersc.gov/users/computational-systems/cori/running-jobs/for-edison-users/torque-moab-vs-slurm-comparisons/ tables comparing Torque and SLURM].<br />
* Here is a text tutorial from [http://www.ceci-hpc.be/slurm_tutorial.html CÉCI], Belgium<br />
* Here is a rather minimal text tutorial from [http://www.brightcomputing.com/blog/bid/174099/slurm-101-basic-slurm-usage-for-linux-clusters Bright Computing]<br />
<br />
<!--T:48--><br />
[[Category:SLURM]]<br />
</translate></div>Feimaohttps://docs.alliancecan.ca/mediawiki/index.php?title=AI_and_Machine_Learning&diff=37216AI and Machine Learning2017-08-24T21:01:06Z<p>Feimao: /* Software Packages */</p>
<hr />
<div>{{Draft}}<br />
== General ==<br />
<br />
At this location a discipline guide on AI and Machine Learning will be created.<br />
<br />
== Software Packages ==<br />
<br />
The following software packages are available on Compute Canada's HPC resources:<br />
<br />
* [[Caffe2]]<br />
* [[Tensorflow]]<br />
* [[Torch]]</div>Feimaohttps://docs.alliancecan.ca/mediawiki/index.php?title=AI_and_Machine_Learning&diff=37215AI and Machine Learning2017-08-24T21:00:51Z<p>Feimao: </p>
<hr />
<div>{{Draft}}<br />
== General ==<br />
<br />
At this location a discipline guide on AI and Machine Learning will be created.<br />
<br />
== Software Packages ==<br />
<br />
The following software packages are available on Compute Canada's HPC resources:<br />
<br />
* [[Tensorflow]]<br />
* [[Torch]]<br />
* [[Caffe2]]</div>Feimaohttps://docs.alliancecan.ca/mediawiki/index.php?title=Using_GPUs_with_Slurm&diff=36034Using GPUs with Slurm2017-08-10T18:36:55Z<p>Feimao: /* Multi-threaded Jobs */</p>
<hr />
<div>{{Draft}}<br />
== GPU Hardwares and Node Types ==<br />
{| class="wikitable"<br />
|-<br />
! # of Nodes !!Node Type !! CPU Cores !! CPU Memory !! # of GPUs !! GPU Type !! PCIe Bus Topology<br />
|-<br />
| 114 || Cedar Base GPU Node || 24 || 128GB || 4 || NVIDIA P100-PCIE-12GB || Two GPUs per CPU socket<br />
|-<br />
| 32 || Cedar Large GPU Node || 24|| 256GB || 4 || NVIDIA P100-PCIE-16GB || All GPUs under same CPU socket<br />
|-<br />
| 160 || Graham Base GPU Node || 32|| 128GB || 2 || NVIDIA P100-PCIE-12GB || One GPU per CPU socket<br />
|}<br />
<br />
== Serial Jobs ==<br />
For GPU jobs asking only single CPU core: <br />
{{File<br />
|name=gpu_serial_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --account=def-someuser<br />
#SBATCH --gres=gpu:1 # Number of GPU(s) per node<br />
#SBATCH --mem=4000M # memory per node<br />
#SBATCH --time=0-05:00 # time (DD-HH:MM)<br />
#SBATCH --output=%N-%j.out # %N for node name, %j for jobID<br />
./program<br />
}}<br />
<br />
== Multi-threaded Jobs ==<br />
For GPU jobs asking for multiple CPUs in a single node:<br />
{{File<br />
|name=gpu_threaded_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --account=def-someuser<br />
#SBATCH --gres=gpu:1 # Number of GPU(s) per node<br />
#SBATCH --cpus-per-task=6 # CPU cores/threads<br />
#SBATCH --mem=4000M # memory per node<br />
#SBATCH --time=0-05:00 # time (DD-HH:MM)<br />
#SBATCH --output=%N-%j.out # %N for node name, %j for jobID<br />
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK<br />
./program<br />
}}<br />
On Cedar, multi-threaded jobs are recommended to use at most 6 CPU cores for each GPU request. On Graham, user can use up to 16 CPU cores for each GPU request.<br />
<br />
== MPI Jobs ==<br />
{{File<br />
|name=gpu_mpi_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --account=def-someuser<br />
#SBATCH --gres=gpu:4 # Number of GPUs per node<br />
#SBATCH --nodes=2 # Number of Nodes<br />
#SBATCH --ntask=48 # Number of MPI ranks<br />
#SBATCH --cpus-per-task=1 # CPU cores per MPI rank<br />
#SBATCH --mem=120G # memory per node<br />
#SBATCH --time=0-05:00 # time (DD-HH:MM)<br />
#SBATCH --output=%N-%j.out # %N for node name, %j for jobID<br />
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK<br />
srun ./program<br />
}}<br />
<br />
== Using Cedar's Large GPU nodes ==</div>Feimaohttps://docs.alliancecan.ca/mediawiki/index.php?title=Using_GPUs_with_Slurm&diff=36033Using GPUs with Slurm2017-08-10T18:33:01Z<p>Feimao: </p>
<hr />
<div>{{Draft}}<br />
== GPU Hardwares and Node Types ==<br />
{| class="wikitable"<br />
|-<br />
! # of Nodes !!Node Type !! CPU Cores !! CPU Memory !! # of GPUs !! GPU Type !! PCIe Bus Topology<br />
|-<br />
| 114 || Cedar Base GPU Node || 24 || 128GB || 4 || NVIDIA P100-PCIE-12GB || Two GPUs per CPU socket<br />
|-<br />
| 32 || Cedar Large GPU Node || 24|| 256GB || 4 || NVIDIA P100-PCIE-16GB || All GPUs under same CPU socket<br />
|-<br />
| 160 || Graham Base GPU Node || 32|| 128GB || 2 || NVIDIA P100-PCIE-12GB || One GPU per CPU socket<br />
|}<br />
<br />
== Serial Jobs ==<br />
For GPU jobs asking only single CPU core: <br />
{{File<br />
|name=gpu_serial_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --account=def-someuser<br />
#SBATCH --gres=gpu:1 # Number of GPU(s) per node<br />
#SBATCH --mem=4000M # memory per node<br />
#SBATCH --time=0-05:00 # time (DD-HH:MM)<br />
#SBATCH --output=%N-%j.out # %N for node name, %j for jobID<br />
./program<br />
}}<br />
<br />
== Multi-threaded Jobs ==<br />
For GPU jobs asking for multiple CPUs in a single node:<br />
{{File<br />
|name=gpu_threaded_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --account=def-someuser<br />
#SBATCH --gres=gpu:1 # Number of GPU(s) per node<br />
#SBATCH --cpus-per-task=6 # CPU cores/threads<br />
#SBATCH --mem=4000M # memory per node<br />
#SBATCH --time=0-05:00 # time (DD-HH:MM)<br />
#SBATCH --output=%N-%j.out # %N for node name, %j for jobID<br />
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK<br />
./program<br />
}}<br />
<br />
== MPI Jobs ==<br />
{{File<br />
|name=gpu_mpi_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --account=def-someuser<br />
#SBATCH --gres=gpu:4 # Number of GPUs per node<br />
#SBATCH --nodes=2 # Number of Nodes<br />
#SBATCH --ntask=48 # Number of MPI ranks<br />
#SBATCH --cpus-per-task=1 # CPU cores per MPI rank<br />
#SBATCH --mem=120G # memory per node<br />
#SBATCH --time=0-05:00 # time (DD-HH:MM)<br />
#SBATCH --output=%N-%j.out # %N for node name, %j for jobID<br />
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK<br />
srun ./program<br />
}}<br />
<br />
== Using Cedar's Large GPU nodes ==</div>Feimaohttps://docs.alliancecan.ca/mediawiki/index.php?title=Using_GPUs_with_Slurm&diff=36032Using GPUs with Slurm2017-08-10T18:32:22Z<p>Feimao: /* Multi-threaded Jobs */</p>
<hr />
<div>{{Draft}}<br />
== GPU Hardwares and Node Types ==<br />
{| class="wikitable"<br />
|-<br />
! # of Nodes !!Node Type !! CPU Cores !! CPU Memory !! # of GPUs !! GPU Type !! PCIe Bus Topology<br />
|-<br />
| 114 || Cedar Base GPU Node || 24 || 128GB || 4 || NVIDIA P100-PCIE-12GB || Two GPUs per CPU socket<br />
|-<br />
| 32 || Cedar Large GPU Node || 24|| 256GB || 4 || NVIDIA P100-PCIE-16GB || All GPUs under same CPU socket<br />
|-<br />
| 160 || Graham Base GPU Node || 32|| 128GB || 2 || NVIDIA P100-PCIE-12GB || One GPU per CPU socket<br />
|}<br />
<br />
== Serial Jobs ==<br />
For GPU jobs asking only single CPU core: <br />
{{File<br />
|name=one_gpu_serial_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --account=def-someuser<br />
#SBATCH --gres=gpu:1 # Number of GPU(s) per node<br />
#SBATCH --mem=4000M # memory per node<br />
#SBATCH --time=0-05:00 # time (DD-HH:MM)<br />
#SBATCH --output=%N-%j.out # %N for node name, %j for jobID<br />
./program<br />
}}<br />
<br />
== Multi-threaded Jobs ==<br />
For GPU jobs asking for multiple CPUs in a single node:<br />
{{File<br />
|name=one_gpu_threaded_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --account=def-someuser<br />
#SBATCH --gres=gpu:1 # Number of GPU(s) per node<br />
#SBATCH --cpus-per-task=6 # CPU cores/threads<br />
#SBATCH --mem=4000M # memory per node<br />
#SBATCH --time=0-05:00 # time (DD-HH:MM)<br />
#SBATCH --output=%N-%j.out # %N for node name, %j for jobID<br />
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK<br />
./program<br />
}}<br />
<br />
== MPI Jobs ==<br />
{{File<br />
|name=one_gpu_mpi_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --account=def-someuser<br />
#SBATCH --gres=gpu:4 # Number of GPUs per node<br />
#SBATCH --nodes=2 # Number of Nodes<br />
#SBATCH --ntask=48 # Number of MPI ranks<br />
#SBATCH --cpus-per-task=1 # CPU cores per MPI rank<br />
#SBATCH --mem=120G # memory per node<br />
#SBATCH --time=0-05:00 # time (DD-HH:MM)<br />
#SBATCH --output=%N-%j.out # %N for node name, %j for jobID<br />
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK<br />
srun ./program<br />
}}<br />
<br />
== Using Cedar's Large GPU nodes ==</div>Feimaohttps://docs.alliancecan.ca/mediawiki/index.php?title=Using_GPUs_with_Slurm&diff=36031Using GPUs with Slurm2017-08-10T18:32:08Z<p>Feimao: /* Serial Jobs */</p>
<hr />
<div>{{Draft}}<br />
== GPU Hardwares and Node Types ==<br />
{| class="wikitable"<br />
|-<br />
! # of Nodes !!Node Type !! CPU Cores !! CPU Memory !! # of GPUs !! GPU Type !! PCIe Bus Topology<br />
|-<br />
| 114 || Cedar Base GPU Node || 24 || 128GB || 4 || NVIDIA P100-PCIE-12GB || Two GPUs per CPU socket<br />
|-<br />
| 32 || Cedar Large GPU Node || 24|| 256GB || 4 || NVIDIA P100-PCIE-16GB || All GPUs under same CPU socket<br />
|-<br />
| 160 || Graham Base GPU Node || 32|| 128GB || 2 || NVIDIA P100-PCIE-12GB || One GPU per CPU socket<br />
|}<br />
<br />
== Serial Jobs ==<br />
For GPU jobs asking only single CPU core: <br />
{{File<br />
|name=one_gpu_serial_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --account=def-someuser<br />
#SBATCH --gres=gpu:1 # Number of GPU(s) per node<br />
#SBATCH --mem=4000M # memory per node<br />
#SBATCH --time=0-05:00 # time (DD-HH:MM)<br />
#SBATCH --output=%N-%j.out # %N for node name, %j for jobID<br />
./program<br />
}}<br />
<br />
== Multi-threaded Jobs ==<br />
For GPU jobs asking for multiple CPUs in a single node:<br />
{{File<br />
|name=one_gpu_threaded_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --account=def-someuser<br />
#SBATCH --gres=gpu:1 # Number of GPU(s) per node<br />
#SBATCH --cpus-per-task=6 # CPU cores/threads<br />
#SBATCH --mem=4000M # memory per node<br />
#SBATCH --time=0-05:00 # time (DD-HH:MM)<br />
#SBATCH --output=%N-%j.out # %N for node name, %j for jobID<br />
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK<br />
nvidia-smi<br />
}}<br />
== MPI Jobs ==<br />
{{File<br />
|name=one_gpu_mpi_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --account=def-someuser<br />
#SBATCH --gres=gpu:4 # Number of GPUs per node<br />
#SBATCH --nodes=2 # Number of Nodes<br />
#SBATCH --ntask=48 # Number of MPI ranks<br />
#SBATCH --cpus-per-task=1 # CPU cores per MPI rank<br />
#SBATCH --mem=120G # memory per node<br />
#SBATCH --time=0-05:00 # time (DD-HH:MM)<br />
#SBATCH --output=%N-%j.out # %N for node name, %j for jobID<br />
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK<br />
srun ./program<br />
}}<br />
<br />
== Using Cedar's Large GPU nodes ==</div>Feimaohttps://docs.alliancecan.ca/mediawiki/index.php?title=Using_GPUs_with_Slurm&diff=36030Using GPUs with Slurm2017-08-10T18:31:51Z<p>Feimao: </p>
<hr />
<div>{{Draft}}<br />
== GPU Hardwares and Node Types ==<br />
{| class="wikitable"<br />
|-<br />
! # of Nodes !!Node Type !! CPU Cores !! CPU Memory !! # of GPUs !! GPU Type !! PCIe Bus Topology<br />
|-<br />
| 114 || Cedar Base GPU Node || 24 || 128GB || 4 || NVIDIA P100-PCIE-12GB || Two GPUs per CPU socket<br />
|-<br />
| 32 || Cedar Large GPU Node || 24|| 256GB || 4 || NVIDIA P100-PCIE-16GB || All GPUs under same CPU socket<br />
|-<br />
| 160 || Graham Base GPU Node || 32|| 128GB || 2 || NVIDIA P100-PCIE-12GB || One GPU per CPU socket<br />
|}<br />
<br />
== Serial Jobs ==<br />
For GPU jobs asking only single CPU core: <br />
{{File<br />
|name=one_gpu_serial_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --account=def-someuser<br />
#SBATCH --gres=gpu:1 # Number of GPU(s) per node<br />
#SBATCH --mem=4000M # memory per node<br />
#SBATCH --time=0-05:00 # time (DD-HH:MM)<br />
#SBATCH --output=%N-%j.out # %N for node name, %j for jobID<br />
nvidia-smi<br />
}}<br />
== Multi-threaded Jobs ==<br />
For GPU jobs asking for multiple CPUs in a single node:<br />
{{File<br />
|name=one_gpu_threaded_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --account=def-someuser<br />
#SBATCH --gres=gpu:1 # Number of GPU(s) per node<br />
#SBATCH --cpus-per-task=6 # CPU cores/threads<br />
#SBATCH --mem=4000M # memory per node<br />
#SBATCH --time=0-05:00 # time (DD-HH:MM)<br />
#SBATCH --output=%N-%j.out # %N for node name, %j for jobID<br />
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK<br />
nvidia-smi<br />
}}<br />
== MPI Jobs ==<br />
{{File<br />
|name=one_gpu_mpi_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --account=def-someuser<br />
#SBATCH --gres=gpu:4 # Number of GPUs per node<br />
#SBATCH --nodes=2 # Number of Nodes<br />
#SBATCH --ntask=48 # Number of MPI ranks<br />
#SBATCH --cpus-per-task=1 # CPU cores per MPI rank<br />
#SBATCH --mem=120G # memory per node<br />
#SBATCH --time=0-05:00 # time (DD-HH:MM)<br />
#SBATCH --output=%N-%j.out # %N for node name, %j for jobID<br />
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK<br />
srun ./program<br />
}}<br />
<br />
== Using Cedar's Large GPU nodes ==</div>Feimaohttps://docs.alliancecan.ca/mediawiki/index.php?title=Using_GPUs_with_Slurm&diff=36029Using GPUs with Slurm2017-08-10T18:19:55Z<p>Feimao: /* Single GPU Jobs */</p>
<hr />
<div>{{Draft}}<br />
== GPU Hardwares and Node Types ==<br />
{| class="wikitable"<br />
|-<br />
! # of Nodes !!Node Type !! CPU Cores !! CPU Memory !! # of GPUs !! GPU Type !! PCIe Bus Topology<br />
|-<br />
| 114 || Cedar Base GPU Node || 24 || 128GB || 4 || NVIDIA P100-PCIE-12GB || Two GPUs per CPU socket<br />
|-<br />
| 32 || Cedar Large GPU Node || 24|| 256GB || 4 || NVIDIA P100-PCIE-16GB || All GPUs under same CPU socket<br />
|-<br />
| 160 || Graham Base GPU Node || 32|| 128GB || 2 || NVIDIA P100-PCIE-12GB || One GPU per CPU socket<br />
|}<br />
<br />
== Single GPU Jobs ==<br />
=== Serial Jobs ===<br />
{{File<br />
|name=one_gpu_serial_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --account=def-someuser<br />
#SBATCH --gres=gpu:1 # request GPU "generic resource"<br />
#SBATCH --mem=4000M # memory per node<br />
#SBATCH --time=0-05:00 # time (DD-HH:MM)<br />
#SBATCH --output=%N-%j.out # %N for node name, %j for jobID<br />
nvidia-smi<br />
}}<br />
=== Multi-threaded Jobs===<br />
{{File<br />
|name=one_gpu_threaded_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --account=def-someuser<br />
#SBATCH --gres=gpu:1 # request GPU "generic resource"<br />
#SBATCH --cpus-per-task=6 # CPU cores/threads<br />
#SBATCH --mem=4000M # memory per node<br />
#SBATCH --time=0-05:00 # time (DD-HH:MM)<br />
#SBATCH --output=%N-%j.out # %N for node name, %j for jobID<br />
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK<br />
nvidia-smi<br />
}}<br />
=== MPI Jobs ===<br />
{{File<br />
|name=one_gpu_mpi_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --account=def-someuser<br />
#SBATCH --gres=gpu:1 # Number of GPUs per node<br />
#SBATCH --nodes=1<br />
#SBATCH --ntask=6 # Number of MPI ranks<br />
#SBATCH --cpus-per-task=1 # CPU cores per MPI rank<br />
#SBATCH --mem=4000M # memory per node<br />
#SBATCH --time=0-05:00 # time (DD-HH:MM)<br />
#SBATCH --output=%N-%j.out # %N for node name, %j for jobID<br />
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK<br />
nvidia-smi<br />
}}<br />
<br />
== Whole Node(s) GPU Jobs ==<br />
== Using Cedar's Large GPU nodes ==</div>Feimaohttps://docs.alliancecan.ca/mediawiki/index.php?title=Running_jobs&diff=36028Running jobs2017-08-10T18:08:50Z<p>Feimao: </p>
<hr />
<div><languages /><br />
<translate><br />
<br />
<!--T:54--><br />
This page is intended for the user who is already familiar with the concepts of job scheduling and job scripts, and who wants guidance on submitting jobs to Compute Canada clusters.<br />
If you have not worked on a large shared computer cluster before, you should probably read [[What is a scheduler?]] first.<br />
<br />
<!--T:55--><br />
On Compute Canada clusters, the job scheduler is the <br />
[https://en.wikipedia.org/wiki/Slurm_Workload_Manager Slurm Workload Manager].<br />
Comprehensive [https://slurm.schedmd.com/documentation.html documentation for Slurm] is maintained by SchedMD. If you are coming to Slurm from PBS/Torque, SGE, LSF, or LoadLeveler, you might find this table of [https://slurm.schedmd.com/rosetta.pdf corresponding commands] useful.<br />
<br />
==Use <code>sbatch</code> to submit jobs== <!--T:56--><br />
The command to submit a job is [https://slurm.schedmd.com/sbatch.html <code>sbatch</code>]:<br />
<source lang="bash"><br />
[someuser@host ~]$ sbatch simple_job.sh<br />
Submitted batch job 123456<br />
</source><br />
<br />
<!--T:57--><br />
A minimal Slurm job script looks like this:<br />
{{File<br />
|name=simple_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --time=00:01:00<br />
#SBATCH --account=def-someuser<br />
echo 'Hello, world!'<br />
sleep 30 <br />
}}<br />
<br />
<!--T:58--><br />
Directives (or "options") in the job script are prefixed with <code>#SBATCH</code> and must precede all executable commands. All available directives are described on the [https://slurm.schedmd.com/sbatch.html sbatch page]. Compute Canada policies require that you supply at least a time limit (<code>--time</code>) and an account name (<code>--account</code>) for each job. (See [[#Accounts and projects]] below.)<br />
<br />
<!--T:59--><br />
You can also specify directives as command-line arguments to <code>sbatch</code>. So for example,<br />
[someuser@host ~]$ sbatch --time=00:30:00 simple_job.sh <br />
will submit the above job script with a time limit of 30 minutes.<br />
<br />
==Use <code>squeue</code> to list jobs== <!--T:60--><br />
<br />
<!--T:61--><br />
The [https://slurm.schedmd.com/squeue.html <code>squeue</code>] command lists pending and running jobs. Supply your username as an argument with <code>-u</code> to list only your own jobs:<br />
<br />
<!--T:62--><br />
<source lang="bash"><br />
[someuser@host ~]$ squeue -u $USER<br />
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)<br />
123456 cpubase_b simple_j someuser R 0:03 1 cdr234<br />
123457 cpubase_b simple_j someuser PD 1 (Priority)<br />
</source><br />
<br />
<!--T:12--><br />
The ST column of the output shows the status of each job. The two most common states are "PD" for "pending" or "R" for "running". See the [https://slurm.schedmd.com/squeue.html squeue page]<br />
for more on selecting, formatting, and interpreting the <code>squeue</code> output.<br />
<br />
==Where does the output go?== <!--T:63--><br />
<br />
<!--T:64--><br />
By default the output is placed in a file named "slurm-", suffixed with the job ID number and ".out", ''e.g.'' <code>slurm-123456.out</code>, in the directory from which the job was submitted.<br />
You can use <code>--output</code> to specify a different name or location. <br />
Certain replacement symbols can be used in the filename, ''e.g.'' <code>%j</code> will be replaced <br />
by the job ID number. See [https://slurm.schedmd.com/sbatch.html sbatch] for a complete list.<br />
<br />
<!--T:65--><br />
The following sample script sets a ''job name'' (which appears in <code>squeue</code> output) and sends the output to a file with a name constructed from the job name (%x) and the job ID number (%j). <br />
<br />
<!--T:15--><br />
{{File<br />
|name=name_output.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --account=def-someuser<br />
#SBATCH --time=00:01:00<br />
#SBATCH --job-name=test<br />
#SBATCH --output=%x-%j.out<br />
echo 'Hello, world!'<br />
}}<br />
<br />
<!--T:16--><br />
Error output will normally appear in the same file as standard output, just as it would if you were typing commands interactively. If you want to send the standard error channel (stderr) to a separate file, use <code>--error</code>.<br />
<br />
==Accounts and projects== <!--T:66--><br />
<br />
<!--T:67--><br />
Every job must have an associated ''account name'' corresponding to a Compute Canada [https://ccdb.computecanada.ca/me/faq#what_is_rap Resource Allocation Project], <br />
specified using the <code>--account</code> directive:<br />
#SBATCH --account=def-user-ab<br />
<br />
<!--T:68--><br />
If you try to submit a job with <code>sbatch</code> without supplying an account name, <br />
you will be shown a list of valid account names to chose from. If you have access to <br />
several Resource Allocation Projects and want to know which account name corresponds<br />
to a given Resource Allocation Project, log in to [https://ccdb.computecanada.ca CCDB] <br />
and visit the page for that project. The second field in the display, the '''group name''',<br />
is the string you should use with the <code>--account</code> directive. Note that a Resource <br />
Allocation Project may only apply to a specific cluster (or set of clusters) and therefore<br />
may not be transferable from one cluster to another. <br />
<br />
<!--T:69--><br />
In the illustration below, jobs which are to be accounted against RAP wnp-003-ac<br />
should be submitted with <code>--account=def-rdickson-ac</code>.<br />
<br />
<!--T:70--><br />
[[File:Find-group-name-annotated.png|750px|frame|left| Finding the group name for a Resource Allocation Project (RAP)]]<br />
<br clear=all> <!-- This is to prevent the next section from filling to the right of the image. --><br />
<br />
<!--T:71--><br />
If you plan to use one account consistently for all jobs, once you have determined the right account name you may find it convenient to set the <code>SLURM_ACCOUNT</code> and <code>SBATCH_ACCOUNT</code> environment variables in your <code>~/.bashrc</code> file, like so:<br />
export SLURM_ACCOUNT=def-someuser<br />
export SBATCH_ACCOUNT=$SLURM_ACCOUNT<br />
export SALLOC_ACCOUNT=$SLURM_ACCOUNT<br />
Slurm will use the value of <code>SBATCH_ACCOUNT</code> in place of the <code>--account</code> directive in the job script. Note that even if you supply an account name inside the job script, ''the environment variable takes priority.'' In order to override the environment variable you must supply an account name as a command-line argument to <code>sbatch</code>.<br />
<br />
<!--T:72--><br />
<code>SLURM_ACCOUNT</code> plays the same role as <code>SBATCH_ACCOUNT</code>, but for the <code>srun</code> command instead of <code>sbatch</code>. The same idea holds for <code>SALLOC_ACCOUNT</code>.<br />
<br />
== Examples of job scripts == <!--T:17--><br />
<br />
=== MPI job === <!--T:18--><br />
<br />
<!--T:51--><br />
This example script launches four MPI processes, each with 1024 MB of memory. The run time is limited to 5 minutes. <br />
<br />
<!--T:19--><br />
{{File<br />
|name=mpi_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --account=def-someuser<br />
#SBATCH --ntasks=4 # number of MPI processes<br />
#SBATCH --mem-per-cpu=1024M # memory; default unit is megabytes<br />
#SBATCH --time=0-00:05 # time (DD-HH:MM)<br />
srun ./mpi_program # mpirun or mpiexec also work<br />
}}<br />
<br />
<!--T:20--><br />
One can have detailed control over the location of MPI processes by, for example, requesting a specific number of processes per node. Hybrid MPI/threaded jobs are also possible. For more on these and other options relating to distributed parallel jobs, see [[Advanced MPI scheduling]].<br />
<br />
=== Threaded or OpenMP job === <!--T:21--><br />
This example script launches a single process with six CPU cores. Bear in mind that for an application to use OpenMP it must be compiled with the appropriate flag, e.g. <code>gcc -fopenmp ...</code> or <code>icc -openmp ...</code><br />
<br />
<!--T:22--><br />
{{File<br />
|name=openmp_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --account=def-someuser<br />
#SBATCH --time=0-0:5<br />
#SBATCH --cpus-per-task=6<br />
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK<br />
./ompHello<br />
}}<br />
<br />
<!--T:23--><br />
For more on writing and running parallel programs with OpenMP, see [[OpenMP]].<br />
<br />
=== GPU job === <!--T:24--><br />
This example is a serial job with one GPU allocated, a memory limit of 4000 MB per node, and a run-time limit of 5 hours. The output filename will include the name of the first node used and the job ID number.<br />
<br />
<!--T:25--><br />
{{File<br />
|name=simple_gpu_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --account=def-someuser<br />
#SBATCH --gres=gpu:1 # request GPU "generic resource"<br />
#SBATCH --mem=4000M # memory per node<br />
#SBATCH --time=0-05:00 # time (DD-HH:MM)<br />
#SBATCH --output=%N-%j.out # %N for node name, %j for jobID<br />
nvidia-smi<br />
}}<br />
<br />
<!--T:49--><br />
Because no node count is specified in the above example, one node will be allocated. If you were to add <code>--nodes=3</code>, the total memory allocated would be 12000M. The same goes for <code>--gres</code>: If you request three nodes, you will get one GPU per node, for a total of three.<br />
<br />
<br />
This example is a parallel job with 4 GPUs allocated on the same node.<br />
{{File<br />
|name=simple_gpu_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --account=def-someuser<br />
#SBATCH --gres=gpu:4 # request GPU "generic resource"<br />
#SBATCH --mem=4000M # memory per node<br />
#SBATCH --time=0-05:00 # time (DD-HH:MM)<br />
#SBATCH --output=%N-%j.out # %N for node name, %j for jobID<br />
nvidia-smi<br />
}}<br />
<br />
<br />
This example is a whole node gpu job on the large gpu node on cedar with all 4 GPUs allocated. You must request all 4 gpus to use these resources.<br />
{{File<br />
|name=large_gpu_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --ntasks=1 # Number of tasks<br />
#SBATCH --cpus-per-task=24 # Number of CPU cores per task<br />
#SBATCH --nodes=1 # Number of nodes, ensure that all cores are on one machine<br />
#SBATCH --gres=gpu:lgpu:4 # ask for 4 gpus per node of the large-gpu node variaty<br />
#SBATCH --time=0-00:10 # Runtime in D-HH:MM<br />
#SBATCH -o large_gpu-%j.out # File to which STDOUT will be written<br />
#SBATCH --mail-type=ALL # Type of email notification- BEGIN,END,FAIL,ALL<br />
<br />
# The large GPUs nodes have 4 Tesla P100 16GB (as opposed to 12GB cards for the rest of the cluster)<br />
# These GPUs are sitting on the same PCI switch so the inter-GPU communication is faster<br />
# These nodes have 256 GB RAM as opposed to 128GB for the rest of the cluster<br />
# You have to specify lgpu and use all 4 on a node to be able to submit jobs to these resources<br />
<br />
hostname<br />
sleep 500<br />
}}<br />
<br />
<br />
<br />
<!--T:26--><br />
For more on running GPU jobs, see [[Using GPUs with SLURM]].<br />
<br />
=== Array job === <!--T:27--><br />
Also known as a ''task array'', an array job is a way to submit a whole set of jobs with one command. The individual jobs in the array are distinguished by an environment variable, <code>$SLURM_ARRAY_TASK_ID</code>, which is set to a different value for each instance of the job. <br />
sbatch --array=0-7 ... # $SLURM_ARRAY_TASK_ID will take values from 0 to 7 inclusive<br />
sbatch --array=1,3,5,7 ... # $SLURM_ARRAY_TASK_ID will take the listed values<br />
sbatch --array=1-7:2 ... # Another way to do the same thing<br />
sbatch --array=1-100%10 ... # Allow no more than 10 of the jobs to run simultaneously<br />
<br />
== Interactive jobs == <!--T:28--><br />
Though batch submission is the most common and most efficient way to take advantage of our clusters, interactive jobs are also supported. These can be useful for things like:<br />
* Data exploration at the command line<br />
* Interactive "console tools" like R and iPython<br />
* Significant software development, debugging, or compiling<br />
<br />
<!--T:29--><br />
You can start an interactive session on a compute node with [https://slurm.schedmd.com/salloc.html salloc]. In the following example we request two tasks, which corresponds to two CPU cores, for an hour:<br />
[name@login ~]$ salloc --time=1:0:0 --ntasks=2 --account=def-someuser<br />
salloc: Granted job allocation 1234567<br />
[name@node01 ~]$ ... # do some work<br />
[name@node01 ~]$ exit # terminate the allocation<br />
salloc: Relinquishing job allocation 1234567<br />
<br />
<!--T:30--><br />
For more details see [[Interactive jobs]].<br />
<br />
== Monitoring jobs == <!--T:31--><br />
<br />
<!--T:32--><br />
By default [https://slurm.schedmd.com/squeue.html squeue] will show all the jobs the scheduler is managing at the moment. It may run much faster if you ask only about your own jobs with<br />
squeue -u <username><br />
<br />
<!--T:33--><br />
You can show only running jobs, or only pending jobs:<br />
squeue -u <username> -t RUNNING<br />
squeue -u <username> -t PENDING<br />
<br />
<!--T:34--><br />
You can show detailed information for a specific job with [https://slurm.schedmd.com/scontrol.html scontrol]:<br />
scontrol show job -dd <jobid><br />
<br />
<!--T:35--><br />
Find information about a completed job with [https://slurm.schedmd.com/sacct.html sacct], and optionally, control what it prints using <code>--format</code>:<br />
sacct -j <jobid><br />
sacct -j <jobid> --format=JobID,JobName,MaxRSS,Elapsed<br />
<br />
<!--T:73--><br />
If a node fails while running a job, the job may be restarted. <code>sacct</code> will normally show you only the record for the last (presumably successful) run. If you wish to see all records related to a given job, add the <code>--duplicates</code> option.<br />
<br />
<!--T:52--><br />
Use the MaxRSS accounting field to determine how much memory a job needed. The value returned will be the largest [https://en.wikipedia.org/wiki/Resident_set_size resident set size] for any of the tasks. If you want to know which task and node this occurred on, print the MaxRSSTask and MaxRSSNode fields also.<br />
<br />
<!--T:53--><br />
The [https://slurm.schedmd.com/sstat.html sstat] command works on a running job much the same way that [https://slurm.schedmd.com/sacct.html sacct] works on a completed job.<br />
<br />
<!--T:36--><br />
You can ask to be notified by email of certain job conditions by supplying options to <br />
[https://slurm.schedmd.com/sbatch.html sbatch]:<br />
#SBATCH --mail-user=<email_address><br />
#SBATCH --mail-type=BEGIN<br />
#SBATCH --mail-type=END<br />
#SBATCH --mail-type=FAIL<br />
#SBATCH --mail-type=REQUEUE<br />
#SBATCH --mail-type=ALL<br />
<br />
==Cancelling jobs== <!--T:37--><br />
<br />
<!--T:38--><br />
Use [https://slurm.schedmd.com/scancel.html scancel] with the job ID to cancel a job:<br />
<br />
<!--T:39--><br />
scancel <jobid><br />
<br />
<!--T:40--><br />
You can also use it to cancel all your jobs, or all your pending jobs:<br />
<br />
<!--T:41--><br />
scancel -u <username><br />
scancel -t PENDING -u <username><br />
<br />
== Resubmitting jobs for long running computations == <!--T:74--><br />
<br />
<!--T:75--><br />
When a computation is going to require a long time to complete, so long that it cannot be done within the time limits on the system, <br />
the software has to support checkpointing. The software should be able to save its complete state to a file, called a checkpoint, and<br />
then it should be able to restart and continue the computation from that saved state. <br />
<br />
<!--T:76--><br />
If there is only few such restarts are required this can be easily done manually, but sometimes multiple running simulations may require numerous restarts. <br />
Then, some kind of automation technique may be employed to simplify resubmission of multi step jobs. <br />
<br />
<!--T:77--><br />
Currently, on Compute Canada systems there are two recommended methods of resubmission<br />
* Using SLURM '''job arrays''';<br />
* Resubmitting from the end of the job script.<br />
<br />
=== Resubmission using job arrays ===<br />
<br />
This way, one can submit several jobs with the same parameters, an array of jobs, with a condition that only one job of the array will run at any given time.<br />
The same job script will be executed a predefined number of times, so that the script should include all the necessary commands to ensure that the last checkpoint is used <br />
for the next job.<br />
<br />
<!--T:78--><br />
For example, there is a molecular dynamics simulations that has to be run for 1 000 000 steps, and such simulation does not fit into the time limit on the cluster. <br />
We can split the simulation into 10 smaller jobs of 100 000 steps, one after another. <br />
<br />
<!--T:79--><br />
An example of a job script with resubmission:<br />
<pre><br />
#!/bin/bash<br />
# ---------------------------------------------------------------------<br />
# SLURM script for job resubmission on a Compute Canada cluster. <br />
# ---------------------------------------------------------------------<br />
#SBATCH --job-name=job_array<br />
<br />
<!--T:80--><br />
#SBATCH --account=def-rozmanov<br />
<br />
<!--T:81--><br />
#SBATCH --cpus-per-task=1<br />
#SBATCH --time=0-10:00<br />
#SBATCH --mem=100M<br />
<br />
<!--T:82--><br />
# Run a 10 job array, one job at a time.<br />
#SBATCH --array=1-10%1<br />
<br />
<!--T:83--><br />
# ---------------------------------------------------------------------<br />
echo "Current working directory: `pwd`"<br />
echo "Starting run at: `date`"<br />
# ---------------------------------------------------------------------<br />
echo ""<br />
echo "Job Array ID / Job ID: $SLURM_ARRAY_JOB_ID / $SLURM_JOB_ID"<br />
echo "This is job $SLURM_ARRAY_TASK_ID out of $SLURM_ARRAY_TASK_COUNT jobs."<br />
echo ""<br />
# ---------------------------------------------------------------------<br />
# Run your simulation step here...<br />
<br />
if test -e state.cpt; then <br />
# There is a checkpoint file, restart;<br />
mdrun --restart state.cpt<br />
else<br />
# There is no checkpoint file, start a new simulation.<br />
mdrun<br />
fi<br />
<br />
# ---------------------------------------------------------------------<br />
echo "Job finished with exit code $? at: `date`"<br />
# ---------------------------------------------------------------------<br />
</pre><br />
<br />
=== Resubmission from the job script ===<br />
<br />
In this case one submits a job that runs the first chunk of the calculation and saves a checkpoint. <br />
Once the calculation is done before the allocated time at the end of the job script one has to check if the end of the simulation has been reached<br />
and if not a new job will be submitted to work on the next chunk of work.<br />
<br />
An example of a job script with resubmission:<br />
<pre><br />
#!/bin/bash<br />
# ---------------------------------------------------------------------<br />
# SLURM script for job resubmission on a Compute Canada cluster. <br />
# ---------------------------------------------------------------------<br />
#SBATCH --job-name=job_chain<br />
<br />
#SBATCH --account=def-rozmanov<br />
<br />
<!--T:81--><br />
#SBATCH --cpus-per-task=1<br />
#SBATCH --time=0-10:00<br />
#SBATCH --mem=100M<br />
<br />
# ---------------------------------------------------------------------<br />
echo "Current working directory: `pwd`"<br />
echo "Starting run at: `date`"<br />
# ---------------------------------------------------------------------<br />
# Run your simulation step here...<br />
<br />
if test -e state.cpt; then <br />
# There is a checkpoint file, restart;<br />
mdrun --restart state.cpt<br />
else<br />
# There is no checkpoint file, start a new simulation.<br />
mdrun<br />
fi<br />
<br />
# Resubmit if not all work has been done yet.<br />
if end_is_not_reached; then<br />
sbatch ${BASH_SOURCE[0]}<br />
fi<br />
<br />
# ---------------------------------------------------------------------<br />
echo "Job finished with exit code $? at: `date`"<br />
# ---------------------------------------------------------------------<br />
</pre><br />
<br />
== Troubleshooting == <!--T:42--><br />
<br />
==== Avoid hidden characters in job scripts ==== <!--T:43--><br />
Preparing a job script with a ''word processor'' instead of a ''text editor'' is a common cause of trouble. Best practice is to prepare your job script on the cluster using an [[Editors|editor]] such as nano, vim, or emacs. If you prefer to prepare or alter the script off-line, then:<br />
* '''Windows users:''' <br />
** Use a text editor such as Notepad or [https://notepad-plus-plus.org/ Notepad++].<br />
** After uploading the script, use <code>dos2unix</code> to change Windows end-of-line characters to Linux end-of-line characters. <br />
* '''Mac users:'''<br />
** Open a terminal window and use an [[Editors|editor]] such as nano, vim, or emacs.<br />
<br />
== Further reading == <!--T:44--><br />
* Details on [[Job scheduling]] policies at Cedar and Graham.<br />
* Comprehensive [https://slurm.schedmd.com/documentation.html documentation] is maintained by SchedMD, as well as some [https://slurm.schedmd.com/tutorials.html tutorials].<br />
** [https://slurm.schedmd.com/sbatch.html sbatch] command options<br />
* There is also a [https://slurm.schedmd.com/rosetta.pdf "Rosetta stone"] mapping commands and directives from PBS/Torque, SGE, LSF, and LoadLeveler, to SLURM. NERSC also offers some [http://www.nersc.gov/users/computational-systems/cori/running-jobs/for-edison-users/torque-moab-vs-slurm-comparisons/ tables comparing Torque and SLURM].<br />
* Here is a text tutorial from [http://www.ceci-hpc.be/slurm_tutorial.html CÉCI], Belgium<br />
* Here is a rather minimal text tutorial from [http://www.brightcomputing.com/blog/bid/174099/slurm-101-basic-slurm-usage-for-linux-clusters Bright Computing]<br />
<br />
<!--T:48--><br />
[[Category:SLURM]]<br />
</translate></div>Feimaohttps://docs.alliancecan.ca/mediawiki/index.php?title=Running_jobs&diff=36027Running jobs2017-08-10T18:07:09Z<p>Feimao: </p>
<hr />
<div><languages /><br />
<translate><br />
<br />
<!--T:54--><br />
This page is intended for the user who is already familiar with the concepts of job scheduling and job scripts, and who wants guidance on submitting jobs to Compute Canada clusters.<br />
If you have not worked on a large shared computer cluster before, you should probably read [[What is a scheduler?]] first.<br />
<br />
<!--T:55--><br />
On Compute Canada clusters, the job scheduler is the <br />
[https://en.wikipedia.org/wiki/Slurm_Workload_Manager Slurm Workload Manager].<br />
Comprehensive [https://slurm.schedmd.com/documentation.html documentation for Slurm] is maintained by SchedMD. If you are coming to Slurm from PBS/Torque, SGE, LSF, or LoadLeveler, you might find this table of [https://slurm.schedmd.com/rosetta.pdf corresponding commands] useful.<br />
<br />
==Use <code>sbatch</code> to submit jobs== <!--T:56--><br />
The command to submit a job is [https://slurm.schedmd.com/sbatch.html <code>sbatch</code>]:<br />
<source lang="bash"><br />
[someuser@host ~]$ sbatch simple_job.sh<br />
Submitted batch job 123456<br />
</source><br />
<br />
<!--T:57--><br />
A minimal Slurm job script looks like this:<br />
{{File<br />
|name=simple_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --time=00:01:00<br />
#SBATCH --account=def-someuser<br />
echo 'Hello, world!'<br />
sleep 30 <br />
}}<br />
<br />
<!--T:58--><br />
Directives (or "options") in the job script are prefixed with <code>#SBATCH</code> and must precede all executable commands. All available directives are described on the [https://slurm.schedmd.com/sbatch.html sbatch page]. Compute Canada policies require that you supply at least a time limit (<code>--time</code>) and an account name (<code>--account</code>) for each job. (See [[#Accounts and projects]] below.)<br />
<br />
<!--T:59--><br />
You can also specify directives as command-line arguments to <code>sbatch</code>. So for example,<br />
[someuser@host ~]$ sbatch --time=00:30:00 simple_job.sh <br />
will submit the above job script with a time limit of 30 minutes.<br />
<br />
==Use <code>squeue</code> to list jobs== <!--T:60--><br />
<br />
<!--T:61--><br />
The [https://slurm.schedmd.com/squeue.html <code>squeue</code>] command lists pending and running jobs. Supply your username as an argument with <code>-u</code> to list only your own jobs:<br />
<br />
<!--T:62--><br />
<source lang="bash"><br />
[someuser@host ~]$ squeue -u $USER<br />
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)<br />
123456 cpubase_b simple_j someuser R 0:03 1 cdr234<br />
123457 cpubase_b simple_j someuser PD 1 (Priority)<br />
</source><br />
<br />
<!--T:12--><br />
The ST column of the output shows the status of each job. The two most common states are "PD" for "pending" or "R" for "running". See the [https://slurm.schedmd.com/squeue.html squeue page]<br />
for more on selecting, formatting, and interpreting the <code>squeue</code> output.<br />
<br />
==Where does the output go?== <!--T:63--><br />
<br />
<!--T:64--><br />
By default the output is placed in a file named "slurm-", suffixed with the job ID number and ".out", ''e.g.'' <code>slurm-123456.out</code>, in the directory from which the job was submitted.<br />
You can use <code>--output</code> to specify a different name or location. <br />
Certain replacement symbols can be used in the filename, ''e.g.'' <code>%j</code> will be replaced <br />
by the job ID number. See [https://slurm.schedmd.com/sbatch.html sbatch] for a complete list.<br />
<br />
<!--T:65--><br />
The following sample script sets a ''job name'' (which appears in <code>squeue</code> output) and sends the output to a file with a name constructed from the job name (%x) and the job ID number (%j). <br />
<br />
<!--T:15--><br />
{{File<br />
|name=name_output.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --account=def-someuser<br />
#SBATCH --time=00:01:00<br />
#SBATCH --job-name=test<br />
#SBATCH --output=%x-%j.out<br />
echo 'Hello, world!'<br />
}}<br />
<br />
<!--T:16--><br />
Error output will normally appear in the same file as standard output, just as it would if you were typing commands interactively. If you want to send the standard error channel (stderr) to a separate file, use <code>--error</code>.<br />
<br />
==Accounts and projects== <!--T:66--><br />
<br />
<!--T:67--><br />
Every job must have an associated ''account name'' corresponding to a Compute Canada [https://ccdb.computecanada.ca/me/faq#what_is_rap Resource Allocation Project], <br />
specified using the <code>--account</code> directive:<br />
#SBATCH --account=def-user-ab<br />
<br />
<!--T:68--><br />
If you try to submit a job with <code>sbatch</code> without supplying an account name, <br />
you will be shown a list of valid account names to chose from. If you have access to <br />
several Resource Allocation Projects and want to know which account name corresponds<br />
to a given Resource Allocation Project, log in to [https://ccdb.computecanada.ca CCDB] <br />
and visit the page for that project. The second field in the display, the '''group name''',<br />
is the string you should use with the <code>--account</code> directive. Note that a Resource <br />
Allocation Project may only apply to a specific cluster (or set of clusters) and therefore<br />
may not be transferable from one cluster to another. <br />
<br />
<!--T:69--><br />
In the illustration below, jobs which are to be accounted against RAP wnp-003-ac<br />
should be submitted with <code>--account=def-rdickson-ac</code>.<br />
<br />
<!--T:70--><br />
[[File:Find-group-name-annotated.png|750px|frame|left| Finding the group name for a Resource Allocation Project (RAP)]]<br />
<br clear=all> <!-- This is to prevent the next section from filling to the right of the image. --><br />
<br />
<!--T:71--><br />
If you plan to use one account consistently for all jobs, once you have determined the right account name you may find it convenient to set the <code>SLURM_ACCOUNT</code> and <code>SBATCH_ACCOUNT</code> environment variables in your <code>~/.bashrc</code> file, like so:<br />
export SLURM_ACCOUNT=def-someuser<br />
export SBATCH_ACCOUNT=$SLURM_ACCOUNT<br />
export SALLOC_ACCOUNT=$SLURM_ACCOUNT<br />
Slurm will use the value of <code>SBATCH_ACCOUNT</code> in place of the <code>--account</code> directive in the job script. Note that even if you supply an account name inside the job script, ''the environment variable takes priority.'' In order to override the environment variable you must supply an account name as a command-line argument to <code>sbatch</code>.<br />
<br />
<!--T:72--><br />
<code>SLURM_ACCOUNT</code> plays the same role as <code>SBATCH_ACCOUNT</code>, but for the <code>srun</code> command instead of <code>sbatch</code>. The same idea holds for <code>SALLOC_ACCOUNT</code>.<br />
<br />
== Examples of job scripts == <!--T:17--><br />
<br />
=== MPI job === <!--T:18--><br />
<br />
<!--T:51--><br />
This example script launches four MPI processes, each with 1024 MB of memory. The run time is limited to 5 minutes. <br />
<br />
<!--T:19--><br />
{{File<br />
|name=mpi_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --account=def-someuser<br />
#SBATCH --ntasks=4 # number of MPI processes<br />
#SBATCH --mem-per-cpu=1024M # memory; default unit is megabytes<br />
#SBATCH --time=0-00:05 # time (DD-HH:MM)<br />
srun ./mpi_program # mpirun or mpiexec also work<br />
}}<br />
<br />
<!--T:20--><br />
One can have detailed control over the location of MPI processes by, for example, requesting a specific number of processes per node. Hybrid MPI/threaded jobs are also possible. For more on these and other options relating to distributed parallel jobs, see [[Advanced MPI scheduling]].<br />
<br />
=== Threaded or OpenMP job === <!--T:21--><br />
This example script launches a single process with six CPU cores. Bear in mind that for an application to use OpenMP it must be compiled with the appropriate flag, e.g. <code>gcc -fopenmp ...</code> or <code>icc -openmp ...</code><br />
<br />
<!--T:22--><br />
{{File<br />
|name=openmp_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --account=def-someuser<br />
#SBATCH --time=0-0:5<br />
#SBATCH --cpus-per-task=6<br />
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK<br />
./ompHello<br />
}}<br />
<br />
<!--T:23--><br />
For more on writing and running parallel programs with OpenMP, see [[OpenMP]].<br />
<br />
=== GPU job === <!--T:24--><br />
This example is a serial job with one GPU allocated, a memory limit of 4000 MB per node, and a run-time limit of 5 hours. The output filename will include the name of the first node used and the job ID number.<br />
<br />
<!--T:25--><br />
{{File<br />
|name=simple_gpu_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --account=def-someuser<br />
#SBATCH --gres=gpu:1 # request GPU "generic resource"<br />
#SBATCH --mem=4000M # memory per node<br />
#SBATCH --time=0-05:00 # time (DD-HH:MM)<br />
#SBATCH --output=%N-%j.out # %N for node name, %j for jobID<br />
nvidia-smi<br />
}}<br />
<br />
<!--T:49--><br />
Because no node count is specified in the above example, one node will be allocated. If you were to add <code>--nodes=3</code>, the total memory allocated would be 12000M. The same goes for <code>--gres</code>: If you request three nodes, you will get one GPU per node, for a total of three.<br />
<br />
<br />
This example is a parallel job with 4 GPUs allocated on the same node.<br />
{{File<br />
|name=simple_gpu_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --account=def-someuser<br />
#SBATCH --gres=gpu:4 # request GPU "generic resource"<br />
#SBATCH --mem=4000M # memory per node<br />
#SBATCH --time=0-05:00 # time (DD-HH:MM)<br />
#SBATCH --output=%N-%j.out # %N for node name, %j for jobID<br />
nvidia-smi<br />
}}<br />
<br />
<br />
This example is a whole node gpu job on the large gpu node on cedar with all 4 GPUs allocated. You must request all 4 gpus to use these resources.<br />
{{File<br />
|name=large_gpu_job.sh<br />
|lang="sh"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --ntasks=1 # Number of tasks<br />
#SBATCH --cpus-per-task=24 # Number of CPU cores per task<br />
#SBATCH --nodes=1 # Number of nodes, ensure that all cores are on one machine<br />
#SBATCH --gres=gpu:lgpu:4 # ask for 4 gpus per node of the large-gpu node variaty<br />
#SBATCH --time=0-00:10 # Runtime in D-HH:MM<br />
#SBATCH -o large_gpu-%j.out # File to which STDOUT will be written<br />
#SBATCH --mail-type=ALL # Type of email notification- BEGIN,END,FAIL,ALL<br />
<br />
# The large GPUs nodes have 4 Tesla P100 16GB (as opposed to 12GB cards for the rest of the cluster)<br />
# These GPUs are sitting on the same PCI switch so the inter-GPU communication is faster<br />
# These nodes have 256 GB RAM as opposed to 128GB for the rest of the cluster<br />
# You have to specify lgpu and use all 4 on a node to be able to submit jobs to these resources<br />
<br />
hostname<br />
sleep 500<br />
}}<br />
<br />
<br />
<br />
<!--T:26--><br />
For more on running GPU jobs, see [[Using GPUs with SLURM]].<br />
<br />
=== Array job === <!--T:27--><br />
Also known as a ''task array'', an array job is a way to submit a whole set of jobs with one command. The individual jobs in the array are distinguished by an environment variable, <code>$SLURM_ARRAY_TASK_ID</code>, which is set to a different value for each instance of the job. <br />
sbatch --array=0-7 ... # $SLURM_ARRAY_TASK_ID will take values from 0 to 7 inclusive<br />
sbatch --array=1,3,5,7 ... # $SLURM_ARRAY_TASK_ID will take the listed values<br />
sbatch --array=1-7:2 ... # Another way to do the same thing<br />
sbatch --array=1-100%10 ... # Allow no more than 10 of the jobs to run simultaneously<br />
<br />
== Interactive jobs == <!--T:28--><br />
Though batch submission is the most common and most efficient way to take advantage of our clusters, interactive jobs are also supported. These can be useful for things like:<br />
* Data exploration at the command line<br />
* Interactive "console tools" like R and iPython<br />
* Significant software development, debugging, or compiling<br />
<br />
<!--T:29--><br />
You can start an interactive session on a compute node with [https://slurm.schedmd.com/salloc.html salloc]. In the following example we request two tasks, which corresponds to two CPU cores, for an hour:<br />
[name@login ~]$ salloc --time=1:0:0 --ntasks=2 --account=def-someuser<br />
salloc: Granted job allocation 1234567<br />
[name@node01 ~]$ ... # do some work<br />
[name@node01 ~]$ exit # terminate the allocation<br />
salloc: Relinquishing job allocation 1234567<br />
<br />
<!--T:30--><br />
For more details see [[Interactive jobs]].<br />
<br />
== Monitoring jobs == <!--T:31--><br />
<br />
<!--T:32--><br />
By default [https://slurm.schedmd.com/squeue.html squeue] will show all the jobs the scheduler is managing at the moment. It may run much faster if you ask only about your own jobs with<br />
squeue -u <username><br />
<br />
<!--T:33--><br />
You can show only running jobs, or only pending jobs:<br />
squeue -u <username> -t RUNNING<br />
squeue -u <username> -t PENDING<br />
<br />
<!--T:34--><br />
You can show detailed information for a specific job with [https://slurm.schedmd.com/scontrol.html scontrol]:<br />
scontrol show job -dd <jobid><br />
<br />
<!--T:35--><br />
Find information about a completed job with [https://slurm.schedmd.com/sacct.html sacct], and optionally, control what it prints using <code>--format</code>:<br />
sacct -j <jobid><br />
sacct -j <jobid> --format=JobID,JobName,MaxRSS,Elapsed<br />
<br />
<!--T:73--><br />
If a node fails while running a job, the job may be restarted. <code>sacct</code> will normally show you only the record for the last (presumably successful) run. If you wish to see all records related to a given job, add the <code>--duplicates</code> option.<br />
<br />
<!--T:52--><br />
Use the MaxRSS accounting field to determine how much memory a job needed. The value returned will be the largest [https://en.wikipedia.org/wiki/Resident_set_size resident set size] for any of the tasks. If you want to know which task and node this occurred on, print the MaxRSSTask and MaxRSSNode fields also.<br />
<br />
<!--T:53--><br />
The [https://slurm.schedmd.com/sstat.html sstat] command works on a running job much the same way that [https://slurm.schedmd.com/sacct.html sacct] works on a completed job.<br />
<br />
<!--T:36--><br />
You can ask to be notified by email of certain job conditions by supplying options to <br />
[https://slurm.schedmd.com/sbatch.html sbatch]:<br />
#SBATCH --mail-user=<email_address><br />
#SBATCH --mail-type=BEGIN<br />
#SBATCH --mail-type=END<br />
#SBATCH --mail-type=FAIL<br />
#SBATCH --mail-type=REQUEUE<br />
#SBATCH --mail-type=ALL<br />
<br />
==Cancelling jobs== <!--T:37--><br />
<br />
<!--T:38--><br />
Use [https://slurm.schedmd.com/scancel.html scancel] with the job ID to cancel a job:<br />
<br />
<!--T:39--><br />
scancel <jobid><br />
<br />
<!--T:40--><br />
You can also use it to cancel all your jobs, or all your pending jobs:<br />
<br />
<!--T:41--><br />
scancel -u <username><br />
scancel -t PENDING -u <username><br />
<br />
== Resubmitting jobs for long running computations == <!--T:74--><br />
<br />
<!--T:75--><br />
When a computation is going to require a long time to complete, so long that it cannot be done within the time limits on the system, <br />
the software has to support checkpointing. The software should be able to save its complete state to a file, called a checkpoint, and<br />
then it should be able to restart and continue the computation from that saved state. <br />
<br />
<!--T:76--><br />
If there is only few such restarts are required this can be easily done manually, but sometimes multiple running simulations may require numerous restarts. <br />
Then, some kind of automation technique may be employed to simplify resubmission of multi step jobs. <br />
<br />
<!--T:77--><br />
Currently, on Compute Canada systems there are two recommended methods of resubmission<br />
* Using SLURM '''job arrays''';<br />
* Resubmitting from the end of the job script.<br />
<br />
=== Resubmission using job arrays ===<br />
<br />
This way, one can submit several jobs with the same parameters, an array of jobs, with a condition that only one job of the array will run at any given time.<br />
The same job script will be executed a predefined number of times, so that the script should include all the necessary commands to ensure that the last checkpoint is used <br />
for the next job.<br />
<br />
<!--T:78--><br />
For example, there is a molecular dynamics simulations that has to be run for 1 000 000 steps, and such simulation does not fit into the time limit on the cluster. <br />
We can split the simulation into 10 smaller jobs of 100 000 steps, one after another. <br />
<br />
<!--T:79--><br />
An example of a job script with resubmission:<br />
<pre><br />
#!/bin/bash<br />
# ---------------------------------------------------------------------<br />
# SLURM script for job resubmission on a Compute Canada cluster. <br />
# ---------------------------------------------------------------------<br />
#SBATCH --job-name=job_array<br />
<br />
<!--T:80--><br />
#SBATCH --account=def-rozmanov<br />
<br />
<!--T:81--><br />
#SBATCH --cpus-per-task=1<br />
#SBATCH --time=0-10:00<br />
#SBATCH --mem=100M<br />
<br />
<!--T:82--><br />
# Run a 10 job array, one job at a time.<br />
#SBATCH --array=1-10%1<br />
<br />
<!--T:83--><br />
# ---------------------------------------------------------------------<br />
echo "Current working directory: `pwd`"<br />
echo "Starting run at: `date`"<br />
# ---------------------------------------------------------------------<br />
echo ""<br />
echo "Job Array ID / Job ID: $SLURM_ARRAY_JOB_ID / $SLURM_JOB_ID"<br />
echo "This is job $SLURM_ARRAY_TASK_ID out of $SLURM_ARRAY_TASK_COUNT jobs."<br />
echo ""<br />
# ---------------------------------------------------------------------<br />
# Run your simulation step here...<br />
<br />
if test -e state.cpt; then <br />
# There is a checkpoint file, restart;<br />
mdrun --restart state.cpt<br />
else<br />
# There is no checkpoint file, start a new simulation.<br />
mdrun<br />
fi<br />
<br />
# ---------------------------------------------------------------------<br />
echo "Job finished with exit code $? at: `date`"<br />
# ---------------------------------------------------------------------<br />
</pre><br />
<br />
=== Resubmission from the job script ===<br />
<br />
In this case one submits a job that runs the first chunk of the calculation and saves a checkpoint. <br />
Once the calculation is done before the allocated time at the end of the job script one has to check if the end of the simulation has been reached<br />
and if not a new job will be submitted to work on the next chunk of work.<br />
<br />
An example of a job script with resubmission:<br />
<pre><br />
#!/bin/bash<br />
# ---------------------------------------------------------------------<br />
# SLURM script for job resubmission on a Compute Canada cluster. <br />
# ---------------------------------------------------------------------<br />
#SBATCH --job-name=job_chain<br />
<br />
#SBATCH --account=def-rozmanov<br />
<br />
<!--T:81--><br />
#SBATCH --cpus-per-task=1<br />
#SBATCH --time=0-10:00<br />
#SBATCH --mem=100M<br />
<br />
# ---------------------------------------------------------------------<br />
echo "Current working directory: `pwd`"<br />
echo "Starting run at: `date`"<br />
# ---------------------------------------------------------------------<br />
# Run your simulation step here...<br />
<br />
if test -e state.cpt; then <br />
# There is a checkpoint file, restart;<br />
mdrun --restart state.cpt<br />
else<br />
# There is no checkpoint file, start a new simulation.<br />
mdrun<br />
fi<br />
<br />
# Resubmit if not all work has been done yet.<br />
if end_is_not_reached; then<br />
sbatch ${BASH_SOURCE[0]}<br />
fi<br />
<br />
# ---------------------------------------------------------------------<br />
echo "Job finished with exit code $? at: `date`"<br />
# ---------------------------------------------------------------------<br />
</pre><br />
<br />
== Troubleshooting == <!--T:42--><br />
<br />
==== Avoid hidden characters in job scripts ==== <!--T:43--><br />
Preparing a job script with a ''word processor'' instead of a ''text editor'' is a common cause of trouble. Best practice is to prepare your job script on the cluster using an [[Editors|editor]] such as nano, vim, or emacs. If you prefer to prepare or alter the script off-line, then:<br />
* '''Windows users:''' <br />
** Use a text editor such as Notepad or [https://notepad-plus-plus.org/ Notepad++].<br />
** After uploading the script, use <code>dos2unix</code> to change Windows end-of-line characters to Linux end-of-line characters. <br />
* '''Mac users:'''<br />
** Open a terminal window and use an [[Editors|editor]] such as nano, vim, or emacs.<br />
<br />
== Further reading == <!--T:44--><br />
* Details on [[Job scheduling]] policies at Cedar and Graham.<br />
* Comprehensive [https://slurm.schedmd.com/documentation.html documentation] is maintained by SchedMD, as well as some [https://slurm.schedmd.com/tutorials.html tutorials].<br />
** [https://slurm.schedmd.com/sbatch.html sbatch] command options<br />
* There is also a [https://slurm.schedmd.com/rosetta.pdf "Rosetta stone"] mapping commands and directives from PBS/Torque, SGE, LSF, and LoadLeveler, to SLURM. NERSC also offers some [http://www.nersc.gov/users/computational-systems/cori/running-jobs/for-edison-users/torque-moab-vs-slurm-comparisons/ tables comparing Torque and SLURM].<br />
* Here is a text tutorial from [http://www.ceci-hpc.be/slurm_tutorial.html CÉCI], Belgium<br />
* Here is a rather minimal text tutorial from [http://www.brightcomputing.com/blog/bid/174099/slurm-101-basic-slurm-usage-for-linux-clusters Bright Computing]<br />
<br />
<!--T:48--><br />
[[Category:SLURM]]<br />
</translate></div>Feimaohttps://docs.alliancecan.ca/mediawiki/index.php?title=Using_GPUs_with_Slurm&diff=36026Using GPUs with Slurm2017-08-10T17:42:01Z<p>Feimao: /* GPU Hardwares and Node Types */</p>
<hr />
<div>{{Draft}}<br />
== GPU Hardwares and Node Types ==<br />
{| class="wikitable"<br />
|-<br />
! # of Nodes !!Node Type !! CPU Cores !! CPU Memory !! # of GPUs !! GPU Type !! PCIe Bus Topology<br />
|-<br />
| 114 || Cedar Base GPU Node || 24 || 128GB || 4 || NVIDIA P100-PCIE-12GB || Two GPUs per CPU socket<br />
|-<br />
| 32 || Cedar Large GPU Node || 24|| 256GB || 4 || NVIDIA P100-PCIE-16GB || All GPUs under same CPU socket<br />
|-<br />
| 160 || Graham Base GPU Node || 32|| 128GB || 2 || NVIDIA P100-PCIE-12GB || One GPU per CPU socket<br />
|}<br />
<br />
== Single GPU Jobs ==<br />
== Whole Node(s) GPU Jobs ==<br />
== Using Cedar's Large GPU nodes ==</div>Feimaohttps://docs.alliancecan.ca/mediawiki/index.php?title=Using_GPUs_with_Slurm&diff=36025Using GPUs with Slurm2017-08-10T17:41:44Z<p>Feimao: /* GPU Hardwares and Node Types */</p>
<hr />
<div>{{Draft}}<br />
== GPU Hardwares and Node Types ==<br />
{| class="wikitable"<br />
|-<br />
! # of Nodes !!Node Type !! CPU Cores !! CPU Memory !! # of GPUs !! GPU Type !! Topology<br />
|-<br />
| 114 || Cedar Base GPU Node || 24 || 128GB || 4 || NVIDIA P100-PCIE-12GB || Two GPUs per CPU socket<br />
|-<br />
| 32 || Cedar Large GPU Node || 24|| 256GB || 4 || NVIDIA P100-PCIE-16GB || All GPUs under same CPU socket<br />
|-<br />
| 160 || Graham Base GPU Node || 32|| 128GB || 2 || NVIDIA P100-PCIE-12GB || One GPU per CPU socket<br />
|}<br />
<br />
== Single GPU Jobs ==<br />
== Whole Node(s) GPU Jobs ==<br />
== Using Cedar's Large GPU nodes ==</div>Feimaohttps://docs.alliancecan.ca/mediawiki/index.php?title=Using_GPUs_with_Slurm&diff=36024Using GPUs with Slurm2017-08-10T17:07:54Z<p>Feimao: Created page with "{{Draft}} == GPU Hardwares and Node Types == == Single GPU Jobs == == Whole Node(s) GPU Jobs == == Using Cedar's Large GPU nodes =="</p>
<hr />
<div>{{Draft}}<br />
== GPU Hardwares and Node Types ==<br />
== Single GPU Jobs ==<br />
== Whole Node(s) GPU Jobs ==<br />
== Using Cedar's Large GPU nodes ==</div>Feimaohttps://docs.alliancecan.ca/mediawiki/index.php?title=TensorFlow&diff=35780TensorFlow2017-08-08T20:26:38Z<p>Feimao: /* Packing single-GPU jobs within one SLURM job */</p>
<hr />
<div><languages /><br />
<translate><br />
==Installing Tensorflow== <!--T:1--><br />
<br />
<!--T:2--><br />
These instructions install Tensorflow into your home directory using Compute Canada's pre-built [http://pythonwheels.com/ Python wheels]. Custom Python wheels are stored in <code>/cvmfs/soft.computecanada.ca/custom/python/wheelhouse/</code>. To install Tensorflow's wheel we will use the <code>pip</code> command and install it into a [[Python#Creating_and_using_a_virtual_environment | Python virtual environments]]. The below instructions install for Python 3.5.2 but you can also install for Python 3.5.Y or 2.7.X by loading a different Python module.<br />
<br />
<!--T:3--><br />
Load modules required by Tensorflow:<br />
{{Command|module load cuda cudnn python/3.5.2}}<br />
Create a new python virtual environment:<br />
{{Command|virtualenv tensorflow}}<br />
<br />
<!--T:4--><br />
Activate your newly created python virtual environment:<br />
{{Command|source tensorflow/bin/activate}}<br />
Install Tensorflow into your newly created virtual environment:<br />
{{Command|pip install tensorflow}}<br />
<br />
==Submitting a Tensorflow job== <!--T:5--><br />
Once you have the above setup completed you can submit a Tensorflow job as<br />
{{Command|sbatch tensorflow-test.sh}}<br />
The job submission script has the contents<br />
</translate><br />
{{File<br />
|name=tensorflow-test.sh<br />
|lang="bash"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --gres=gpu:1 # request GPU "generic resource"<br />
#SBATCH --cpus-per-task=6 #Maximum of CPU cores per GPU request: 6 on Cedar, 16 on Graham.<br />
#SBATCH --mem=32000M # memory per node<br />
#SBATCH --time=0-03:00 # time (DD-HH:MM)<br />
#SBATCH --output=%N-%j.out # %N for node name, %j for jobID<br />
<br />
module load cuda cudnn python/3.5.2<br />
source tensorflow/bin/activate<br />
python ./tensorflow-test.py<br />
}}<br />
<translate><br />
<!--T:6--><br />
while the Python script has the form,<br />
</translate><br />
{{File<br />
|name=tensorflow-test.py<br />
|lang="python"<br />
|contents=<br />
import tensorflow as tf<br />
node1 = tf.constant(3.0, dtype=tf.float32)<br />
node2 = tf.constant(4.0) # also tf.float32 implicitly<br />
print(node1, node2)<br />
sess = tf.Session()<br />
print(sess.run([node1, node2]))<br />
}}<br />
<translate><br />
<!--T:7--><br />
Once the above job has completed (should take less than a minute) you should see an output file called something like <tt>cdr116-122907.out</tt> with contents similar to the following example,<br />
</translate><br />
{{File<br />
|name=cdr116-122907.out<br />
|lang="text"<br />
|contents=<br />
2017-07-10 12:35:19.489458: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 0 with properties:<br />
name: Tesla P100-PCIE-12GB<br />
major: 6 minor: 0 memoryClockRate (GHz) 1.3285<br />
pciBusID 0000:82:00.0<br />
Total memory: 11.91GiB<br />
Free memory: 11.63GiB<br />
2017-07-10 12:35:19.491097: I tensorflow/core/common_runtime/gpu/gpu_device.cc:961] DMA: 0<br />
2017-07-10 12:35:19.491156: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0: Y<br />
2017-07-10 12:35:19.520737: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla P100-PCIE-12GB, pci bus id: 0000:82:00.0)<br />
Tensor("Const:0", shape=(), dtype=float32) Tensor("Const_1:0", shape=(), dtype=float32)<br />
[3.0, 4.0]<br />
}}<br />
<br />
==Using Cedar's large GPU nodes== <br />
TensorFlow can run on all GPU node types on Cedar and Graham. Cedar's large GPU node type, which equips 4 x P100-PCIE-16GB with GPU Direct P2P enabled between each pair, is highly recommended for large scale Deep Learning/ Machine Learning research.<br />
<br />
Currently all large GPU nodes on Cedar accept whole node(s) jobs only. User should run multi-gpu supported code or pack multiple codes in one job. The job submission script should have the contents<br />
{{File<br />
|name=tensorflow-test-lgpu.sh<br />
|lang="bash"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --nodes=1 # request number of whole nodes<br />
#SBATCH --ntasks-per-node=1 <br />
#SBATCH --cpus-per-task=24 # must use a combination of --ntasks-per-node and --cpus-per-task. Total CPU cores should be 24.<br />
#SBATCH --gres=gpu:lgpu:4 # lgpu is required for using large GPU nodes<br />
#SBATCH --mem=250G # memory per node<br />
#SBATCH --time=0-03:00 # time (DD-HH:MM)<br />
#SBATCH --output=%N-%j.out # %N for node name, %j for jobID<br />
<br />
module load cuda cudnn python/3.5.2<br />
source tensorflow/bin/activate<br />
python ./tensorflow-test.py<br />
}}<br />
<br />
===Packing single-GPU jobs within one SLURM job===<br />
Cedar's large GPU nodes are highly recommended to run Deep Learning models which can be accelerated by multiple GPUs. If user has to run 4 x single GPU codes or 2 x 2-GPU codes in a node, [https://www.gnu.org/software/parallel/ GNU Parallel] is recommended. A simple example is given below:<br />
<pre><br />
cat params.input | parallel -j4 'CUDA_VISIBLE_DEVICES=$(({%} - 1)) python {}'<br />
</pre><br />
A params.input file should includes input parameters in each line like:<br />
<pre><br />
code1.py<br />
code2.py<br />
code3.py<br />
code4.py<br />
...<br />
</pre><br />
With this method, user can run multiple codes in one submission. In this case, GNU Parallel will run a maximum of 4 jobs at a time. It will launch the next job when one job is finished. CUDA_VISIBLE_DEVICES is used to force using only 1 GPU for each code.</div>Feimaohttps://docs.alliancecan.ca/mediawiki/index.php?title=TensorFlow&diff=35779TensorFlow2017-08-08T20:25:12Z<p>Feimao: </p>
<hr />
<div><languages /><br />
<translate><br />
==Installing Tensorflow== <!--T:1--><br />
<br />
<!--T:2--><br />
These instructions install Tensorflow into your home directory using Compute Canada's pre-built [http://pythonwheels.com/ Python wheels]. Custom Python wheels are stored in <code>/cvmfs/soft.computecanada.ca/custom/python/wheelhouse/</code>. To install Tensorflow's wheel we will use the <code>pip</code> command and install it into a [[Python#Creating_and_using_a_virtual_environment | Python virtual environments]]. The below instructions install for Python 3.5.2 but you can also install for Python 3.5.Y or 2.7.X by loading a different Python module.<br />
<br />
<!--T:3--><br />
Load modules required by Tensorflow:<br />
{{Command|module load cuda cudnn python/3.5.2}}<br />
Create a new python virtual environment:<br />
{{Command|virtualenv tensorflow}}<br />
<br />
<!--T:4--><br />
Activate your newly created python virtual environment:<br />
{{Command|source tensorflow/bin/activate}}<br />
Install Tensorflow into your newly created virtual environment:<br />
{{Command|pip install tensorflow}}<br />
<br />
==Submitting a Tensorflow job== <!--T:5--><br />
Once you have the above setup completed you can submit a Tensorflow job as<br />
{{Command|sbatch tensorflow-test.sh}}<br />
The job submission script has the contents<br />
</translate><br />
{{File<br />
|name=tensorflow-test.sh<br />
|lang="bash"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --gres=gpu:1 # request GPU "generic resource"<br />
#SBATCH --cpus-per-task=6 #Maximum of CPU cores per GPU request: 6 on Cedar, 16 on Graham.<br />
#SBATCH --mem=32000M # memory per node<br />
#SBATCH --time=0-03:00 # time (DD-HH:MM)<br />
#SBATCH --output=%N-%j.out # %N for node name, %j for jobID<br />
<br />
module load cuda cudnn python/3.5.2<br />
source tensorflow/bin/activate<br />
python ./tensorflow-test.py<br />
}}<br />
<translate><br />
<!--T:6--><br />
while the Python script has the form,<br />
</translate><br />
{{File<br />
|name=tensorflow-test.py<br />
|lang="python"<br />
|contents=<br />
import tensorflow as tf<br />
node1 = tf.constant(3.0, dtype=tf.float32)<br />
node2 = tf.constant(4.0) # also tf.float32 implicitly<br />
print(node1, node2)<br />
sess = tf.Session()<br />
print(sess.run([node1, node2]))<br />
}}<br />
<translate><br />
<!--T:7--><br />
Once the above job has completed (should take less than a minute) you should see an output file called something like <tt>cdr116-122907.out</tt> with contents similar to the following example,<br />
</translate><br />
{{File<br />
|name=cdr116-122907.out<br />
|lang="text"<br />
|contents=<br />
2017-07-10 12:35:19.489458: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 0 with properties:<br />
name: Tesla P100-PCIE-12GB<br />
major: 6 minor: 0 memoryClockRate (GHz) 1.3285<br />
pciBusID 0000:82:00.0<br />
Total memory: 11.91GiB<br />
Free memory: 11.63GiB<br />
2017-07-10 12:35:19.491097: I tensorflow/core/common_runtime/gpu/gpu_device.cc:961] DMA: 0<br />
2017-07-10 12:35:19.491156: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0: Y<br />
2017-07-10 12:35:19.520737: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla P100-PCIE-12GB, pci bus id: 0000:82:00.0)<br />
Tensor("Const:0", shape=(), dtype=float32) Tensor("Const_1:0", shape=(), dtype=float32)<br />
[3.0, 4.0]<br />
}}<br />
<br />
==Using Cedar's large GPU nodes== <br />
TensorFlow can run on all GPU node types on Cedar and Graham. Cedar's large GPU node type, which equips 4 x P100-PCIE-16GB with GPU Direct P2P enabled between each pair, is highly recommended for large scale Deep Learning/ Machine Learning research.<br />
<br />
Currently all large GPU nodes on Cedar accept whole node(s) jobs only. User should run multi-gpu supported code or pack multiple codes in one job. The job submission script should have the contents<br />
{{File<br />
|name=tensorflow-test-lgpu.sh<br />
|lang="bash"<br />
|contents=<br />
#!/bin/bash<br />
#SBATCH --nodes=1 # request number of whole nodes<br />
#SBATCH --ntasks-per-node=1 <br />
#SBATCH --cpus-per-task=24 # must use a combination of --ntasks-per-node and --cpus-per-task. Total CPU cores should be 24.<br />
#SBATCH --gres=gpu:lgpu:4 # lgpu is required for using large GPU nodes<br />
#SBATCH --mem=250G # memory per node<br />
#SBATCH --time=0-03:00 # time (DD-HH:MM)<br />
#SBATCH --output=%N-%j.out # %N for node name, %j for jobID<br />
<br />
module load cuda cudnn python/3.5.2<br />
source tensorflow/bin/activate<br />
python ./tensorflow-test.py<br />
}}<br />
<br />
===Packing single-GPU jobs within one SLURM job===<br />
Cedar's large GPU nodes are highly recommended to run Deep Learning models which can be accelerated by multiple GPUs. If user has to run 4 x single GPU job or 2 x 2-GPU job in a node, [https://www.gnu.org/software/parallel/ GNU Parallel] is recommended. A simple example is given below:<br />
<pre><br />
cat params.input | parallel -j4 'CUDA_VISIBLE_DEVICES=$(({%} - 1)) python {}'<br />
</pre><br />
A params.input file should includes input parameters in each line like:<br />
<pre><br />
code1.py<br />
code2.py<br />
code3.py<br />
code4.py<br />
...<br />
</pre><br />
With this method, user can run multiple codes in one submission. In this case, GNU Parallel will run a maximum of 4 jobs at a time. It will launch the next job when one job is finished. CUDA_VISIBLE_DEVICES is used to force using only 1 GPU for each code.</div>Feimaohttps://docs.alliancecan.ca/mediawiki/index.php?title=AI_and_Machine_Learning&diff=35120AI and Machine Learning2017-07-27T14:24:40Z<p>Feimao: Created page with "{{Draft}} == General == At this location a discipline guide on AI and Machine Learning will be created. == Software Packages == The following software packages are availabl..."</p>
<hr />
<div>{{Draft}}<br />
== General ==<br />
<br />
At this location a discipline guide on AI and Machine Learning will be created.<br />
<br />
== Software Packages ==<br />
<br />
The following software packages are available on Compute Canada's HPC resources:<br />
<br />
* [[Tensorflow]]<br />
* [[Torch]]</div>Feimaohttps://docs.alliancecan.ca/mediawiki/index.php?title=Graham&diff=14766Graham2017-05-18T13:55:19Z<p>Feimao: </p>
<hr />
<div><noinclude><br />
<languages /><br />
<br />
<translate><br />
</noinclude><br />
===Graham (GP3)=== <!--T:1--><br />
<br />
<!--T:27--><br />
{| class="wikitable"<br />
|-<br />
| Expected availability: '''June 2017''' for opportunistic use<br />
|-<br />
| Login node: '''graham.computecanada.ca'''<br />
|-<br />
| Globus endpoint: '''computecanada#graham'''<br />
|}<br />
<br />
<!--T:2--><br />
GRAHAM is a heterogeneous cluster, suitable for a variety of workloads, and located at the University of Waterloo. It is named after [https://en.wikipedia.org/wiki/Wes_Graham Wes Graham], the first director of the Computing Centre at Waterloo. It was previously known as "GP3" and is still identified as such in the [https://www.computecanada.ca/research-portal/accessing-resources/resource-allocation-competitions/ 2017 RAC] documentation.<br />
<br />
<!--T:4--><br />
The parallel filesystem and external persistent storage ([[National Data Cyberinfrastructure|NDC-Waterloo]]) are similar to [[Cedar|Cedar's]]. The interconnect is different and there is a slightly different mix of compute nodes.<br />
<br />
<!--T:28--><br />
The Graham system is sold and supported by Huawei Canada, Inc. It is entirely liquid cooled, using rear-door heat exchangers.<br />
<br />
====Attached storage systems==== <!--T:23--><br />
<br />
<!--T:24--><br />
{| class="wikitable sortable"<br />
|-<br />
| '''Home space''' ||<br />
* Standard home directory.<br />
* Small, standard quota. <br />
* Not allocated via [https://www.computecanada.ca/research-portal/accessing-resources/rapid-access-service/ RAS] or [https://www.computecanada.ca/research-portal/accessing-resources/resource-allocation-competitions/ RAC]. Larger requests go to Project space.<br />
|-<br />
| '''Scratch space'''<br />Parallel high-performance filesystem ||<br />
* For active or temporary (<code>/scratch</code>) storage.<br />
* Available to all nodes.<br />
* Not allocated.<br />
* Inactive data will be purged.<br />
* [http://e.huawei.com/en/products/cloud-computing-dc/storage Huawei OceanStor] storage system with approximately 3.6PB usable capacity and aggregate performance of approximately 30GB/s.<br />
|-<br />
|'''Project space'''<br />External persistent storage<br />
||<br />
* Part of the [[National Data Cyberinfrastructure]].<br />
* Allocated via [https://www.computecanada.ca/research-portal/accessing-resources/rapid-access-service/ RAS] or [https://www.computecanada.ca/research-portal/accessing-resources/resource-allocation-competitions/ RAC].<br />
* Available to all nodes.<br />
* Not designed for parallel I/O workloads. Use Scratch space instead.<br />
|}<br />
<br />
====High-performance interconnect==== <!--T:19--><br />
<br />
<!--T:21--><br />
Mellanox FDR (56Gb/s) and EDR (100Gb/s) InfiniBand interconnect. FDR is used for GPU and cloud nodes, EDR for other node types. A central 324-port director switch aggregates connections from islands of 1024 cores each for CPU and GPU nodes. The 56 cloud nodes are a variation on CPU nodes, and are on a single larger island sharing 8 FDR uplinks to the director switch.<br />
<br />
<!--T:29--><br />
A low-latency high-bandwidth Infiniband fabric connects all nodes and scratch storage.<br />
<br />
<!--T:30--><br />
Nodes configurable for cloud provisioning also have a 10Gb/s Ethernet network, with 40Gb/s uplinks to scratch storage.<br />
<br />
<!--T:22--><br />
The design of Graham is to support multiple simultaneous parallel jobs of up to 1024 cores in a fully non-blocking manner. <br />
<br />
<!--T:31--><br />
For larger jobs the interconnect has a 8:1 blocking factor, i.e., even for jobs running on multiple islands the Graham system provides a high-performance interconnect.<br />
<br />
<!--T:32--><br />
[https://docs.computecanada.ca/mediawiki/images/b/b3/Gp3-network-topo.png Graham high performance interconnect diagram]<br />
<br />
====Node types and characteristics==== <!--T:5--><br />
<br />
<!--T:25--><br />
''Processor type:'' All nodes except bigmem3000 have Intel E5-2683 V4 CPUs, running at 2.1 GHz<br />
<br />
<!--T:26--><br />
''GPU type:'' P100 12g<br />
<br />
<!--T:6--><br />
{| class="wikitable sortable"<br />
|-<br />
| "Base" compute nodes || 800 nodes || 128 GB of memory, 16 cores/socket, 2 sockets/node. Intel "Broadwell" CPUs at 2.1Ghz, model E5-2683 v4. 960GB SATA SSD.<br />
|-<br />
| "Large" nodes (cloud configuration) || 56 nodes || 256 GB of memory, 16 cores/socket, 2 sockets/node. Intel "Broadwell" CPUs at 2.1Ghz, model E5-2683 v4. 960GB SATA SSD.<br />
|-<br />
| "Bigmem500" nodes|| 24 nodes || 0.5 TB (512 GB) of memory, 16 cores/socket, 2 sockets/node. Intel "Broadwell" CPUs at 2.1Ghz, model E5-2683 v4. 960GB SATA SSD.<br />
|-<br />
| "Bigmem3000" nodes || 3 nodes || 3 TB of memory, 16 cores/socket, 4 sockets/node. Intel "Broadwell" CPUs at 2.1Ghz, model E7-4850 v4. 960GB SATA SSD.<br />
|-<br />
| "GPU" nodes || 160 nodes || 128 GB of memory, 16 cores/socket, 2 sockets/node, 2 NVIDIA P100 Pascal GPUs/node (12GB HBM2 memory). Intel "Broadwell" CPUs at 2.1Ghz, model E5-2683 v4. 1.6TB NVMe SSD.<br />
|}<br />
<br />
<!--T:7--><br />
Local (on-node) storage in the above nodes will be available as /tmp.<br />
<br />
<!--T:14--><br />
<noinclude><br />
</translate><br />
</noinclude></div>Feimao