Running jobs: Difference between revisions

Jump to navigation Jump to search
no edit summary
No edit summary
No edit summary
Line 3: Line 3:


<!--T:54-->
<!--T:54-->
This page is intended for the user who is already familiar with the concepts of job scheduling and job scripts, and who wants guidance on submitting jobs to Compute Canada clusters.
This page is intended for the user who is already familiar with the concepts of job scheduling and job scripts, and who wants guidance on submitting jobs to our clusters.
If you have not worked on a large shared computer cluster before, you should probably read [[What is a scheduler?]] first.
If you have not worked on a large shared computer cluster before, you should probably read [[What is a scheduler?]] first.


Line 12: Line 12:


<!--T:55-->
<!--T:55-->
On Compute Canada clusters, the job scheduler is the  
On our clusters, the job scheduler is the  
[https://en.wikipedia.org/wiki/Slurm_Workload_Manager Slurm Workload Manager].
[https://en.wikipedia.org/wiki/Slurm_Workload_Manager Slurm Workload Manager].
Comprehensive [https://slurm.schedmd.com/documentation.html documentation for Slurm] is maintained by SchedMD. If you are coming to Slurm from PBS/Torque, SGE, LSF, or LoadLeveler, you might find this table of [https://slurm.schedmd.com/rosetta.pdf corresponding commands] useful.
Comprehensive [https://slurm.schedmd.com/documentation.html documentation for Slurm] is maintained by SchedMD. If you are coming to Slurm from PBS/Torque, SGE, LSF, or LoadLeveler, you might find this table of [https://slurm.schedmd.com/rosetta.pdf corresponding commands] useful.
Line 38: Line 38:
<!--T:58-->
<!--T:58-->
On general-purpose (GP) clusters this job reserves 1 core and 256MB of memory for 15 minutes. On [[Niagara]] this job reserves the whole node with all its memory.
On general-purpose (GP) clusters this job reserves 1 core and 256MB of memory for 15 minutes. On [[Niagara]] this job reserves the whole node with all its memory.
Directives (or "options") in the job script are prefixed with <code>#SBATCH</code> and must precede all executable commands. All available directives are described on the [https://slurm.schedmd.com/sbatch.html sbatch page]. Compute Canada policies require that you supply at least a time limit (<code>--time</code>) for each job. You may also need to supply an account name (<code>--account</code>). See [[#Accounts and projects|Accounts and projects]] below.
Directives (or "options") in the job script are prefixed with <code>#SBATCH</code> and must precede all executable commands. All available directives are described on the [https://slurm.schedmd.com/sbatch.html sbatch page]. Our policies require that you supply at least a time limit (<code>--time</code>) for each job. You may also need to supply an account name (<code>--account</code>). See [[#Accounts and projects|Accounts and projects]] below.


<!--T:59-->
<!--T:59-->
Line 76: Line 76:


<!--T:167-->
<!--T:167-->
If you want to know more about the output of <code>sq</code> or <code>squeue</code>, or learn how to change the output, see the  [https://slurm.schedmd.com/squeue.html online manual page for squeue].  <code>sq</code> is a Compute Canada customization.
If you want to know more about the output of <code>sq</code> or <code>squeue</code>, or learn how to change the output, see the  [https://slurm.schedmd.com/squeue.html online manual page for squeue].  <code>sq</code> is a local customization.


<!--T:115-->
<!--T:115-->
Line 98: Line 98:


<!--T:67-->
<!--T:67-->
Every job must have an associated ''account name'' corresponding to a Compute Canada [[Frequently_Asked_Questions_about_the_CCDB#What_is_a_RAP.3F|Resource Allocation Project]] (RAP). If you are a member of only one account, the scheduler will automatically associate your jobs with that account.
Every job must have an associated ''account name'' corresponding to a [[Frequently_Asked_Questions_about_the_CCDB#What_is_a_RAP.3F|Resource Allocation Project]] (RAP). If you are a member of only one account, the scheduler will automatically associate your jobs with that account.


<!--T:107-->
<!--T:107-->
Line 377: Line 377:
#!/bin/bash
#!/bin/bash
# ---------------------------------------------------------------------
# ---------------------------------------------------------------------
# SLURM script for a multi-step job on a Compute Canada cluster.  
# SLURM script for a multi-step job on our clusters.  
# ---------------------------------------------------------------------
# ---------------------------------------------------------------------
#SBATCH --account=def-someuser
#SBATCH --account=def-someuser
Line 426: Line 426:
#!/bin/bash
#!/bin/bash
# ---------------------------------------------------------------------
# ---------------------------------------------------------------------
# SLURM script for job resubmission on a Compute Canada cluster.  
# SLURM script for job resubmission on our clusters.  
# ---------------------------------------------------------------------
# ---------------------------------------------------------------------
#SBATCH --job-name=job_chain
#SBATCH --job-name=job_chain
Line 465: Line 465:


== Automating job submission == <!--T:174-->
== Automating job submission == <!--T:174-->
As described earlier, [[#Array job|array jobs]] can be used to automate job submission. Compute Canada provides a few other (more advanced) tools designed to facilitate running a large number of related serial, parallel, or GPU calculations. This practice is sometimes called "farming", "serial farming", or "task farming". In addition to automating the workflow, these tools can also improve computational efficiency by bundling up many short computations into fewer tasks of longer duration.
As described earlier, [[#Array job|array jobs]] can be used to automate job submission. We provide a few other (more advanced) tools designed to facilitate running a large number of related serial, parallel, or GPU calculations. This practice is sometimes called "farming", "serial farming", or "task farming". In addition to automating the workflow, these tools can also improve computational efficiency by bundling up many short computations into fewer tasks of longer duration.


<!--T:175-->
<!--T:175-->
Line 476: Line 476:


<!--T:178-->
<!--T:178-->
Certain software packages such as [https://github.com/alekseyzimin/masurca Masurca] operate by submitting jobs to Slurm automatically, and expect a partition to be specified for each job.  This is in conflict with the best practice at Compute Canada, which is that you should allow the scheduler to assign a partition to your job based on the resources it requests.  If you are using such a piece of software, you may configure the software to use <code>--partition=default</code>, which the script treats the same as not specifying a partition.
Certain software packages such as [https://github.com/alekseyzimin/masurca Masurca] operate by submitting jobs to Slurm automatically, and expect a partition to be specified for each job.  This is in conflict with what we recommend, which is that you should allow the scheduler to assign a partition to your job based on the resources it requests.  If you are using such a piece of software, you may configure the software to use <code>--partition=default</code>, which the script treats the same as not specifying a partition.


== Cluster particularities == <!--T:154-->
== Cluster particularities == <!--T:154-->


<!--T:155-->
<!--T:155-->
There are certain differences in the job scheduling policies from one Compute Canada cluster to another and these are summarized by tab in the following section:
There are certain differences in the job scheduling policies from one our clusters to another and these are summarized by tab in the following section:


<!--T:156-->
<!--T:156-->
Bureaucrats, cc_docs_admin, cc_staff, rsnt_translations
2,807

edits

Navigation menu