Running jobs: Difference between revisions

Jump to navigation Jump to search
no edit summary
No edit summary
No edit summary
Line 51: Line 51:


<!--T:106-->
<!--T:106-->
Memory may be requested with <code>--mem-per-cpu</code> (memory per core) or <code>--mem</code> (memory per node).  On general-purpose (GP) clusters a default memory amount of 256 MB per core will be allocated unless you make some other request.  At [[Niagara]] only whole nodes are allocated along with all available memory, so a memory specification is not required there.
Memory may be requested with <code>--mem-per-cpu</code> (memory per core) or <code>--mem</code> (memory per node).  On general-purpose (GP) clusters a default memory amount of 256 MB per core will be allocated unless you make some other request.  On [[Niagara]], only whole nodes are allocated along with all available memory, so a memory specification is not required there.


<!--T:162-->
<!--T:162-->
A common source of confusion comes from the fact that some memory on a node is not available to the job (reserved for the OS, etc.).  The effect of this is that each node-type has a maximum amount available to jobs - for instance, nominally "128G" nodes are typically configured to permit 125G of memory to user jobs.  If you request more memory than a node-type provides, your job will be constrained to run on higher-memory nodes, which may be fewer in number.
A common source of confusion comes from the fact that some memory on a node is not available to the job (reserved for the OS, etc.).  The effect of this is that each node type has a maximum amount available to jobs; for instance, nominally "128G" nodes are typically configured to permit 125G of memory to user jobs.  If you request more memory than a node-type provides, your job will be constrained to run on higher-memory nodes, which may be fewer in number.


<!--T:163-->
<!--T:163-->
Adding to this confusion, Slurm interprets K, M, G, etc., as [https://en.wikipedia.org/wiki/Binary_prefix binary prefixes], so <code>--mem=125G</code> is equivalent to <code>--mem=128000M</code>.  See the "available memory" column in the "Node characteristics" table for each GP cluster for the Slurm specification of the maximum memory you can request on each node: [[Béluga/en#Node_characteristics|Béluga]], [[Cedar#Node_characteristics|Cedar]], [[Graham#Node_characteristics|Graham]], [[Narval/en#Node_characteristics|Narval]].
Adding to this confusion, Slurm interprets K, M, G, etc., as [https://en.wikipedia.org/wiki/Binary_prefix binary prefixes], so <code>--mem=125G</code> is equivalent to <code>--mem=128000M</code>.  See the <i>Available memory</i> column in the "Node characteristics" table for each GP cluster for the Slurm specification of the maximum memory you can request on each node: [[Béluga/en#Node_characteristics|Béluga]], [[Cedar#Node_characteristics|Cedar]], [[Graham#Node_characteristics|Graham]], [[Narval/en#Node_characteristics|Narval]].


==Use <code>squeue</code> or <code>sq</code> to list jobs== <!--T:60-->
==Use <code>squeue</code> or <code>sq</code> to list jobs== <!--T:60-->


<!--T:61-->
<!--T:61-->
The general command for checking the status of Slurm jobs is <code>squeue</code>, but by default it supplies information about ''all'' jobs in the system, not just your own.  You can use the shorter <code>sq</code> to list only your own jobs:
The general command for checking the status of Slurm jobs is <code>squeue</code>, but by default it supplies information about <b>all</b> jobs in the system, not just your own.  You can use the shorter <code>sq</code> to list only your own jobs:


<!--T:62-->
<!--T:62-->
Line 79: Line 79:


<!--T:115-->
<!--T:115-->
'''Do not''' run <code>sq</code> or <code>squeue</code> from a script or program at high frequency, ''e.g.,'' every few seconds. Responding to <code>squeue</code> adds load to Slurm, and may interfere with its performance or correct operation.  See [[#Email_notification|Email notification]] below for a much better way to learn when your job starts or ends.
<b>Do not</b> run <code>sq</code> or <code>squeue</code> from a script or program at high frequency (e.g. every few seconds). Responding to <code>squeue</code> adds load to Slurm, and may interfere with its performance or correct operation.  See [[#Email_notification|Email notification]] below for a much better way to learn when your job starts or ends.


==Where does the output go?== <!--T:63-->
==Where does the output go?== <!--T:63-->


<!--T:64-->
<!--T:64-->
By default the output is placed in a file named "slurm-", suffixed with the job ID number and ".out", ''e.g.'' <code>slurm-123456.out</code>, in the directory from which the job was submitted.
By default the output is placed in a file named "slurm-", suffixed with the job ID number and ".out" (e.g. <code>slurm-123456.out</code>), in the directory from which the job was submitted.
Having the job ID as part of the file name is convenient for troubleshooting.
Having the job ID as part of the file name is convenient for troubleshooting.


Line 98: Line 98:


<!--T:67-->
<!--T:67-->
Every job must have an associated ''account name'' corresponding to a [[Frequently_Asked_Questions_about_the_CCDB#What_is_a_RAP.3F|Resource Allocation Project]] (RAP). If you are a member of only one account, the scheduler will automatically associate your jobs with that account.
Every job must have an associated account name corresponding to a [[Frequently_Asked_Questions_about_the_CCDB#What_is_a_RAP.3F|Resource Allocation Project]] (RAP). If you are a member of only one account, the scheduler will automatically associate your jobs with that account.


<!--T:107-->
<!--T:107-->
Line 122: Line 122:
and click on "My Account -> Account Details". You will see a list of all the projects  
and click on "My Account -> Account Details". You will see a list of all the projects  
you are a member of. The string you should use with the <code>--account</code> for  
you are a member of. The string you should use with the <code>--account</code> for  
a given project is under the column '''Group Name'''. Note that a Resource  
a given project is under the column <i>Group Name</i>. Note that a Resource  
Allocation Project may only apply to a specific cluster (or set of clusters) and therefore
Allocation Project may only apply to a specific cluster (or set of clusters) and therefore
may not be transferable from one cluster to another.  
may not be transferable from one cluster to another.  
Line 138: Line 138:
  export SBATCH_ACCOUNT=$SLURM_ACCOUNT
  export SBATCH_ACCOUNT=$SLURM_ACCOUNT
  export SALLOC_ACCOUNT=$SLURM_ACCOUNT
  export SALLOC_ACCOUNT=$SLURM_ACCOUNT
Slurm will use the value of <code>SBATCH_ACCOUNT</code> in place of the <code>--account</code> directive in the job script. Note that even if you supply an account name inside the job script, ''the environment variable takes priority.'' In order to override the environment variable, you must supply an account name as a command-line argument to <code>sbatch</code>.
Slurm will use the value of <code>SBATCH_ACCOUNT</code> in place of the <code>--account</code> directive in the job script. Note that even if you supply an account name inside the job script, <i>the environment variable takes priority.</i> In order to override the environment variable, you must supply an account name as a command-line argument to <code>sbatch</code>.


<!--T:72-->
<!--T:72-->
rsnt_translations
53,464

edits

Navigation menu