Running jobs

From CC Doc
Revision as of 18:28, 8 January 2018 by Rdickson (talk | contribs)

Jump to: navigation, search
Other languages:
English • ‎français

This page is intended for the user who is already familiar with the concepts of job scheduling and job scripts, and who wants guidance on submitting jobs to Compute Canada clusters. If you have not worked on a large shared computer cluster before, you should probably read What is a scheduler? first.

All jobs must be submitted via the scheduler!
Exceptions are made for compilation and other tasks not expected to consume more than about 10 CPU-minutes or about 4 gigabytes of RAM. Such tasks may be run on a login node. In no case should you run processes on compute nodes except via the scheduler.

On Compute Canada clusters, the job scheduler is the Slurm Workload Manager. Comprehensive documentation for Slurm is maintained by SchedMD. If you are coming to Slurm from PBS/Torque, SGE, LSF, or LoadLeveler, you might find this table of corresponding commands useful.

Use sbatch to submit jobs

The command to submit a job is sbatch:

[someuser@host ~]$ sbatch
Submitted batch job 123456

A minimal Slurm job script looks like this:

File :

#SBATCH --time=00:01:00
#SBATCH --account=def-someuser
echo 'Hello, world!'
sleep 30

Directives (or "options") in the job script are prefixed with #SBATCH and must precede all executable commands. All available directives are described on the sbatch page. Compute Canada policies require that you supply at least a time limit (--time) and an account name (--account) for each job. (See #Accounts and projects below.)

A default memory amount of 256 MB per core will be allocated unless you make some other memory request with --mem-per-cpu (memory per core) or --mem (memory per node).

You can also specify directives as command-line arguments to sbatch. So for example,

[someuser@host ~]$ sbatch --time=00:30:00 

will submit the above job script with a time limit of 30 minutes.

Please be cautious if you use a script to submit multiple Slurm jobs in a short time. Submitting thousands of jobs at a time can cause Slurm to become unresponsive to other users. Consider using an array job instead, or use sleep to space out calls to sbatch by one second or more.

Use squeue to list jobs

The squeue command lists pending and running jobs. Supply your username as an argument with -u to list only your own jobs:

[someuser@host ~]$ squeue -u $USER
     123456 cpubase_b  simple_j someuser  R   0:03      1 cdr234
     123457 cpubase_b  simple_j someuser PD             1 (Priority)

The ST column of the output shows the status of each job. The two most common states are "PD" for "pending" or "R" for "running". See the squeue page for more on selecting, formatting, and interpreting the squeue output.

Please do not run squeue from a script or program at high frequency, e.g., every few seconds. Responding to squeue adds load to Slurm, and may interfere with its performance or correct operation.

Where does the output go?

By default the output is placed in a file named "slurm-", suffixed with the job ID number and ".out", e.g. slurm-123456.out, in the directory from which the job was submitted. You can use --output to specify a different name or location. Certain replacement symbols can be used in the filename, e.g. %j will be replaced by the job ID number. See sbatch for a complete list.

The following sample script sets a job name (which appears in squeue output) and sends the output to a file with a name constructed from the job name (%x) and the job ID number (%j).

File :

#SBATCH --account=def-someuser
#SBATCH --time=00:01:00
#SBATCH --job-name=test
#SBATCH --output=%x-%j.out
echo 'Hello, world!'

Error output will normally appear in the same file as standard output, just as it would if you were typing commands interactively. If you want to send the standard error channel (stderr) to a separate file, use --error.

Accounts and projects

Every job must have an associated account name corresponding to a Compute Canada Resource Allocation Project (RAP).

If you try to submit a job with sbatch and receive one of these messages:

 You are associated with multiple _cpu allocations...
 Please specify one of the following accounts to submit this job:

 You are associated with multiple _gpu allocations...
 Please specify one of the following accounts to submit this job:

then you have more than one valid account, and you will have to specify one using the --account directive:

#SBATCH --account=def-user-ab

To find out which account name corresponds to a given Resource Allocation Project, log in to CCDB and click on "My Account -> Account Details". You will see a list of all the projects you are a member of. The string you should use with the --account for a given project is under the column Group Name. Note that a Resource Allocation Project may only apply to a specific cluster (or set of clusters) and therefore may not be transferable from one cluster to another.

In the illustration below, jobs submitted with --account=def-rdickson will be accounted against RAP wnp-003-aa.

Finding the group name for a Resource Allocation Project (RAP)

If you plan to use one account consistently for all jobs, once you have determined the right account name you may find it convenient to set the following three environment variables in your ~/.bashrc file:

export SLURM_ACCOUNT=def-someuser

Slurm will use the value of SBATCH_ACCOUNT in place of the --account directive in the job script. Note that even if you supply an account name inside the job script, the environment variable takes priority. In order to override the environment variable you must supply an account name as a command-line argument to sbatch.

SLURM_ACCOUNT plays the same role as SBATCH_ACCOUNT, but for the srun command instead of sbatch. The same idea holds for SALLOC_ACCOUNT.

Examples of job scripts

MPIMessage Passing Interface job

This example script launches four MPIMessage Passing Interface processes, each with 1024 MB of memory. The run time is limited to 5 minutes.

File :

#SBATCH --account=def-someuser
#SBATCH --ntasks=4               # number of MPI processes
#SBATCH --mem-per-cpu=1024M      # memory; default unit is megabytes
#SBATCH --time=0-00:05           # time (DD-HH:MM)
srun ./mpi_program               # mpirun or mpiexec also work

Large MPIMessage Passing Interface jobs, specifically those which can efficiently use a multiple of 32 cores, should use --nodes and --ntasks-per-node instead of --ntasks. Hybrid MPIMessage Passing Interface/threaded jobs are also possible. For more on these and other options relating to distributed parallel jobs, see Advanced MPI scheduling.

Threaded or OpenMP job

This example script launches a single process with eight CPU cores. Bear in mind that for an application to use OpenMP it must be compiled with the appropriate flag, e.g. gcc -fopenmp ... or icc -openmp ...

File :

#SBATCH --account=def-someuser
#SBATCH --time=0-0:5
#SBATCH --cpus-per-task=8

For more on writing and running parallel programs with OpenMP, see OpenMP.

GPU job

There are many options involved in requesting GPUs because

  • the GPU-equipped nodes at Cedar and Graham have different configurations,
  • there are two different configurations at Cedar, and
  • there are different policies for the different Cedar GPU nodes.

Please see Using GPUs with SLURM for a discussion and examples of how to schedule various job types on the available GPU resources.

Array job

Also known as a task array, an array job is a way to submit a whole set of jobs with one command. The individual jobs in the array are distinguished by an environment variable, $SLURM_ARRAY_TASK_ID, which is set to a different value for each instance of the job.

sbatch --array=0-7 ...      # $SLURM_ARRAY_TASK_ID will take values from 0 to 7 inclusive
sbatch --array=1,3,5,7 ...  # $SLURM_ARRAY_TASK_ID will take the listed values
sbatch --array=1-7:2 ...    # Step-size of 2, does the same as the previous example
sbatch --array=1-100%10 ... # Allow no more than 10 of the jobs to run simultaneously

Interactive jobs

Though batch submission is the most common and most efficient way to take advantage of our clusters, interactive jobs are also supported. These can be useful for things like:

  • Data exploration at the command line
  • Interactive "console tools" like R and iPython
  • Significant software development, debugging, or compiling

You can start an interactive session on a compute node with salloc. In the following example we request two tasks, which corresponds to two CPU cores, for an hour:

[name@login ~]$ salloc --time=1:0:0 --ntasks=2 --account=def-someuser
salloc: Granted job allocation 1234567
[name@node01 ~]$ ...             # do some work
[name@node01 ~]$ exit            # terminate the allocation
salloc: Relinquishing job allocation 1234567

Interactive jobs of up to 24 hours are possible, but we strongly recommend that you restrict your interactive job requests to 3 hours or less.

Monitoring jobs

By default squeue will show all the jobs the scheduler is managing at the moment. It may run much faster if you ask only about your own jobs with

squeue -u <username>

You can show only running jobs, or only pending jobs:

squeue -u <username> -t RUNNING
squeue -u <username> -t PENDING

You can show detailed information for a specific job with scontrol:

scontrol show job -dd <jobid>

Find information about a completed job with sacct, and optionally, control what it prints using --format:

sacct -j <jobid>
sacct -j <jobid> --format=JobID,JobName,MaxRSS,Elapsed

If a node fails while running a job, the job may be restarted. sacct will normally show you only the record for the last (presumably successful) run. If you wish to see all records related to a given job, add the --duplicates option.

Use the MaxRSS accounting field to determine how much memory a job needed. The value returned will be the largest resident set size for any of the tasks. If you want to know which task and node this occurred on, print the MaxRSSTask and MaxRSSNode fields also.

The sstat command works on a running job much the same way that sacct works on a completed job.

You can ask to be notified by email of certain job conditions by supplying options to sbatch:

#SBATCH --mail-user=<email_address>
#SBATCH --mail-type=BEGIN
#SBATCH --mail-type=END
#SBATCH --mail-type=FAIL
#SBATCH --mail-type=REQUEUE
#SBATCH --mail-type=ALL

Cancelling jobs

Use scancel with the job ID to cancel a job:

 scancel <jobid>

You can also use it to cancel all your jobs, or all your pending jobs:

scancel -u <username>
scancel -t PENDING -u <username>

Resubmitting jobs for long running computations

When a computation is going to require a long time to complete, so long that it cannot be done within the time limits on the system, the application you are running must support checkpointing. The application should be able to save its state to a file, called a checkpoint file, and then it should be able to restart and continue the computation from that saved state.

For many users restarting a calculation will be rare and may be done manually, but some workflows require frequent restarts. In this case some kind of automation technique may be employed.

Here are two recommended methods of automatic restarting:

  • Using SLURM job arrays.
  • Resubmitting from the end of the job script.

Restarting using job arrays

Using the --array=1-100%10 syntax mentioned earlier one can submit a collection of identical jobs with the condition that only one job of them will run at any given time. The script should be written to ensure that the last checkpoint is always used for the next job. The number of restarts is fixed by the --array argument.

Consider, for example, a molecular dynamics simulations that has to be run for 1 000 000 steps, and such simulation does not fit into the time limit on the cluster. We can split the simulation into 10 smaller jobs of 100 000 steps, one after another.

An example of using a job array to restart a simulation:

File :

# ---------------------------------------------------------------------
# SLURM script for a multi-step job on a Compute Canada cluster. 
# ---------------------------------------------------------------------
#SBATCH --account=def-someuser
#SBATCH --cpus-per-task=1
#SBATCH --time=0-10:00
#SBATCH --mem=100M
#SBATCH --array=1-10%1   # Run a 10-job array, one job at a time.
# ---------------------------------------------------------------------
echo "Current working directory: `pwd`"
echo "Starting run at: `date`"
# ---------------------------------------------------------------------
echo ""
echo "Job Array ID / Job ID: $SLURM_ARRAY_JOB_ID / $SLURM_JOB_ID"
echo "This is job $SLURM_ARRAY_TASK_ID out of $SLURM_ARRAY_TASK_COUNT jobs."
echo ""
# ---------------------------------------------------------------------
# Run your simulation step here...

if test -e state.cpt; then 
     # There is a checkpoint file, restart;
     mdrun --restart state.cpt
     # There is no checkpoint file, start a new simulation.

# ---------------------------------------------------------------------
echo "Job finished with exit code $? at: `date`"
# ---------------------------------------------------------------------

Resubmission from the job script

In this case one submits a job that runs the first chunk of the calculation and saves a checkpoint. Once the chunk is done but before the allocated run-time of the job has elapsed, the script checks if the end of the calculation has been reached. If the calculation is not yet finished, the script submits a copy of itself to continue working.

An example of a job script with resubmission:

File :

# ---------------------------------------------------------------------
# SLURM script for job resubmission on a Compute Canada cluster. 
# ---------------------------------------------------------------------
#SBATCH --job-name=job_chain
#SBATCH --account=def-someuser
#SBATCH --cpus-per-task=1
#SBATCH --time=0-10:00
#SBATCH --mem=100M
# ---------------------------------------------------------------------
echo "Current working directory: `pwd`"
echo "Starting run at: `date`"
# ---------------------------------------------------------------------
# Run your simulation step here...

if test -e state.cpt; then 
     # There is a checkpoint file, restart;
     mdrun --restart state.cpt
     # There is no checkpoint file, start a new simulation.

# Resubmit if not all work has been done yet.
# You must define the function end_is_not_reached().
if end_is_not_reached; then
     sbatch ${BASH_SOURCE[0]}

# ---------------------------------------------------------------------
echo "Job finished with exit code $? at: `date`"
# ---------------------------------------------------------------------


Avoid hidden characters in job scripts

Preparing a job script with a word processor instead of a text editor is a common cause of trouble. Best practice is to prepare your job script on the cluster using an editor such as nano, vim, or emacs. If you prefer to prepare or alter the script off-line, then:

  • Windows users:
    • Use a text editor such as Notepad or Notepad++.
    • After uploading the script, use dos2unix to change Windows end-of-line characters to Linux end-of-line characters.
  • Mac users:
    • Open a terminal window and use an editor such as nano, vim, or emacs.

Cancellation of jobs with dependency conditions which cannot be met

A job submitted with dependency=afterok:<jobid> is a "dependent job". A dependent job will wait for the parent job to be completed. If the parent job fails (that is, ends with a non-zero exit code) the dependent job can never be scheduled and so will be automatically cancelled. See sbatch for more on dependency.

Job cannot load a module

It is possible to see an error such as:

Lmod has detected the following error: These module(s) exist but cannot be
loaded as requested: "<module-name>/<version>"
   Try: "module spider <module-name>/<version>" to see how to load the module(s).

This can occur if the particular module has an unsatisfied prerequisite. For example

[name@server]$ module load gcc
[name@server]$ module load quantumespresso/6.1
Lmod has detected the following error:  These module(s) exist but cannot be loaded as requested: "quantumespresso/6.1"
   Try: "module spider quantumespresso/6.1" to see how to load the module(s).
[name@server]$ module spider quantumespresso/6.1

  quantumespresso: quantumespresso/6.1
      Quantum ESPRESSO is an integrated suite of computer codes for electronic-structure calculations and materials modeling at the nanoscale. It is based on density-functional theory, plane waves, and pseudopotentials (both
      norm-conserving and ultrasoft).

      Chemistry libraries/apps / Logiciels de chimie

    You will need to load all module(s) on any one of the lines below before the "quantumespresso/6.1" module is available to load.

      nixpkgs/16.09  intel/2016.4  openmpi/2.1.1


      Quantum ESPRESSO  is an integrated suite of computer codes
       for electronic-structure calculations and materials modeling at the nanoscale.
       It is based on density-functional theory, plane waves, and pseudopotentials
        (both norm-conserving and ultrasoft).

      More information
       - Homepage:

In this case adding the line module load nixpkgs/16.09 intel/2016.4 openmpi/2.1.1 to your job script before loading the "quantumespresso/6.1" will solve this problem.

Jobs inherit environment variables

By default a job will inherit the environment variables of the shell where the job was submitted. The module command which is used to make various software packages available changes and sets environment variables. Changes will propagate to any job submitted from the shell and thus could affect the job's ability to load modules if there are missing prerequisites. It is best to include the line module purge in your job script before loading all the required modules to ensure a consistent state for each job submission and avoid changes made in your shell affecting your jobs.

Job status and priority

  • For a discussion of how job priority is determined and how things like time limits may affect the scheduling of your jobs at Cedar and Graham, see Job scheduling policies.

Further reading