Advanced MPI scheduling
Most users should submit MPIMessage Passing Interface or distributed memory parallel jobs following the example
given at Running jobs. Simply request a number of
-n and trust the scheduler
to allocate those processes in a way that balances the efficiency of your job
with the overall efficiency of the cluster.
If you want more control over how your job is allocated, then SchedMD's
page on multicore support is a good place to begin. It describes how many of the options to the
command interact to constrain the placement of processes.
You may find this discussion of What exactly is considered a CPU? in Slurm to be useful.
Examples of common MPIMessage Passing Interface scenarios
Few cores, any number of nodes
In addition to the time limit needed for any Slurm job, an MPIMessage Passing Interface job requires that you specify how many MPIMessage Passing Interface processes Slurm should start. The simplest way to do this is with
--ntasks. Since the default memory allocation of 256MB per core is often insufficient, you may also wish to specify how much memory is needed. Using
--ntasks you cannot know in advance how many cores will reside on each node, so you should request memory with
--mem-per-cpu. For example:
--ntasks=15 --mem-per-cpu=3G srun application.exe
This will run 15 MPIMessage Passing Interface processes. The cores could be allocated on one node, on 15 nodes, or on any number in between.
Most nodes in Cedar and Graham have 32 cores and 128GB or more of memory. If you have a large parallel job to run, which can use 32 or a multiple of 32 cores efficiently, you should request whole nodes like so:
--nodes=2 --ntasks-per-node=32 --mem=128000M srun application.exe
The above job will probably start sooner than an equivalent one that requests
You should use
--mem=128000M rather than
--mem=128G when requesting whole nodes because a small amount of memory is reserved to the operating system, and requesting precisely 128GB means the job cannot be scheduled on the plentiful 128GB nodes. The job will not be rejected by Slurm, it will just wait much longer to start than it needs to.
Few cores, single node
If for some reason you need less than 32 cores but they must all in a single node, then you can request,
--nodes=1 --ntasks-per-node=15 --mem=45G srun application.exe
In this case you could also say
--mem-per-cpu=3G. The advantage of
--mem=45G is that the memory consumed by each individual process doesn't matter, as long as all of them together don’t use more than 45GB. With
--mem-per-cpu=3G, the job will be canceled if any of the processes exceeds 3GB.
Large parallel job, not a multiple of 32 cores
Not every application runs with maximum efficiency on a multiple of 32 cores. Choosing the number of cores to request, and whether or not to request whole nodes, may be a trade-off between running time (or efficient use of the computer) and waiting time (or efficient use of your time). If you want help evaluating these factors, please contact Technical support.
Hybrid jobs: MPIMessage Passing Interface and OpenMP, or MPIMessage Passing Interface and threads
It is important to understand that the number of tasks requested of Slurm is the number of processes that will be started by
srun. So for a hybrid job that will use both MPIMessage Passing Interface processes and OpenMP threads or Posix threads, you should set the MPIMessage Passing Interface process count with
-ntasks-per-node, and set the thread count with
--ntasks=16 --cpus-per-task=4 --mem-per-cpu=3G srun application.exe
In this example a total of 64 cores will be allocated, but only 16 MPIMessage Passing Interface processes (tasks) can and will be initialized. If the application is also OpenMP, then each process will spawn 4 threads, one per core. Each process will be allocated with 12GB of memory. The tasks, with 4 cores each, could be allocated anywhere, from 2 to up to 16 nodes.
--nodes=2 --ntasks-per-node=8 --cpus-per-task=4 --mem=96G srun application.exe
This job is the same size as the last one: 16 tasks (that is, 16 MPIMessage Passing Interface processes), each with 4 threads. The difference here is that we are sure of getting exactly 2 whole nodes. Recall that
--mem requests memory per node, so we use it instead of
--mem-per-cpu for the reason described earlier.
Why srun instead of mpiexec or mpirun?
mpirun is a wrapper that enables communication between processes running on different machines. Modern schedulers already provide many things that
mpirun needs. With Torque/Moab, for example, there is no need to pass to
mpirun the list of nodes on which to run, or the number of processes to launch; this is done automatically by the scheduler. With Slurm, the task affinity is also resolved by the scheduler, so there is no need to specify things like
mpirun --map-by node:pe=4 -n 16 application.exe
As implied in the examples above,
srun application.exe will automatically distribute the processes to precisely the resources allocated to the job.
In programming terminology,
srun is higher level of abstraction than
mpirun. Anything that can be done with
mpirun can be done with
srun, and more. It is the tool in Slurm to distribute any kind of computations. It replaces Torque’s
pbsdsh, for example, and much more. Think of
srun as the SLURM "all-around parallel-tasks distributor"; once a particular set of resources is allocated, the nature of your application doesn't matter (MPIMessage Passing Interface, OpenMP, hybrid, serial farming, pipelining, multi-program, etc.), you just have to
Also, as you would expect,
srun is fully coupled to Slurm. When you
srun an application, a "job step" is started, the environment variables
SLURM_PROCID are initialized correctly, and correct accounting information is recorded.
For an example of some differences between
mpiexec, see this discussion on the Open MPIMessage Passing Interface support forum. Better performance might be achievable with
mpiexec than with
srun under certain circumstances, but using
srun minimizes the risk that there will be a mismatch between the resources allocated by Slurm and those used by Open MPIMessage Passing Interface.