GROMACS

From CC Doc
Jump to: navigation, search
This page contains changes which are not marked for translation.

Other languages:
English • ‎français

General

GROMACS is a versatile package to perform molecular dynamics for systems with hundreds to millions of particles. It is primarily designed for biochemical molecules like proteins, lipids and nucleic acids that have a lot of complicated bonded interactions, but since GROMACS is extremely fast at calculating the nonbonded interactions (that usually dominate simulations) many groups are also using it for research on non-biological systems, e.g. polymers.

Strengths

  • GROMACS provides extremely high performance compared to all other programs.
  • Since GROMACS 4.6, we have excellent CUDA-based GPU acceleration on GPUs that have Nvidia compute capability >= 2.0 (e.g. Fermi or later).
  • GROMACS comes with a large selection of flexible tools for trajectory analysis
  • GROMACS can be run in parallel, using either the standard MPIMessage Passing Interface communication protocol, or via our own "Thread MPIMessage Passing Interface" library for single-node workstations.
  • GROMACS is Free Software, available under the GNU Lesser General Public License (LGPL), version 2.1.

Weak points

  • To get very high simulation speed GROMACS does not do much additional analysis and / or data collection on the fly. It may be a challenge to obtain somewhat non-standard information about the simulated system from a GROMACS simulation.
  • Different version of GROMACS may have significant differences in simulation methods and default parameters. Reproducing results of older versions with a newer version may not be straightforward.
  • Additional tools and utilities that come with GROMACS can be not of the highest quality, may contain bugs and may implement not very well documented methods. Reconfirming the results of such tools with some independent methods is always a good idea.

GPU support

The top part of any log file in Gromacs will describe the configuration, and in particular whether your version has GPU support compiled-in. Gromacs will automatically use any GPUs it finds.

Gromacs uses both CPUs and GPUs; it relies on a reasonable balance between CPU and GPU performance.

The new neighbor-structure required introduction of a new variable called "cutoff-scheme" in the mdp file. The old Gromacs settings corresponds to the value "group", while you must switch this to "verlet" to use GPU acceleration.

Quickstart Guide

This section summarizes configuration details.

Environment Modules

The following GROMACS versions have been installed:

GROAMCS version modules needed for CPU version modules needed for GPU (CUDA) version Notes
gromacs/2018.1 gcc/6.4.0 openmpi/2.1.1 gromacs/2018.1 gcc/6.4.0 cuda/9.0.176 openmpi/2.1.1 gromacs/2018.1 GCCGNU Compiler Collection, an open source compiler collection & FFTW
gromacs/2018 gromacs/2018 cuda/9.0.176 gromacs/2018 Intel & MKLIntel Math Kernel Library, a software library of optimized math routines
gromacs/2016.5 gcc/6.4.0 openmpi/2.1.1 gromacs/2016.5 gcc/6.4.0 cuda/9.0.176 openmpi/2.1.1 gromacs/2016.5 GCCGNU Compiler Collection, an open source compiler collection & FFTW
gromacs/2016.3 gromacs/2016.3 cuda/8.0.44 gromacs/2016.3 Intel & MKLIntel Math Kernel Library, a software library of optimized math routines
gromacs/5.1.4 gromacs/5.1.4 cuda/8.0.44 gromacs/5.1.4 Intel & MKLIntel Math Kernel Library, a software library of optimized math routines
gromacs/5.0.7 gromacs/5.0.7 cuda/8.0.44 gromacs/5.0.7 Intel & MKLIntel Math Kernel Library, a software library of optimized math routines
gromacs/4.6.7 gromacs/4.6.7 cuda/8.0.44 gromacs/4.6.7 Intel & MKLIntel Math Kernel Library, a software library of optimized math routines

GROMACS versions 2016.5, 2018.1 and newer have been compiled with GCCGNU Compiler Collection, an open source compiler collection compilers and FFTW- & OpenMPI 2.1.1 libraries, as they result in a slightly better performance. Older versions have been compiled with Intel compilers, using Intel MKLIntel Math Kernel Library, a software library of optimized math routines and Open MPIMessage Passing Interface 2.1.1 libraries from the default environment. CPU (non-GPU) versions are available in both single- and double precision.

These modules can be loaded by using a module load command with the modules as stated in the second column in above table. For example:

$ module load  gcc/6.4.0  openmpi/2.1.1  gromacs/2018.1

These versions are also available with GPU support, albeit only with single precision. In order to load the GPU enabled version of GROMACS, the cuda module needs to be loaded first. The modules needed are listed in the third column of above table, e.g.:

$ module load  gcc/6.4.0  cuda/9.0.176  openmpi/2.1.1  gromacs/2018.1
or
$ module load  cuda/8.0.44  gromacs/2016.3 

For more information on Environment Modules, please refer to the Using modules page.

Suffixes

GROMACS 5.x, 2016.x and 2018.x

GROMACS 5 and newer releases consist of only four binaries that contain the full functionality. All GROMACS tools from previous versions have been implemented as sub-commands of the gmx binaries. Please refer to GROMACS 5.0 Tool Changes and the GROMACS documentation manuals for your version.

  • gmx - single precision GROMACS with OpenMP (threading) but without MPIMessage Passing Interface.
  • gmx_mpi - single precision GROMACS with OpenMP and MPIMessage Passing Interface.
  • gmx_d - double precision GROMACS with OpenMP but without MPIMessage Passing Interface.
  • gmx_mpi_d - double precision GROMACS with OpenMP and MPIMessage Passing Interface.

GROMACS 4.6.7

  • The double precision binaries have the suffix _d.
  • The parallel single and double precision mdrun binaries are:
  • mdrun_mpi
  • mdrun_mpi_d

Submission Scripts

Please refer to the page "Running jobs" for help on using the SLURM workload manager.

Serial Job

Here's a simple job script for serial mdrun:


File : serial_gromacs_job.sh

#!/bin/bash
#SBATCH --time 0:30           # time limit (D-HH:MM)
module purge  
module load gcc/6.4.0 openmpi/2.1.1 gromacs/2018.1
export OMP_NUM_THREADS="${SLURM_CPUS_PER_TASK:-1}"

gmx mdrun -deffnm em


This will run the simulation of the molecular system in the file em.tpr.

MPIMessage Passing Interface Job

A job script for mdrun using 4 MPIMessage Passing Interface processes:


File : mpi_gromacs_job.sh

#!/bin/bash
#SBATCH --ntasks 4               # number of MPI processes
#SBATCH --mem 4000               # memory limit per node (megabytes)
#SBATCH --time 0:30:00           # time limit (D-HH:MM:ss)
module purge  
module load gcc/6.4.0 openmpi/2.1.1 gromacs/2018.1
export OMP_NUM_THREADS="${SLURM_CPUS_PER_TASK:-1}"

srun gmx_mpi mdrun -deffnm md



Hybrid MPIMessage Passing Interface/OpenMP Job

A job script for mdrun using 8 MPIMessage Passing Interface tasks and 2 OpenMP threads per MPIMessage Passing Interface task:


File : hybrid_gromacs_job.sh

#!/bin/bash
#SBATCH --ntasks 8               # number of MPI processes
#SBATCH --cpus-per-task 2        # number of OpenMP threads per MPI process
#SBATCH --mem-per-cpu 1000       # memory limit per CPU core (megabytes)
#SBATCH --time 0:30:00           # time limit (D-HH:MM:ss)
module purge  
module load gcc/6.4.0 openmpi/2.1.1 gromacs/2018.1
export OMP_NUM_THREADS="${SLURM_CPUS_PER_TASK:-1}"

srun gmx_mpi mdrun -deffnm md


GPU Job

A job script for mdrun using 4 OpenMP threads and one GPU:

File : gpu_gromacs_job.sh

#!/bin/bash
#SBATCH --gres=gpu:1             # request 1 GPU as "generic resource"
#SBATCH --cpus-per-task 4        # number of OpenMP threads per MPI process
#SBATCH --mem-per-cpu 1000       # memory limit per CPU core (megabytes)
#SBATCH --time 0:30:00           # time limit (D-HH:MM:ss)
module purge  
module load gcc/6.4.0 cuda/9.0.176 openmpi/2.1.1 gromacs/2018.1
export OMP_NUM_THREADS="${SLURM_CPUS_PER_TASK:-1}"

gmx mdrun -ntomp $SLURM_NTASKS_PER_NODE -deffnm md


GPU-MPIMessage Passing Interface Job

A job script for mdrun using 1 node, 2 GPUs and 6 MPIMessage Passing Interface tasks per node and 4 OpenMP threads per MPIMessage Passing Interface task:

File : gpu_mpi_gromacs_job.sh

#!/bin/bash
#SBATCH --nodes=1                # number of nodes
#SBATCH --gres=gpu:2             # request 2 GPUs per node
#SBATCH --ntasks-per-node=6      # request 6 MPI tasks per node
#SBATCH --cpus-per-task=2        # 2 OpenMP threads per MPI process
#SBATCH --mem-per-cpu 1000       # memory limit per CPU core (megabytes)
#SBATCH --time 1:00:00           # time limit (D-HH:MM:ss)
module purge  
module load gcc/6.4.0 cuda/9.0.176 openmpi/2.1.1 gromacs/2018.1
export OMP_NUM_THREADS="${SLURM_CPUS_PER_TASK:-1}"

mpiexec gmx_mpi mdrun -deffnm md


Notes on running GROMACS in GPUs
  • The new national systems (Cedar and Graham) have differently configured GPU nodes:
  • Cedar has 4 GPUs and 24 CPU cores per node
  • Graham has 2 GPUs and 32 CPU cores per node
Therefore one needs to use different settings to make use of all GPUs and CPU-cores in a node.
  • Cedar: --gres=gpu:4 --ntasks-per-node=8 --cpus-per-task=3
  • Graham: --gres=gpu:2 --ntasks-per-node=8 --cpus-per-task=4
Of course the simulated system needs to be large enough to utilize the resources.
  • GROMACS imposes a number of constraints for choosing number of GPUs, tasks (MPIMessage Passing Interface ranks) and OpenMP threads.
    For GROMACS 2016.3 the constraints are:
  • The number of --tasks-per-node always needs to be a multiple of the number of GPUs (--gres=gpu:)
  • GROMACS will not run GPU runs with only 1 OpenMP thread, unless forced by setting the -ntomp option.
    According to the developers, the optimum number of --cpus-per-task is between 2 and 6.
  • Avoid using a larger fraction of CPUs and memory than the fraction of GPUs you have requested in a node.
  • While according to the developers of the SLURM scheduler using srun as a replacement for mpiexec/mpirun is the preferred way to start MPIMessage Passing Interface jobs, we have seen evidence of jobs failing on startup, when two jobs using srun are started on the same compute node.
    At this time we therefore recommend to use mpiexec, especially when utilizing only partial nodes.

Usage

More content for this section will be added at a later time.

System Preparation

In order to run a Gromacs simulation, one needs to create a tpr file (portable binary run input file). This file contains the starting structure of the simulation, the molecular topology and all the simulation parameters.

Tpr files are created with the gmx grompp command (or simply grompp for Gromacs versions older than 5.0). Therefore one needs the following files:

  • The coordinate file with the starting structure. Gromacs can read the starting structure from various file-formats, such as .gro, .pdb or .cpt (Gromacs checkpoint).
  • The (system) topology (.top)) file. It defines which force-field is used and how the force-field parameters are applied to the simulated system. Often the topologies for individual parts of the simulated system (e.g. molecules) are placed in separate .itp files and included in the .top file using a #include directive.
  • The run-parameter (.mdp) file. See the Gromacs user guide for a detailed description of the options.

Tpr files are portable, that is they can be grompp'ed on one machine, copied over to a different machine and used as an input file for mdrun. One should always use the same Gromacs version for both grompp and mdrun. Although mdrun is able to use tpr files that have been created with an older version of grompp, this can lead to unexpected simulation results.

Running Simulations

MD Simulations often take much longer than the maximum walltime for a job to complete and therefore need to be restarted. To minimize the time a job needs to wait before it starts, you should maximise the number of nodes you have access to by choosing a shorter running time for your job. Requesting a walltime of 24 hours or 72 hours (three days) is often a good trade-off between waiting- and running-time.

You should use the mdrun parameter -maxh to tell the program the requested walltime so that it gracefully finishes the current timestep when reaching 99% of this walltime. This causes mdrun to create a new checkpoint file at this final timestep and gives it the chance to properly close all output-files (trajectories, energy- and log-files, etc.).

For example use #SBATCH --time=24:00 along with gmx mdrun -maxh 24 ... or #SBATCH --time=3-00:00 along with gmx mdrun -maxh 72 ....


File : gromacs_job.sh

#!/bin/bash
#SBATCH --nodes=1                # number of Nodes
#SBATCH --tasks-per-node=32      # number of MPI processes per node
#SBATCH --mem-per-cpu=4000       # memory limit per CPU (megabytes)
#SBATCH --time=24:00:00          # time limit (D-HH:MM:ss)
module purge
module load gcc/6.4.0 openmpi/2.1.1 gromacs/2018.1
export OMP_NUM_THREADS="${SLURM_CPUS_PER_TASK:-1}"

srun  gmx_mpi  mdrun  -deffnm md  -maxh 24



Restarting Simulations

You can restart a GROMACS simulation by using the same mdrun command as the original simulation and adding the -cpi state.cpt parameter where state.cpt is the filename of the most recent checkpoint file. Mdrun will by default (since version 4.5) try to append to the existing files (trajectories, energy- and log-files, etc.). GROMACS will check the consistency of the output files and - if needed - discard timesteps that are newer than that of the checkpoint file.

Using the -maxh parameter ensures that the checkpoint and output files are written in a consistent state when the simulation reaches the time limit.

The GROMACS manual contains more detailed information [1][2].


File : gromacs_job_restart.sh

#!/bin/bash
#SBATCH --nodes=1                # number of Nodes
#SBATCH --tasks-per-node=32      # number of MPI processes per node
#SBATCH --mem-per-cpu=4000       # memory limit per CPU (megabytes)
#SBATCH --time=24:00:00          # time limit (D-HH:MM:ss)
module purge
module load gcc/6.4.0 openmpi/2.1.1 gromacs/2018.1
export OMP_NUM_THREADS="${SLURM_CPUS_PER_TASK:-1}"

srun  gmx_mpi  mdrun  -deffnm md  -maxh 24.0  -cpi md.cpt


Performance Considerations

Getting the best mdrun performance with GROMACS is not a straight-forward task. The GROMACS developers are maintaing a long section in their user-guide deticated to mdrun-performance[3] which explains all relevant options/parameters and strategies.

There is no "One size fits all", but the best parameters to choose highly depend on the size of the system (number of particles as well as size and shape of the simulation box) and the simulation parameters (cut-offs, use of Particle-Mesh-Ewald[4] (PME) method for long-range electrostatics).

GROMACS prints performance information and statistics at the end of the md.log file, which is helpful in identifying bottlenecks. This section often contains notes on how to further improve the performance.

The simulation performance is typically quantified by the number of nanoseconds of MD-trajectory that can be simulated within a day (ns/day).

Parallel scaling is a measure how effectivly the compute resources are used. It is defined as:

S = pN / ( N * p1 )

Where pN is the performance using N CPU cores.

Ideally, the performance increases linearly with the number of CPU cores ("linear scaling"; S = 1).


MPIMessage Passing Interface processes / Slurm tasks / Domain Decomposition

The most straight-forward way to increase the number of MPIMessage Passing Interface processes (called MPIMessage Passing Interface-ranks in the GROMACS documentation), which is done by using Slurm's --ntasks or --ntasks-per-node in the job script.

GROMACS uses Domain Decomposition[4] (DD) to distribute the work of solving the non-bonded Particle-Particle (PP) interactions across multiple CPU cores. This is done by effectivly cutting the simulation box along the X, Y and/or Z axes into domains and assigning each domain to one MPIMessage Passing Interface process.

This works well until the time needed for communication becomes large in respect to the size (in respect of number of particles as well as volume) of the domain. In that case the parallel scaling will drop significantly below 1 and in extreme cases the performance drops when increasing the number of domains.

GROMACS can use Dynamic Load Balancing to shift the boundaries between domains to some extent, in order to avoid certain domains taking significantly longer to solve than others. The mdrun parameter -dlb auto is the default.

Domains cannot be smaller in any direction, than the longest cut-off radius.


Long-Range Interactions with PME

The Particle-Mesh-Ewald method (PME) is often used to calculate the long-range non-bonded interactions (interactions beyond the cut-off radius). As PME requires global communication, the performance can degrade quickly when many MPIMessage Passing Interface processes are involved that are calculating both the short-range (PP) as well as the long-range (PME) interactons. This is avoided by having dedicated MPIMessage Passing Interface processes that only perform PME (PME-ranks).

GROMACS mdrun by default uses heuristics to dedicate a number of MPIMessage Passing Interface processes to PME when the total number of MPIMessage Passing Interface processes 12 or greater. The mdrun parameter -npme can be used to select the number of PME ranks manually.

In case there is a significant "Load Imbalance" between the PP and PME ranks (e.g. the PP ranks have more work per timestep than the PME ranks), one can shift work from the PP ranks to the PME ranks by increasing the cut-off radius. This will not effect the result, as the sum of short-range + long-range forces (or energies) will be the same for a given timestep. Mdrun will attemtp to do that automatically since GROMACS version 4.6 unless the mdrun parameter -notunepme is used.

Since GROMACS version 2018, PME can be offloaded to the GPU (see below) however the implementation as of version 2018.1 has still several limitations [5] among them that only a single GPU rank can be dedicated to PME.


OpenMP threads / CPUs-per-task

Once Domain Decomposition with MPIMessage Passing Interface processes reaches the scaling limit (parallel scaling starts dropping), performance can be further imporoved by using OpenMP threads to spread the work of an MPIMessage Passing Interface process (rank) over more than one CPU core. To use OpenMP threads, use Slurm's --cpus-per-task parameter in the job script and either set the OMP_NUM_THREADS variable with: export OMP_NUM_THREADS="${SLURM_CPUS_PER_TASK:-1}" (reccomended) or the mdrun parameter -ntomp ${SLURM_CPUS_PER_TASK:-1}.

According to GROMACS, the optimum is usually between 2 and 6 OpenMP threads per MPIMessage Passing Interface process (cpus-per-task).


GPUs

Tips how to use GPUs efficiently will be added soon.

Analyzing Results

Common pitfalls

Links

Biomolecular simulation

References

  1. GROMACS User-Guide: Managing long simulations.
  2. GROMACS Manual page: gmx mdrun
  3. GROMACS User-Guide: Getting good performance from mdrun
  4. 4.04.1 GROMACS User-Guide: Performance background information
  5. GROMACS User-Guide: GPU accelerated calculation of PME