AlphaFold

From CC Doc
Jump to navigation Jump to search


This article is a draft

This is not a complete article: This is a Draft, a work in progress that is intended to be published into an article, which may or may not be ready for inclusion in the main wiki. It should not necessarily be considered factual or authoritative.



AlphaFold is a machine-learning model for the prediction of protein folding.

This page discusses how to use AlphaFold v2.0, the version that was entered in CASP14 and published in Nature.

Source code and documentation for AlphaFold can be found at their GitHub page. Any publication that discloses findings arising from using this source code or the model parameters should cite the AlphaFold paper.

Usage in Compute Canada systems

AlphaFold documentation explains how to run the software using Docker. In Compute Canada we do not provide Docker, but instead provide Singularity. We will describe how to use AlphaFold with Singularity much further down this page, but we recommend instead that you use a virtual environment and a Python wheel available from the Compute Canada "wheelhouse".

AlphaFold in Python environment

1. AlphaFold has a number of other dependencies that need to be loaded first. These include Cuda, kalign, hmmer, and openmm, all of which are available in the Compute Canada software stack. Load these modules like so:

[name@cluster ~]$ module load gcc/9 openmpi/4.0.3 cuda/11.2.2 cudnn/8.2.0 kalign/2.03 hmmer/3.2.1 openmm-alphafold/7.5.1 hh-suite/3.3.0 python/3.8

2. Clone the AlphaFold repository in $SCRATCH:

[name@cluster ~]$ cd $SCRATCH
[name@cluster ~]$ git clone https://github.com/deepmind/alphafold.git
[name@cluster ~]$ wget https://git.scicore.unibas.ch/schwede/openstructure/-/raw/7102c63615b64735c4941278d92b554ec94415f8/modules/mol/alg/src/stereo_chemical_props.txt -P alphafold/common/

3. Create a Python virtual environment and activate it:

[name@cluster ~]$ virtualenv --no-download ~/my_env && source ~/my_env/bin/activate

3. Install AlphaFold and its dependencies by:

(my_env)[name@cluster ~]$ pip install --no-index pdbfixer alphafold

Now AlphaFold is ready to be used. Note that to use AlphaFold outside a container, you need to use the run_alphafold.py script that is provided in the repository.

Creating the virtual environment in the job script

As discussed on the Python page, your job may run faster if you create the virtual environment on node-local storage during the job. If you do so, your job script should look something like this:


File : my_alphafoldjob.sh

#!/bin/bash
#SBATCH --job-name=alphafold_run
#SBATCH --account=def-someprof # adjust this to match the accounting group you are using to submit jobs
#SBATCH --time=0-08:00         # adjust this to match the walltime of your job
#SBATCH --nodes=1      
#SBATCH --ntasks=1
#SBATCH --gres=gpu:1           # You need to request one GPU to be able to run AlphaFold properly
#SBATCH --cpus-per-task=8      # adjust this if you are using parallel commands
#SBATCH --mem=4000             # adjust this according to the memory you need

# Load your modules as before
module load gcc/9 openmpi/4.0.3 cuda/11.2.2 cudnn/8.2.0 kalign/2.03 hmmer/3.2.1 openmm-alphafold/7.5.1 hh-suite/3.3.0 python/3.8

cd $SCRATCH 

# Generate your virtual environment in $SLURM_TMPDIR
virtualenv --no-download ${SLURM_TMPDIR}/my_env && source ${SLURM_TMPDIR}/my_env/bin/activate

# Install alphafold and dependencies
pip install --no-index scipy==1.4.1 pdbfixer alphafold --upgrade

# Run your commands
python $SCRATCH/alphafold/run_alphafold.py --help


Databases

Note that AlphaFold requires a set of datasets/databases that need to be downloaded into the $SCRATCH. Also notice that we prefer you avoid using `aria2c`. To do so:

Important: The database must live in the $SCRATCH.

1. Move to the AlphaFold repository and the scripts folder:

[name@cluster ~]$ cd $SCRATCH/alphafold
[name@cluster ~]$ mkdir data

2. Modify all the files there with the following command:

[name@cluster scripts]$ sed -i -e 's/aria2c/wget/g' -e 's/--dir=/-P /g' -e 's/--preserve-permissions//g' scripts/*.sh

3. Use the scripts to download the data:

[name@cluster ~]$ bash scripts/download_all_data.sh $SCRATCH/alphafold/data

Note that this might take a while and SHOULD NOT BE DONE IN THE COMPUTE NODES. Instead, you should use the data transfer nodes or the login nodes. Since the download might take a while we recommend you do this in a screen or Tmux sessions. If your path/to/download is stored in $DOWNLOAD_DIR, then the structure of your data should be:

$DOWNLOAD_DIR/                             # Total: ~ 2.2 TB (download: 428 GB)
    bfd/                                   # ~ 1.8 TB (download: 271.6 GB)
        # 6 files.
    mgnify/                                # ~ 64 GB (download: 32.9 GB)
        mgy_clusters.fa
    params/                                # ~ 3.5 GB (download: 3.5 GB)
        # 5 CASP14 models,
        # 5 pTM models,
        # LICENSE,
        # = 11 files.
    pdb70/                                 # ~ 56 GB (download: 19.5 GB)
        # 9 files.
    pdb_mmcif/                             # ~ 206 GB (download: 46 GB)
        mmcif_files/
            # About 180,000 .cif files.
        obsolete.dat
    uniclust30/                            # ~ 87 GB (download: 24.9 GB)
        uniclust30_2018_08/
            # 13 files.
    uniref90/                              # ~ 59 GB (download: 29.7 GB)
        uniref90.fasta

This is important when passing the commands to AlphaFold.

Running AlphaFold

 AlphaFold2 has the number of CPUS hardcoded!. Plase do not use other number but 8 as these are the required CPUS.

Once you have everything setup, you can run a production run of AlphaFold by:


File : my_alphafoldjob.sh

#!/bin/bash
#SBATCH --job-name=alphafold_run
#SBATCH --account=def-someprof # adjust this to match the accounting group you are using to submit jobs
#SBATCH --time=0-12:00:00      # adjust this to match the walltime of your job
#SBATCH --nodes=1      
#SBATCH --ntasks=1
#SBATCH --gres=gpu:1           # You need to request one GPU to be able to run AlphaFold properly
#SBATCH --cpus-per-task=8      # DO NOT INCREASE THIS AS ALPHAFOLD CANNOT TAKE ADVANTAGE OF MORE
#SBATCH --mem=32G              # adjust this according to the memory requirement per node you need
#SBATCH --mail-user=you@youruniversity.ca # adjust this to match your email address
#SBATCH --mail-type=ALL

# Set the path to download dir
DOWNLOAD_DIR=$SCRATCH/alphafold/data  # Set the appropriate path to your downloaded data
INPUT_DIR=$SCRATCH/alphafold/input     # Set the appropriate path to your supporting data
REPO_DIR=$SCRATCH/alphafold # Set the appropriate path to AlphaFold's cloned repo

# Load your modules as before
module load gcc/9 openmpi/4.0.3 cuda/11.2.2 cudnn/8.2.0 hdf5 kalign/2.03 hmmer/3.2.1 openmm-alphafold/7.5.1 hh-suite/3.3.0 python/3.8

cd $SCRATCH # Set the appropriate folder where the repo is contained

# Generate your virtual environment in $SLURM_TMPDIR
virtualenv --no-download ${SLURM_TMPDIR}/my_env && source ${SLURM_TMPDIR}/my_env/bin/activate

# Install alphafold and dependencies
pip install --no-index scipy==1.4.1 pdbfixer alphafold --upgrade

# Run your commands
python ${REPO_DIR}/run_alphafold.py \
   --data_dir=${DOWNLOAD_DIR} \
   --fasta_paths=${INPUT_DIR}/YourSequence.fasta,${INPUT_DIR}/AnotherSequence.fasta \
   --bfd_database_path=${DOWNLOAD_DIR}/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
   --pdb70_database_path=${DOWNLOAD_DIR}/pdb70/pdb70 \
   --template_mmcif_dir=${DOWNLOAD_DIR}/pdb_mmcif/mmcif_files \
   --uniclust30_database_path=${DOWNLOAD_DIR}/uniclust30/uniclust30_2018_08/uniclust30_2018_08  \
   --uniref90_database_path=${DOWNLOAD_DIR}/uniref90/uniref90.fasta  \
   --fasta_paths=${INPUT_DIR}/YourSequence.fasta,${INPUT_DIR}/AnotherSequence.fasta \
   --hhblits_binary_path=${EBROOTHHMINSUITE}/bin/hhblits \
   --hhsearch_binary_path=${EBROOTHHMINSUITE}/bin/hhsearch \
   --jackhmmer_binary_path=${EBROOTHMMER}/bin/jackhmmer \
   --kalign_binary_path=${EBROOTKALIGN}/bin/kalign \
   --mgnify_database_path=${DOWNLOAD_DIR}/mgnify \
   --model_names=model_1,model_2,model_3,model_4,model_5 \
   --output_dir=${SCRATCH}/alphafold_output \
   --obsolete_pdbs_path=${DOWNLOAD_DIR}/pdb_mmcif \
   --max_template_date=2020-05-14 \
   --preset=casp14


Using singularity

In case you want to try the conternarized version (NOT our preferred option), you can build a singularity container:

[name@cluster ~]$ module load singularity
[name@cluster ~]$ singularity build alphafold.sif docker://uvarc/alphafold:2.0.0

Before trying to build it or run it, check our singularity documentation as there are particularities of each system that need to be taken into account.

Running AlphaFold within Singularity

Here is an example to run the containerized version of alphafold2 on a given protein sequence. The protein sequence is saved in fasta format as below:

[name@cluster ~]$ cat input.fasta
>5ZE6_1
MNLEKINELTAQDMAGVNAAILEQLNSDVQLINQLGYYIVSGGGKRIRPMIAVLAARAVGYEGNAHVTIAALIEFIHTATLLHDDVVDESDMRRGKATANAA
FGNAASVLVGDFIYTRAFQMMTSLGSLKVLEVMSEAVNVIAEGEVLQLMNVNDPDITEENYMRVIYSKTARLFEAAAQCSGILAGCTPEEEKGLQDYGRYLG
TAFQLIDDLLDYNADGEQLGKNVGDDLNEGKPTLPLLHAMHHGTPEQAQMIRTAIEQGNGRHLLEPVLEAMNACGSLEWTRQRAEEEADKAIAALQVLPDTP
WREALIGLAHIAVQRDR

The reference databases and models were downloaded to predict the structure of the above protein sequence.

[name@cluster ~]$ tree databases/
databases/
├── bfd
│   ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_a3m.ffdata
│   ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_a3m.ffindex
│   ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_cs219.ffdata
│   ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_cs219.ffindex
│   ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffdata
│   └── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffindex
├── mgnify
│   └── mgy_clusters.fa
├── params
│   ├── LICENSE
│   ├── params_model_1.npz
│   ├── params_model_1_ptm.npz
│   ├── params_model_2.npz
│   ├── params_model_2_ptm.npz
│   ├── params_model_3.npz
│   ├── params_model_3_ptm.npz
│   ├── params_model_4.npz
│   ├── params_model_4_ptm.npz
│   ├── params_model_5.npz
│   └── params_model_5_ptm.npz
├── pdb70
│   ├── md5sum
│   ├── pdb70_a3m.ffdata
│   ├── pdb70_a3m.ffindex
│   ├── pdb70_clu.tsv
│   ├── pdb70_cs219.ffdata
│   ├── pdb70_cs219.ffindex
│   ├── pdb70_hhm.ffdata
│   ├── pdb70_hhm.ffindex
│   └── pdb_filter.dat
├── pdb_mmcif
 │   ├── mmcif_files
│   │   ├── 100d.cif
│   │   ├── 101d.cif
│   │   ├── 101m.cif
│   │   ├── ...
│   │   ├── ...
│   │   ├── 9wga.cif
│   │   ├── 9xia.cif
│   │   └── 9xim.cif
│   └── obsolete.dat
├── uniclust30
│   └── uniclust30_2018_08
│       ├── uniclust30_2018_08_a3m_db -> uniclust30_2018_08_a3m.ffdata
│       ├── uniclust30_2018_08_a3m_db.index
│       ├── uniclust30_2018_08_a3m.ffdata
│       ├── uniclust30_2018_08_a3m.ffindex
│       ├── uniclust30_2018_08.cs219
│       ├── uniclust30_2018_08_cs219.ffdata
│       ├── uniclust30_2018_08_cs219.ffindex
│       ├── uniclust30_2018_08.cs219.sizes
│       ├── uniclust30_2018_08_hhm_db -> uniclust30_2018_08_hhm.ffdata
│       ├── uniclust30_2018_08_hhm_db.index
│       ├── uniclust30_2018_08_hhm.ffdata
│       ├── uniclust30_2018_08_hhm.ffindex
│       └── uniclust30_2018_08_md5sum
└── uniref90
    └── uniref90.fasta

Let's say we want to run alphafold2 from the directory scratch/run_alphafold2

[name@cluster ~]$ cd scratch/run_alphafold2
[name@cluster run_alphafold2]$ mkdir alphafold_output # create directory for the output files
[name@cluster run_alphafold2]$ ls # list the directory contents to ensure the singularity image file (.sif) is available
alphafold_output alphaFold.sif input.fasta

Alphafold2 launches a couple of multithreaded analyses using up to 8 CPUs before running model inference on the GPU. Memory requirements will vary with different size proteins. We created a batch input file for the above protein sequence as below.

#!/bin/bash
#SBATCH --job-name alphafold-run
#SBATCH --account=def-someuser
#SBATCH --time=08:00:00
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-task=8
#SBATCH --mem=20G

#set the environment PATH
export PYTHONNOUSERSITE=True
module load singularity
ALPHAFOLD_DATA_PATH=/path/to/alphafold/databases
ALPHAFOLD_MODELS=/path/to/alphafold/databases/params

#Run the command
singularity run --nv \
 -B $ALPHAFOLD_DATA_PATH:/data \
 -B $ALPHAFOLD_MODELS \
 -B .:/etc \
 --pwd  /app/alphafold alphaFold.sif \
 --fasta_paths=input.fasta  \
 --uniref90_database_path=/data/uniref90/uniref90.fasta  \
 --data_dir=/data \
 --mgnify_database_path=/data/mgnify/mgy_clusters.fa   \
 --bfd_database_path=/data/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
 --uniclust30_database_path=/data/uniclust30/uniclust30_2018_08/uniclust30_2018_08 \
 --pdb70_database_path=/data/pdb70/pdb70  \
 --template_mmcif_dir=/data/pdb_mmcif/mmcif_files  \
 --obsolete_pdbs_path=/data/pdb_mmcif/obsolete.dat \
 --max_template_date=2020-05-14   \
 --output_dir=alphafold_output  \
 --model_names='model_1' \
 --preset=casp14

Bind-mount the current working directory to /etc inside the container for the cache file ld.so.cache [-B .:/etc]. The --nv flag is used to enable the GPU support. Submit this job script ('alpharun_jobscript.sh') using the Slurm sbatch command.

[name@cluster run_alphafold2]$ sbatch alpharun_jobscript.sh 

On the successful completion, the output directory should have the following files:

[name@cluster run_alphafold2]$ $ tree alphafold_output/input
alphafold_output
└── input
   ├── features.pkl
   ├── msas
   │   ├── bfd_uniclust_hits.a3m
   │   ├── mgnify_hits.sto
   │   └── uniref90_hits.sto
   ├── ranked_0.pdb
   ├── ranking_debug.json
   ├── relaxed_model_1.pdb
   ├── result_model_1.pkl
   ├── timings.json
   └── unrelaxed_model_1.pdb
2 directories, 10 files