Caffe2

From CC Doc
Jump to: navigation, search
Other languages:
English • ‎français

General

Caffe2 is a lightweight, modular, and scalable deep learning framework. It aims to provide an easy and straightforward way for you to experiment with deep learning and leverage community contributions of new models and algorithms. There is a home page and a GitHub repository.

Quickstart guide

Environment module

The following Caffe2 version is installed centrally:

  • caffe2/0.8.1

It was compiled by GCCGNU Compiler Collection, an open source compiler collection 5.4.0 compilers, together with Python 2.7, OpenCV 2.4.13.3, CUDA 8.0.44 and cuDNN 7.0 libraries. In order to load the Caffe2 module, gcc, cuda and cudnn modules need to be loaded first:

$ module load gcc/5.4.0 cuda/8.0.44 cudnn/7.0
$ module load caffe2/0.8.1

For more information on environment modules, please refer to the Using modules page.

Submission scripts

The following instructions use the Distributed Training example. The resnet50_trainer.py script can be found on GitHub. Please refer to the page Running jobs if you want more information about using the Slurm workload manager.

Single-GPU job

Here is an example job submission script for a single-GPU job. See Using GPUs with Slurm for guidance on choosing a value for --cpus-per-task appropriate to the GPU nodes you are using.

File : caffe2_single-gpu_job.sh

#!/bin/bash
#SBATCH --mem=30g                  
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=6   # 6 per GPU at Cedar, 16 at Graham.
#SBATCH --gres=gpu:1

module load gcc/5.4.0 cuda/8.0.44 cudnn/7.0
module load caffe2/0.8.1

python resnet50_trainer.py --train_data=/scratch/$USER/ilsvrc12_train_lmdb/ --gpus=0 --batch_size=32 --num_shards=$SLURM_JOB_NUM_NODES --run_id=$SLURM_JOBID --num_epochs=70 --epoch_size=1200000 --file_store_path=/scratch/$USER/resnet50/nfs


Single-node multi-GPU job

Here is a job submission script for a single-node, multi-GPU job. Cedar's large GPU node type, equipped with four P100-PCIE-16GB GPUs, is highly recommended for multi-GPU Caffe2 jobs.

File : caffe2_single-node_multi-gpu_job.sh

#!/bin/bash
#SBATCH --mem=250g           # Use up to 120g for Cedar base and Graham GPU node
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=24   # 6 per GPU at Cedar, 16 per GPU at Graham.
#SBATCH --gres=gpu:lgpu:4    # Use =gpu:4 for Cedar's base GPU node, =gpu:2 for Graham's GPU node

module load gcc/5.4.0 cuda/8.0.44 cudnn/7.0
module load caffe2/0.8.1

python resnet50_trainer.py --train_data=/scratch/$USER/ilsvrc12_train_lmdb/ --gpus=0,1,2,3 --batch_size=128 --num_shards=$SLURM_JOB_NUM_NODES --run_id=$SLURM_JOBID --num_epochs=70 --epoch_size=1200000 --file_store_path=/scratch/$USER/resnet50/nfs


Distributed multi-GPU job

Due to a limit in the Gloo library, running Caffe2 on multiple nodes requires all GPUs to have direct Peer-to-Peer (P2P) access. Currently, only Cedar's large GPU nodes meet this requirement. The program needs to use srun to launch on all nodes, with the --gres flag.


File : caffe2_distributed_multi-gpu_job.sh

#!/bin/bash
#SBATCH --mem=250g
#SBATCH --nodes=2              # number of nodes to use
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=24
#SBATCH --gres=gpu:lgpu:4

module load gcc/5.4.0 cuda/8.0.44 cudnn/7.0
module load caffe2/0.8.1

srun --gres=gpu:lgpu:4 python resnet50_trainer.py --train_data=/scratch/$USER/ilsvrc12_train_lmdb/ --gpus=0,1,2,3 --batch_size=128 --num_shards=$SLURM_JOB_NUM_NODES --run_id=$SLURM_JOBID --num_epochs=70 --epoch_size=1200000 --file_store_path=/scratch/$USER/resnet50/nfs


I/O considerations

If loading data takes more time than the computation, local solid-state disk (SSD) should be used to temporarily store the dataset and speed up the loading. You should copy the data from the /project or /scratch space to $SLURM_TMPDIR, which is on SSD, for each job. If it is a multi-node job, you must use srun to ensure the data is copied to every node. The data in $SLURM_TMPDIR will be deleted after job finishes. The SSD capacity is 800GB on Cedar GPU nodes and 1.6TB on Graham GPU nodes.

File : caffe2_distributed_multi-gpu_job_ssd.sh

#!/bin/bash
#SBATCH --mem=250g
#SBATCH --nodes=2              # number of nodes to use
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=24
#SBATCH --gres=gpu:lgpu:4

module load gcc/5.4.0 cuda/8.0.44 cudnn/7.0
module load caffe2/0.8.1

srun rsync -rl /scratch/$USER/ilsvrc12_train_lmdb $SLURM_TMPDIR/

srun --gres=gpu:lgpu:4 python resnet50_trainer.py --train_data=$SLURM_TMPDIR/ilsvrc12_train_lmdb/ --gpus=0,1,2,3 --batch_size=128 --num_shards=$SLURM_JOB_NUM_NODES --run_id=$SLURM_JOBID --num_epochs=70 --epoch_size=1200000 --file_store_path=/scratch/$USER/resnet50/nfs