Caffe2

From CC Doc
Jump to: navigation, search

General

Caffe2 is a lightweight, modular, and scalable deep learning framework. Building on the original Caffe, Caffe2 is designed with expression, speed, and modularity in mind. Homepage and GitHub

Quickstart Guide

This section summarizes configuration details.

Environment Modules

The following Caffe2 versions have been installed centrally:

  • caffe2/0.8.1

They have been compiled by GCCGCC is an open source compiler collection 5.4.0 compilers, together with Python 2.7, OpenCV 2.4.13.3, CUDA 8.0.44 and cuDNN 7.0 libraries. In order to load the Caffe2 module, gcc, cuda and cudnn modules need to be load first:

$ module load gcc/5.4.0 cuda/8.0.44 cudnn/7.0
$ module load caffe2/0.8.1

For more information on Environment Modules, please refer to the Using modules page.

Submission Scripts

The below instructions use Distributed Training example resnet50_trainer.py. Code can be found on GitHub. Please refer to the page "Running jobs" for help on using the SLURM workload manager.

Single-GPU Job

Caffe2 single-GPU job can be run on all GPU node types on Cedar and Graham.

File : caffe2_single-gpu_job.sh

#!/bin/bash
#SBATCH --mem=30g                  
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=6  # Caffe2 is threaded. Use up to 6 per GPU on Cedar and up to 16 per GPU on Graham.
#SBATCH --gres=gpu:1

module load gcc/5.4.0 cuda/8.0.44 cudnn/7.0
module load caffe2/0.8.1

python resnet50_trainer.py --train_data=/scratch/$USER/ilsvrc12_train_lmdb/ --gpus=0 --batch_size=32 --num_shards=$SLURM_JOB_NUM_NODES --run_id=$SLURM_JOBID --num_epochs=70 --epoch_size=1200000 --file_store_path=/scratch/$USER/resnet50/nfs


Single-node Multi-GPU Job

Caffe2 Single-node Multi-GPU job can be run on all GPU node types on Cedar and Graham. Cedar's large GPU node type, which equips 4 x P100-PCIE-16GB with GPU Direct P2P enabled between each pair, is highly recommended for Multi-GPU Caffe2 jobs.

File : caffe2_single-node_multi-gpu_job.sh

#!/bin/bash
#SBATCH --mem=250g             # Use up to 120g for Cedar base and Graham GPU node
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=24     # Caffe2 is threaded. Use up to 6 per GPU on Cedar and up to 16 per GPU on Graham.
#SBATCH --gres=gpu:lgpu:4      # Use --gres=gpu:4 for Cedar's base GPU node,  --gres=gpu:2 for Graham's GPU node

module load gcc/5.4.0 cuda/8.0.44 cudnn/7.0
module load caffe2/0.8.1

python resnet50_trainer.py --train_data=/scratch/$USER/ilsvrc12_train_lmdb/ --gpus=0,1,2,3 --batch_size=128 --num_shards=$SLURM_JOB_NUM_NODES --run_id=$SLURM_JOBID --num_epochs=70 --epoch_size=1200000 --file_store_path=/scratch/$USER/resnet50/nfs


Distributed Multi-GPU Job

Due to a limit in Gloo library, Caffe2 multi-node training requres all GPUs to have direct p2p access. Currently only Cedar's large GPU nodes can meet this requirement. The program needs to use srun to launch on all nodes, --gres flag is also needed.


File : caffe2_distributed_multi-gpu_job.sh

#!/bin/bash
#SBATCH --mem=250g
#SBATCH --nodes=2              # number of nodes to use
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=24
#SBATCH --gres=gpu:lgpu:4

module load gcc/5.4.0 cuda/8.0.44 cudnn/7.0
module load caffe2/0.8.1

srun --gres=gpu:lgpu:4 python resnet50_trainer.py --train_data=/scratch/$USER/ilsvrc12_train_lmdb/ --gpus=0,1,2,3 --batch_size=128 --num_shards=$SLURM_JOB_NUM_NODES --run_id=$SLURM_JOBID --num_epochs=70 --epoch_size=1200000 --file_store_path=/scratch/$USER/resnet50/nfs


I/O consideration

If loading data takes longer time than computation, local SSD can be used to temporally host the dataset. User needs to copy the data from the /project or /scratch space to SLURM_TMPDIR, which is on SSD, for each job. srun is needed for multi-node job to copy data to every node. The data in SLURM_TMPDIR will be deleted after job finishes. (The SSD capacity is 800GB on Cedar GPU nodes, 1.6TB on Graham GPU nodes)

File : caffe2_distributed_multi-gpu_job_ssd.sh

#!/bin/bash
#SBATCH --mem=250g
#SBATCH --nodes=2              # number of nodes to use
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=24
#SBATCH --gres=gpu:lgpu:4

module load gcc/5.4.0 cuda/8.0.44 cudnn/7.0
module load caffe2/0.8.1

srun rsync -rl /scratch/$USER/ilsvrc12_train_lmdb $SLURM_TMPDIR/

srun --gres=gpu:lgpu:4 python resnet50_trainer.py --train_data=$SLURM_TMPDIR/ilsvrc12_train_lmdb/ --gpus=0,1,2,3 --batch_size=128 --num_shards=$SLURM_JOB_NUM_NODES --run_id=$SLURM_JOBID --num_epochs=70 --epoch_size=1200000 --file_store_path=/scratch/$USER/resnet50/nfs