Caffe2

From CC Doc
Jump to: navigation, search
This page contains changes which are not marked for translation.

Other languages:
English • ‎français

General

Caffe2 is a lightweight, modular, and scalable deep learning framework. It aims to provide an easy and straightforward way for you to experiment with deep learning and leverage community contributions of new models and algorithms. There is a home page and a GitHub repository.

Quickstart guide

Installing caffe2

As of May 2018, caffe2 is no longer its own project, it has merged with PyTorch. Therefore, the instructions below are no longer supported. Caffe2 is now included in the Python wheels that we build for PyTorch since version 0.4.1. Please follow the instructions on PyTorch to install it in your own account. The instructions below are outdated and will be updated in the future.

Environment module

Outdated

Some information in this section has been marked as outdated. Our team is aware of it and is working to update the documentation. In the mean time, the information in this section should not necessarily be considered currently factual or authoritative.



The following Caffe2 version is installed centrally:

  • caffe2/0.8.1

It was compiled by GCCGNU Compiler Collection, an open source compiler collection 5.4.0 compilers, together with Python 2.7, OpenCV 2.4.13.3, CUDA 8.0.44 and cuDNN 7.0 libraries. In order to load the Caffe2 module, gcc, cuda and cudnn modules need to be loaded first:

$ module load gcc/5.4.0 cuda/8.0.44 cudnn/7.0
$ module load caffe2/0.8.1

For more information on environment modules, please refer to the Using modules page.

Submission scripts

The following instructions use the Distributed Training example. The resnet50_trainer.py script can be found on GitHub. Please refer to the page Running jobs if you want more information about using the Slurm workload manager.

Single-GPU job

Here is an example job submission script for a single-GPU job. See Using GPUs with Slurm for guidance on choosing a value for --cpus-per-task appropriate to the GPU nodes you are using.

File : caffe2_single-gpu_job.sh

#!/bin/bash
#SBATCH --mem=30g                  
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=6   # 6 per GPU at Cedar, 16 at Graham.
#SBATCH --gres=gpu:1

module load gcc/5.4.0 cuda/8.0.44 cudnn/7.0
module load caffe2/0.8.1

python resnet50_trainer.py --train_data=/scratch/$USER/ilsvrc12_train_lmdb/ --gpus=0 --batch_size=32 --num_shards=$SLURM_JOB_NUM_NODES --run_id=$SLURM_JOBID --num_epochs=70 --epoch_size=1200000 --file_store_path=/scratch/$USER/resnet50/nfs


Single-node multi-GPU job

Here is a job submission script for a single-node, multi-GPU job. Cedar's large GPU node type, equipped with four P100-PCIE-16GB GPUs, is highly recommended for multi-GPU Caffe2 jobs.

File : caffe2_single-node_multi-gpu_job.sh

#!/bin/bash
#SBATCH --mem=250g           # Use up to 120g for Cedar base and Graham GPU node
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=24   # 6 per GPU at Cedar, 16 per GPU at Graham.
#SBATCH --gres=gpu:lgpu:4    # Use =gpu:4 for Cedar's base GPU node, =gpu:2 for Graham's GPU node

module load gcc/5.4.0 cuda/8.0.44 cudnn/7.0
module load caffe2/0.8.1

python resnet50_trainer.py --train_data=/scratch/$USER/ilsvrc12_train_lmdb/ --gpus=0,1,2,3 --batch_size=128 --num_shards=$SLURM_JOB_NUM_NODES --run_id=$SLURM_JOBID --num_epochs=70 --epoch_size=1200000 --file_store_path=/scratch/$USER/resnet50/nfs


Distributed multi-GPU job

Due to a limit in the Gloo library, running Caffe2 on multiple nodes requires all GPUs to have direct Peer-to-Peer (P2P) access. Currently, only Cedar's large GPU nodes meet this requirement. The program needs to use srun to launch on all nodes, with the --gres flag.


File : caffe2_distributed_multi-gpu_job.sh

#!/bin/bash
#SBATCH --mem=250g
#SBATCH --nodes=2              # number of nodes to use
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=24
#SBATCH --gres=gpu:lgpu:4

module load gcc/5.4.0 cuda/8.0.44 cudnn/7.0
module load caffe2/0.8.1

srun --gres=gpu:lgpu:4 python resnet50_trainer.py --train_data=/scratch/$USER/ilsvrc12_train_lmdb/ --gpus=0,1,2,3 --batch_size=128 --num_shards=$SLURM_JOB_NUM_NODES --run_id=$SLURM_JOBID --num_epochs=70 --epoch_size=1200000 --file_store_path=/scratch/$USER/resnet50/nfs


I/O considerations

If loading data takes more time than the computation, local solid-state disk (SSD) should be used to temporarily store the dataset and speed up the loading. You should copy the data from the /project or /scratch space to $SLURM_TMPDIR, which is on SSD, for each job. If it is a multi-node job, you must use srun to ensure the data is copied to every node. The data in $SLURM_TMPDIR will be deleted after job finishes. The SSD capacity is 800GB on Cedar GPU nodes and 1.6TB on Graham GPU nodes.

File : caffe2_distributed_multi-gpu_job_ssd.sh

#!/bin/bash
#SBATCH --mem=250g
#SBATCH --nodes=2              # number of nodes to use
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=24
#SBATCH --gres=gpu:lgpu:4

module load gcc/5.4.0 cuda/8.0.44 cudnn/7.0
module load caffe2/0.8.1

srun rsync -rl /scratch/$USER/ilsvrc12_train_lmdb $SLURM_TMPDIR/

srun --gres=gpu:lgpu:4 python resnet50_trainer.py --train_data=$SLURM_TMPDIR/ilsvrc12_train_lmdb/ --gpus=0,1,2,3 --batch_size=128 --num_shards=$SLURM_JOB_NUM_NODES --run_id=$SLURM_JOBID --num_epochs=70 --epoch_size=1200000 --file_store_path=/scratch/$USER/resnet50/nfs