Machine Learning tutorial
This page is a beginner's manual concerning how to port a machine learning job to a Compute Canada cluster.
Step 1: Archiving a data set
The shared storage on Compute Canada clusters are not designed to handle lots of small files (they are optimized for very large files). Make sure that the data set which you need for your training is an archive format like tar, which you can then transfer to your job's compute node when the job starts. If you do not respect these rules, you risk causing enormous numbers of I/O operations on the shared filesystem, leading to performance issues on the cluster for all of its users. If you want to learn more about how to handle collections of large number of files, we recommend that you spend some time reading this page.
Assuming that the files which you need are in the directory mydataset:
$ tar cf mydataset.tar mydataset/*
The above command does not compress the data. If you believe that this is appropriate, you can use tar czf.
Step 2: Preparing your virtual environment
We recommend that you try running your job in an interactive job before submitting it using a script (discussed in the following section). You can diagnose problems more quickly using an interactive job. An example of the command for submitting such a job is:
$ salloc --account=def-someuser --gres=gpu:1 --cpus-per-task=6 --mem=32000M --time=1:00
Once the job has started:
- Create and activate a virtual environment in $SLURM_TMPDIR (this variable points to a directory on the local disk, directly attached to the compute node). Do not use Anaconda. For example:
$ virtualenv --no-download $SLURM_TMPDIR/env
- Install the modules that you will need. For TensorFlow, install the module tensorflow_gpu; it is a version of this software optimized for Compute Canada clusters.
- Try to run your program.
- Install any missing modules if necessary. (Since the compute nodes don't have internet access, some packages won't be available. In that case, contact us for support.)
- Create a file requirements.txt in order to recreate this virtual environment:
(env) $ pip freeze > ~/requirements.txt
Now is a good time to verify that your job reads and writes as much as possible on the compute node's local storage ($SLURM_TMPDIR) and as little as possible on the shared filesystems (home, scratch and project).
Step 3: Preparing your job submission script
You must submit your jobs using a script in conjunction with the sbatch command, so that they can be entirely automated as a batch process. Interactive jobs are just for preparing and debugging your work.
Important elements of a sbatch script
- Account that will be "billed" for the resources used
- Resources required:
- Number of CPUs, suggestion: 6
- Number of GPUs, suggestion: 1 (Use one (1) single GPU, unless you are certain that your program can use several. By default, TensorFlow and PyTorch use just one GPU.)
- Amount of memory, suggestion: 32000M
- Duration (Maximum Béluga: 7 days, Graham and Cedar: 28 days)
- Bash commands:
- Preparing your environment (modules, virtualenv)
- Transferring data to the compute node
- Starting the executable
#!/bin/bash #SBATCH --gres=gpu:1 # Request GPU "generic resources" #SBATCH --cpus-per-task=6 # Cores proportional to GPUs: 6 on Cedar, 10 on Béluga, 16 on Graham. #SBATCH --mem=32000M # Memory proportional to GPUs: 32000 Cedar, 47000 Béluga, 64000 Graham. #SBATCH --time=0-03:00 # DD-HH:MM:SS module load python/3.6 cuda cudnn SOURCEDIR=~/ml-test # Prepare virtualenv virtualenv --no-download $SLURM_TMPDIR/env source $SLURM_TMPDIR/env/bin/activate pip install --no-index -r $SOURCEDIR/requirements.txt # Prepare data mkdir $SLURM_TMPDIR/data tar xf ~/projects/def-xxxx/data.tar -C $SLURM_TMPDIR/data # Start training python $SOURCEDIR/train.py $SLURM_TMPDIR/data
Checkpointing a long-running job
We recommend that you checkpoint your jobs in 24 hour units. Submitting jobs which have short durations ensures they are more likely to start sooner. By creating a daisy chain of jobs, it is possible to overcome the seven day limit on Béluga.
- Modify your job submission script (or your program) so that your job can be interrupted and continued . Your program should be able to access the most recent checkpoint file. (See the example script below).
- Verify how many epochs (or iterations) can be carried out in a 24 hour unit.
- Calculate how many of these 24 hour units you will need: n_units = n_epochs_total / n_epochs_per_24h
- Use the argument --array 1-<n_blocs>%1 to ask for a chain of n_blocs jobs.
The job submission script will look like this:
#!/bin/bash #SBATCH --array=1-10%1 # 10 is the number of jobs in the chain #SBATCH ... module load python/3.6 cuda cudnn # Prepare virtualenv ... # Prepare data ... # Get most recent checkpoint CHECKPOINT_EXT='*.h5' # Replace by *.pt for PyTorch checkpoints CHECKPOINTS=~/scratch/checkpoints/ml-test LAST_CHECKPOINT=$(find $CHECKPOINTS -maxdepth 1 -name "$CHECKPOINT_EXT" -print0 | xargs -r -0 ls -1 -t | head -1) # Start training if [ -z "$LAST_CHECKPOINT" ]; then # $LAST_CHECKPOINT is null; start from scratch python $SOURCEDIR/train.py --write-checkpoints-to $CHECKPOINTS ... else python $SOURCEDIR/train.py --load-checkpoint $LAST_CHECKPOINT --write-checkpoints-to $CHECKPOINTS ... fi