AI and Machine Learning
To get the most out of our clusters for machine learning applications, special care must be taken. A cluster is a complicated beast that is very different from your local machine that you use for prototyping. Notably, a cluster uses a distributed filesystem, linking many storage devices seamlessly. Accessing a file on /project feels the same as accessing one from the current node; but under the hood, these two IO operations have very different performance implications. In short, you need to choose wisely where to put your data.
The sections below list links relevant to AI practitioners, and good practices to be observed on our clusters.
If you are ready to port your program for using on a Compute Canada cluster, please follow our tutorial.
Python is very popular in the field of machine learning. If you (plan to) use it on our clusters, please refer to our documentation about Python to get important information about Python versions, virtual environments on login or on compute nodes, multiprocessing, Anaconda, Jupyter, etc.
We ask our users to avoid using Anaconda, and use virtualenv instead. Here are the reasons:
- Tries to handle library management that should be reserved to staff
- Ships binaries unoptimized for specific CPU architecture
- Makes wrong assumptions about library locations
- Installs in /home by default, where it puts a large number of files (Virtual envs should be installed on compute node as much as possible)
- Slower to install packages
- Modifies bashrc, which can cause conflicts
Switching to virtualenv is easy in most cases. Just install all the same packages, except CUDA, CuDNN and other low level libraries, which are already installed on our clusters.
Useful information about software packages
Please refer to the page of your machine learning package of choice for useful information about how to install, common pitfalls, etc.:
Managing your datasets
Storage and file management
Compute Canada provides a wide range of storage options to cover the needs of our very diverse users. These storage solutions range from high-speed temporary local storage to different kinds of long-term storage, so you can choose the storage medium that best corresponds to your needs and usage patterns. Please refer to our documentation on Storage and file management.
Datasets containing lots of small files (e.g. image datasets)
In machine learning, it is common to have to manage very large collections of files, meaning hundreds of thousands or more. The individual files may be fairly small, e.g. less than a few hundred kilobytes. In these cases, problems arise:
- Filesystem quotas on Compute Canada clusters limit the number of filesystem objects;
- Your software could be significantly slowed down from streaming lots of small files from /project (or /scratch) to a compute node.
On a distributed filesystem, data should be stored in large single-file archives. On this subject, please refer to Handling large collections of files.
Long running computations
If your computations are long, you should use checkpointing. For example, if your training time is 3 days, you should split it in 3 chunks of 24 hours. This will prevent you from losing all the work in case of an outage, and give you an edge in terms of priority (more nodes are available for short jobs). Most machine learning libraries natively support checkpointing; the typical case is covered in our tutorial. If your program does not natively support this, we provide a general checkpointing solution.
Running many similar jobs
If you are in one of these situations:
- Hyperparameter search
- Training many variants of the same method
- Running many optimization processes of similar duration