AI and Machine Learning
To get the most out of our clusters for machine learning applications, special care must be taken. A cluster is a complicated beast that is very different from your local machine that you use for prototyping. Notably, a cluster uses a distributed filesystem, linking many storage devices seamlessly. Accessing a file on /project feels the same as accessing one from the current node; but under the hood, these two IO operations have very different performance implications. In short, you need to choose wisely where to put your data.
The sections below are a starting point for machine learning practitioners looking for solutions, or just getting started working with our clusters.
Python is very popular in the field of machine learning. If you (plan to) use it on our clusters, please refer to our documentation about Python to get important information about Python versions, virtual environments on login or on compute nodes, multiprocessing, Anaconda, Jupyter, etc.
Useful information about software packages
Please refer to the page of your machine learning package of choice for useful information about how to install, common pitfalls, etc.:
Managing your datasets
Storage and file management
Compute Canada provides a wide range of storage options to cover the needs of our very diverse users. These storage solutions range from high-speed temporary local storage to different kinds of long-term storage, so you can choose the storage medium that best corresponds to your needs and usage patterns. Please refer to our documentation on Storage and file management.
Datasets containing lots of small files (e.g. image datasets)
In machine learning, it is common to have to manage very large collections of files, meaning hundreds of thousands or more. The individual files may be fairly small, e.g. less than a few hundred kilobytes. In these cases, problems arise:
- Filesystem quotas on Compute Canada clusters limit the number of filesystem objects;
- Your software could become be significantly slowed down from streaming lots of small files from /project (or /scratch) to a compute node.
On a distributed filesystem, data should be stored in large single-file archives. On this subject, please refer to Handling large collections of files.
Long running computations
If your computations are long, you should use checkpointing. For example, if your training time is 3 days, you could split it in 3 chunks of 24 hours. This would prevent you from losing all the work in case of an outage, and would give you an edge in terms of priority (more nodes are available for short jobs). Most machine learning libraries natively support checkpointing. Please see our suggestions about resubmitting jobs for long running computations. If your program does not natively support this, we provide a general checkpointing solution.
Running many similar jobs
If you are in one of these situations:
- Hyperparameter search
- Training many variants of the same method
- Running many optimization processes of similar duration