AI and Machine Learning: Difference between revisions

Jump to navigation Jump to search
Fix titles
No edit summary
(Fix titles)
Line 7: Line 7:
The sections below list links relevant to AI practitioners, and good practices to be observed on our clusters.
The sections below list links relevant to AI practitioners, and good practices to be observed on our clusters.


= Tutorial =
== Tutorial ==


If you are ready to port your program for using on a Compute Canada cluster, please follow [[Tutoriel_Apprentissage_machine/en|our tutorial]].
If you are ready to port your program for using on a Compute Canada cluster, please follow [[Tutoriel_Apprentissage_machine/en|our tutorial]].


= Python = <!--T:3-->
== Python == <!--T:3-->


<!--T:4-->
<!--T:4-->
Python is very popular in the field of machine learning. If you (plan to) use it on our clusters, please refer to [[Python|our documentation about Python]] to get important information about Python versions, virtual environments on login or on compute nodes, <tt>multiprocessing</tt>, Anaconda, Jupyter, etc.
Python is very popular in the field of machine learning. If you (plan to) use it on our clusters, please refer to [[Python|our documentation about Python]] to get important information about Python versions, virtual environments on login or on compute nodes, <tt>multiprocessing</tt>, Anaconda, Jupyter, etc.


== Avoid Anaconda == <!--T:21-->
=== Avoid Anaconda === <!--T:21-->


<!--T:22-->
<!--T:22-->
Line 32: Line 32:
'''Switching to virtualenv is easy in most cases. Just install all the same packages, except CUDA, CuDNN and other low level libraries, which are already installed on our clusters.'''
'''Switching to virtualenv is easy in most cases. Just install all the same packages, except CUDA, CuDNN and other low level libraries, which are already installed on our clusters.'''


= Useful information about software packages = <!--T:5-->
== Useful information about software packages == <!--T:5-->


<!--T:6-->
<!--T:6-->
Line 46: Line 46:
* [[Caffe2]]
* [[Caffe2]]


= Managing your datasets = <!--T:8-->
== Managing your datasets == <!--T:8-->


== Storage and file management == <!--T:9-->
=== Storage and file management === <!--T:9-->


<!--T:10-->
<!--T:10-->
Compute Canada provides a wide range of storage options to cover the needs of our very diverse users. These storage solutions range from high-speed temporary local storage to different kinds of long-term storage, so you can choose the storage medium that best corresponds to your needs and usage patterns. Please refer to our documentation on [[Storage and file management]].
Compute Canada provides a wide range of storage options to cover the needs of our very diverse users. These storage solutions range from high-speed temporary local storage to different kinds of long-term storage, so you can choose the storage medium that best corresponds to your needs and usage patterns. Please refer to our documentation on [[Storage and file management]].


== Datasets containing lots of small files (e.g. image datasets) == <!--T:11-->
=== Datasets containing lots of small files (e.g. image datasets) === <!--T:11-->


<!--T:12-->
<!--T:12-->
Line 66: Line 66:
[[Handling large collections of files]].
[[Handling large collections of files]].


= Long running computations = <!--T:15-->
== Long running computations == <!--T:15-->


<!--T:16-->
<!--T:16-->
If your computations are long, you should use checkpointing. For example, if your training time is 3 days, you could split it in 3 chunks of 24 hours. This would prevent you from losing all the work in case of an outage, and would give you an edge in terms of priority (more nodes are available for short jobs). Most machine learning libraries natively support checkpointing. Please see our suggestions about [[Running jobs#Resubmitting_jobs_for_long_running_computations|resubmitting jobs for long running computations]]. If your program does not natively support this, we provide a [[Points de contrôle/en|general checkpointing solution]].
If your computations are long, you should use checkpointing. For example, if your training time is 3 days, you could split it in 3 chunks of 24 hours. This would prevent you from losing all the work in case of an outage, and would give you an edge in terms of priority (more nodes are available for short jobs). Most machine learning libraries natively support checkpointing. Please see our suggestions about [[Running jobs#Resubmitting_jobs_for_long_running_computations|resubmitting jobs for long running computations]]. If your program does not natively support this, we provide a [[Points de contrôle/en|general checkpointing solution]].


= Running many similar jobs = <!--T:17-->
== Running many similar jobs == <!--T:17-->


<!--T:18-->
<!--T:18-->
cc_staff
353

edits

Navigation menu