AI and Machine Learning: Difference between revisions

Jump to navigation Jump to search
Marked this version for translation
No edit summary
(Marked this version for translation)
Line 81: Line 81:
If your computations are long, you should use checkpointing. For example, if your training time is 3 days, you should split it in 3 chunks of 24 hours. This will prevent you from losing all the work in case of an outage, and give you an edge in terms of priority (more nodes are available for short jobs). Most machine learning libraries natively support checkpointing; the typical case is covered in our [[Tutoriel_Apprentissage_machine/en#Checkpointing_a_long-running_job|tutorial]]. If your program does not natively support this, we provide a [[Points de contrôle/en|general checkpointing solution]].
If your computations are long, you should use checkpointing. For example, if your training time is 3 days, you should split it in 3 chunks of 24 hours. This will prevent you from losing all the work in case of an outage, and give you an edge in terms of priority (more nodes are available for short jobs). Most machine learning libraries natively support checkpointing; the typical case is covered in our [[Tutoriel_Apprentissage_machine/en#Checkpointing_a_long-running_job|tutorial]]. If your program does not natively support this, we provide a [[Points de contrôle/en|general checkpointing solution]].


<!--T:37-->
For more examples, please see the following sections:
For more examples, please see the following sections:


<!--T:38-->
[[PyTorch#Creating_Model_Checkpoints|Checkpointing with PyTorch]]
[[PyTorch#Creating_Model_Checkpoints|Checkpointing with PyTorch]]


<!--T:39-->
[[TensorFlow#Creating_Model_Checkpoints|Checkpointing with TensorFlow]]
[[TensorFlow#Creating_Model_Checkpoints|Checkpointing with TensorFlow]]


rsnt_translations
53,731

edits

Navigation menu