Frequently Asked Questions: Difference between revisions

Jump to navigation Jump to search
explain "Nodes required for job are DOWN, DRAINED or RESERVED ..."
No edit summary
(explain "Nodes required for job are DOWN, DRAINED or RESERVED ...")
Line 131: Line 131:
* <tt>Resources</tt>ː The cluster is simply very busy and you will have to be patient or perhaps consider if you can submit a job that asks for fewer resources (e.g. CPUs/nodes, GPUs, memory, time).
* <tt>Resources</tt>ː The cluster is simply very busy and you will have to be patient or perhaps consider if you can submit a job that asks for fewer resources (e.g. CPUs/nodes, GPUs, memory, time).
*  <tt>Priority</tt>ː Your job is waiting to start due to its lower priority. This is because you and other members of your research group have been over-consuming your fair share of the cluster resources in the recent past, something you can track using the command <tt>sshare</tt> as explained in [[Job scheduling policies]]. The <tt>LevelFS</tt> column gives you information about your over- or under-consumption of cluster resources: when <tt>LevelFS</tt> is greater than one, you are consuming fewer resources than your fair share, while if it is less than one you are consuming more. The more you overconsume resources, the closer the value gets to zero and the more your pending jobs decrease in priority. There is a memory effect to this calculation so the scheduler gradually "forgets" about any potential over- or under-consumption of resources from months past. Finally, note that the value of <tt>LevelFS</tt> is unique to the specific cluster.
*  <tt>Priority</tt>ː Your job is waiting to start due to its lower priority. This is because you and other members of your research group have been over-consuming your fair share of the cluster resources in the recent past, something you can track using the command <tt>sshare</tt> as explained in [[Job scheduling policies]]. The <tt>LevelFS</tt> column gives you information about your over- or under-consumption of cluster resources: when <tt>LevelFS</tt> is greater than one, you are consuming fewer resources than your fair share, while if it is less than one you are consuming more. The more you overconsume resources, the closer the value gets to zero and the more your pending jobs decrease in priority. There is a memory effect to this calculation so the scheduler gradually "forgets" about any potential over- or under-consumption of resources from months past. Finally, note that the value of <tt>LevelFS</tt> is unique to the specific cluster.
== Why do my jobs show "Nodes required for job are DOWN, DRAINED or RESERVED for jobs in higher priority partitions"? ==
This string may appear in the "Reason" field of <tt>squeue</tt> output for a waiting job, and is new to Slurm 19.05.
It means just what it says: One or more of the nodes Slurm considered for the job are down, or deliberately taken offline,
or are being reserved for other jobs.  On a large busy cluster there will almost always be such nodes.  The message means
effectively the same thing as the reason "Resources" that appeared in Slurm version 17.11.


== How accurate is START_TIME in <tt>squeue</tt> output? == <!--T:33-->
== How accurate is START_TIME in <tt>squeue</tt> output? == <!--T:33-->
Start times shown by <tt>squeue</tt> depend on rapidly-changing conditions, and are therefore not very useful.
We don't show the start time by default with <tt>squeue</tt>, but it can be printed with an option.  The start times Slurm forecasts depend on rapidly-changing conditions, and are therefore not very useful.


<!--T:34-->
<!--T:34-->
Bureaucrats, cc_docs_admin, cc_staff
2,774

edits

Navigation menu