Frequently Asked Questions
- 1 Forgot my password
- 2 Disk quota exceeded error on /project filesystems
- 3 sbatch: error: Batch job submission failed: Socket timed out on send/recv operation
- 4 Why are my jobs taking so long to start?
- 5 How accurate is START_TIME in squeue output?
Forgot my password
To reset your password for any Compute Canada national cluster, visit https://ccdb.computecanada.ca/security/forgot.
Disk quota exceeded error on /project filesystems
Some users have seen this message or some similar quota error on their project folders. Other users have reported obscure failures while transferring files into their
/project folder from another cluster. Many of the problems reported are due to bad file ownership.
diskusage_report to see if you are at or over your quota:
[ymartin@cedar5 ~]$ diskusage_report Description Space # of files Home (user ymartin) 345M/50G 9518/500k Scratch (user ymartin) 93M/20T 6532/1000k Project (group ymartin) 5472k/2048k 158/5000k Project (group/def-zrichard) 20k/1000G 4/5000k
The example above illustrates a frequent problem:
/project for user
ymartin contains too much data in files belonging to group
ymartin. The data should instead be in files belonging to
Note the two lines labelled
Project (group ymartin)describes files belonging to group
ymartin, which has the same name as the user. This user is the only member of this group, which has a very small quota (2048k).
Project (group def-zrichard)describes files belonging to a project group. Your account may be associated with one or more project groups, and they will typically have names like
In this example, files have somehow been created belonging to group
ymartin instead of group
def-zrichard. This is neither the desired nor the expected behaviour.
By design, new files and directories in
/project will normally be created belonging to a project group. The two main reasons why files may be associated with the wrong group are that
- files were moved from
mvcommand; to avoid this, use
- files were transfered from another cluster using rsync or scp with an option to preserve the original group ownership. If you have a recurring problem with ownership, check the options you are using with your file transfer program.
For rsync you can use the following command to transfer a directory from a remote location to your project directory:
$ rsync -axvpH --no-g --no-p firstname.lastname@example.org:remote/dir/path $HOME/project/$USER/
You can also compress the data to get a better transfer rate.
$ rsync -axvpH --no-g --no-p --compress-level=5 email@example.com:remote/dir/path $HOME/project/$USER/
To see the project groups you may use, run the following command:
[name@server $] stat -c %G $HOME/projects/*/
If you are the owner of the files, you can run the
chgrp command to change their group ownership to the appropriate project group. To ask us to change the group owner for several users, contact technical support.
You can also use the command chmod g+s <directory name> to ensure that files created in that directory will inherit the directory's group membership.
Each file in Linux belongs to a person and a group at the same time. By default, a file you create belongs to you, user username, and your group, named the same username. That is it is owned by username:username. Your group is created at the same time your account was created and you are the only user in that group.
This file ownership is good for your home directory and the scratch space, as shown hereː
Description Space # of files Home (user username) 15G/53G 74k/500k Scratch (user username) 1522G/100T 65k/1000k Project (group username) 34G/2048k 330/2048 Project (group def-professor) 28k/1000G 9/500k
The quota is set for these for a user username.
The other two lines are set for groups username and def-professor in Project space. It is not important what users own the files in that space, but the group the files belong to determines the quota limit.
You see, that files that are owned by username group (your default group) have very small limit in the project space, only 2MB, and you already have 34 GB of data that is owned by your group (your files). This is why you cannot write more data there. Because you are trying to place data there owned by a group that has very little allocation there.
The allocation for the group def-professor, your professor's group, on the other hand does not use almost any space and has 1 TB limit. The files that can be put there should have username:def-professor ownership.
Now, depending on how you copy you files, what software you use, that software either will respect the ownership of the directory and apply the correct group, or it may insist on retaining the ownership of the source data. In the latter case you will have a problem like you have now.
Most probably your original data belongs to username:username, properly, upon moving it, it should belong to username:def-professor, but you software probably insists on keeping the original ownership and this causes the problem.
If you already have data in your project directory with wrong ownership, you can correct this with commands:
$ cd project/$USER $ chown -R username:def-professor data_dir
This will correct the ownership of the files inside data_dir directory in your project space.
Finding files with the wrong group ownership
You may find it difficult to identify files that are contributing to an over-quota condition in
find command can be used in conjunction with
readlink to solve this:
[name@server $] lfs find $(readlink $HOME/projects/*) -group $USER
This will identify files belonging to the user's unique group, e.g.
ymartin in the example shown earlier. If the output of
quota indicates that a different group is over quota, use that group name instead of
See Project layout for further explanations.
sbatch: error: Batch job submission failed: Socket timed out on send/recv operation
You may see this message when the load on the Slurm manager or scheduler process is too high. We are working both to improve Slurm's tolerance of that and to identify and eliminate the sources of load spikes, but that is a long-term project. The best advice we have currently is to wait a minute or so. Then run
squeue -u $USER and see if the job you were trying to submit appears: in some cases the error message is delivered even though the job was accepted by Slurm. If it doesn't appear, simply submit it again.
Why are my jobs taking so long to start?
You can see why your jobs are in the PD (pending) state by running the squeue -u <username> command on the cluster.
The (REASON) column typically has the values Resources or Priority.
- Resourcesː The cluster is simply very busy and you will have to be patient or perhaps consider if you can submit a job that asks for fewer resources (e.g. CPUs/nodes, GPUs, memory, time).
- Priorityː Your job is waiting to start due to its lower priority. This is because you and other members of your research group have been over-consuming your fair share of the cluster resources in the recent past, something you can track using the command sshare as explained in Job scheduling policies.
The LevelFS column gives you information about your over- or under-consumption of cluster resources: when LevelFS is greater than one, you are consuming fewer resources than your fair share, while if it is less than one you are consuming more. The more you overconsume resources, the closer the value gets to zero and the more your pending jobs decrease in priority. There is a memory effect to this calculation so the scheduler gradually "forgets" about any potential over- or under-consumption of resources from months past. Finally, note that the value of LevelFS is unique to the specific cluster.
How accurate is START_TIME in squeue output?
Start times shown by squeue depend on rapidly-changing conditions, and are therefore not very useful.
Slurm computes START_TIME for high-priority pending jobs. These expected start times are computed from currently-available information:
- What resources will be freed by running jobs that complete; and
- what resources will be needed by other, higher-priority jobs waiting to run.
Slurm invalidates these future plans:
- if jobs end early, changing which resources become available; and
- if prioritization changes, due to submission of higher-priority jobs or cancellation of queued jobs for example.
On Compute Canada general purpose clusters, new jobs are submitted about every five seconds, and 30-50% of jobs end early, so Slurm often discards and recomputes its future plans.
Most waiting jobs have a START_TIME of "N/A", which stands for "not available", meaning Slurm is not attempting to project a start time for them.
For jobs which are already running, the start time reported by squeue is perfectly accurate.