Job Status and Reason Codes¶

The squeue command details a variety of information on an active job’s status with state and reason codes. Job state codes describe a job’s current state in queue (e.g. pending, completed). Job reason codes describe the reason why the job is in its current state.

The following tables outline a variety of job state and reason codes you may encounter when using squeue to check on your jobs.

Job State Codes¶

Status	Code	Explaination
CANCELLED	`CA`	The job was explicitly cancelled by the user or system administrator.
COMPLETED	`CD`	The job has completed successfully.
COMPLETING	`CG`	The job is finishing but some processes are still active.
DEADLINE	`DL`	The job terminated on deadline
FAILED	`F`	The job terminated with a non-zero exit code and failed to execute.
NODE_FAIL	`NF`	The job terminated due to failure of one or more allocated nodes
OUT_OF_MEMORY	`OOM`	The Job experienced an out of memory error.
PENDING	`PD`	The job is waiting for resource allocation. It will eventually run.
PREEMPTED	`PR`	The job was terminated because of preemption by another job.
RUNNING	`R`	The job currently is allocated to a node and is running.
SUSPENDED	`S`	A running job has been stopped with its cores released to other jobs.
STOPPED	`ST`	A running job has been stopped with its cores retained.
TIMEOUT	`TO`	Job terminated upon reaching its time limit.

A full list of these Job State codes can be found in squeue documentation. or sacct documentation.

Job Reason Codes¶

Reason Code	Explaination
`Priority`	One or more higher priority jobs is in queue for running. Your job will eventually run.
`Dependency`	This job is waiting for a dependent job to complete and will run afterwards.
`Resources`	The job is waiting for resources to become available and will eventually run.
`InvalidAccount`	The job’s account is invalid. Cancel the job and rerun with correct account.
`InvaldQoS`	The job’s QoS is invalid. Cancel the job and rerun with correct account.
`QOSGrpCpuLimit`	All CPUs assigned to your job’s specified QoS are in use; job will run eventually.
`QOSGrpMaxJobsLimit`	Maximum number of jobs for your job’s QoS have been met; job will run eventually.
`QOSGrpNodeLimit`	All nodes assigned to your job’s specified QoS are in use; job will run eventually.
`PartitionCpuLimit`	All CPUs assigned to your job’s specified partition are in use; job will run eventually.
`PartitionMaxJobsLimit`	Maximum number of jobs for your job’s partition have been met; job will run eventually.
`PartitionNodeLimit`	All nodes assigned to your job’s specified partition are in use; job will run eventually.
`AssociationCpuLimit`	All CPUs assigned to your job’s specified association are in use; job will run eventually.
`AssociationMaxJobsLimit`	Maximum number of jobs for your job’s association have been met; job will run eventually.
`AssociationNodeLimit`	All nodes assigned to your job’s specified association are in use; job will run eventually.

A full list of these Job Reason Codes can be found in Slurm’s documentation.

Running Job Statistics Metrics¶

The sstat command allows users to easily pull up status information about their currently running jobs. This includes information about CPU usage, task information, node information, resident set size (RSS), and virtual memory (VM). We can invoke the sstat command as such:

# /!\ ADAPT <jobid> accordingly
$ sstat --jobs=<jobid>

By default, sstat will pull up significantly more information than what would be needed in the commands default output. To remedy this, we can use the --format flag to choose what we want in our output. A chart of some these variables are listed in the table below:

Variable	Description
`avecpu`	Average CPU time of all tasks in job.
`averss`	Average resident set size of all tasks.
`avevmsize`	Average virtual memory of all tasks in a job.
`jobid`	The id of the Job.
`maxrss`	Maximum number of bytes read by all tasks in the job.
`maxvsize`	Maximum number of bytes written by all tasks in the job.
`ntasks`	Number of tasks in a job.

For an example, let's print out a job's average job id, cpu time, max rss, and number of tasks. We can do this by typing out the command:

# /!\ ADAPT <jobid> accordingly
sstat --jobs=<jobid> --format=jobid,cputime,maxrss,ntasks

A full list of variables that specify data handled by sstat can be found with the --helpformat flag or by visiting the slurm page on sstat.

Past Job Statistics Metrics¶

You can use the custom susage function in /etc/profile.d/slurm.sh to collect statistics information.

$ susage -h
Usage: susage [-m] [-Y] [-S YYYY-MM-DD] [-E YYYT-MM-DD]
  For a specific user (if accounting rights granted):    susage [...] -u <user>
  For a specific account (if accounting rights granted): susage [...] -A <account>
Display past job usage summary

But by default, you should use the sacct command allows users to pull up status information about past jobs. This command is very similar to sstat, but is used on jobs that have been previously run on the system instead of currently running jobs.

# /!\ ADAPT <jobid> accordingly
$ sacct [-X] --jobs=<jobid> [--format=metric1,...]
# OR, for a user, eventually between a Start and End date
$ sacct [-X] -u $USER  [-S YYYY-MM-DD] [-E YYYY-MM-DD] [--format=metric1,...]
# OR, for an account - ADAPT <account> accordingly
$ sacct [-X] -A <account> [--format=metric1,...]

Use -X to aggregate the statistics relevant to the job allocation itself, not taking job steps into consideration.

The main metrics code you may be interested to review are listed below.

Variable	Description
`account`	Account the job ran under.
`avecpu`	Average CPU time of all tasks in job.
`averss`	Average resident set size of all tasks in the job.
`cputime`	Formatted (Elapsed time * CPU) count used by a job or step.
`elapsed`	Jobs elapsed time formated as DD-HH:MM:SS.
`exitcode`	The exit code returned by the job script or salloc.
`jobid`	The id of the Job.
`jobname`	The name of the Job.
`maxdiskread`	Maximum number of bytes read by all tasks in the job.
`maxdiskwrite`	Maximum number of bytes written by all tasks in the job.
`maxrss`	Maximum resident set size of all tasks in the job.
`ncpus`	Amount of allocated CPUs.
`nnodes`	The number of nodes used in a job.
`ntasks`	Number of tasks in a job.
`priority`	Slurm priority.
`qos`	Quality of service.
`reqcpu`	Required number of CPUs
`reqmem`	Required amount of memory for a job.
`reqtres`	Required Trackable RESources (TRES)
`user`	Userna

A full list of variables that specify data handled by sacct can be found with the --helpformat flag or by visiting the slurm page on sacct.