Job Monitoring

Job Monitoring¶

Show jobs queue¶

To determine what jobs exist on the system use

:$ squeue --all
JOBID PARTITION  NAME  USER ST  TIME NODES NODELIST(REASON)

JOBID: job id
PARTITION: partition (use sinfo to list all available partitions)
NAME: partition name
USER: username
ST: STate column,
- R: Running
- PD: PenDing
- TO: TimedOut
- S: Suspended
- CD: Completed
- CA: CAncelled
- F: Failed
- NF: Node Failure

To list jobs only for your user, use

squeue -u username

Check job scheduled time to start

squeue --start

squeue -o "%.8i %.9P %.10j %.10u %.8T %.5C %.4D %.6m %.10l %.10M %.10L %.16R"

Please check squeue man for more information.

man squeue

Job information¶

To view detailed job information use

:$ scontrol show job 689

JobId=689 JobName=test
   UserId=user(1831) GroupId=user(1831) MCS_label=N/A
   Priority=3348 Nice=0 Account=testproj QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=1-04:02:39 TimeLimit=2-00:00:00 TimeMin=N/A
   SubmitTime=2025-06-03T09:54:41 EligibleTime=2025-06-03T09:54:41
   AccrueTime=2025-06-03T09:54:41
   StartTime=2025-06-03T09:54:41 EndTime=2025-06-05T09:54:41 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2025-06-03T09:54:41 Scheduler=Backfill
   Partition=compute AllocNode:Sid=login05:3056374
   ReqNodeList=m02 ExcNodeList=(null)
   NodeList=m02
   BatchHost=m02
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   ReqTRES=cpu=1,mem=8G,node=1,billing=1
   AllocTRES=cpu=1,mem=8G,node=1,billing=1
   Socks/Node=* NtasksPerN:B:S:C=1:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=8G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/gpfs/users/staff/user/../test1.job
   WorkDir=/gpfs/users/staff/user/benchmarks/lammps/spce
   StdErr=/gpfs/users/staff/user/../test.err
   StdIn=/dev/null
   StdOut=/gpfs/users/staff/user/../test.out
   TresPerTask=cpu=1

Pending Jobs¶

Common reasons for awaiting jobs.


Dependency	This job is waiting for a dependent job to complete.
NodeDown	A node required by the job is down.
PartitionDown	The partition (queue) required by this job is in a DOWN state and temporarily accepting no jobs, for instance because of maintenance. Note that this message may be displayed for a time even after the system is back up.
Priority	One or more higher priority jobs exist for this partition or advanced reservation. Other jobs in the queue have higher priority than yours.
ReqNodeNotAvail	No nodes can be found satisfying your limits, for instance because maintenance is scheduled and the job can not finish before it
Reservation	The job is waiting for its advanced reservation to become available.
Resources	The job is waiting for resources (nodes) to become available and will run when Slurm finds enough free nodes.
SystemFailure	Failure of the SLURM system, a file system, the network, etc