Monitoring & Efficiency
Tools such as sstat, sacct, and seff provide real-time and historical insights into job status, runtime, memory consumption, and exit codes. Regularly checking these outputs helps users detect issues like stalled jobs, excessive memory usage, or unexpected termination. Complementing these tools with well-structured output and error log files ensures that important diagnostic information is always available.
| Command | sstat |
sacct |
seff |
|---|---|---|---|
| Job state | Running (only) | Running & Completed | Completed |
| Data freshness | Real-time | Historical | Historical |
| Use case | Monitoring | State Analysis | Efficiency |
Best Practices
Best practice is proactive logging and instrumentation. Users should design their job scripts to produce meaningful progress indicators—such as timestamps, iteration counters, or checkpoint messages—written to standard output. This allows for quick assessment of whether a job is progressing normally. For long-running workflows, implementing checkpointing mechanisms can be especially valuable, enabling jobs to resume from intermediate states rather than restarting from scratch after failure.
Understanding job efficiency is equally important. After completion, users should review accounting data (e.g., via sacct) to evaluate CPU and memory utilization. Jobs that consistently underutilize resources can be tuned to request fewer resources, improving cluster fairness and reducing queue times. Conversely, jobs that hit memory or time limits may require adjustments to resource requests or code optimization.
Finally, effective monitoring involves automation and notification strategies. Users can configure email alerts for job state changes, or integrate lightweight scripts that periodically query job status and summarize key metrics. This is particularly useful in environments where users manage multiple concurrent jobs or complex pipelines.
State Analysis
The sacct1 allows analysis of jobs after these are finished. It queries the accounting database and provides detailed information about completed (and sometimes running) jobs.
# View all jobs from today…
sacct --starttime=today
# …or define a custom time range
sacct --starttime=2026-03-25 --endtime=2026-03-26To check the status of a specific job:
sacct -j <jobid>Output Format
By default, sacct output is limited. You can request more useful fields:
Define the output format with option:
| Option | Description |
|---|---|
--format |
Comma separated list of fields |
# show the list of available fields
sacct --helpformat
# example for a custom output format
sacct --format=JobID,JobName,Partition,State,ExitCode,ElapsedFor a permanent change of the default output format use:
| Variable | Description |
|---|---|
SACCT_FORMAT |
Override the default output format |
Exit Codes
The exit code reported by sacct is an important clues for understanding why a job finished the way it did. For example:
sacct --format=JobID,State,ExitCodeThe output looks similar to the following:
#…
182520 COMPLETED 0:0
182521 FAILED 17:0
182522 RUNNING 0:0
182524 FAILED 255:0
182529 CANCELLED 0:9
#…The ExitCode (last column) has two parts: <exit_status>:<signal>
- The exit status (first number) is the return value from your application or script. It follows standard Unix conventions:
0— Execution successful- non-zero — Error (application-defined meaning)
- The signal (second number) indicates whether the job was terminated by a system signal:
0— No signal- non-zero — Job was killed by a signal
Some exit code examples:
| State | ExitCode | Signal | Meaning |
|---|---|---|---|
| COMPLETED | 0:0 | - | Success |
| FAILED | 1:0 | - | Application error |
| FAILED | 0:9 | SIGKILL |
Killed (often memory) |
| CANCELLED | 0:9 | SIGKILL |
User/system stopped job |
| TIMEOUT | 0:15 | SIGTERM |
Hit walltime |
Monitoring
The sstat2 is used to monitor resource usage of running jobs or job steps in real time. It’s especially to track CPU, memory, or I/O usage while your simulation or analysis is still running.
To check the status of a specific job:
sstat -j <job_id>
# look at job steps
sstat -j <job_id>.batch # batch script step
sstat -j <job_id>.[0-9]* # individual job steps/taskssstat supports configuration of the output format like sacct.
Commonly used options are:
sstat --format=JobID,MaxRSS,AveCPU,AveRSS -j <job_id>Efficiency
Use the seff command (“Slurm efficiency”) to check how a completed job used the resources requested. The command helps users to adjust resource constrains of future jobs more efficiently.
After a job is finishes run:
seff <job_id>
# for an individual jobs step
seff <job_id>.<job_step>seff reports CPU resources, CPU time and Memory utilization:
>>> seff 169761
Job ID: 169761
Cluster: virgo
User/Group: #…
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 8
CPU Utilized: 4-13:28:42
CPU Efficiency: 96.40% of 4-17:33:44 core-walltime
Job Wall-clock time: 14:11:43
Memory Utilized: 8.71 GB
Memory Efficiency: 54.44% of 16.00 GB (2.00 GB/core)- CPU Efficiency — How much of the allocated CPU time was actually used. If the percentage is low to many CPUs have been requested or the program is not parallelized well.
- Memory Efficiency — The fraction of memory used in percent used. If the percentage is low to more memory was available then needed. Note that CPUs are allocated with a fixed memory to CPU ratio
Remember that clusters are shared systems, and users are required to contribute to make efficient utilization possible. Configuring the resource requirements of a jobs correctly prevents wasting resources and improves queue wait times.
Footnotes
sacctManual Page, SchedMD Documentation
https://slurm.schedmd.com/sacct.html↩︎sstatManual Page, SchedMD Documentation
https://slurm.schedmd.com/sstat.html↩︎