Monitoring & Efficiency

Modified

May 7, 2026

Abstract

Tools such as sstat, sacct, and seff provide real-time and historical insights into job status, runtime, memory consumption, and exit codes. Regularly checking these outputs helps users detect issues like stalled jobs, excessive memory usage, or unexpected termination. Complementing these tools with well-structured output and error log files ensures that important diagnostic information is always available.

Command	`sstat`	`sacct`	`seff`
Job state	Running (only)	Running & Completed	Completed
Data freshness	Real-time	Historical	Historical
Use case	Monitoring	State Analysis	Efficiency

Best Practices

Best practice is proactive logging and instrumentation. Users should design their job scripts to produce meaningful progress indicators—such as timestamps, iteration counters, or checkpoint messages—written to standard output. This allows for quick assessment of whether a job is progressing normally. For long-running workflows, implementing checkpointing mechanisms can be especially valuable, enabling jobs to resume from intermediate states rather than restarting from scratch after failure.

Understanding job efficiency is equally important. After completion, users should review accounting data (e.g., via sacct) to evaluate CPU and memory utilization. Jobs that consistently underutilize resources can be tuned to request fewer resources, improving cluster fairness and reducing queue times. Conversely, jobs that hit memory or time limits may require adjustments to resource requests or code optimization.

Finally, effective monitoring involves automation and notification strategies. Users can configure email alerts for job state changes, or integrate lightweight scripts that periodically query job status and summarize key metrics. This is particularly useful in environments where users manage multiple concurrent jobs or complex pipelines.

State Analysis

The sacct¹ allows analysis of jobs after these are finished. It queries the accounting database and provides detailed information about completed (and sometimes running) jobs.

# View all jobs from today…
sacct --starttime=today

# …or define a custom time range
sacct --starttime=2026-03-25 --endtime=2026-03-26

To check the status of a specific job:

sacct -j <jobid>

Output Format

By default, sacct output is limited. You can request more useful fields:

Define the output format with option:

Option	Description
`--format`	Comma separated list of fields

# show the list of available fields
sacct --helpformat

# example for a custom output format
sacct --format=JobID,JobName,Partition,State,ExitCode,Elapsed

For a permanent change of the default output format use:

Variable	Description
`SACCT_FORMAT`	Override the default output format

Exit Codes

The exit code reported by sacct is an important clues for understanding why a job finished the way it did. For example:

sacct --format=JobID,State,ExitCode

The output looks similar to the following:

#…
182520        COMPLETED      0:0 
182521           FAILED     17:0 
182522          RUNNING      0:0 
182524           FAILED    255:0 
182529        CANCELLED      0:9
#…

The ExitCode (last column) has two parts: <exit_status>:<signal>

The exit status (first number) is the return value from your application or script. It follows standard Unix conventions:
- 0 — Execution successful
- non-zero — Error (application-defined meaning)
The signal (second number) indicates whether the job was terminated by a system signal:
- 0 — No signal
- non-zero — Job was killed by a signal

Some exit code examples:

State	ExitCode	Signal	Meaning
COMPLETED	0:0	-	Success
FAILED	1:0	-	Application error
FAILED	0:9	`SIGKILL`	Killed (often memory)
CANCELLED	0:9	`SIGKILL`	User/system stopped job
TIMEOUT	0:15	`SIGTERM`	Hit walltime

Monitoring

The sstat² is used to monitor resource usage of running jobs or job steps in real time. It’s especially to track CPU, memory, or I/O usage while your simulation or analysis is still running.

To check the status of a specific job:

sstat -j <job_id>

# look at job steps
sstat -j <job_id>.batch    # batch script step
sstat -j <job_id>.[0-9]*   # individual job steps/tasks

sstat supports configuration of the output format like sacct.

Commonly used options are:

sstat --format=JobID,MaxRSS,AveCPU,AveRSS -j <job_id>

Efficiency

Use the seff command (“Slurm efficiency”) to check how a completed job used the resources requested. The command helps users to adjust resource constrains of future jobs more efficiently.

After a job is finishes run:

seff <job_id>

# for an individual jobs step
seff <job_id>.<job_step>

seff reports CPU resources, CPU time and Memory utilization:

>>> seff 169761
Job ID: 169761
Cluster: virgo
User/Group: #…
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 8
CPU Utilized: 4-13:28:42
CPU Efficiency: 96.40% of 4-17:33:44 core-walltime
Job Wall-clock time: 14:11:43
Memory Utilized: 8.71 GB
Memory Efficiency: 54.44% of 16.00 GB (2.00 GB/core)

CPU Efficiency — How much of the allocated CPU time was actually used. If the percentage is low to many CPUs have been requested or the program is not parallelized well.
Memory Efficiency — The fraction of memory used in percent used. If the percentage is low, more memory was available then needed. Note that CPUs are allocated with a fixed memory to CPU ratio

Remember that clusters are shared systems, and users are required to contribute to make efficient utilization possible. Configuring the resource requirements of a jobs correctly prevents wasting resources and improves queue wait times.

Footnotes

sacct Manual Page, SchedMD Documentation
https://slurm.schedmd.com/sacct.html ↩︎
sstat Manual Page, SchedMD Documentation
https://slurm.schedmd.com/sstat.html ↩︎

--- title: Monitoring & Efficiency date-modified: 2026/05/07 toc-expand: 3 abstract: > Tools such as `sstat`, `sacct`, and `seff` provide real-time and historical insights into job status, runtime, memory consumption, and exit codes. Regularly checking these outputs helps users detect issues like stalled jobs, excessive memory usage, or unexpected termination. Complementing these tools with well-structured output and error log files ensures that important diagnostic information is always available. --- | Command | `sstat` | `sacct` | `seff` | | -------------- | -------------- | ------------------- | -----------| | Job state | Running (only) | Running & Completed | Completed | | Data freshness | Real-time | Historical | Historical | | Use case | Monitoring | State Analysis | Efficiency | ## Best Practices Best practice is **proactive logging and instrumentation**. Users should design their job scripts to produce meaningful progress indicators—such as timestamps, iteration counters, or checkpoint messages—written to [standard output][O56Ld]. This allows for quick assessment of whether a job is progressing normally. For long-running workflows, implementing checkpointing mechanisms can be especially valuable, enabling jobs to resume from intermediate states rather than restarting from scratch after failure. [O56Ld]: environment.md#io-redirection **Understanding job efficiency** is equally important. After completion, users should review accounting data (e.g., via `sacct`) to evaluate CPU and memory utilization. Jobs that consistently underutilize resources can be tuned to request fewer resources, improving cluster fairness and reducing queue times. Conversely, jobs that hit memory or time limits may require [adjustments to resource requests][5RTg2] or code optimization. [5RTg2]: resource-constrains.md Finally, effective monitoring involves **automation and notification strategies**. Users can configure [email alerts for job state changes][P12kQ], or integrate lightweight scripts that periodically query job status and summarize key metrics. This is particularly useful in environments where users manage multiple concurrent jobs or complex pipelines. [P12kQ]: environment.md#mail-notification ## State Analysis The `sacct`[^4Rt3s] allows analysis of jobs after these are finished. It queries the accounting database and provides detailed information about completed (and sometimes running) jobs. [^4Rt3s]: `sacct` Manual Page, SchedMD Documentation <https://slurm.schedmd.com/sacct.html> ```bash # View all jobs from today… sacct --starttime=today # …or define a custom time range sacct --starttime=2026-03-25 --endtime=2026-03-26 ``` To check the status of a specific job: ```bash sacct -j <jobid> ``` ### Output Format By default, `sacct` output is limited. You can request more useful fields: Define the output format with option: Option | Description -----------|------------ `--format` | Comma separated list of fields ```bash # show the list of available fields sacct --helpformat # example for a custom output format sacct --format=JobID,JobName,Partition,State,ExitCode,Elapsed ``` For a permanent change of the default output format use: Variable | Description ---------------|--------------------------------------------------- `SACCT_FORMAT` | Override the default output format ### Exit Codes The exit code reported by `sacct` is an important clues for understanding **why a job finished the way it did**. For example: ```bash sacct --format=JobID,State,ExitCode ``` The output looks similar to the following: ```bash #… 182520 COMPLETED 0:0 182521 FAILED 17:0 182522 RUNNING 0:0 182524 FAILED 255:0 182529 CANCELLED 0:9 #… ``` The `ExitCode` (last column) has two parts: `<exit_status>:<signal>` * The **exit status** (first number) is the return value from your application or script. It follows standard Unix conventions: - `0` — Execution successful - non-zero — Error (application-defined meaning) * The **signal** (second number) indicates whether the job was terminated by a system signal: - `0` — No signal - non-zero — Job was killed by a signal Some exit code examples: | State | ExitCode | Signal | Meaning | | --------- | -------- | --------- | ----------------------- | | COMPLETED | 0:0 | - | Success | | FAILED | 1:0 | - | Application error | | FAILED | 0:9 | `SIGKILL` | Killed (often memory) | | CANCELLED | 0:9 | `SIGKILL` | User/system stopped job | | TIMEOUT | 0:15 | `SIGTERM` | Hit walltime | ## Monitoring The `sstat`[^KL3cZ] is used to **monitor resource usage of running jobs or job steps in real time**. It’s especially to track CPU, memory, or I/O usage while your simulation or analysis is still running. [^KL3cZ]: `sstat` Manual Page, SchedMD Documentation <https://slurm.schedmd.com/sstat.html> To check the status of a specific job: ```bash sstat -j <job_id> # look at job steps sstat -j <job_id>.batch # batch script step sstat -j <job_id>.[0-9]* # individual job steps/tasks ``` `sstat` supports configuration of the [output format](#output-format) like `sacct`. Commonly used options are: ```bash sstat --format=JobID,MaxRSS,AveCPU,AveRSS -j <job_id> ``` ## Efficiency Use the `seff` command (“Slurm efficiency”) to **check how a completed job used the resources requested**. The command helps users to adjust resource constrains of future jobs more efficiently. After a job is finishes run: ```bash seff <job_id> # for an individual jobs step seff <job_id>.<job_step> ``` `seff` reports CPU resources, CPU time and Memory utilization: ```bash >>> seff 169761 Job ID: 169761 Cluster: virgo User/Group: #… State: COMPLETED (exit code 0) Nodes: 1 Cores per node: 8 CPU Utilized: 4-13:28:42 CPU Efficiency: 96.40% of 4-17:33:44 core-walltime Job Wall-clock time: 14:11:43 Memory Utilized: 8.71 GB Memory Efficiency: 54.44% of 16.00 GB (2.00 GB/core) ``` * **CPU Efficiency** — How much of the allocated CPU time was actually used. If the percentage is low to many CPUs have been requested or the program is not parallelized well. * **Memory Efficiency** — The fraction of memory used in percent used. If the percentage is low, more memory was available then needed. Note that CPUs are allocated with a fixed [memory to CPU ratio](#memorycpu-ratio) Remember that clusters are shared systems, and users are required to contribute to make efficient utilization possible. Configuring the resource requirements of a jobs correctly prevents wasting resources and improves queue wait times.