Cluster
Abstract
The Virgo cluster uses Slurm for workload management. This section covers how to configure, submit, and monitor compute jobs.
Introduction
Think of an HPC cluster like a shared supercomputer that many people use at the same time. A scheduler like Slurm1 is the system that decides who gets to use which part of the computer, when, and for how long. The basic idea is: You don’t run programs directly like on your laptop. Instead, you ask the scheduler for resources, and it runs your job when those resources are available.
Following illustrates a step-by-step workflow:
- Login — You connect via SSH to a submit node.
- Batch script — Write a batch job script that defines what program to run, which resources (CPUs, RAM) are required, and how long it will execute.
- Submit job — You send the script to Slurm, and your job goes into a queue.
- Waiting in the queue — Slurm keeps all user jobs in a queue and starts them on free resources when available.
- Job runs — Slurm assigns compute resources and your job is started using the batch script.
- Access output — After the job finished access output data and log-files.
Overview
| Page | Description |
|---|---|
| Environment | Slurm commands, meta-commands, working directory, I/O redirection |
| Partitions | Available partitions and their resource limits |
| Resource Allocation | Interactive jobs, batch jobs, and recurring jobs with scrontab |
| Resource Constrains | Runtime, memory, CPU, and feature constraints |
| Scheduler Queue | Job states and job queue priority |
| Monitoring & Efficiency | Monitor resource usage, analyse job failures and efficiency |
| Accounts | Slurm accounts, coordinators, and fair-share |
| Reservations | Requesting and using resource reservations |
Footnotes
Slurm Overview, SchedMD Documentation
https://slurm.schedmd.com/overview.html↩︎