First Job

Modified

March 20, 2026

Abstract

The examples introduced in this section use the debug partition to test an application launch. Remember, that the debug partition provides limited runtime, and short wall and response times in order to execute jobs on a short notice.

Before You Begin

Make sure you have completed the following steps before continuing:

  • You can log in to a submit node via SSH
  • You have a Slurm account association (see Accounts)
  • You know the path to your Lustre working directory (see Lustre)

Display an overview of the allocatable resources from the debug partitions with the sinfo 1 command:

sinfo -lN -p debug

Setup

The examples on this page use the following environment variables. Adjust LUSTRE_HOME to match your group’s directory on shared storage:

# shared storage on the cluster (adjust to your group's path)
export LUSTRE_HOME=/lustre/$(id -g -n)/$USER

# use the debug partition for quick allocation times in this tutorial
export SBATCH_PARTITION=debug
export SALLOC_PARTITION=debug

First Job

In case you are interested in more elaborate examples how to run applications on the cluster, we would like to draw your attention to The Virgo Blog. The cluster group plans to continually publish articles illustrating common use-cases and best practices.

Let’s move you through the steps required to execute your first application (cf. batch jobs) on the cluster. Slurm expects an executable as argument to the sbatch command. Typically this is a wrapper-script including Slurm meta-commands setting runtime configuration options for Slurm and all the specifics to launch a user application.

Following is a very simplistic wrapper script that identifies the compute node executing the job. Create it in your $LUSTRE_HOME working directory:

cat > $LUSTRE_HOME/sleep.sh <<'EOF'
#!/bin/bash
#SBATCH --output %j_%N.out
hostname ; sleep ${1:-30}  # first positional argument, default: 30
EOF
chmod +x $LUSTRE_HOME/sleep.sh

Once you have created the file above, submit it to the Slurm workload management system using the sbatch command. In order to simplify monitoring of this job, specify a job name with the command line option --job-name. We will use your $LUSTRE_HOME as working-directory:

# submit a job to sleep for 300 seconds
sbatch --job-name sleep --chdir $LUSTRE_HOME -- $LUSTRE_HOME/sleep.sh 300

The system answers with the JOBID, if it has been accepted. The job label can be used as an option to the squeue command, which prints the state of the scheduling queue:

# list all jobs with a given name
squeue --name sleep

In the debug queue, the job-state should quickly become R for running. Of course it is possible to submit multiple jobs with the same label, but each will be identifiable by its unique JOBID. Go on and submit some more jobs with different sleep times. The scontrol command can be used to show details about the runtime configuration of a job. Jobs can be removed from the system using scancel 2.

# show the runtime configuration of the latest sleep job
scontrol show job $(squeue -h -o %A -n sleep | tail -n1)

Once a job disappears from squeue, it has finished. Use sacct to check completed jobs and their exit status:

# show recent jobs for your user
sacct --starttime now-1hour --format=JobID,JobName,State,ExitCode,Elapsed

The output file specified by #SBATCH --output (in this example %j_%N.out) will be written to the working directory.

First Issue

In case of a failure during job execution, it is important to distinguish between a problem internal to the application and an issue of the job execution. In case you want to report a problem with the runtime environment on the cluster, follow the guidelines in report-issues. For this example, we will work with a “broken” program called segfaulter, which is a variant of the famous “Hello World” program.

// segfaulter.c
int main(void)
{
    char *s = "hello world";
    *s = 'H';
}

The following commands use $LUSTRE_HOME/bin and $LUSTRE_HOME/src:

export PATH=$LUSTRE_HOME/bin:$PATH
mkdir -p $LUSTRE_HOME/bin $LUSTRE_HOME/src

Build an executable binary from this small C program using following commands. Execution of this program yields a segmentation fault 3 with non zero program exit code:

# Compile the program...
» gcc -g $LUSTRE_HOME/src/segfaulter.c -o $LUSTRE_HOME/bin/segfaulter

# ...and execute it
» $LUSTRE_HOME/bin/segfaulter ; echo $?
zsh: segmentation fault  $LUSTRE_HOME/bin/segfaulter
139            # <-- exit code

In the execution environment of a compute cluster, it is important to monitor the runtime requirements of an application as well as the runtime behavior of the application itself. This information will help to determine problems occurring during job execution. Users can implement environment checks and basic application monitoring within an application wrapper script.

#!/bin/bash

function puts() {
  echo \[$(date +%Y/%m/%dT%H:%M:%S)\] "$@"
}

uid=$(id -u)

puts JOB CONFIGURATION ----------------------
scontrol show job -d -o $SLURM_JOB_ID
puts JOB CONFIGURATION END ------------------
puts JOB ENVIRONMENT ------------------------
env | grep ^SLURM_
puts JOB ENVIRONMENT END --------------------
puts NUMA CONFIGURATION ----------------------
lscpu | grep -e '^Model name' -e '^NUMA node[0-9]'
puts NUMA CONFIGURATION END ------------------

#####################################################
## APPLICATION
#####################################################

# Generate defined load on the execution node
command="srun -- $@"

#####################################################

puts EXEC $command
$command &

# The process ID of the last spawned child process
child=$!
sleep 1 # wait for start-up
puts PROCESSES -------------------------------
ps -u $USER -o user,pid,cpuid,args -H
puts PROCESSES END ---------------------------

puts WAIT PID $child
wait $child
# Exit signal of the child process
state=$?
puts EXIT $state
# Propagate last signal to the system
exit $state

For the following example, we store the code above in a file called generic-wrapper and use it to monitor the segfaulter program during execution. The segfaulter program is passed as first argument to the generic-wrapper, which will execute it using srun 4.

# make the wrapper script executable
chmod +x $LUSTRE_HOME/generic-wrapper

# use the wrapper script to launch the segfaulter program
sbatch --partition debug \
       --job-name segfaulter \
       --chdir $LUSTRE_HOME \
       --no-requeue \
       -- $LUSTRE_HOME/generic-wrapper \
          $LUSTRE_HOME/bin/segfaulter

In the job standard output (cf. I/O redirection) you will find following information:

  • The runtime configuration of the job in Slurm including:
    • Submit node and submit time
    • Slurm job id with partition and account details
    • Start time, working directory, output streams
    • Resource allocation details, including execution node(s)
  • The Slurm environment variables available during job execution
  • The hardware NUMA architecture
  • The application launch command executed
  • The application process ID, and a process tree during runtime
  • Exit code of the program

The information described above will be instrumental to report-issues to the cluster support group.

Footnotes

  1. sinfo Manual Page, SchedMD
    https://slurm.schedmd.com/sinfo.html↩︎

  2. scancel Manual Page, SchedMD https://slurm.schedmd.com/scancel.html↩︎

  3. Segmentation Fault, Wikipedia
    https://en.wikipedia.org/wiki/Segmentation_fault↩︎

  4. srun Manual Page, SchedMD https://slurm.schedmd.com/srun.html↩︎