Submitting Jobs to Cerberus

The Slurm workload scheduler is used to manage the compute nodes on the cluster. Jobs must be submitted through the scheduler to have access to compute resources on the system. There are two groups of partitions (aka "queues") currently configured on the system. One group is specifically designated for CPU jobs, i.e., non-GPU jobs: defq a general partition and a gpu partition. NOTE: Running jobs on the login nodes is prohibited and may result in account suspension.

defq - Each node in this partition has 28 cpu cores and 128 GB of memory. This partition is designed for jobs that do not require GPU cards. The maximum time limit is 4 hours. Nodes don't overlap with other partitions.

The other partition is specifically for jobs requiring GPU resources and should not be used for jobs that do not require GPU resources:

gpu – Same configuration as defq but each node has a P100 NVIDIA GPU.

Note that you must set a time limit (with the -t flag, for example -t 1:00:00 would set a limit of one hour) for your jobs when submitting, otherwise they will be immediately rejected. You must also set a partition (with the -p flag for example -p short would submit it to the short partition) for your jobs otherwise they will be immediately rejected. This allows the Slurm scheduler to keep the system busy by backfilling when trying to allocate resources for larger jobs. The maximum time-limit for any job is 4 hours.

Please estimate your required time carefully and request a time limit accordingly. Longer running processes should checkpoint and restart to avoid losing significant amounts of outstanding work if there is a problem with the hardware or cluster configuration. We will provide guidance to help you better understand your job's requirements if you are unsure.

As an example of a simple MPI job script, which could be submitted as sbatch job_script.sh:

#!/bin/sh
# one hour timelimit:
#SBATCH --time 1:00:00
# default queue, 32 processors (two nodes worth)
#SBATCH -p defq -n 32

module load openmpi

mpirun ./test

The Slurm documentation has further documentation of some of the advanced features. The Slurm Quick-Start User Guide provides a good overview. The use of Job Arrays (Job Array Support) is mandatory for people submitting a large quantity of similar jobs.

Matlab Example

#!/bin/bash

# set output and error output filenames, %j will be replaced by Slurm with the jobid
#SBATCH -o testing%j.out
#SBATCH -e testing%j.err 

# single node in the "short" partition
#SBATCH -N 1
#SBATCH -p defq

# half hour timelimit
#SBATCH -t 0:30:00

module load matlab/2022a

# test.m is your matlab code
matlab -nodesktop -r "run('test.m'), exit"

R Example

#!/bin/bash

# set output and error output filenames, %j will be replaced by Slurm with the jobid
#SBATCH -o testing%j.out
#SBATCH -e testing%j.err 

# single node in the "short" partition
#SBATCH -N 1
#SBATCH -p defq

# half hour timelimit
#SBATCH -t 0:30:00

module load R/4.1.2

# test.R is your matlab code
Rscript test.R