Job Submission

Job Submission¶

In order to create a resource allocation and launch tasks you can submit a batch script.

A batch script, submitted to the scheduling system must specify the job specifications:

resource queue , default is compute
number of nodes required
number of cores per node required
maximum wall time for the job , (please notice the jobs exceeding wall time will be killed).

To submit a job, user can use the sbatch command.

sbatch my_script

Please check sbatch man for more information.

man sbatch

Define batch script¶

Batch scripts contain

scheduler directives : lines begin with #SBATCH
shell commands: UNIX shell (bash) commands
job steps: created with the srun command

#!/bin/bash -l
#SBATCH --job-name=my_script    # Job name
#SBATCH --ntasks=2              # Number of tasks
#SBATCH --time=01:30:00         # Run time (hh:mm:ss) - 1.5 hours

module load gnu                 #load any needed modules

echo "Start at `date`"
cd $HOME/workdir
./a.out
echo "End at `date`"

To submit this batch script

sbatch my_script

Job Specifications¶

Option	Argument	Specification
--job-name, -J	job_name	Job name is job_name
--partition, -p	queue_name	Submits to queue queue_name
--account, -A	project_name	Project to charge compute hours
--ntasks, -n	number_of_tasks	Total number of tasks
--nodes, -N	number_of_nodes	Number of nodes
--ntasks-per-node	ntasks_per_node	Tasks per node
--cpus-per-task, -c	ntasks_per_node	Threads per task
--time, -t	HH:MM:SS	Time limit (hh:mm:ss)
--mem	memory_mb	Total memory requirements (MB)
--mem-per-cpu	memory_mb	Memory per task (MB)
--output, -o	stdout_filename	Direct job satndard output to stdout_filename, (%j expands to jobID)
--error, -e	stderr_filename	Direct job error to error_file, (%j expands to jobID)
--depend, -d	afterok:jobid	Job dependency

SLURM Environment Variables¶

SLURM provides environment variables for most of the values used in the #SBATCH directives.

Evironment Variable	Description
$SLURM_JOBID	Job id
$SLURM_JOB_NAME	Job name
$SLURM_SUBMIT_DIR	Submit directory
$SLURM_SUBMIT_HOST	Submit host
$SLURM_JOB_NODELIST	Node list
$SLURM_JOB_NUM_NODES	Number of nodes
$SLURM_CPUS_ON_NODE	Number of cores/node
$SLURM_CPUS_PER_TASK	Threads per task
$SLURM_NTASKS_PER_NODE	Number of tasks per node

#!/bin/bash -l
#SBATCH --job-name=slurm_env
#SBATCH --nodes=2                # 2 nodes
#SBATCH --ntasks-per-node=12     # Number of tasks to be invoked on each node
#SBATCH --mem-per-cpu=1024       # Minimum memory required per CPU (in megabytes)
#SBATCH --time=00:01:00          # Run time in hh:mm:ss
#SBATCH --error=job.%J.out
#SBATCH --output=job.%J.out

echo "Start at `date`"
echo "Running on hosts: $SLURM_NODELIST"
echo "Running on $SLURM_NNODES nodes."
echo "Running $SLURM_NTASKS_PER_NODE tasks per node"
echo "Job id is $SLURM_JOBID"
echo "End at `date`"

Job Scripts¶

Here are some sample job submission scripts for different runtime models.

MPI job: Run multi-process programs with MPI.
Hybrid job: Parallel programs with MPI and OpenMP threads.
GPU job: Utilize GPU accelerators.

Pure MPI batch script¶

Launch MPI jobs with srun command

DON’T USE mpirun AND mpiexec

#!/bin/bash -l

#-----------------------------------------------------------------
# Pure MPI job , using 256 procs on 2 nodes ,
# with 128 procs per node and 1 thread per MPI task

#-----------------------------------------------------------------

#SBATCH --job-name=mpijob # Job name
#SBATCH --output=mpijob.%j.out # Stdout (%j expands to jobId)
#SBATCH --error=mpijob.%j.err # Stderr (%j expands to jobId)
#SBATCH --ntasks=256 # Total number of tasks
#SBATCH --nodes=2 # Total number of nodes requested
#SBATCH --ntasks-per-node=128 # Tasks per node
#SBATCH --cpus-per-task=1 # Threads per task(=1) for pure MPI
#SBATCH --mem=128000 # Memory per job in MB
#SBATCH -t 01:30:00 # Run time (hh:mm:ss) - (max 48h)
#SBATCH --partition=compute # Submit queue
#SBATCH -A testproj # Accounting project


# Load any necessary modules

module purge    # Clean environment from loaded modules
module load gnu/13.3.0
module load openmpi/4.1.8/gnu

# Launch the executable

srun EXE ARGS

Hybrid MPI/OpenMP batch script¶

Launch MPI jobs with srun command

DON’T USE mpirun AND mpiexec

#!/bin/bash -l

#-----------------------------------------------------------------
# Hybrid MPI/OpenMP job , using 256 procs on 2 nodes ,
# with 128 procs per node and 2 threads per MPI task.
#-----------------------------------------------------------------

#SBATCH --job-name=hybridjob # Job name
#SBATCH --output=hybridjob.%j.out # Stdout (%j expands to jobId)
#SBATCH --error=hybridjob.%j.err # Stderr (%j expands to jobId)
#SBATCH --ntasks=128 # Total number of tasks
#SBATCH --nodes=2 # Total number of nodes requested
#SBATCH --ntasks-per-node=64 # Tasks per node
#SBATCH --cpus-per-task=2 # Threads per task
#SBATCH --mem=56000 # Memory per job in MB
#SBATCH -t 01:30:00 # Run time (hh:mm:ss) - (max 48h)
#SBATCH --partition=compute # Submit queue
#SBATCH -A testproj # Accounting project

if [ x$SLURM_CPUS_PER_TASK == x ]; then
  export OMP_NUM_THREADS=1
else
  export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
fi

# Load any necessary modules

module purge    # Clean environment from loaded modules
module load gnu/13.3.0
module load openmpi/4.1.8/gnu

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

# Launch the executable
srun EXE ARGS

GPU batch script - 1x Α100¶

Use 32 CPU cores per GPU

Use 124 GBs of RAM per GPU

Launch GPU accelerated jobs.

#!/bin/bash -l

#-----------------------------------------------------------------
# GPU job
# with 1 gpu, 16 procs and 2 threads per MPI task.
#-----------------------------------------------------------------

#SBATCH --job-name=gpujob # Job name
#SBATCH --output=gpujob.%j.out # Stdout (%j expands to jobId)
#SBATCH --error=gpujob.%j.err # Stderr (%j expands to jobId)
#SBATCH --ntasks=16 # Total number of tasks
#SBATCH --gres=gpu:a100:1 # GPUs per node
#SBATCH --nodes=1 # Total number of nodes requested
#SBATCH --ntasks-per-node=16 # Tasks per node
#SBATCH --cpus-per-task=2 # Threads per task
#SBATCH --mem=126976 # Memory per job in MB
#SBATCH -t 01:30:00 # Run time (hh:mm:ss) - (max 48h)
#SBATCH --partition=gpu # Run on the GPU nodes queue
#SBATCH -A testproj # Accounting project

# Load any necessary modules

module purge    # Clean environment from loaded modules
module load cuda/12.5.1

if [ x$SLURM_CPUS_PER_TASK == x ]; then
  export OMP_NUM_THREADS=1
else
  export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
fi

# Launch the executable
srun EXE ARGS

GPU batch script - 2x Α100¶

Launch GPU accelerated jobs.

Use up to 32 CPU cores per GPU

Use 124 GBs of RAM per GPU

#!/bin/bash -l

#-----------------------------------------------------------------
# GPU job
# with 2 gpus, 32 procs and 2 threads per MPI task.
#-----------------------------------------------------------------

#SBATCH --job-name=gpujob # Job name
#SBATCH --output=gpujob.%j.out # Stdout (%j expands to jobId)
#SBATCH --error=gpujob.%j.err # Stderr (%j expands to jobId)
#SBATCH --ntasks=32 # Total number of tasks
#SBATCH --gres=gpu:a100:2 # GPUs per node
#SBATCH --nodes=1 # Total number of nodes requested
#SBATCH --ntasks-per-node=16 # Tasks per node
#SBATCH --cpus-per-task=2 # Threads per task
#SBATCH --mem=126976 # Memory per job in MB
#SBATCH -t 01:30:00 # Run time (hh:mm:ss) - (max 48h)
#SBATCH --partition=gpu # Run on the GPU nodes queue
#SBATCH -A testproj # Accounting project

# Load any necessary modules

module purge    # Clean environment from loaded modules
module load cuda/12.5.1

if [ x$SLURM_CPUS_PER_TASK == x ]; then
  export OMP_NUM_THREADS=1
else
  export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
fi

# Launch the executable
srun EXE ARGS

GPU MIG batch script - 1x 1g.10gb¶

Launch GPU accelerated jobs.

Use up to 4 CPU cores and 15.5 GBs of RAM for 1g.10gb

Never use 2x 1g.10gb instead of 1x 2g.20gb

Processes running on separate MIG GPUs are not able to communicate via CUDA

#!/bin/bash -l

#-----------------------------------------------------------------
# GPU job
# with 1 MIG gpus, 2 procs and 2 threads per MPI task.
#-----------------------------------------------------------------

#SBATCH --job-name=gpujob # Job name
#SBATCH --output=gpujob.%j.out # Stdout (%j expands to jobId)
#SBATCH --error=gpujob.%j.err # Stderr (%j expands to jobId)
#SBATCH --ntasks=2 # Total number of tasks
#SBATCH --gres=gpu:1g.10gb:1 # GPUs per node
#SBATCH --nodes=1 # Total number of nodes requested
#SBATCH --ntasks-per-node=2 # Tasks per node
#SBATCH --cpus-per-task=2 # Threads per task
#SBATCH --mem=15872 # Memory per job in MB
#SBATCH -t 01:30:00 # Run time (hh:mm:ss) - (max 48h)
#SBATCH --partition=mig # Run on the GPU nodes queue
#SBATCH -A testproj # Accounting project

# Load any necessary modules

module purge    # Clean environment from loaded modules
module load cuda/12.5.1

if [ x$SLURM_CPUS_PER_TASK == x ]; then
  export OMP_NUM_THREADS=1
else
  export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
fi

# Launch the executable
srun EXE ARGS

GPU MIG batch script - 1x 2g.20gb¶

Launch GPU accelerated jobs.

Use up to 8 CPU cores and 31 GBs of RAM for 2g.20gb

Never use 2x 2g.20gb instead of 1x 3g.40gb

Processes running on separate MIG GPUs are not able to communicate via CUDA

#!/bin/bash -l

#-----------------------------------------------------------------
# GPU job
# with 1 MIG gpu, 4 procs and 2 threads per MPI task.
#-----------------------------------------------------------------

#SBATCH --job-name=gpujob # Job name
#SBATCH --output=gpujob.%j.out # Stdout (%j expands to jobId)
#SBATCH --error=gpujob.%j.err # Stderr (%j expands to jobId)
#SBATCH --ntasks=4 # Total number of tasks
#SBATCH --gres=gpu:2g.20gb:1 # GPUs per node
#SBATCH --nodes=1 # Total number of nodes requested
#SBATCH --ntasks-per-node=4 # Tasks per node
#SBATCH --cpus-per-task=2 # Threads per task
#SBATCH --mem=31744 # Memory per job in MB
#SBATCH -t 01:30:00 # Run time (hh:mm:ss) - (max 48h)
#SBATCH --partition=mig # Run on the GPU nodes queue
#SBATCH -A testproj # Accounting project

# Load any necessary modules

module purge    # Clean environment from loaded modules
module load cuda/12.5.1

if [ x$SLURM_CPUS_PER_TASK == x ]; then
  export OMP_NUM_THREADS=1
else
  export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
fi

# Launch the executable
srun EXE ARGS

GPU MIG batch script - 1x 3g.40gb¶

Launch GPU accelerated jobs.

Use up to 16 CPU cores and 62 GBs of RAM for 3g.40gb

Never use 2x 3g.40gb instead of 1x a100

Processes running on separate MIG GPUs are not able to communicate via CUDA

#!/bin/bash -l

#-----------------------------------------------------------------
# GPU job
# with 1 MIG gpu, 16 procs and 1 thread per MPI task.
#-----------------------------------------------------------------

#SBATCH --job-name=gpujob # Job name
#SBATCH --output=gpujob.%j.out # Stdout (%j expands to jobId)
#SBATCH --error=gpujob.%j.err # Stderr (%j expands to jobId)
#SBATCH --ntasks=16 # Total number of tasks
#SBATCH --gres=gpu:3g.40gb:1 # GPUs per node
#SBATCH --nodes=1 # Total number of nodes requested
#SBATCH --ntasks-per-node=16 # Tasks per node
#SBATCH --cpus-per-task=1 # Threads per task
#SBATCH --mem=63488 # Memory per job in MB
#SBATCH -t 01:30:00 # Run time (hh:mm:ss) - (max 48h)
#SBATCH --partition=mig # Run on the GPU nodes queue
#SBATCH -A testproj # Accounting project

# Load any necessary modules

module purge    # Clean environment from loaded modules
module load cuda/12.5.1

if [ x$SLURM_CPUS_PER_TASK == x ]; then
  export OMP_NUM_THREADS=1
else
  export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
fi

# Launch the executable
srun EXE ARGS