Often a user of the campus cluster has a number of single-core (serial) jobs that they need to run. Since the campus cluster nodes have multiple cores (16/20/24/28/40/56 cores on a node), using these resources efficiently means running multiple of these type of batch jobs on a single node. This can be done with multiple batch jobs (one per serial process) or combined within a single batch job.

Keep in mind memory needs to decide how many serial processes to run concurrently on a node. If running your jobs in the secondary queue, also be aware that the compute nodes on the Campus Cluster have different amounts of memory. To avoid overloading the node, make sure that the memory required for multiple jobs or processes can be accommodated on a node. Assume that approximately 90% of the memory on a node is available for your job (the remaining is needed for system processes).

Multiple Batch Jobs

The queue — —mem-per-cpu option in SLURM can be utilized to submit serial jobs to run concurrently on a node. Starting with the a memory amount of 3375 megabytes and a node specification of 1 (— —nodes), along with a specification of 1 for the ntasks per node (— —ntasks-per-node) will allow jobs owned by the same user to share a node. Multiple serial jobs submitted will then be scheduled to run concurrently on a node.

The following SLURM specification will schedule multiple serial jobs on one node:


    #!/bin/bash
    #SBATCH --time=00:05:00                  # Job run time (hh:mm:ss)
    #SBATCH --nodes=1                        # Number of nodes
    #SBATCH --ntasks-per-node=1              # Number of task (cores/ppn) per node
    #SBATCH --mem-per-cpu=3375               # Memory per core (value in MBs)
    <other sbatch options>
    #
    cd ${SLURM_SUBMIT_DIR}

    # Run the serial executable
    ./a.out < input > output

Note:

  1. Increase the — —mem-per-cpu value for each job will cause fewer jobs get scheduled concurrently on a single compute node resulting in more memory available to each job.
  2. The above sbatch specifications are based on a the smallest compute node configuration. A compute node configured with 16 cores and 64GB memory (54000 MB usable).

Single Batch Job

Specify the maximum value for — —ntasks-per-node as an sbatch option/specification for a batch job and execute multiple serial processes within a one-node batch job. This works best if all the processes are estimated to take approximately the same amount of time to complete because the batch job waits to exit until all the processes finish. The basic template for the job script would be:


    #!/bin/bash
    #SBATCH --time=00:05:00                  # Job run time (hh:mm:ss)
    #SBATCH --nodes=1                        # Number of nodes
    #SBATCH --ntasks-per-node=16             # Number of task (cores/ppn) per node
    #SBATCH --job-name=multi-serial_job      # Name of batch job
    #SBATCH --partition=secondary            # Partition (queue)
    #SBATCH --output=multi-serial.o%j        # Name of batch job output file

    executable1 &
    executable2 &
    .
    .
    .
    executable16 &
    wait

The ampersand (&) at the end of each command indicates the process will be backgrounded and allows 16 processes to start concurrently. The wait command at the end is important so the shell waits until the background processes are complete (otherwise the batch job will exit right away). The commands can also be handled in a do loop depending on the specific syntax of the processes.

Note: When running multiple processes in a job, the total number of processes should generally not exceed the number of cores. Also be aware of memory needs so as not to run a node out of memory.

The following example batch script runs 16 instances of Matlab concurrently:


    #!/bin/bash
    #SBATCH --time=00:30:00                  # Job run time (hh:mm:ss)
    #SBATCH --nodes=1                        # Number of nodes
    #SBATCH --ntasks-per-node=16             # Number of task (cores/ppn) per node
    #SBATCH --job-name=matlab_job            # Name of batch job
    #SBATCH --partition=secondary            # Partition (queue)
    #SBATCH --output=multi-serial.o%j        # Name of batch job output file


    cd ${SLURM_SUBMIT_DIR}
    
    module load matlab
    for (( i=1; i<=16; i++))
    do
	matlab -nodisplay -r num.9x2.$i > output.9x2.$i &
    done
    wait