Often a user of the campus cluster has a number of single-core (serial) jobs that they need to run. Since the campus cluster nodes have multiple cores (16/20/24/28/40/56 cores on a node), using these resources efficiently means running multiple of these type of jobs on a single node. This can be done with multiple batch jobs (one per serial process) or combined within a single batch job.

Keep in mind memory needs to decide how many serial processes to run concurrently on a node. If running your jobs in the secondary queue, also be aware that the compute nodes on the Campus Cluster have different amounts of memory. To avoid overloading the node, make sure that the memory required for multiple jobs or processes can be accommodated on a node. Assume that approximately 90% of the memory on a node is available for your job (the remaining is needed for system processes).

Multiple Batch Jobs

The queue Node Access Policy feature in PBS Torque can be utilized to submit serial jobs to run concurrently on a node. Use the singleuser policy, where jobs owned by the same user can share a node. Multiple serial jobs submitted will then be scheduled to run concurrently on a node.

The following PBS specification will schedule multiple serial jobs on one node:


    #!/bin/bash
    #PBS -l nodes=1:ppn=1
    #PBS -l naccesspolicy=singleuser
    <other qsub options>
    #
    cd ${PBS_O_WORKDIR}

    # Run the serial executable
    ./a.out < input > output

Note:

  1. Increase the ppn value for each job to take memory needs into consideration – so fewer jobs get scheduled concurrently on a node resulting in more memory available to each job.
  2. You can also choose to use the shared node access policy – in this case, jobs belonging to other users could also get scheduled concurrently on a node (those jobs would also need to specify the -l naccesspolicy=shared option in queues set up as singlejob).

Single Batch Job

Specify the maximum value for ppn in the qsub -l option in a job and execute multiple serial processes within a one-node job. This works best if all the processes are estimated to take approximately the same amount of time to complete because the batch job waits to exit until all the processes finish. The basic template for the job script would be:


    executable1 &
    executable2 &
    .
    .
    .
    executable16 &
    wait

The & at the end of each command indicates the process will be backgrounded and allows 16 processes to start concurrently. The wait command at the end is important so the shell waits until the background processes are complete (otherwise the batch job will exit right away). The commands can also be handled in a do loop depending on the specific syntax of the processes.

Note: When running multiple processes in a job, the total number of processes should generally not exceed the number of cores. Also be aware of memory needs so as not to run a node out of memory.

The following example batch script runs 16 instances of Matlab concurrently:


    #!/bin/bash
    #PBS -l walltime=00:30:00
    #PBS -l nodes=1:ppn=16
    #PBS -N matlab_job
    #PBS -q secondary
    #PBS -j oe

    cd $PBS_O_WORKDIR
    
    module load matlab
    for (( i=1; i<=16; i++))
    do
	matlab -nodisplay -r num.9x2.$i > output.9x2.$i &
    done
    wait

If you are running in the secondary queue and want to run your processes on every available core, use the allprocs option in the PBS resource specification described under the qsub section of the User Guide (this option can also be used in your primary queue if your investor nodes have different number of cores).

The following syntax will run an executable named a.out on all cores:


    #!/bin/bash
    #PBS -l nodes=1,flags=allprocs
    <other qsub options>
    #
    cd ${PBS_O_WORKDIR}

    numprocs=`cat $PBS_NODEFILE | wc -l`
    for (( i=1; i<=$numprocs; i++))
    do
	./a.out < input.$i > output.$i &
    done
    wait