Quick Summary
The PBSPro service that manages scheduling and monitoring of jobs onto HPC cluster nodes, is configured to kill jobs that use more resources than requested. Some useful tips:
- Do not have any blank lines before the first
#PBS
directive. See HPRC PBSPro script files for examples of PBS script files. - Always specify the resources your job(s) will require. This will usually be done with two directives:
#PBS -l walltime=...
and#PBS -l select=1:ncpus=#:mem=...
- Defaults are in place - e.g., if you don't specify memory, your job will be given access to 8GiB of memory/RAM.
- The PBSPro service will kill jobs that use more resources than requested or assigned (default).
- If in doubt, ask HPC staff to look over your PBS script.
Some jobs can be run on login nodes
HPC login (40 CPU cores and 384GiB of RAM) nodes are intended to be used for tasks such as:
- Human modification of files/data,
- Submission and monitoring of PBS jobs,
- Running a job that involves a GUI - requires human interaction,
- Transfer of files to/from desktops, laptops, or other HPC facilities,
- Using software (e.g., MATLAB or R) in a way that isn't consistently CPU intensive.
- Testing and development of software, including understanding how much resource future jobs are likely to consume.
HPC login nodes can be accessed by using SSH software (e.g., putty) to connect to zodiac.hpc.jcu.edu.au
Submitting jobs to compute nodes
HPC compute nodes are basically identical to login nodes, except for being publicly visible. Each compute node has 40 CPU cores, 384GiB of memory/RAM, and a small amount of SSD. Many 'new to HPC' people wonder why the HPC system has to be so 'difficult', so here are some points on this:
- Accountability - Reports can be generated on usage.
- Sustainability - Correctly used, a HPC cluster has a much lower 'total cost of ownership' than DIY research.
- Reliability - HPC is a multi-user environment. A single user's resource over-consumption can impact many others.
- Performance - Dedicate resources to job(s) maximises performance. Over-consumption of resources could result in hundred or thousand fold increase in how long it takes to complete a workflow/job.
- Consistency - JCU provides an entry-level HPC cluster that can help you move workflows onto much larger and stricter HPC environments such as NCI or Pawsey.
- Options - Environment modules software is used to allow easy access to several versions of software.
The best method for submitting jobs to HPC is by creating PBS script files. Several examples can be found at HPC PBSPro script files. The example scripts provided will need to be modified, both in terms of PBS directives and commands required to complete job/workflow execution. The following table explains, by examples, some of the PBS directive components you may be modifying frequently.
Job requirement | PBS Directive component |
---|---|
5 minutes of clock time | -l walltime=5:00 |
8 hours of clock time | -l walltime=8:00:00 |
20 days of clock time | -l walltime=20:00:00:00 |
1 CPU core & 500MB of memory | -l select=1:ncpus=1:mem=500mb |
1 CPU core & 16GB of memory | -l select=1:ncpus=1:mem=16gb |
2 CPU cores & 1GB of memory | -l select=1:ncpus=2:mem=1gb |
4 CPU cores & 24GB of memory | -l select=1:ncpus=4:mem=24gb |
20 CPU cores & 100GB of memory | -l select=1:ncpus=20:mem=100gb |
40 CPU cores | -l select=1:ncpus=40:mem=380gb |
The JCU HPC cluster isn't large enough for a single job to be assigned more than 1 node - every job submission must use select=1
.
The current JCU HPC cluster is comprised of servers with 40 cores and 384GB of memory, so if your jobs will use 40 cores, you may as well request most of the memory resource.
The above table and all examples provided by HPC staff are just a small sample of the options available. Consult the PBSPro User Guide for a full listing of PBS directives and more.
See HPC PBSPro script files for examples of PBS script files.
Essential Understanding
- Most software will only use 1 CPU core. For such software, requesting more CPUs will only result in your blocking HPC resources from other researchers.
- Jobs submitted to compute nodes will get dedicated access to the requested resource, providing mistakes haven't been made.
- The PBSPro system may delete jobs that use more resources than requested. The alternative has resulted in outages affecting several researchers.
- Researchers who consistently over-specify their resource requirements will be contacted by HPC staff and requested to change their habits.
- The JCU HPC cluster is designed for jobs requiring less than 40 CPU cores and 380GB of memory. If this isn't sufficient for you job(s), please contact HPC staff to discuss alternatives.
- Our PBSPro configuration will accept jobs with walltime requests up to 180 days. No guarantee will be given that a given server will remain online for 180 days.
- HPC staff reserve the right to kill any jobs running on login nodes without notification (very rare).