The management of HPRC cluster resources is done using three tools (in combination, referred to as PBS):
- Resource (cluster) manager,
- Node manager,
- Job scheduler.
The job scheduler handles the task of which job should be run where and when. Once a job has started, communications between the node manager and the resource (cluster) manager occur on a regular basis. These communications hold information about the job's progress and resource consumption (which the job scheduler uses for new job submissions). When submitting a job, requesting information about jobs, or similar you are talking to the resource (cluster) manager. Some important points to note about cluster management software:
- Software configurations determine placement of jobs.
- The software really doesn't come into it's own until the cluster is fully consumed with jobs waiting in the queue. Until then, it's pretty much a FIFO (first in, first out) queue.
- The more accurate and complete your submission information, the more efficient HPRC systems will be (as a whole).
- Underestimating resource requirements (CPU and memory) can lead to node crashes (killing all jobs running on that node). Underestimating wall time requirements leads to inefficient job scheduling.
- You should always be overestimating resource requirements, however you should aim to minimize the amount of resource wasted on each job.
- HPRC staff have the ability to override the system, although this is usually a last resort.
- Cluster efficiency often requires patience from individual users.
Hints for Researchers
qsubcommand is what you'll require to submit jobs for execution of compute or big memory nodes. It is also advisable to write a PBS script containing options to for the resource management system and the command(s) to be run.
- The job number (returned from
qsubcommand) is very important.
qstatcommand can be used to get a snapshot view of status of your jobs. For example, a will return the status of jobs for userid
qdelcommand can be used to delete jobs. For example, a will attempt to delete job number
12345. There are situations where users will not be able to delete jobs. In such cases, please contact a HPRC staff member (e.g., Email
- By default messages to standard output or standard error are placed in files. Options exists to merge the streams into a single file and to choose filename(s) for what would normally be screen output. An example of such a file (default naming scheme) might be
myjob.o12345, if you named you job
myjoband the job number returned from