Child pages
  • HPRC Cluster Job Management Explained
Skip to end of metadata
Go to start of metadata

The management of HPRC cluster resources is done using three tools (in combination, referred to as PBS):

  • Resource (cluster) manager,
  • Node manager,
  • Job scheduler.

The job scheduler handles the task of which job should be run where and when. Once a job has started, communications between the node manager and the resource (cluster) manager occur on a regular basis. These communications hold information about the job's progress and resource consumption (which the job scheduler uses for new job submissions). When submitting a job, requesting information about jobs, or similar you are talking to the resource (cluster) manager. Some important points to note about cluster management software:

  1. Software configurations determine placement of jobs.
  2. The software really doesn't come into it's own until the cluster is fully consumed with jobs waiting in the queue. Until then, it's pretty much a FIFO (first in, first out) queue.
  3. The more accurate and complete your submission information, the more efficient HPRC systems will be (as a whole).
  4. Underestimating resource requirements (CPU and memory) can lead to node crashes (killing all jobs running on that node). Underestimating wall time requirements leads to inefficient job scheduling.
  5. You should always be overestimating resource requirements, however you should aim to minimize the amount of resource wasted on each job.
  6. HPRC staff have the ability to override the system, although this is usually a last resort.
  7. Cluster efficiency often requires patience from individual users.

Hints for Researchers

  1. The qsub command is what you'll require to submit jobs for execution of compute or big memory nodes. It is also advisable to write a PBS script containing options to for the resource management system and the command(s) to be run.
  2. The job number (returned from qsub command) is very important.
  3. The qstat command can be used to get a snapshot view of status of your jobs. For example, a
    qstat -1n -u jc999999
    will return the status of jobs for userid jc999999.
  4. The qdel command can be used to delete jobs. For example, a
    qdel 12345
    will attempt to delete job number 12345. There are situations where users will not be able to delete jobs. In such cases, please contact a HPRC staff member (e.g., Email
  5. By default messages to standard output or standard error are placed in files. Options exists to merge the streams into a single file and to choose filename(s) for what would normally be screen output. An example of such a file (default naming scheme) might be myjob.o12345, if you named you job myjob and the job number returned from qsub was 12345.

Links to Similar Content

HPRC Cluster Explained (v1.0)
HPRC Storage Explained (v0.1)
HPRC Services Availability (v1.0)
HPRC Performance Considerations (v0.1)
HPRC Acronyms (v0.1)

  • No labels