Child pages
  • HPRC Frequently Asked Questions
Skip to end of metadata
Go to start of metadata

What does it cost to run HPC?

  • In 2011, approximately $600K was spent on HPC cluster and storage infrastructure.
  • The estimated yearly cost to run the HPC cluster (power and cooling) is approximately $294K.

  • The estimated yearly cost to provide HPC storage (power and cooling) is approximately $49K.
  • At present, JCU pays for 1 staff member to manage HPC services.  Although, other ITR staff are required to perform work to assist HPC.  As a guesstimate, this cost might be $160K/year (including on-costs).

A research using 1 CPU core (no more than 2.5GB of memory) for an entire year receives up to $570 of indirect HPC funding from JCU.

A researcher consuming an average of 1TB of disk space for an entire year receives up to $1242 of indirect HPC funding from JCU.  The cost of tape capacity consumption is significantly more difficult to calculate but a cost of $150/TB/year might be close.

Note:  Costs associated with generators, cooling infrastructure, data-centre provisioning, and networking hardware and software are not known or explicitly included.  A guesstimated multiplier of power costs was used in the above cost calculations to implicitly accommodate these things.  Costs associated with backup of research data is not included at all.

Why should I consider using HPRC services?

  1. The HPRC cluster has hundreds of CPU cores.
  2. All nodes have at least 64GB of memory installed.
  3. You can run many jobs at the same time with minimal impact on your desktop/laptop.
  4. Virtual machines can be provisioned (subject to approval) for special task/service requirements.
  5. HPRC servers use ECC memory. A memory error (corruption of results) on your desktop may go undetected.
  6. HPRC infrastructure is well protected against power surges or loss.
  7. HPRC has 228TB of raw disk space (purchased by JCU).  RDSI storage should be available from Sep-2013.
  8. HPRC storage has a significant amount of resiliency built-in (individual disk failures will not result in loss of data).
  9. Selected HPRC filesystems are backed up daily. If disaster does strike, recovery of files is possible (but not guaranteed).

What resources are available in the HPRC cluster?

The exact composition of the HPRC cluster at any point in time depends on many factors, including: hardware failure and reassignment of hardware to other tasks.  In Aug-2013, there were

Qty

Description

CPU cores

Memory

Local Scratch Disk

Network capacity

30

Compute node

2x12

64GB

1621GB (/tmp)

40Gb/s (IB)

2

Memory node

4x12

256GB

1621GB (/tmp), 221GB (/fast)

40Gb/s (IB)

2

Login node

2x12

64GB

1621GB (/tmp)

40Gb/s (IB)

How many jobs may I submit (at one time) on the HPRC cluster?

  • There is no upper limit to the number of jobs you may submit, but try to keep the count under 1000.
  • If you are submitting more than 100 at a time, please include a pause of at least 5 seconds between submissions.
  • HPRC staff have configured the job management software to a fair share policy with some bias toward high-end users. The bias is required to meet requirements imposed on us from external funding bodies. The software does have limitations which can be exploited - action may be taken against users knowingly exploiting these limitations.
  • In extreme cases, HPRC staff may kill jobs if one user is (solely) preventing other users from accessing the HPRC cluster. This will only happen if their jobs are likely to take several days (or longer) to complete.
  • Users running lots of single processor jobs should consult HPRC staff/documentation for options that have less impact on our systems and other users.

HPRC doesn't have the software I require, how do I go about getting it installed?

  • If the software is available for Linux, then just contact HPRC staff.
  • If the software is only available for Windows, just contact HPRC staff.
  • If the software is only available for OSX, the answer is a definite no.

HPRC staff ask users not to install software in their home directory.
Please also read answers to the next FAQ if you are talking about commercially licensed software (including Microsoft Windows).

Will HPRC pay for the software I require?

The following answer relates only to commercially licensed software.
HPRC will not pay for commercially licensed software, with the following possible exceptions:

  1. HPRC may pay for the software if there is general interest - use across multiple faculties.
  2. ITR allocate significant funds each year to the ongoing maintenance of MATLAB and selected toolboxes.
  3. ITR have CAUDIT (and other) agreements with some vendors that may be used to provide commercial software for your use.

Please include HPRC staff in any software purchase decisions, if the software is to be run on HPRC systems.

May I backup my desktop/laptop to HPRC storage?

Having at least 2 copies of all research data is something that HPRC staff recommend.
All JCU researchers are entitled to hold one (1) copy of all their research data on HPRC storage.
It is your responsibility to maintain the integrity of this copy.  Holding multiple daily backups of your research data is not permitted.
IMPORTANT: You cannot legally use HPRC storage for storing protected content (e.g, commercial movies or music).

I've lost some files. Could you restore them from backups?

The answer to this question must be split up as follows:

  1. HPRC fileshares - No, ITR backups are for DR purposes only (e.g., fire in the datacentre). The primary reasons for this are cost and amount of data/files. The latter makes genuine file level backup impossible - next time you move 1TB of data around, multiply the time taken by 300 and you'll come close to understanding the issue. It is for this reason the HPC chose to implement a HSM solution in the year 2000 and have stuck with it ever since.
  2. Virtual machines - Yes, assuming you have requested a backup and that the amount to be backed up is less than 100GB.
  3. Physical servers - Yes, assuming you have requested a backup and that the amount to be backed up is less than 100GB.

ITR or HPRC staff will probably need to be involved in the restoration of files that have been backed up (and are restorable). A service request (https://jcueduau.service-now.com/) should be created for this task.

One user is using most of the cluster, will you delete their jobs?

Generally, the answer to this one is no, for the following reasons:

  1. External funding for HPRC is allocated based on merit (historical CPU usage). Until we hit 100% utilization for at least 1 year, the current biased policies will be maintained.
  2. HPRC systems are not purchased to sit idle waiting for jobs. Ideally, we aim to have all CPUs running jobs 24x7. In this ideal world (for HPC), all users will have to wait for jobs to be executed.
  3. HPRC systems are configured for overall efficiency, not efficiency for the individual user. Individual users should seek to optimize their code if they want better efficiency.
  4. At this time (Oct-2012), the number of jobs an individual user can run is based on usage patterns. When we have only a few users running jobs, each user may run hundreds of jobs at a time. If there are (regularly) many users actively using the HPC cluster compute nodes, then the limit will be reduced. HPC staff realize that this is a very reactive policy - due mostly to the first item in this list.
  5. The PBS software has been configured on a "fair share" policy. Once a job is running, nothing can be done. However, jobs waiting to be run are not run on a FIFO (first in, first out) basis. Your job will not necessarily have to wait for all previously submitted jobs to complete. On a cluster with idle nodes, the PBS system doesn't provide any advantage. The following snapshot of part of an overloaded system demonstrates the non-FIFO nature of our system.

    545120.admin.def     jc230643 normal   dummytest           --      1   1    --  2160: Q   --     -- 
    545121.admin.def     jc230643 normal   dummytest           --      1   1    --  2160: Q   --     -- 
    545122.admin.def     jc230643 normal   dummytest           --      1   1    --  2160: Q   --     -- 
    545123.admin.def     jc230643 normal   dummytest           --      1   1    --  2160: Q   --     -- 
    545124.admin.def     jc230643 normal   dummytest           --      1   1    --  2160: Q   --     -- 
    545125.admin.def     jc230643 normal   dummytest           --      1   1    --  2160: Q   --     -- 
    545126.admin.def     jc230643 normal   dummytest           --      1   1    --  2160: Q   --     -- 
    545128.admin.def     ctbccr   normal   06.4215.summary.    914     1   1    --  2160: R 16:02   n028/19
    545129.admin.def     ctbccr   normal   06.4223.summary.  24264     1   1    --  2160: R 14:04   n020/0
    545130.admin.def     ctbccr   normal   06.4226.summary.  24289     1   1    --  2160: R 14:04   n020/1
    545131.admin.def     ctbccr   normal   06.4227.summary.  28295     1   1    --  2160: R 13:36   n027/8
    545132.admin.def     ctbccr   normal   06.4229.summary.  11554     1   1    --  2160: R 13:02   n006/22
    545136.admin.def     ctbccr   normal   06.4271.summary.  29879     1   1    --  2160: R 12:13   n027/7
    545137.admin.def     ctbccr   normal   06.4275.summary.  13574     1   1    --  2160: R 12:13   n029/1
    545138.admin.def     ctbccr   normal   06.4310.summary.  13709     1   1    --  2160: R 12:10   n029/0
    545139.admin.def     ctbccr   normal   06.4372.summary.  32750     1   1    --  2160: R 12:00   n027/4
    
  6. Finally, the job name that users choose may mean absolutely nothing. Just because a job is called test, doesn't mean it isn't running some analysis for someone's research.
  • No labels