Impact of Network Outages
- JCU network outages won't affect PBS jobs (generally).
- HPC IP network outages won't affect PBS jobs (generally).
- HPC IB network outages will affect all jobs using fileshares.
Impact of Entry Point Outages
- Service outages on login nodes don't affect running PBS jobs.
Note that many HPC centres don't offer an interactive node. Researchers should avoid, if possible, submitting interactive jobs to the cluster.
Impact of PBS Service Outages
Service outages on the PBS server don't affect running jobs (generally). They do affect you ability to monitor existing jobs and/or submit new jobs. The possible exception to the first statement here is when a significant PBS software upgrade is undertaken.
Impact of Storage Service Outages
- Service outages (longer than a few minutes) to one of the HPC storage arrays will likely be fatal for all jobs. Depending on the nature of the failure, such events could lead to data loss.
- Service outages related to HSM services simply mean that file recalls will take longer than normal. Most jobs won't be affected by an outage of HSM services.
- HPRC NAS services have been configured for high availability - e.g., two NAS servers exist in an active-standby mode. This doesn't discount the possibility of service outages, it simply means that most service outages should be short (minutes at most) in duration. HPRC cluster nodes have been configured so that short HA failover events won't impact running jobs.
There are several single points of failure in HPRC infrastructure. Turn around time for hardware failures is almost always greater than 1 business day (mostly due to shipment of replacement parts). Note that HPC staff do not work 24/7 and our contract with the vendor does not provide 24/7 cover. Occasionally, we are required to work with staff in other countries - introducing timezone issues that can prolong outages.
While HPC would like to guarantee JCU researchers a 99.99% available service (see http://en.wikipedia.org/wiki/High_availability). Realistically, 95% availability is probably closer to what you should expect from the infrastructure design we have.