While HPC would like to guarantee JCU researchers a 99.99% available service (see High_availability), the infrastructure we have is only really "best-effort" in terms of reliability. Historically, we have achieved >=99% reliability most years.
There are many single points of failure in the current (2021) HPC infrastructure design:
- There is no ethernet switch redundancy (public or private networks).
- The only component redundancy present in servers is the SSDs that house the operating system.
- There is no redundancy built into NAS (network attached storage) services.
- High density server configurations mean that network port redundancy isn't possible.
Hardware failures on servers/switches may lead to lengthy outages, although workarounds may be implemented to reduce the outage timeframe. Failures on storage infrastructure are much more problematic (universal impact) and local workarounds may not be possible. All of the above risks could be addressed with extra investment and people time.
Impact of QCIF Service Outages
JCU HPC provides a cache to nationally significant data housed at QCIF(Bne). Outages at QCIF may mean that only files on the cache will be accessible for the duration of the outage. JCU HPC staff aren't in a position to fix or even escalate priority placed on such problems.