While HPC would like to guarantee JCU researchers a 99.99% available service (see High_availability), the infrastructure we have is only really "best-effort" in terms of reliability. Historically, we have achieved >=99% reliability most years.
There are many single points of failure in the current (2021) HPC infrastructure design:
Hardware failures on servers/switches may lead to lengthy outages, although workarounds may be implemented to reduce the outage timeframe. Failures on storage infrastructure are much more problematic (universal impact) and local workarounds may not be possible. All of the above risks could be addressed with extra investment and people time.
Impact of QCIF Service Outages
JCU HPC provides a cache to nationally significant data housed at QCIF(Bne). Outages at QCIF may mean that only files on the cache will be accessible for the duration of the outage. JCU HPC staff aren't in a position to fix or even escalate priority placed on such problems.
While HPC would like to guarantee JCU researchers a 99.99% available service (see http://en.wikipedia.org/wiki/High_availability). Realistically, 95% availability is probably closer to what you should expect from the infrastructure design we have.