[Date Prev][Date Next][Date Index]

Update - HPC Cluster Service Down



Who is Affected: Researchers who run jobs on HPC cluster nodes
Service Affected:  High Performance Computing (HPC) Cluster Service
When: 10th June 2014 – 4:10pm AEST
ETA: A partial restoration of services is operational as at 3:30pm AEST on the 10th June 2014

Description: HPC cluster services are partially available.  A new job management system has been configured to provide a resource on which researchers can run and monitor new jobs.  Many old jobs remain running but are not easily monitored.  Some existing jobs have crashed due to compute node crashes over the long weekend.  HPC staff will make every effort to see existing jobs run to their completion.  Due to a number of researchers under-requesting resources, more compute node failures can be expected - there are still 8 nodes under significant memory pressure.

What do I need to do? Submit jobs with the knowledge that the _new_ cluster size is about 15% of the original cluster size.  Nodes will be added as soon as they become available. Please continue to monitor the Central Computing Bulletins for further updates.


Ref: PRB0000138