[Date Prev][Date Next][Date Index]

Unplanned Outage - HPC Cluster Service Down



Who is Affected: Clients who access the HPC cluster nodes
Service Affected:  High Performance Computing (HPC) Cluster Service
When: 10th June 2014 – 7:30am AEST
ETA: Unknown

Description: HPC cluster nodes/services are currently experiencing problems on several fronts. The Torque+Maui service is currently non-responsive. Most nodes are unable to maintain NFS mounted filesystem connections. Additionally, NO compute nodes are able to talk to the Torque+Maui server. At present, there are two main symptoms; "too many open files" when trying to close completed jobs (on the Torque+Maui server) and NFS mounts failing on most compute nodes.

 

This issue should not affect researchers who only access HPC fileshares from their desktop/laptop computer.

What do I need to do? Currently there are no workarounds available at this time. Please continue to monitor the Central Computing Bulletins for further updates.


Ref: PRB0000138