[Date Prev][Date Next][Date Index]

Unplanned Outage - Researchers wanting to submit or check HPC jobs



Who is Affected: Any JCU Researchers wanting to submit or check HPC jobs
Service AffectedHPC job submission and monitoring
When: 13th August 2015 – 2:40pm AEST
ETA: TBA

Description: The HPC job management service failed at about 2:40pm.  At this stage, the most likely reason appears to be too many user job submissions in too short a time frame. The system was unable to keep up and became unresponsive.  The system has been reset but the service will not start again.  It is possible that the request that caused the outage is still happening (based on a search of related forums).   History has shown that jobs already running won't be affected by the outage.

What do I need to do? Researchers who might be trying to submit new jobs should stop the process(es) immediately.  Be patient and await further advice.  Use login nodes if you have urgent demands to run jobs. 

Ref: PRB0000296