[Date Prev][Date Next][Date Index]

Restored - Researchers wanting to submit or check HPC jobs



Who is Affected: Any JCU Researchers wanting to submit or check HPC jobs
Service AffectedHPC job submission and monitoring
When: 13th August 2015 – 2:40pm AEST
ETA: 13th August 2015 – 7:15pm AEST

Description: The HPC job management service failed at about 2:40pm.  The root cause of the service not being able to be restarted was eventually tracked down to corruption on a critical configuration file.  This may have been caused by the hard reset at a time of extreme I/O activity (based on the "unresponsive system" symptom).  It would appear that no running or queued jobs were impacted and the issue has been resolved.

What do I need to do? Use the HPC system as you did prior to the problem.  If you will be submitting hundreds or thousands of jobs in quick succession try to place 5 -10 seconds  between submissions.  Please contact the IT Help Desk if you are still experiencing issues as a result of this outage and require further assistance.

Ref: PRB0000296