As a generalisation, Technology Solutions (inc. HPC) provide support for research services on a best effort basis.
Contacting TS-DIS Staff
If you want/need to contact TS-DIS staff, it is best to use Service-Now.
It is important that you provide as much detail as possible. The more detail provided, the more likely there will be timely resolution.
TS staff follow an agile way of working. If a job is unlikely to be completed within a few hours, it will be moved to a system for plan-able work. Plan-able work is subject to prioritisation against other planned works - be aware that your priorities may not match TS priorities.
TS-DIS (HPC) Staff Hours
There are only two (2) staff members supporting HPC cluster operations. There will be days where both staff members are aren't working.
Support for HPC is only provided during approved staff working hours. Work done outside these times needs to be approved by a TS leadership team member.
There is very little redundancy/resiliency built into JCU research infrastructure managed by HPC staff. Hardware failures usually equate to service outages, prematurely terminated jobs, and/or reduced computational capacity. Hardware failures occur more often than most researchers realise - on average, failures occur on a weekly basis. Sometimes, however, one failure can trigger other failures - a situation which can mean that we cannot provide operational support in a timely fashion.
TS-DIS (HPC) Supported Services
The current JCU HPC (CPU) cluster contains hardware purchased between 2016 & 2019. TS-DIS staff are responsible for maintaining current HPC cluster infrastructure and related services/systems until at least Q2-2023.
TS-DIS staff manage a few VMware ESXi clusters, which are currently used to deliver over 50 essentially independent virtual machines. The vast majority of these VMs provide bespoke, non-HPC services in support of JCU research. Note that capacity challenges exist in our virtualisation infrastructure.
HPC staff also install software/environments, available for general use, on the HPC cluster and related systems.
What TS-DIS (HPC) staff are NOT obliged to do
- Maintain and/or fix software written by others.
- Extensively test software installed on the HPC cluster.
- Provide support for any issues/problems associated with software and/or environments not installed by HPC. You are fully responsible for maintaining all software you create/install. If updates to HPC systems break your software, it is up to you to modify/update your software to work with current HPC systems configuration. Updates/Patches are generally applied for security reasons, roll back is not permitted.
Assistance may be provided for the above (or similar) situations on a best effort basis. Timeliness of support will be subject to evaluation of value benefit to JCU. Be aware that there may be problems that we don't have the skills to fix.