The Australian government and their research funding bodies do not consider use of personal computing devices (e.g., desktop/laptop computers) to be responsible research. Using personal computing devices for research could result in non-awarding of funding or a post-graduate degree. For career researchers, it could also lead to your research being labelled as untrusted.
During my time at JCU, I have heard several stories of researchers (PhD students mostly) having to "start from scratch" after a disk/laptop has failed. HPC storage has significant levels of protection against hardware failures.
The following list contains some key advantages to use of HPC resources:
- You can run many jobs in parallel, increasing your research productivity.
- Some software has been written to run on many CPU cores - all HPC compute nodes have 40 CPU cores, again increasing your research productivity.
- Each HPC compute node (1-Jan-2020) has 384GiB of memory installed, an amount you won't see in any personal computing device. So you can perform research on the HPC infrastructure that isn't possible on a personal computing device.
- HPC infrastructure uses ECC memory - single bit errors will be reported and corrected, multiple bit errors will result in server shutdown, so your results won't be corrupted by events invisible to you. Unless you are using workstation class hardware, you cannot guarantee that results of computational research done on personal computing devices can be trusted.
JCU CPU Cluster
As of 1-Jan-2020, the CPU cluster has:
- 2 login nodes; each with 40 CPU cores (Intel Xeon 6248) and 384GiB of memory.
- 17 compute nodes; each with 40 CPU cores (Intel Xeon 6148/6248) and 384GiB of memory.
Login nodes may be used for interactive workflows (e.g., where GUIs are required), testing and/or development purposes, and for short (<4 hours), single-core jobs (no more than 4 at a time).
This is not an easy question to answer as there are many potential reasons, for example:
- HPC clusters are configured to give dedicated resources to jobs. This results in greatest organisational level efficiency.
- Your jobs request cannot be fulfilled by any node in the cluster (physical limitation or cluster/queue configuration).
- You have already hit per-user limits configured for cluster and/or a queue.
- You have requested or been assigned to a specific node which doesn't have sufficient free resources to run your job yet.
- JCU runs FIFO (first in, first out) queues. Recent history has shown that some part of our cluster is idle most (~70%) of the time. FIFO were considered "minimum viable product" at the time the current PBSPro job management system was put in place. Fairshare algorithms only provide benefits when a cluster is at 100% consumption for long periods of time - once a job starts running, it will remain running until completion (at JCU).
PS. Most HPC facilities impose a relatively short maximum walltime limit on jobs (e.g., 7 days). Modifying workflows (e.g., writing checkpoint/resume capabilities) is the responsibility of each research. JCU has a much more lenient approach to maximum walltime requests. JCU will most likely introduce much shorter timeframes on jobs in time, possibly at the same time as a fairshare algorithm is introduced (to achieve maximum value from fairshare).
The HPC cluster is built upon the Linux operating system, not Windows.
However, There may be a version of the software you want to use available for Linux (features/functionality may differ).
Software such as mono or wine are not installed due to risks of ransomware to an environment that cannot be backed up (due to size/cost).
Researchers' using computational research software that only works under Windows can request a virtual machine (see Virtual Machines section below).
Note: HPC staff have commenced work on a high-end graphical environment (remote desktop) for researchers. Windows/Linux VMs with NVIDIA quadro graphics/GPU capabilities may be provisioned on this environment in future.
No. HPC does not have any resources that could be used to run computational research software written only for OSX.
However, there may be a Linux or Windows version of the software you wish to use (features/functionality may differ).
JCU GPU Cluster
Yes. There is one (1) remaining server with two nVIDIA Volta100 GPU cards installed - this server will hit end of life in Apr-2021 and is unlikely to be replaced.
A virtual GPU resource provisioning capability is in proof-of-concept phase. At this point of time (Mar-2021), the outcomes from testing have been very positive.
In 2019, JCU has purchased three (3) years of access to 10% of the GPU capacity at Queensland Brain Institute (QBI). Please contact eResearch centre staff to gain access to the QBI GPU cluster.
JCU Virtual Machines
JCU HPC run three servers that provide virtual machine provisioning capabilities to support JCU research activities. As of 1-Jan-2020, the HPC ESXi cluster consists of:
- 2 servers; each with 28 CPU cores and 512GiB of memory.
- 1 server capable with 24 CPU cores and 192GiB of memory. It is also capable of providing virtual Quadro graphics card capabilities for up to 10 virtual machines.
Virtual machine platforms may be provisioned to perform tasks or provide services that cannot be provided on the the HPC CPU cluster.
Update (Mar-2021): The servers mentioned above are due to be replaced in Q2-2021.
By default, VMs are created with 1 vCPU and 4GB of vRAM. VM requests up to 4 vCPUs and 32GB of memory may be provisioned (approval required). Initially configured resources may be reduced if there is little evidence of use.
Note that virtual CPU and memory resources are shared - do not expect physical server levels of performance.
JCU Technology Solutions provide/support two platform(s) for use by researchers. Development of all service providing platforms will be based on one of the following operating systems
- Microsoft Windows Server (excluding end-of-life versions)
- RedHat Enterprise Linux (excluding end-of-life versions)
VMs get decommissioned for multiple reasons, some of which are:
- Following notification from owner that it is no longer required.
- When (possibly before) the operating system installed reaches end-of-life.
- Non-compliance with ICT Acceptable Usage Policy.
- Following a system or service compromise.
- After service owner has left JCU.
In the event of an operating system reaching end-of-life, a service that is still relevant/important can be migrated to a new VM. Alternatively, a distribution upgrade of the operating system may be requested and performed.
As of 1-Jan-2020, there are two storage platforms for JCU researcher consumption. They provide multiple filesystems:
/home- 512TiB of space for researchers' home directories.
/scratch- 80TiB of "scratch" space (similar to
/tmpin terms of usage).
/sw- 200GiB of space to house software installed by HPRC staff, available for all researchers.
/gpfs01- 516TiB of cache space for "nationally significant", ARDC collections. The primary copy of all ARDC collections is held by QCIF in South-East Queensland.
One of the most important things for you to note is that there is insufficient space for all researchers to actually consume their default quota.
Quota enforcement is in place on all research filesystems:
/home- 5TiB per researcher. 250,000 inodes per researcher.
/scratch- 5TiB per researcher, 1,000,000 inodes per researcher. Not suitable for long-term data housing.
/gpfs01- ARDC/QCIF have a merit-based approval process for research projects that request storage. JCU's cache quota per collection will be lower than the quota set at QCIF.
Individual jobs/workflows can also use SSD space available on each server. Each CPU node should have a 300TiB
/tmp filesystem for such requirements. Scheduled processes regularly clean up (delete) old files in this filesystem (on all nodes).
Perception of filesystem performance decreases with increasing inode count.
JCU replicate all research data to an offsite location. Our current total inode count is the primary reason that a replication process (all data) will take well over a month, even if there has been no/little change to data.
The default quotas can be thought of as "what a researcher gets with a HPC account". Until processes mentioned below are introduced, reasonable quota increases will be actioned. However, such quota increases may be removed at a latter date (at short notice).
Mechanisms are likely to be introduced that allow research groups to pay for increased quota.
Disk (space) quota increases without extra payment will likely be subject to an approval process.
Firstly, "slow" is subjective - what you consider slow others may consider fast. Note that use of personal computers for research computing (or data retention) isn't regarded as "responsible research" and I'm not aware of any personal computing device that can scale up to 2PB, so comparisons with speed of an SSD on your personal computer are meaningless.
There are numerous reasons impacting performance of speed of uploads/downloads:
- I am not aware of any shared network that provides a performance guarantee. Think about it this way - on your home NBN/ADSL link, do you always get the speed you pay for? I definitely don't.
- Storage is often the biggest bottleneck to percieved network performance. Once again, all HPC storage is shared across users and jobs they are running (could be hundreds of parallel IO operations at any given time).
- HPC houses over 2PiB (2038TiB) of research data, almost wholly on 7200RPM NL-SAS disks. Performance at that scale comes at an extremely high price - way above HPC budget.
- JCU HPC filesystem (home directories) performance caps out at about 12Gb/s (theoretical). Filesystem performance decreases with increasing inode (file) count. Also, the smaller the file the worse the maximum speed you'll see.
Ultimately, it really comes down to cost. If you are unhappy with the performance provided by HPC, there are other options - most of which you will have to find budget and/or time to setup and use.
This is a very difficult question to answer, some of the factors involved are:
- Size of file. Small files will never see high transfer speeds (even when there is no network involved).
- Number of concurrent IO requests active on the system/environment or number of concurrent users and jobs using the infrastructure.
- Time of day (related to 2). The biggest factor to performance seems to be time of day - highest speeds are achieved outside normal working hours.
- Anti-virus software, encryption level, security devices, etc. In a world where more people are having their accounts compromised, think very carefully before bypassing or not using such devices/methods.
For uploads of old research data to AWS, I see average speeds ranging between 10MB/s and 240MB/s. At a given instant though, I have seen transfer rates that are less than 1MB/s. Generally speaking, I don't like seeing <10MB/s, happy with 30MB/s, but love seeing more.
Note: There are ways/tools to improve network performance, however many come with a high price tag or risk.
There are numerous options available to JCU researchers, three of which are:
- QCIF/QCISCloud - There is a short-term merit allocation scheme in place. Pay-for-service options exist.
- NCI - There is a merit allocation scheme in place. Pay-for-service options may exist.
- Public cloud (e.g., AWS or Azure) - This is a pay-for-service option in most cases.
If you are collaborating with researchers from other institutions, you may consider requesting access to their resources.