The HPC cluster is built upon a RedHat Enterprise Linux operating system.  The following list mentions some of the more frequent problems JCU researchers have face challenges with:

  • Linux is case sensitive, Windows isn't.
  • Most Linux software is command-line only (no GUI).  I dare say that most Windows software is GUI based.
  • Windows software will often work with formatted (e.g., RTF) files.  Most Linux software will not understand such files.  When creating files on Windows for use on Linux, I recommend saving them as "plain text".
  • Windows uses carriage return and new-line characters, similar to typewriters, for what most people think of a "new line".  In Linux, a new-line (\n) is all that is required.  Not all software written for linux will be able to ignore the carriage return (\r) - leading to errors or unexpected results.   There is a command (dos2unix) that can convert files created on a Windows platform for use on Linux.
  • Using spaces, punctuation, non-printable characters, etc. in filenames can be problematic.  I recommend limiting file/directory names to only use alpha-numeric characters and/or underscores.

The HPC cluster is a multi-user environment where not all users choose, or are able, to use the same version of software.  Hence, software needs to be installed in such a way that any given user can choose which version of software they want/need to use.

  • To allow people to access a chosen version of software without altering the software itself, environment modules software has been installed - see Environment modules cheat sheet
  • For many researchers, the only environment module you should load is anaconda3.   Note that the documented "conda init" process that many people use may be problematic if HPC directory structure changes (e.g., after new/replacement storage is purchased)
  • Many users install software into their home directory.  The location of your home directory will change every time the underlying storage platform is changed.  HPC staff are not required to provide assistance with problems experienced with software you install yourself - this is considered to go against JCU's sustainability core value.  Submitting a service-now job to have HPC staff install software on the system rather than in your home directory.

Note:  Several pieces of software have very complex/large.  Some examples are: R, perl, python, conda, and matlab.  Environment modules do not exist for sub-components of a software package, the sub-components should be visible though once the correct environment module is loaded.

  • The most common mistake in PBS scripts is blank lines.  As a general rule, avoid blank lines before or between lines starting with #PBS (PBS directives).
  • Requesting less resource than you job(s) use may result in job(s) being killed.  In the worst case scenario, the consequences of such requests could lead to other jobs being killed - e.g., servers crash when all RAM and swap are fully consumed.
  • Avoid requesting significantly more request than your job(s) use.
    • Many new HPC users think that requesting 10 CPU cores will mean the job will completes 10 times faster (rarely true).  Unless the software version installed has been written to use multiple cores and you have used the correct syntax to run the job on multiple cores, your job will only use 1 CPU core.
    • Many large HPC facilities charge for resources requested - if you request 10 CPU cores, you would be charged for using 10 even if you job/workflow only used 1.
  • One mistake HPC staff have seen many times is researchers "jumping in at the deep end" with PBSPro.  HPC login nodes exist for testing and development.  If you have written your first (or a new) PBSPro script, you can test the execution components on login nodes by replacing qsub ... myjob.pbs with bash myjob.pbs (say) - assuming the name of your script file is "myjob.pbs".

The most common storage problem affects QCIF/ARDC allocation owners (directories & files found under /gpfs01).  Requests to recall data from QCIF back to the JCU medici cache (/gpfs01) can fail or timeout for many reasons (some examples: tape drive congestion, recall request congestion, link issues anywhere between QCIF infrastructure and JCU infrastructure.  The most common symptom that you may see is I/O errors - due to an "incomplete" file associated with kernel I/O controlling processes timeouts.  When these occur, the following steps may be useful:

  • The issue may be temporary, so trying again is after several (30, say) minutes.
  • Alert JCU HPC staff of the issue you are experiencing.

The following information should be understood and remembered by all QCIF/ARDC allocation users:

  • JCU research storage houses a small fraction (<=20%) of all "nationally significant" data.
  • JCU has only purchased 1 server to provide the caching service known as "medici".  Hardware failures on a server generally take O(weeks) to resolve - vendor verification/resolution occurs on their time scale.
  • JCU HPC staff can only address problems with the local cache-hosting system.  Most medici service problems are due to issues at the QCIF(Bne) home node or networking issues between the JCU HPC and QCIF(Bne) home node infrastructure.  QCIF staff will determine the timeframe for resolving issues identified - they may have higher priority tasks (than your issue) to attend to.

Technology Solutions advertise weekly maintenance windows (Wednesday evenings & Sunday mornings) when minor works can be done.  Some of these works may interrupt HPC services for a short time, even if the changes aren't HPC specific.  Where outages are expected, they will be posted as a JCU central computing bulletin (on the JCU website).  Many HPC services are delivered by corporate infrastructure.  In some cases, workarounds exist when certain devices are undergoing upgrades/changes that result in outages:

  • If isn't working for you, you can try accessing the systems that lie behind the service - or 
  • If isn't working for you, you can try accessing the only system that currently lies behind the service -
  • For most other service outages, you will need to be patient.  If a service isn't available after the advertised outage times, log a Service-Now request or contact IT HelpDesk.

Despite what many HPC users believe, at least 80% of HPRC staff time is devoted to non-user facing items - e.g., security/risk.  Generally speaking, Service-Now requests and incidents will be known (picked up) within 1 business day.  Actions addressing confirmed incidents will always take priority over those addressing requests.  Project work may be prioritized higher than requests, although this is more of an exception than a rule.  When it comes to requests be aware of the following:

  • Unless a request is likely to take less than an hour, it's priority will be assessed against all other existing work (corporate and research).
  • Requests for delivery of services will generally trigger one or more meetings to discuss requirements.  Use of cloud/external options will be given preference, if available.
  • Requests that require budget/approval can be and have been rejected by Technology Solutions (TS) leadership team.  Discussions around alternatives may be initiated.
  • Requests that aren't inline with TS security/risk guidelines will be rejected.  Discussions around alternatives may be initiated.
  • No labels