Child pages
  • WARNING - Using desktop/laptop computers for research
Skip to end of metadata
Go to start of metadata

Memory Errors

For researchers using their laptop/desktop computers for research, consider the following error that have been seen on servers on numerous occasions.

Message from syslogd@n101 at Jun 18 10:28:58 ...
 kernel:[Hardware Error]: MC4 Error (node 6): L3 data cache ECC error.

Message from syslogd@n101 at Jun 18 10:28:58 ...
 kernel:[Hardware Error]: Error Status: Corrected error, no action required.

Message from syslogd@n101 at Jun 18 10:28:58 ...
 kernel:[Hardware Error]: CPU:24 (15:1:2) MC4_STATUS[-|CE|MiscV|-|AddrV|-|-|CECC]: 0x9c60c1d0001c010b

Message from syslogd@n101 at Jun 18 10:28:58 ...
 kernel:[Hardware Error]: MC4_ADDR: 0x00000000000181c8

Message from syslogd@n101 at Jun 18 10:28:58 ...
 kernel:[Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: GEN

These sort of memory errors are usually indicative that a memory DIMM is nearing failure (if frequent).  HPC systems use ECC memory, hence notice the Corrected error, no action required - HPC servers will crash if on multi-bit errors (so computational results are safe from such hidden errors). 

There are very few desktop/laptop computers using ECC memory and this sort of error would go through undetected.  Result would be corruption of data, potentially invalidating your conclusions/publications.  Depending on which bit of data was corrupted and how you validate computational output, the error might be obvious or go completely undetected by you.

This is the first time I've ever seen such a message for L3 data cache.  However, I have seen single bit memory errors on many occasions.  SGI (past vendor for JCU HPC servers) have written software to trap and log such events.  HPC staff have replaced several DIMMs over time and have seen multi-but error crashes.

Storage Errors/Failures

Prior to choosing a desktop/laptop computer for research, consider the fact that disks fail.  Most personal computing devices only have a single disk and many people don't have data protection measures in place. Technology Solutions staff have replaced many thousands of disks over time.  HPC storage has significant protection against data loss due to disk failures.

I personally know of past and present researchers who've lost almost everything when a laptop/desktop computer disk has failed.   I've also seen corruption on disks - researchers have lost files that can never be recaptured (e.g., videos/photos) due to errors on disks.

IMPORTANT:  HPC staff suggest that all researchers know about "silent data corruption" - read https://en.wikipedia.org/wiki/Data_corruption (with particular attention to the Silent section).  This is a far more insidious situation as data has changed without any physical error.  If you place any value your research data, you should be using a storage platform that protects against silent data corruption for the primary copy of all important research data.

Compliance

Australian government agencies are very much aware of people doing research that isn't reproducible.  Using personal computing devices for research is seen as one of the biggest threats to verification of research outcomes after a person leaves JCU.

  • No labels