Child pages
  • WARNING - Using desktop/laptop computers for research
Skip to end of metadata
Go to start of metadata

For researchers using their laptop/desktop computers for research, consider the following error that have been seen on servers on numerous occasions.

Message from syslogd@n101 at Jun 18 10:28:58 ...
 kernel:[Hardware Error]: MC4 Error (node 6): L3 data cache ECC error.

Message from syslogd@n101 at Jun 18 10:28:58 ...
 kernel:[Hardware Error]: Error Status: Corrected error, no action required.

Message from syslogd@n101 at Jun 18 10:28:58 ...
 kernel:[Hardware Error]: CPU:24 (15:1:2) MC4_STATUS[-|CE|MiscV|-|AddrV|-|-|CECC]: 0x9c60c1d0001c010b

Message from syslogd@n101 at Jun 18 10:28:58 ...
 kernel:[Hardware Error]: MC4_ADDR: 0x00000000000181c8

Message from syslogd@n101 at Jun 18 10:28:58 ...
 kernel:[Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: GEN

These sort of memory errors are usually indicative that a memory DIMM is nearing failure (if frequent).  HPC systems use ECC memory, hence notice the Corrected error, no action required - HPC servers will crash if on multi-bit errors (so computational results are safe from such hidden errors). 

There are very few desktop/laptop computers using ECC memory and this sort of error would go through undetected.  Result would be corruption of data, potentially invalidating your conclusions/publications.  Depending on which bit of data was corrupted and how you validate computational output, the error might be obvious or go completely undetected by you.

This is the first time I've ever seen such a message for L3 data cache.  However, I have seen single bit memory errors on many occasions.  SGI (past vendor for JCU HPC servers) have written software to trap and log such events.  HPC staff have replaced several DIMMs over time and have seen multi-but error crashes.

Storage

The situation is similar for storage.  The primary thought that you should be considering is that disks fail - most personal computing devices only have a single disk and many people don't have routine (e.g., daily backups) in place. Technology Solutions staff have replaced many thousands of disks over time.  HPC storage has RAID protection and in some cases up to 12 disks can fail without data loss - due to the level of redundancy built into our storage platforms.

I personally know of past and present researchers who've lost almost everything when a laptop/desktop computer disk has failed (e.g., restart thesis).   I've also seen corruption on USB disks - researchers have lost files that can never be recaptured (e.g., videos/photos).

IMPORTANT: HPC staff suggest that all researchers know about "silent data corruption" - read https://en.wikipedia.org/wiki/Data_corruption (with particular attention to the Silent section).  If you place any value your research data, you should be using a storage platform that protects against silent data corruption.

  • No labels