For all ECC skeptics

omniscence

[H]ard|Gawd
Joined
Jun 27, 2010
Messages
1,311
I just found the following entry in my kernel log. First time I have seen such report across several systems. And I live at only 517 m above sea level. The error is in the IPMI log, too.

Code:
[133651.455223] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
[133651.455229] {1}[Hardware Error]: APEI generic hardware error status
[133651.455230] {1}[Hardware Error]: severity: 2, corrected
[133651.455231] {1}[Hardware Error]: section: 0, severity: 2, corrected
[133651.455232] {1}[Hardware Error]: flags: 0x01
[133651.455233] {1}[Hardware Error]: primary
[133651.455234] {1}[Hardware Error]: fru_text: CorrectedErr
[133651.455235] {1}[Hardware Error]: section_type: memory error
[133651.455235] {1}[Hardware Error]: node: 0
[133651.455236] {1}[Hardware Error]: device: 0
[133651.455237] {1}[Hardware Error]: error_type: 2, single-bit ECC
 
I just found the following entry in my kernel log. First time I have seen such report across several systems.

This is actually why I am an ECC skeptic. The chance a single bit error on known good ram is so low that I do not worry about it that much. At work I have had systems installed for 5+ years containing ECC that never ever reported an single ECC correction. Also the 100 TB of monthly scrubs (on systems that do not have ECC) never finding a single corruption keeps me from worrying. With that said I am going purchase ECC for all new servers at work most likely Supermicro Intel boards with IPMI.
 
Last edited:
I don't really see it as something I expect to happen though, more of a safety net.
 
This is actually why I am an ECC skeptic. The chance a single bit error on known good ram is so low that I do not worry about it that much. At work I have had systems installed for 5+ years containing ECC that never ever reported an single ECC correction. Also the 100 TB of monthly scrubs (on systems that do not have ECC) never finding a single corruption keeps me from worrying.
This is odd. If you read Google study on ECC RAM, they reports lots of ECC errors. Amazon too. Something is weird and maybe wrong here.
 
I thought the google study basically showed that systems with bad ram reported lots of ECC errors. And that a certain percentage of new ram will be bad.
 
We get single bit errors reported in the logs here and there across many systems. Those are usually just random occurrences, it is when you get entire row or column errors that you should change out the RAM immediately. In any case, we only use ECC RAM in all servers.
 
Back
Top