For all ECC skeptics

omniscence

[H]ard|Gawd
Joined
Jun 27, 2010
Messages
1,311
I just found the following entry in my kernel log. First time I have seen such report across several systems. And I live at only 517 m above sea level. The error is in the IPMI log, too.

Code:
[133651.455223] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
[133651.455229] {1}[Hardware Error]: APEI generic hardware error status
[133651.455230] {1}[Hardware Error]: severity: 2, corrected
[133651.455231] {1}[Hardware Error]: section: 0, severity: 2, corrected
[133651.455232] {1}[Hardware Error]: flags: 0x01
[133651.455233] {1}[Hardware Error]: primary
[133651.455234] {1}[Hardware Error]: fru_text: CorrectedErr
[133651.455235] {1}[Hardware Error]: section_type: memory error
[133651.455235] {1}[Hardware Error]: node: 0
[133651.455236] {1}[Hardware Error]: device: 0
[133651.455237] {1}[Hardware Error]: error_type: 2, single-bit ECC
 

drescherjm

[H]F Junkie
Joined
Nov 19, 2008
Messages
14,889
I just found the following entry in my kernel log. First time I have seen such report across several systems.

This is actually why I am an ECC skeptic. The chance a single bit error on known good ram is so low that I do not worry about it that much. At work I have had systems installed for 5+ years containing ECC that never ever reported an single ECC correction. Also the 100 TB of monthly scrubs (on systems that do not have ECC) never finding a single corruption keeps me from worrying. With that said I am going purchase ECC for all new servers at work most likely Supermicro Intel boards with IPMI.
 
Last edited:

vegaman

n00b
Joined
Sep 17, 2013
Messages
41
I don't really see it as something I expect to happen though, more of a safety net.
 

brutalizer

[H]ard|Gawd
Joined
Oct 23, 2010
Messages
1,598
This is actually why I am an ECC skeptic. The chance a single bit error on known good ram is so low that I do not worry about it that much. At work I have had systems installed for 5+ years containing ECC that never ever reported an single ECC correction. Also the 100 TB of monthly scrubs (on systems that do not have ECC) never finding a single corruption keeps me from worrying.
This is odd. If you read Google study on ECC RAM, they reports lots of ECC errors. Amazon too. Something is weird and maybe wrong here.
 

drescherjm

[H]F Junkie
Joined
Nov 19, 2008
Messages
14,889
I thought the google study basically showed that systems with bad ram reported lots of ECC errors. And that a certain percentage of new ram will be bad.
 

mwroobel

Supreme [H]ardness
Joined
Jul 24, 2008
Messages
5,011
We get single bit errors reported in the logs here and there across many systems. Those are usually just random occurrences, it is when you get entire row or column errors that you should change out the RAM immediately. In any case, we only use ECC RAM in all servers.
 
Top