Linus Torvalds On The Importance Of ECC RAM, Calls Out Intel's "Bad Policies" Over ECC

Joined
Oct 27, 2014
Messages
535
The Importance Of ECC RAM

Linus argues that error-correcting code (ECC) memory "absolutely matters" but that "Intel has been instrumental in killing the whole ECC industry with it's horribly bad market segmentation... Intel has been detrimental to the whole industry and to users because of their bad and misguided policies wrt ECC. Seriously...The arguments against ECC were always complete and utter garbage... Now even the memory manufacturers are starting do do ECC internally because they finally owned up to the fact that they absolutely have to. And the memory manufacturers claim it's because of economics and lower power. And they are lying bastards - let me once again point to row-hammer about how those problems have existed for several generations already, but these f*ckers happily sold broken hardware to consumers and claimed it was an "attack", when it always was "we're cutting corners".
https://www.phoronix.com/scan.php?page=news_item&px=Linus-Torvalds-ECC


Original comments by Torvalds are here: https://www.realworldtech.com/forum/?threadid=198497&curpostid=198647
 
Last edited:
The only argument I can think of against ECC is the higher latency, which can potentially lower performance in some circumstances.

Well, that and the higher cost.

And it's not enough that there is higher latency at the given speed with ECC. You also can't buy ECC at higher speeds in most cases. My current ram is DDR4-3600 CL16. Fastest compatible RAM I can find is DDR4-2933 CL21....

Truth is, most applications just aren't critical enough to warrant the cost and performance impact of ECC.

As far as Intel goes, I'm no fan of Intel, but I feel like the industry as a whole, not just intel - has been very poor at documenting pro features, like which platforms support ECC, or VT-D/IOMMU, etc. Intle, AMD and all the motehrboard partners included. Sometimes if you want these features it's been a guesswork. Buy it and test and find out. Oh, and sometimes doing something silly like updating the BIOS can make it stop working.

I THINK my Threadripper supports ECC, but honestly, I am not sure. And if my Threadripper supports ECC, does my motherboard? Does the current version of my BIOS?
 
Last edited:
The only argument I can think of against ECC is the higher latency, which can potentially lower performance in some circumstances.

Well, that and the higher cost.

Truth is, most applications just aren't critical enough to warrant the cost and performance impact of ECC.

As far as Intel goes, I'm no fan of intel, but I feel like the industry as a whole, not just intel - has been very poor at documenting pro features, like which platforms support ECC, or VT-D/IOMMU, etc. Intle, AMD and all the motehrboard partners included. Sometimes if you want these features it's been a guesswork. Buy it and test and find out. Oh, and sometimes doing something silly like updating the BIOS can make it stop working.

I THINK my Threadripper supports ECC, but honestly, I am not sure. And if my Threadripper supports ECC, does my motherboard? Does the current version of my BIOS?
The only time I have found EEC guaranteed to work is on Xeon and EPYC platforms, outside there it's a crapshoot, and even then I have had AMD bios updates break that compatibility.

Then you have AMD pulling shit like this:
"while Threadripper supports ECC RAM and you can put ECC RAM in a Threadripper motherboard, most of the current crop of Ryzen motherboards quietly ignores the ECC functionality altogether. The board will boot, and it'll use the RAM, but it will not actually utilize the ECC bits." *1

So while a board may say it offers EEC support you need to do a deep dive on the MB manuals and support forums to find out if it does in fact actually utilize EEC.

There are no good guys here, Intel and AMD both look for ways to cut corners, and skimping on EEC support was probably the one they all felt most comfortable getting away with.


1. https://arstechnica.com/gadgets/201...-guide-winter-2019-the-one-about-the-servers/
 
I'm glad to see that I'm not the only one that thinks Intel's ridiculous market segmentation is total bullshit.
Good for Linus for stating all of this - it needs to be said, and Intel's toxic practices and anti-competitive and anti-innovative practices need to end.
 
The only time I have found EEC guaranteed to work is on Xeon and EPYC platforms, outside there it's a crapshoot, and even then I have had AMD bios updates break that compatibility.

Then you have AMD pulling shit like this:
"while Threadripper supports ECC RAM and you can put ECC RAM in a Threadripper motherboard, most of the current crop of Ryzen motherboards quietly ignores the ECC functionality altogether. The board will boot, and it'll use the RAM, but it will not actually utilize the ECC bits." *1

So while a board may say it offers EEC support you need to do a deep dive on the MB manuals and support forums to find out if it does in fact actually utilize EEC.

There are no good guys here, Intel and AMD both look for ways to cut corners, and skimping on EEC support was probably the one they all felt most comfortable getting away with.


1. https://arstechnica.com/gadgets/201...-guide-winter-2019-the-one-about-the-servers/

Yep. That part is REALLY annoying.

My Asus ROG Zenith II Extreme Alpha claims to support unbuffered ECC, but I ahve no idea if it is real ECC support, or if it is just "will run ECC RAM like non-ECC RAM" support, like you mention.

And I think it is pretty optimistic of you to expect to find that information in the manual.

Last time I checked, there wasn't even a good way to test if ECC was actually working, once installed....
 
Yep. That part is REALLY annoying.

My Asus ROG Zenith II Extreme Alpha claims to support unbuffered ECC, but I ahve no idea if it is real ECC support, or if it is just "will run ECC RAM like non-ECC RAM" support, like you mention.

And I think it is pretty optimistic of you to expect to find that information in the manual.

Last time I checked, there wasn't even a good way to test if ECC was actually working, once installed....
The 'best' way to test ECC is to run with your ram at intentionally too tight of timings, or too low of voltage, so that it will have errors. And confirm that reporting is working. I saw in the mailing list thread there is a fault injection tool for AMD processors, that could work, too; if so, it would be a lot less work.

But of course, this is after you spent more money on ECC than regular ram, and if it doesn't work, are you going to return the board and rebuild your system? Or just the ram and get faster and cheaper ram?
 
Yep. That part is REALLY annoying.

My Asus ROG Zenith II Extreme Alpha claims to support unbuffered ECC, but I ahve no idea if it is real ECC support, or if it is just "will run ECC RAM like non-ECC RAM" support, like you mention.

And I think it is pretty optimistic of you to expect to find that information in the manual.

Last time I checked, there wasn't even a good way to test if ECC was actually working, once installed....
The only way I have really found to "test" it is to 1 check the bios for an ECC flag, no flag no bit support, then 2 boot up a copy of Memtest 86+ and check the EEC flag there, then lastly use a live boot of Ubuntu and do the eec_check.c test there. But that is a lot of steps and a by the time you can do those you are already out of pocket, it is a most annoying process and honestly I've just come to the conclusion that if I have something that I need ECC support for I will just justify the added cost of ordering from Dell and ensuring that it is part of the initial configuration, sadly that means no Threadrippers, only Xeons, and EPYC.
 
"while Threadripper supports ECC RAM and you can put ECC RAM in a Threadripper motherboard, most of the current crop of Ryzen motherboards quietly ignores the ECC functionality altogether. The board will boot, and it'll use the RAM, but it will not actually utilize the ECC bits." *1

So while a board may say it offers EEC support you need to do a deep dive on the MB manuals and support forums to find out if it does in fact actually utilize EEC.
Many boards (Asus at least) on the functionality page now say they support ecc ram ("Supports ecc ram in non-ecc mode" or something similar) or support ecc completely. I believe Linux has functionality to inject errors which shoudl be caught & logged if it is working.
 
Many boards (Asus at least) on the functionality page now say they support ecc ram ("Supports ecc ram in non-ecc mode" or something similar) or support ecc completely. I believe Linux has functionality to inject errors which shoudl be caught & logged if it is working.
Yeah, Ubuntu live USB key with eec_check.c compiled and ready to go is in my tool kit for when I am doing my dry test of servers before deploying them. But sadly the flags for "Supports ecc ram in non ecc-mode" and what not is at the board makers discretion and their not always up to date on that either on the site so you have to deep dive the documentation but you should do that beforehand anyways when making a purchase, so, really it's just one more step to add to the server/workstation building process.
 
I wished board reviewers would cover this aspect of memory support and actually test it. It goes further than just the manufacturers, if more light was shined on it, real use cases where it is beneficial to vital. Where are the ECC memory aspect of reviews on motherboards and why is that not even standard tests for HEDT configurations?
 
I wished board reviewers would cover this aspect of memory support and actually test it. It goes further than just the manufacturers, if more light was shined on it, real use cases where it is beneficial to vital. Where are the ECC reviews and why is that not even standard tests for HEDT configurations?
This doesn't get tested because effectively no one looking at HEDT systems cares.
 
FWIW: I run an old server (ivy bridge) with ecc reg ddr3 (64G) and I periodically poke around to check for ecc errors & I haven't seen one in the 2 years I've had this server, outside of a bad dimm. It's in a basement so I wonder if that helps...
I DID get a bad dimm & it caught that with ecc errors so I'm sure the functionality works (it was also a multi-thousand dollar server when it came out) and it also caught a bad cpu which I replaced. I'd love to replace this with a nice little zen2 box with ecc for both more speed & less power usage, hence my motherboard purchase/experimentation.
 
This doesn't get tested because effectively no one looking at HEDT systems cares.

I have a HEDT system.

I currently have non-ECC DDR4-3600 CL16 RAM.

I know I'd sacrifice a little in the timings for ECC, but if I could buy unbuffered ECC at DDR4-3600 CL17 or CL18, I would...

I don't know if this poses an unusual technical challenge in RAM production, or if it doesn't exist simply because there isn't a market for it, but either way it is a shame.
 
I have a HEDT system.

I currently have non-ECC DDR4-3600 CL16 RAM.

I know I'd sacrifice a little in the timings for ECC, but if I could buy unbuffered ECC at DDR4-3600 CL17 or CL18, I would...

I don't know if this poses an unusual technical challenge in RAM production, or if it doesn't exist simply because there isn't a market for it, but either way it is a shame.
It's more the added time it takes to verify the ECC bit that adds to the timing as technically it is moving more content, but generally for applications that you would want the ECC for the difference in timings for that memory is not going to have any real change on the speed at which it runs because the other aspects of the system are far slower and in those cases, it is the amount of memory that is important not the speeds it runs at for the most part.
 
Sometimes it takes someone with a higher profile and larger reach than just a lot of enthusiasts complaining about it. Maybe it will finally be addressed by manufacturers.
I doubt he is anyone important enough to stir the pot. He just seems like a cranky fuck.
 
He has a pretty wide reach, though. He is definitely a very cranky fuck and always has been. He's kind of a dick all the time. But, he can get a lot of people riled up.
Yea that's all I see from him. He might have pull in the linux community but doubt it for any hardware manufacturer.
 
The reason mainstream ECC died is that once NT got wide release, people found they could get months of windows uptime without such an expensive investment (when everything in your chain is unstable, you are desperate for some solution, but now business owners have seen the light.).

Also, because Intel processors support ECC cache, and modern hard drves/ssds have their own ECC for writes, the only way you can get errors from a validated hard disk is an in-memory bit flipped.

These bit flips still happen, but most folks don't think one corrupt file every few years is worth the investment in ECC memory plus zfs pool on every machine.

Linus just wishes more people were control freaks like he is but even developers have mostly taken the easy way out and ditched ECC. You can't convince folks that what they're doing is worth the time + money invested, if you produce virtually the same result! High-IO-throughput servers need ECC plus ZFS a lot more than your average workstation.
 
Last edited:
If I could get ECC I would, back 22 years ago, when I owned my own computer store, I put ECC memory in everything. (God I'm old) It was more expensive, but man so many fewer support calls. If I could get it, I would just for the peace of mind. The bad part of his argument however; is ECC doesn't prevent rowhammer at all...
 
If I could get ECC I would, back 22 years ago, when I owned my own computer store, I put ECC memory in everything. (God I'm old) It was more expensive, but man so many fewer support calls. If I could get it, I would just for the peace of mind. The bad part of his argument however; is ECC doesn't prevent rowhammer at all...
ECC doesn't prevent rowhammer, but it does mitigate it. All single bitflips are reported, and if nothing else, the interrupts for reporting will slow down further attempts. All (most?) double bit flips are reported uncorrectable and are likely to halt the machine (depends on your os and the modified page though, some OSes will determine where the bad memory was mapped to, and kill affected processes rather than the whole thing); even if you don't put correctable errors into monitoring, uncorrectable errors should make it. Only the tripple or more flips have a chance at being undetected; those aren't impossible with rowhammer, but they're less likely (unless the technique has been significantly refined since I last looked), and you'd leave a big trail behind you. For the types of servers I've run, that's been good enough for my piece of mind; but then, those servers weren't expected to have adversarial code running on them.
 
From the article:
"The only reason Intel says 'ECC is for servers and embedded' is because Intel marketing people have convinced the powers that be that they can sell otherwise inferior chips for a higher price by enabling ECC functionality," Torvalds added.
He is totally correct, and that is exactly what Intel has done over the last 20+ years with their marketing nonsense.
Glad this is finally biting them in the ass, and that they finally have some real competition on all fronts.
 
My understanding is the DDR5 spec has ECC built in so I wonder how Intel will find a way to disable it on consumer boards and find a way to charge for enabling it? Just a layman opinion, don't know if it is possible or not.
 
My understanding is the DDR5 spec has ECC built in so I wonder how Intel will find a way to disable it on consumer boards and find a way to charge for enabling it? Just a layman opinion, don't know if it is possible or not.
DDR5 ECC has 72 lanes and non-ECC has 64. Pretty easy to disable lanes in a memory controller. The DDR5 spec actually defines BOTH.
 
My understanding is the DDR5 spec has ECC built in so I wonder how Intel will find a way to disable it on consumer boards and find a way to charge for enabling it? Just a layman opinion, don't know if it is possible or not.
Apparently there is a nuance between ECC capability inside the ram (that make what goes to the ram be and stay valid) with total ECC, if the issue occur before the data reach the ram, with DDR5 build in ECC issue can still arise. Do not quote me on this, that something I just read for the first time like yesterday, could be repeating wrong.

Glad this is finally biting them in the ass, and that they finally have some real competition on all fronts.
I am not sure any of Intel current issues or competition has to do with lack of ECC support on non server and other chips, it is not like some of their biggest competitor like AMD are not doing the exact same things than them on that front. Maybe it did participate into loosing Apple has a client to have something called MacBook pro not supporting ECC being strange, but maybe not.
 
The only argument I can think of against ECC is the higher latency, which can potentially lower performance in some circumstances.

Well, that and the higher cost.

And it's not enough that there is higher latency at the given speed with ECC. You also can't buy ECC at higher speeds in most cases. My current ram is DDR4-3600 CL16. Fastest compatible RAM I can find is DDR4-2933 CL21....

Truth is, most applications just aren't critical enough to warrant the cost and performance impact of ECC.

As far as Intel goes, I'm no fan of Intel, but I feel like the industry as a whole, not just intel - has been very poor at documenting pro features, like which platforms support ECC, or VT-D/IOMMU, etc. Intle, AMD and all the motehrboard partners included. Sometimes if you want these features it's been a guesswork. Buy it and test and find out. Oh, and sometimes doing something silly like updating the BIOS can make it stop working.

I THINK my Threadripper supports ECC, but honestly, I am not sure. And if my Threadripper supports ECC, does my motherboard? Does the current version of my BIOS?

Chicken and egg problem.
One of the reasons ECC ram is more expensive is because they dont sell as well so static operations cost is going to be high per unit sold. so its really not that good an argument for going agianst normalizing it as normalizing it will make it cheaper.
The same goes with higher speeds no need to make them when the scenario to use it does not allow it.

Latancy is definitely and issues that is universal though.

You are spot on on wit documentation features.. it is also weird the sometimes intel would hae ECC for its lowest range consumer CPU then remove it from the higher ones then to add it back for it professional line.

Really need some clarifiction and standardization on this
 
no read the spec

So single bit error correction only...still nice!
So on-die ECC is a bit of a mixed-blessing. To answer the big question in the gallery, on-die ECC is not a replacement for DIMM-wide ECC.

On-die ECC is to improve the reliability of individual chips. Between the number of bits per chip getting quite high, and newer nodes getting successively harder to develop, the odds of a single-bit error is getting uncomfortably high. So on-die ECC is meant to counter that, by transparently dealing with single-bit errors.

It's similar in concept to error correction on SSDs (NAND): the error rate is high enough that a modern TLC SSD without error correction would be unusable without it. Otherwise if your chips had to be perfect, these ultra-fine processes would never yield well enough to be usable.

Consequently, DIMM-wide ECC will still be a thing. Which is why in the JEDEC diagram it shows an LRDIMM with 20 memory packages. That's 10 chips (2 ranks) per channel, with 5 chips per rank. The 5th chip is to provide ECC. Since the channel is narrower, you now need an extra memory chip for every 4 chips rather than every 8 like DDR4.
 
  • Like
Reactions: Mega6
like this
The last 2 boards I bought for my ryzen boxes both officially support (unbuffered) ecc. Am actually thinking of swapping out my 4x8GB ddr4-3200 (3900x) to 2x16 w/ ecc or possibly even 4x16.
I’ve got 2x16 3200 Kingston in my FreeNAS box. Happy so far.
 
DDR5 ECC has 72 lanes and non-ECC has 64. Pretty easy to disable lanes in a memory controller. The DDR5 spec actually defines BOTH.
Do you mean bits instead of lanes?
For modern SDRAM, it's all non-ECC is 64-bit and ECC is 72-bit (64-bit plus 8-bit path for ECC).
 
Back
Top