Intel Enables ECC Memory on Consumer Alder Lake CPUs

kac77 · Mar 8, 2022

Intel Enables ECC Memory on Consumer Alder Lake CPUs

For the first time in recent history, Intel is no longer disabling ECC memory support on its standard consumer processors, in this case, the 12th Generation Core ‘Alder Lake’ CPUs. However, you have to use the company’s W680 platform to access the feature.

By Anton Shilov

MrCaffeineX · Mar 8, 2022

This was always an artificial segmentation from when they started doing it. It never really made a lot of sense from the consumer standpoint. I'm glad to see them sort of changing course. They'll get their markup out of the chipset I suppose.

whateverer · Mar 8, 2022

So, it is no-longer limited to shit processors like Pentium g4500?

Still have to pay over $500 for the motherboard!

Lakados · Mar 8, 2022

This is good, the consumer market can very well benefit from ECC and the penalties for running it on DDR5 modules are basically nonexistent. I have a VM running upstairs that I use for slicing my STL's for 3d prints and it makes a big difference there, you might be surprised how cleaner some of the jobs come out on a system with ECC compared to not.

Randall Stephens · Mar 8, 2022

Lakados said:
This is good, the consumer market can very well benefit from ECC and the penalties for running it on DDR5 modules are basically nonexistent. I have a VM running upstairs that I use for slicing my STL's for 3d prints and it makes a big difference there, you might be surprised how cleaner some of the jobs come out on a system with ECC compared to not.

Interesting. Got any screens of the difference?

Lakados · Mar 8, 2022

Randall Stephens said:
Interesting. Got any screens of the difference?

Not at the moment, but I generally found far fewer weird print anomalies and failures especially on larger or more detailed prints. Now this is totally anecdotal and may completely be in my head but it’s not like having Prusa slicer installed on my remote workstation VM really uses any extra resources.
I stumbled across a post in the help forums from another user who swore by it and thought well I have the hardware lets give it a go, and sure enough it fixed my problem so I’ve just stuck with it. But until I f-ed up my hot end it was pumping out beautiful prints.

serpretetsky · Mar 8, 2022

Lakados said:
Not at the moment, but I generally found far fewer weird print anomalies and failures especially on larger or more detailed prints. Now this is totally anecdotal and may completely be in my head but it’s not like having Prusa slicer installed on my remote workstation VM really uses any extra resources.

I'm not an expert, but if you have a system where the results are noticeablly different on every run that does not sound normal to me. Memory bit errors are typically supposed to be pretty rare aren't they?

Eulogy · Mar 8, 2022

I can't imagine how ECC would drastically change slicing, unless your non-ECC RAM is faulty (maybe overclocked, wrong voltage, or something?).

This is pretty welcome for me, for other reasons though. I want to replace my dual socket Xeon server, which runs ZFS. ZFS having ECC is a boon, so, having "consumer" platform be ECC is great... hopefully less spendy than Xeon overall, and less power hungry. Though I doubt any of the consume boards have any kind of IPMI.

Lakados · Mar 8, 2022

serpretetsky said:
I'm not an expert, but if you have a system where the results are noticeablly different on every run that does not sound normal to me. Memory bit errors are typically supposed to be pretty rare aren't they?

Yeah so it could just be the ram in my desktop system has an issue, I had to overclock it to get the system stable, and it's just been a generally unpleasant computer from the get-go.

AMD 3900x, on an ASUS B450-i, with Corsair Dominator ram.
I could not get it stable at stock timings for anything it would just shut down gaming everything happy, then bam it would turn off. ASUS support had me run their AI OC Tuner in the BIOS and that stabilized it but it's now been 2 years and I still don't trust it but everything passes when I run the various tests. So it very well could be something fundamentally wrong with my desktop, but until this crazy-ass market changes I am stuck with the build I am stuck with.

Lakados · Mar 8, 2022

This is interesting. So yeah it will be exclusive to the W680 chipset, but that chipset fills a very interesting gap that both AMD and Intel have left in their product stacks. Depending on pricing this could very well fill the gaps that have been left in the HEDT market for enthusiasts that the current Xeon and Threadripper Pro markets have generally priced out.

toast0 · Mar 8, 2022

serpretetsky said:
I'm not an expert, but if you have a system where the results are noticeablly different on every run that does not sound normal to me. Memory bit errors are typically supposed to be pretty rare aren't they?

Memory bit errors are supposed to be rare yes, but the key thing is they're usually not independent. If a stick in a system has a bit error, it's way more likely to have another bit error than the average bit error rate. Sometimes it's a defect in the ram, or the cpu, or the motherboard, and sometimes it's a setup issue (timing, voltage), sometimes it is just a cosmic ray or whatever. Depends on how lucky you are, I guess. I had one system with apparently terrible ram, but tweaking the config made it go away; for work systems with ECC, we'd just swap the ram if it threw errors and that would usually fix it, sometimes we'd swap (or reseat) the CPUs if swapping the ram didn't help.

Dopamin3 · Mar 8, 2022

While this is a step in the right direction, it's a bit disappointing to limit support to the W680 chipset. Given the memory controller is on the CPU, any chipset motherboard should be able to provide working ECC if they setup the tracing for it.

Like with AMD Ryzen, you can do unbuffered ECC on all non APUs (also "Pro" branded APUs), it's just up to the motherboard vendor to support it. Most board partners except MSI support ECC across a wide range of chipsets on AM4.

Lakados · Mar 8, 2022

Dopamin3 said:
While this is a step in the right direction, it's a bit disappointing to limit support to the W680 chipset. Given the memory controller is on the CPU, any chipset motherboard should be able to provide working ECC if they setup the tracing for it.

Like with AMD Ryzen, you can do unbuffered ECC on all non APUs (also "Pro" branded APUs), it's just up to the motherboard vendor to support it. Most board partners except MSI support ECC across a wide range of chipsets on AM4.

But they don’t all support it the same, many “support” it as in you can seat it and it will recognize it and boot it, but they ignore the correction bits.
They did that on the Threadripper’s too.

ElementDave · Mar 8, 2022

The lack of ECC memory support on "consumer" platforms has long been a complaint of mine, and I'm pleased to see any lessening of that restriction. The W680 chipset limitation is disappointing as another poster commented, but this could have some positive competitive effects down the road that lead to more widespread adoption.

Lakados said:
But they don’t all support it the same, many “support” it as in you can seat it and it will recognize it and boot it, but they ignore the correction bits.
They did that on the Threadripper’s too.

Yes, especially as the reporting of errors is arguably the most important part of the feature, motherboard selection does require a bit of homework.

GiGaBiTe · Mar 9, 2022

serpretetsky said:
Memory bit errors are typically supposed to be pretty rare aren't they?

Memory bit errors are anything but rare, the computer you've been typing on probably had at least one in the time it took for you to type that post. You may have had anywhere from dozens to hundreds of them in a day.

The reason you don't notice them is because a single bit error in a memory pool of several gigabytes on your average home computer is more often than not going to hit an unused memory address, or somewhere where it doesn't cause too many problems. On a server however, it can cause havoc, especially if said server has high memory utilization where something important is likely to be in said flipped memory address. Back 30 years ago when average home computers had just a couple of megabytes of memory, single bit errors were a lot more problematic, because they'd more often hit the OS kernel or a driver and make the system unstable or crash. DOS in particular was extremely intolerant of bad memory and would do all sorts of nasty things.

One terrifying aspect of the silent bit flipping of memory is that if the flipped bit is somewhere in the network stack, it can cause erroneous DNS queries. This is where your computer thinks it's looking up one website, but is actually looking up a potential threat without any knowledge. It's such a well known phenomenon that "bit squatting" is a thing where people buy up bit flipped domains to intercept traffic from legitimate websites and be a sort of MTIM attack.

I've also seen it affect game server maps. I play on a TFC server which runs on a server with sketchy RAM, and maps will often have their state corrupted. Doors open the wrong way or not at all, event timers will break or have their time changed, spawn turrets will have their orientation flipped, brush entities will have their flags, render modes or other attributes change, etc.

Temperature and altitude also have a significant affect on bit errors. The hotter the memory runs, the higher the chance it'll have a bit error. Likewise, the higher the altitude the memory runs at will increase its chances of being hit with cosmic radiation and flip due to the thinner atmosphere. Other obvious things like terrestrial hard radiation will cause bit flipping, but the source isn't always obvious. The video goes over a couple of examples, like one of Intel's fabs being downwind of an old uranium mine causing memory chip contamination, and Sun's UltraSparc having radioactive cache. All of the past nuclear bomb testing and accidents have also spread it worldwide, some locations being more affected than others. So you'd be more likely to have memory errors in Bikini Atoll or New Mexico than you would in Florida or Tennessee.

The video goes over quite a lot of interesting aspects of the reliability of memory, or more specifically how it isn't, and people treat it as if it is.

serpretetsky · Mar 9, 2022

GiGaBiTe said:
Memory bit errors are anything but rare, the computer you've been typing on probably had at least one in the time it took for you to type that post. You may have had anywhere from dozens to hundreds of them in a day.

That would be crazy (I'm not saying your wrong, just that it seems crazy to me). Should be able to test pretty easily I think? Allocate say 8GB of memory, overwrite all of it with 1's, wait 1 minute, read back. Repeat for all 0's, and maybe repeat for mixed pattern 0xAA and 0x55. Would that be a sufficient test?

Eulogy · Mar 9, 2022

serpretetsky said:
That would be crazy (I'm not saying your wrong, just that it seems crazy to me). Should be able to test pretty easily I think? Allocate say 8GB of memory, overwrite all of it with 1's, wait 1 minute, read back. Repeat for all 0's, and maybe repeat for mixed pattern 0xAA and 0x55. Would that be a sufficient test?

That's effectively one of the tests that memtest does, IIRC. Just without the pause.

jardows · Mar 9, 2022

Competition is good! With a W680 motherboard, I won't have to spend so much time researching what will or will not work with ECC. Many of these board are really what I want anyway, so this provides a great alternative to consider.

GiGaBiTe · Mar 9, 2022

serpretetsky said:
That would be crazy (I'm not saying your wrong, just that it seems crazy to me). Should be able to test pretty easily I think? Allocate say 8GB of memory, overwrite all of it with 1's, wait 1 minute, read back. Repeat for all 0's, and maybe repeat for mixed pattern 0xAA and 0x55. Would that be a sufficient test?

While it may seem easy to test for, it really isn't. You need a very low level RTOS like a stripped down Linux kernel or DOS and to do everything on a single thread to avoid timing errors and bugs in context/process switching giving you skewed readings. In the video linked above, the guy briefly discusses I think IBM doing cosmic ray impact on memory reliability and has a chart showing higher altitudes directly correlates to more memory errors.

Memtest could probably be altered to have an additional "set and hold" test, but Memtest has been shown to be unreliable in the past with buggy releases like 4.20 that gave 100% false positives on specific tests. Using its experimental SMP option in 5.01 is also known to give false positives, especially when threading is enabled.

Here are a couple of examples I've encountered.

Phenom II X4 940 system where memtest context switching on multiple cores causes it to crash. Doesn't happen in single core mode, and the CPU and RAM are known to be good.

Here's another caused by context switching on a Ryzen 7 3700x. Take note that it all happens at one specific area in memory and only on one thread. Also note that the written and read data are entirely different, which normally would be almost impossible.

An easier test would be to just buy up a bunch of bit flipped domains for a year and point them to a VPS to log the traffic and analyze it. I did this years ago when I first heard about bit flipping in DNS and I got some pretty concrete results about memory errors, and it didn't cost that much either.

LukeTbk · Mar 9, 2022

I think I am learning something (I have a vague memory of having read it in the past while reading about ECC in the importance for filmmaker-photo and people of the sort that have not only large valuable data but that error would go simply undetected, i.e. often the program would not crash, the file would still save and open later on but have changed), but forgot about it and relearned it) about ram error rate

2008-2009 google made a study and:
The study (download .pdf), which used tens of thousands of Google's servers, showed that about 8.2% of all dual in-line memory modules (DIMM) are affected by correctable errors and that an average DIMM experiences about 3,700 correctable errors per year.

"Our first observation is that memory errors are not rare events. About a third of all machines in the fleet experience at least one memory error per year, and the average number of correctable errors per year is over 22,000," the report states.

If you have good error correction you to one time every 3 years instead of 22,000 time a year, back in the day.

Just find a good excuse to tell clients when the program crash while opening / dealing with really large .stl, .iges 3d files.

serpretetsky · Mar 9, 2022

GiGaBiTe said:
While it may seem easy to test for, it really isn't. You need a very low level RTOS like a stripped down Linux kernel or DOS and to do everything on a single thread to avoid timing errors and bugs in context/process switching giving you skewed readings. In the video linked above, the guy briefly discusses I think IBM doing cosmic ray impact on memory reliability and has a chart showing higher altitudes directly correlates to more memory errors.

I dont understand why it's not simple. Write to 8GB buffer, and continously keep reading back from it for the next hour (so that it's not paged) and check for changes. What's wrong with this setup? If context/process switching creates more memory errors then I'd rather not avoid that! I want to see all of the errors! I mean, from a programmer point of view I want to create a stupid simple program, with the assumption that when i set a variable to a value, it generally keeps that value.

GiGaBiTe · Mar 9, 2022

serpretetsky said:
I dont understand why it's not simple. Write to 8GB buffer, and continously keep reading back from it for the next hour (so that it's not paged) and check for changes. What's wrong with this setup? If context/process switching creates more memory errors then I'd rather not avoid that! I want to see all of the errors! I mean, from a programmer point of view I want to create a stupid simple program, with the assumption that when i set a variable to a value, it generally keeps that value.

It's not simple because we're testing for one very specific type of memory failure, bit flipping. If you want to test for general memory errors, there are other programs that already do that. Bit flip testing requires a controlled environment for testing where you can rule out as much interference as possible. If you just malloc an 8 GB chunk in the OS, you don't know where it is, what the OS is doing with it in the background and what other processes are near it. You could have some application bit banging memory rows around the allocated memory and induce errors in it.

serpretetsky · Mar 10, 2022

GiGaBiTe said:
If you want to test for general memory errors, there are other programs that already do that.

It sounds you are saying that there are other program that already test for ALL memory errors (superset of just bitflips). And I agree with you. I have used various memory tools before. You also said that memory errors are very common, like, my computer probably had a bitflip while i typed this sentence. So this is where I dont understand: Typically when I memtest a machine, I expect it to pass overnight without errors. Once in a while I see machines that fails, but it's either rare, or there is actually something wrong with the machine. Also like you mention perhaps the test has some bug/incompatibility with a certain CPU/chipset, in which case we need to update the test software. If a computer can pass memtest86 for example, overnight, and not get a single error, and this is my experience with most computers, then how can i be getting dozens to 100s of bit errors a day like you said.

Even if a computer passes something like memtest86, but fails some other test, like prime95 blend, I do not consider this normal or a stable machine (I had a machine that did this, it would crash about once a week, finally I found 1 dimm was bad, after replacing the system was 100% stable after that).

My only conclusion to all of this, is that perhaps none of the memory tests actually write data, then wait a while for bit errors to creep up, before reading (I dont know). So I was suggesting making a simple program that does exactly that. I think you are claiming that I'll actually be getting tons of errors with this simple program, which seems nuts to me.

Basically, if a computer fails memory tests overnight, I do NOT usually consider that system stable. I would return/RMA whatever parts I need. BTW I live at sea level, and not near any nuclear power plants

.

What do you think?

EDIT: one more thing. Last time I did a lot of PC diagnostics was like 10 years ago, since then I haven't really fixed any PCs. So please let me know if things have changed since then?

Mchart · Mar 16, 2022

serpretetsky said:
It sounds you are saying that there are other program that already test for ALL memory errors (superset of just bitflips). And I agree with you. I have used various memory tools before. You also said that memory errors are very common, like, my computer probably had a bitflip while i typed this sentence. So this is where I dont understand: Typically when I memtest a machine, I expect it to pass overnight without errors. Once in a while I see machines that fails, but it's either rare, or there is actually something wrong with the machine. Also like you mention perhaps the test has some bug/incompatibility with a certain CPU/chipset, in which case we need to update the test software. If a computer can pass memtest86 for example, overnight, and not get a single error, and this is my experience with most computers, then how can i be getting dozens to 100s of bit errors a day like you said.

Even if a computer passes something like memtest86, but fails some other test, like prime95 blend, I do not consider this normal or a stable machine (I had a machine that did this, it would crash about once a week, finally I found 1 dimm was bad, after replacing the system was 100% stable after that).

My only conclusion to all of this, is that perhaps none of the memory tests actually write data, then wait a while for bit errors to creep up, before reading (I dont know). So I was suggesting making a simple program that does exactly that. I think you are claiming that I'll actually be getting tons of errors with this simple program, which seems nuts to me.

Basically, if a computer fails memory tests overnight, I do NOT usually consider that system stable. I would return/RMA whatever parts I need. BTW I live at sea level, and not near any nuclear power plants .

What do you think?

EDIT: one more thing. Last time I did a lot of PC diagnostics was like 10 years ago, since then I haven't really fixed any PCs. So please let me know if things have changed since then?

ECC reports when there was an error. That’s the difference. This is a big deal on a server where you can’t be dealing with running memtest and just need it deployed running production ASAP. When the error is reported now you can RMA and get a replacement next day from Dell, etc.

Lakados · Mar 16, 2022

Mchart said:
ECC reports when there was an error. That’s the difference. This is a big deal on a server where you can’t be dealing with running memtest and just need it deployed running production ASAP. When the error is reported now you can RMA and get a replacement next day from Dell, etc.

Or better yet with Dell if your still under warranty and have support assist configured in IDRAC it sends them the errors then one day you just get an email from Dell letting you know it’s detecting errors and your new module is on the way.

GiGaBiTe · Mar 17, 2022

serpretetsky said:
If a computer can pass memtest86 for example, overnight, and not get a single error, and this is my experience with most computers, then how can i be getting dozens to 100s of bit errors a day like you said.

Memtest doesn't test all memory addresses all at the same time. It's still possible for memory addresses not currently under test to flip and Memtest wouldn't know because it's not being actively tested. Nothing else would know either unless you have ECC memory.

It wouldn't be detected on a second pass since the data is overwritten, unless the bit is stuck and doesn't change when Memtest tries to change it.

ElementDave · Mar 21, 2022

From the article linked in the OP:

Speaking of Intel’s W680, it is necessary to note that this chipset has essentially the same features as Z690, but given its workstation nature, it lacks support for overclocking.

The above part about overclocking is not true. Here's a comparison between the Z690 and the W680 chipset: https://www.intel.com/content/www/us/en/products/compare.html?productIds=218833,218834

AnandTech also reviewed the chipset: The Intel W680 Chipset Overview: Alder Lake Workstations Get ECC Memory and Overclocking Support. As the chipset essentially serves as a licensing mechanism, there wasn't a lot to cover, but they did mention and provide links to six motherboards, four from ASRock Industrial and two from Supermicro.

I assume this signifies the death of the entry-level Xeon processors (Xeon E- or W-series or whatever), but don't know whether there's been any official statement to that effect either way.

I'm hoping that with time we'll see an improvement in the price and availability of ECC UDIMMs, which is a big problem at the moment, but I doubt this marketing change is going to have any effect on that. It's ridiculous that over the years the reporting and correction of main memory errors has become regarded as a special feature reserved for high-end workstations and servers, when it should be a universal standard. Linux kernel developers have been pleading for mainstream ECC support for ages (no one more vocal than Linus himself), and even Microsoft pushed hardware vendors for ECC support prior to Vista's launch.

Lakados · Mar 21, 2022

ElementDave said:
From the article linked in the OP:

The above part about overclocking is not true. Here's a comparison between the Z690 and the W680 chipset: https://www.intel.com/content/www/us/en/products/compare.html?productIds=218833,218834

AnandTech also reviewed the chipset: The Intel W680 Chipset Overview: Alder Lake Workstations Get ECC Memory and Overclocking Support. As the chipset essentially serves as a licensing mechanism, there wasn't a lot to cover, but they did mention and provide links to six motherboards, four from ASRock Industrial and two from Supermicro.

I assume this signifies the death of the entry-level Xeon processors (Xeon E- or W-series or whatever), but don't know whether there's been any official statement to that effect either way.

I'm hoping that with time we'll see an improvement in the price and availability of ECC UDIMMs, which is a big problem at the moment, but I doubt this marketing change is going to have any effect on that. It's ridiculous that over the years the reporting and correction of main memory errors has become regarded as a special feature reserved for high-end workstations and servers, when it should be a universal standard. Linux kernel developers have been pleading for mainstream ECC support for ages (no one more vocal than Linus himself), and even Microsoft pushed hardware vendors for ECC support prior to Vista's launch.

The Alder Lake Xeon-W’s are expected in Q3, which will line them up with the TR Pro’s in terms of availability, Q3 is a big time for enterprise procurement contracts and it’s a pretty important launch date.

From what I understand ECC is cheaper to implement in DDR5 than it is in DDR4 which is probably the main reason we’re seeing this. Something to do with the changes to the bus, I remember an article about it back when Rambus announced it back in the day but I can’t find it, I’ll update if/when I do.

Intel Enables ECC Memory on Consumer Alder Lake CPUs

2[H]4U

[H]ard|Gawd

[H]ard|Gawd

[H]F Junkie

[H]ard|Gawd

[H]F Junkie

2[H]4U

2[H]4U

[H]F Junkie

[H]F Junkie

2[H]4U

Gawd

[H]F Junkie

Limp Gawd

2[H]4U

2[H]4U

2[H]4U

2[H]4U

2[H]4U

Supreme [H]ardness

2[H]4U

2[H]4U

2[H]4U

Supreme [H]ardness

[H]F Junkie

2[H]4U

Limp Gawd

[H]F Junkie