Switch to Ecc?

fatryan

[H]ard|Gawd
Joined
Feb 19, 2004
Messages
1,402
I'm not an expert with computers by any means, and the only thing I know about ecc memory is that it has an extra chip that's used to correct errors. Having said that, does this error correcting feature of ecc ram also provide protection in the event of a ram stick failure? As in, is there a failsafe that will effectively shut off the ram so that the user can't inadvertently make things worse by running a computer on faulty ram?

The backstory is that I just had a non-ecc ram failure on a new unraid server I built, and it seems to have destroyed everything. I already RMA'd the kit and confirmed the replacement is good (4 passes of memtest). I also rebuilt my unraid OS USB drive and reinstalled all docker my containers, but I'm still having weird issues that I cannot figure out. It's turned into a complete nightmare, as nothing I try seems to work.

I only built this server in July of this year, and the memory failure occurred the first week of September, so I got 1-1.5 months of a new fun server before it all went to crap. My build was centered around using Intel QS with uhd770 graphics, so when planning the build my only option for an ecc setup was to go with W680 and more specifically the Asus pro ace lga1700 board. There are no other ecc boards for Intel that have approved ecc memory as of this writing. However I passed on the ace, because it only listed 4 sata ports (i now know that there's a way to expand it to 8 😒). So I decided to forgo ecc in favor of more sata ports and a cheaper build price, and I settled on z790 with non-ecc ddr5.

After this nightmare I've dealt with from the memory stick failure, I'm kicking myself for not going ecc from the beginning. But would ecc even protect against a stick failure anyway? Would there be any benefit to me right now to buy the ace board and some approved ecc sticks to protect against this scenario occurring again?
 
On a properly setup ECC system, you'll get notifications somehow, somewhere for correctable errors, and uncorrectable errors will most likely halt the system. Some OSes will do better and look at which page has the error and if you're lucky, you don't have to halt: if it's a clean disk cache page, throw it away and reread from disk; if it's a user process's memory, kill that process; otherwise, there's no safe option so still halt. But, I think that's mostly commercial unix and DEC VMS, and mainframes will have something too.

You might not notice the correctable errors, things aren't always plumbed well, but you should notice the crashes. If you're running a consumery platform with ECC, you can usually generate errors by tweaking the ram timings. Workstation/serverl platforms may have an error injection feature to intentionally cause an error so you can verify. When I was running ECC on FreeBSD, the correctable errors would be reported once an hour, so there could be some delay.
 
On a properly setup ECC system, you'll get notifications somehow, somewhere for correctable errors, and uncorrectable errors will most likely halt the system. Some OSes will do better and look at which page has the error and if you're lucky, you don't have to halt: if it's a clean disk cache page, throw it away and reread from disk; if it's a user process's memory, kill that process; otherwise, there's no safe option so still halt. But, I think that's mostly commercial unix and DEC VMS, and mainframes will have something too.

You might not notice the correctable errors, things aren't always plumbed well, but you should notice the crashes. If you're running a consumery platform with ECC, you can usually generate errors by tweaking the ram timings. Workstation/serverl platforms may have an error injection feature to intentionally cause an error so you can verify. When I was running ECC on FreeBSD, the correctable errors would be reported once an hour, so there could be some delay.
Thanks. I dont know about the error reporting in unraid, and that's what concerns me. When my stick failed, I had no crashes or any major process failures that I could tell. Everything was still running, but some things were acting weird. The logs didn't say memory failure or anything obvious like that which a novice could pick up on. So I didn't even realize there was a major problem at first, and even when I did realize I had errors in the log I didn't know what they meant.

I didn't know what I was dealing with until I ran memtest and had the 1 stick abort the test in under a minute with 10k errors. But getting to that point of running memtest took days. I had already tried many other fixes first, which probably contributed to all the damage that was done. Moreover, memtest requires me to be at the physical server to install a USB drive with memtest on it, boot to bios, switch boot order, etc. So it's not exactly something that can be run easily or scheduled to run automatically.

I assume I had uncorrectable errors in my failed stick, or else a reboot would fix it, no? If you aren't aware, unraid loads into and runs entirely in ram, so everything in the system resets with each reboot. I rebooted many times while troubleshooting, and it fixed nothing. It did still boot up every time though. So if I could potentially have the same issue with ecc memory, then maybe it's not really worth upgrading.
 
In a system without ECC, all ram errors are uncorrectable, and most aren't exactly detectable either. Pretty easy to go from bad to worse like that.

In an ECC system, single bit errors are correctable, and double bit errors are detectable, and multi-bit errors are a maybe. But some systems hide the reports of single bit errors so you don't know anythings going on unless you get two or more bit error and then the system halts.
 
In a system without ECC, all ram errors are uncorrectable, and most aren't exactly detectable either. Pretty easy to go from bad to worse like that.

In an ECC system, single bit errors are correctable, and double bit errors are detectable, and multi-bit errors are a maybe. But some systems hide the reports of single bit errors so you don't know anythings going on unless you get two or more bit error and then the system halts.
So if I am understanding you correctly, the ecc will actually stop the system from running if a dual bit or multi-bit error occurs. Is that correct? Is that by design, or is that just a high probability based on the nature of dual/multi bit errors?

Another question I had was about types of ecc memory. I'm probably going to butcher this, but I was reading that there are different types of ecc memory modules and I believe they were differentiated as "single bit" and "dual bit". This sounds very similar to what you're describing, except that it's a memory module specification. I was reading about this in the context of the w680, and it was said that only "single bit" modules were available for the w680. So does this mean that switching to Ecc on a w680 would only afford me protection against a single bit error? Dual/multi bit errors would go undetected?

I guess we might as well discuss module models anyway, as it's my understanding that there's really only 1 option available for the Asus ace w680 boards: Kingston KSM48E40BD8KM-32HM. The other modules on the list aren't available for purchase in the US yet.
 
I'm not an expert with computers by any means, and the only thing I know about ecc memory is that it has an extra chip that's used to correct errors. Having said that, does this error correcting feature of ecc ram also provide protection in the event of a ram stick failure? As in, is there a failsafe that will effectively shut off the ram so that the user can't inadvertently make things worse by running a computer on faulty ram?

Yes, when a RAM stick goes bad you will get lots of notifications about ECC errors. I had that when a separate chip overheated my RAM. Very handy, I slapped a fan on that chip and everything was fixed. If I didn't have reporting I would have seen the same corruption you did over time.

Actually taking a module out of use is called chipkill. I don't know what you have to do to get it other than outright mirroring memory (like RAID1), which I bet isn't possible on W680. So when you get messages about corrected errors you have to do something physical about it such as adding cooling or taking the machine down to remove the faulty stick.

Whether the machine stops on uncorrectable but detectable errors should be configurable, but again I don't know how since I have never done it. I certainly don't see options for that in the BIOS.
 
So if I am understanding you correctly, the ecc will actually stop the system from running if a dual bit or multi-bit error occurs. Is that correct? Is that by design, or is that just a high probability based on the nature of dual/multi bit errors?

A two bit error is detectable, but uncorrectable by design (SEC-DED), more than two bits and it depends on which bits are flipped if it can be detected, either way, it won't be corrected.

Whether the machine stops on uncorrectable but detectable errors should be configurable, but again I don't know how since I have never done it. I certainly don't see options for that in the BIOS.

An uncorrectable error will throw a machine check exception, and it's up to the OS to decide what to do. AFAIK, Windows, Linux, and FreeBSD will halt --- that's the safe option if you have a memory read you know is bad. You do need ECC enabled in the BIOS for this to happen, so you could disable it, but then you miss out on other things.
 
Any "server" you build should be using ECC. How I see it, I know people like to use desktop boards and gamer boards for home labs and home servers, but you can get used server boards and such for cheap as heck off ebay and it is more reliable for "server" roles at home than anything else. Also for most server related stuff, you dont need or want the latest and greatest hardware, you want solid and reliable. And used DDR3/4 ECC is also fairly cheap to find as well.
 
Any "server" you build should be using ECC. How I see it, I know people like to use desktop boards and gamer boards for home labs and home servers, but you can get used server boards and such for cheap as heck off ebay and it is more reliable for "server" roles at home than anything else. Also for most server related stuff, you dont need or want the latest and greatest hardware, you want solid and reliable. And used DDR3/4 ECC is also fairly cheap to find as well.

Registered ECC DDR4 is dirt cheap, you use that for servers such as most EPYC and Xeons v3/v4. But DDR4 ECC UDIMM for the desktop-like platforms is unfortunately very expensive.
 
Yes, when a RAM stick goes bad you will get lots of notifications about ECC errors. I had that when a separate chip overheated my RAM. Very handy, I slapped a fan on that chip and everything was fixed. If I didn't have reporting I would have seen the same corruption you did over time.

Actually taking a module out of use is called chipkill. I don't know what you have to do to get it other than outright mirroring memory (like RAID1), which I bet isn't possible on W680. So when you get messages about corrected errors you have to do something physical about it such as adding cooling or taking the machine down to remove the faulty stick.

Whether the machine stops on uncorrectable but detectable errors should be configurable, but again I don't know how since I have never done it. I certainly don't see options for that in the BIOS.
I have no issue taking the system offline in the event of an uncorrectable memory error. I just need to know that theres a memory error. If ecc memory gets me clear notifications of memory errors, then maybe I should just switch.
 
A two bit error is detectable, but uncorrectable by design (SEC-DED), more than two bits and it depends on which bits are flipped if it can be detected, either way, it won't be corrected.



An uncorrectable error will throw a machine check exception, and it's up to the OS to decide what to do. AFAIK, Windows, Linux, and FreeBSD will halt --- that's the safe option if you have a memory read you know is bad. You do need ECC enabled in the BIOS for this to happen, so you could disable it, but then you miss out on other things.
I don't know how unraid handles ecc errors or what the w680 bios settings look like. The latter can presumably be looked up online. I may be able to get feedback on the unraid part.
 
Any "server" you build should be using ECC. How I see it, I know people like to use desktop boards and gamer boards for home labs and home servers, but you can get used server boards and such for cheap as heck off ebay and it is more reliable for "server" roles at home than anything else. Also for most server related stuff, you dont need or want the latest and greatest hardware, you want solid and reliable. And used DDR3/4 ECC is also fairly cheap to find as well.
It really all came down to the cpu, or more specifically the integrated gpu. This server is primarily a plex media server. Intel QS is the gold standard in hardware transcoding for plex, beating virtually any dedicated gpu on the market. So this build was centered around using an Intel chip with uhd770 integrated graphics. AMD integrated graphics aren't supported, and Xeon chips don't have integrated graphics at all (except maybe the newest ones, I think?). So basically I was stuck using consumer/gaming boards no matter what. I was really wanting to do ecc, but the only board that supports it is Asus pro w680. Lots of people in the unraid community claimed ecc was overkill, so I convinced myself that it was and went with non-ecc. Perhaps that was a mistake. All I know is that this has been a nightmare, and I don't plan to use g. Skill ram ever again.
 
It really all came down to the cpu, or more specifically the integrated gpu. This server is primarily a plex media server. Intel QS is the gold standard in hardware transcoding for plex, beating virtually any dedicated gpu on the market. So this build was centered around using an Intel chip with uhd770 integrated graphics. AMD integrated graphics aren't supported, and Xeon chips don't have integrated graphics at all (except maybe the newest ones, I think?). So basically I was stuck using consumer/gaming boards no matter what. I was really wanting to do ecc, but the only board that supports it is Asus pro w680. Lots of people in the unraid community claimed ecc was overkill, so I convinced myself that it was and went with non-ecc. Perhaps that was a mistake. All I know is that this has been a nightmare, and I don't plan to use g. Skill ram ever again.
You could consider a cheap low end NVIDIA GPU if you have an open slot as well.

Personally i just use NVIDIA Shields, or put Kodi on my Sony TV as most modern TV's are fully capable of decoding 4k BR, at least my Sony x950 does so fine and let them do the heavy lifting! Of course, also depends on if you are all wired, or rely in Wifi for the throughput, which then using plex to decode makes sense to keep the bandwidth requirements down and you know you can play anything to any device no matter how old or new..

I grabbed an HP Z6 G4 box for cheap from someone and it is my TrueNAS server, has an NVIDIA GPU in it, but i dont use it for decoding.
 
You could consider a cheap low end NVIDIA GPU if you have an open slot as well.

Personally i just use NVIDIA Shields, or put Kodi on my Sony TV as most modern TV's are fully capable of decoding 4k BR, at least my Sony x950 does so fine and let them do the heavy lifting! Of course, also depends on if you are all wired, or rely in Wifi for the throughput, which then using plex to decode makes sense to keep the bandwidth requirements down and you know you can play anything to any device no matter how old or new..

I can't use Kodi because audio and video always go out of sync after a while. It doesn't do that for you? I am on an Amazon Fire TV quad.
 
You could consider a cheap low end NVIDIA GPU if you have an open slot as well.

Personally i just use NVIDIA Shields, or put Kodi on my Sony TV as most modern TV's are fully capable of decoding 4k BR, at least my Sony x950 does so fine and let them do the heavy lifting! Of course, also depends on if you are all wired, or rely in Wifi for the throughput, which then using plex to decode makes sense to keep the bandwidth requirements down and you know you can play anything to any device no matter how old or new..

I grabbed an HP Z6 G4 box for cheap from someone and it is my TrueNAS server, has an NVIDIA GPU in it, but i dont use it for decoding.
I built this server as a replacement for a windows based plex server that was using a 2060ko to transcode. Yeah, the 2060ko worked fine, but I was trying to build my ideal plex server this time around, so I didn't want a dedicated gpu. The igpu is slightly faster than the dgpu in my experience, but that really only comes into play at the beginning of the stream. At this point I'm definitely not going to completely scrap the whole server in favor of a different approach. I'm $1500 in it, excluding hdds, and the server is only 2 months old. It'll be hard enough trying to explain to my wife why I need a new board and memory if I go the ecc route lol
 
I can't use Kodi because audio and video always go out of sync after a while. It doesn't do that for you? I am on an Amazon Fire TV quad.
It does not for me currently, but since you mention it I do recall having that issue years ago on some content.....

I can not say I ever dove into it deep, I think i just adjust the audio offset in kodi at one point for it.
 
I built this server as a replacement for a windows based plex server that was using a 2060ko to transcode. Yeah, the 2060ko worked fine, but I was trying to build my ideal plex server this time around, so I didn't want a dedicated gpu. The igpu is slightly faster than the dgpu in my experience, but that really only comes into play at the beginning of the stream. At this point I'm definitely not going to completely scrap the whole server in favor of a different approach. I'm $1500 in it, excluding hdds, and the server is only 2 months old. It'll be hard enough trying to explain to my wife why I need a new board and memory if I go the ecc route lol
Understandable, make due with what we got (I am married as well so understand!) trying to explain to the wife why i need a new BrocadeICX 6610 switch cause I wanted 40Gbps links for my truenas server :D But for me it is easy since I am in IT for a career, i just explain my hobbies directly relate to my job and experience
 
ECC is something we'd probably have a lot more general understanding of if Intel didn't specifically market-segment it out of the reach of casual computer users.

I suppose enabling it on consumer Core i5s and i7s with a W680 motherboard (as opposed to needing a Xeon CPU and a workstation chipset) is a step in the right direction, but not enough when I can't just go to the local Micro Center and pick up an Asus Pro WS W680-ACE or similar board. No, it's all Z690/790, which don't support ECC because Intel would rather have you spend the big bucks on W790 and Sapphire Rapids for that.

AMD doesn't do as much of the frustrating market segmentation with their platforms, but the motherboard vendors are substantially more hit-and-miss with properly implementing ECC on consumer Ryzen boards - making sure it works, reporting to the OS properly, etc. It's naturally less of a problem on Threadripper and EPYC, but again, that's a lot more money spent on the platform and registered DIMMs. (Until enough years pass by and RDIMMs become much, much cheaper than ECC UDIMMs, if DDR4 is any indication.)

However I passed on the ace, because it only listed 4 sata ports (i now know that there's a way to expand it to 8 😒). So I decided to forgo ecc in favor of more sata ports and a cheaper build price, and I settled on z790 with non-ecc ddr5.
Anyone serious about storage at an enthusiast level uses LSI SAS HBAs with IT mode firmware anyway, regardless of whether you're running Unraid, TrueNAS or any other system. Each SAS port can be broken out into four SAS or SATA ports with the right cable.

This isn't even getting into external SAS JBOD enclosures/drive shelves yet. It's pretty impressive stuff, if you're willing to invest a bit into it.

And yes, they're also serious about using ECC wherever possible, especially with how ZFS is heavy on RAM caching and how a memory error in said cache could potentially result in bit-rot on what is otherwise a very resilient file system. (This is true of any file system, really, but ZFS users in particular tend to be overzealous about it.)
 
ECC is something we'd probably have a lot more general understanding of if Intel didn't specifically market-segment it out of the reach of casual computer users.

I suppose enabling it on consumer Core i5s and i7s with a W680 motherboard (as opposed to needing a Xeon CPU and a workstation chipset) is a step in the right direction, but not enough when I can't just go to the local Micro Center and pick up an Asus Pro WS W680-ACE or similar board. No, it's all Z690/790, which don't support ECC because Intel would rather have you spend the big bucks on W790 and Sapphire Rapids for that.

AMD doesn't do as much of the frustrating market segmentation with their platforms, but the motherboard vendors are substantially more hit-and-miss with properly implementing ECC on consumer Ryzen boards - making sure it works, reporting to the OS properly, etc. It's naturally less of a problem on Threadripper and EPYC, but again, that's a lot more money spent on the platform and registered DIMMs. (Until enough years pass by and RDIMMs become much, much cheaper than ECC UDIMMs, if DDR4 is any indication.)


Anyone serious about storage at an enthusiast level uses LSI SAS HBAs with IT mode firmware anyway, regardless of whether you're running Unraid, TrueNAS or any other system. Each SAS port can be broken out into four SAS or SATA ports with the right cable.

This isn't even getting into external SAS JBOD enclosures/drive shelves yet. It's pretty impressive stuff, if you're willing to invest a bit into it.

And yes, they're also serious about using ECC wherever possible, especially with how ZFS is heavy on RAM caching and how a memory error in said cache could potentially result in bit-rot on what is otherwise a very resilient file system. (This is true of any file system, really, but ZFS users in particular tend to be overzealous about it.)
I actually already have an lsi hba in my old windows server that I can use in this one now. I couldn't remove it until transferring all my data between servers though, so I'd need a second hba if I only had 4 sata ports on the w680. And I just figured it was preferable to not spend the extra cash, take up a slot, prolong the boot time, etc. I didn't mention it, but I also had concerns about the NIC on the Asus w680 boards. I read a number of complaints about it crapping out. Had I known ecc would be so vital, I would have just gotten the w680 anyway. Everyone in the unraid community says running ecc is overkill, which is why I didn't put so much emphasis on it initially.

I'm happy to report though that my system seems to be good now. After rebuilding unraid twice and reinstalling all dockers, everything seemed to start working again.
 
Back
Top