X470 ECC Support

drescherjm · May 6, 2018

In less 2 weeks I am headed to MC to purchase a X470 / 2700 combo however I very much want ECC support for this (since it will be for a server application zfs linux / pvr). I have spent a little time looking into this and found very few options. Of the GigaByte boards only the X470-AORUS-GAMING-7-WIFI lists ECC support. The cheaper models say ECC operates in non-ECC mode.

https://www.gigabyte.com/us/Motherboard/X470-AORUS-GAMING-7-WIFI-rev-10#sp

I am having trouble finding info for ASUS.

ASRock does mention ECC support.

AMD Ryzen series CPUs (Pinnacle Ridge) support DDR4 3466+(OC) / 3200(OC) / 2933/2667/2400/2133 ECC & non-ECC, un-buffered memory*

Aluminum · May 7, 2018

Good luck, none of this shit is well documented and you get endless finger pointing when you try to verify. Its all unofficial, hell its hard enough just to verify it on threadripper which is official...maybe. Summit ridge worked on x370 on good boards, pinnacle ridge should work on good x470 boards they really aren't much different. It does seem that raven ridge is fucked with no ecc possible though, probably something to do with gpu imc added (random amd reddit comments saying half-yes vs motherboard vendors saying no).

The only true positive is ECC corrected errors logged by your OS. All the bit width, registry and other bullshit is unverifiable. Its also possible to have false negatives (no logging) and inducing errors without totally crashing or fucking up your system is a problem in itself.

Still better than intel though, where you know the answer is "fuck you" + you just reminded them of another feature to remove/segment with the next generation.

drescherjm · May 7, 2018

Thanks.

Nobu · May 9, 2018

I think the only way to ensure you get ecc support without getting threadripper or epyc is to get a ryzen pro cpu (which only comes in made to order workstations as far as I can tell, at least for now).

Don't know about support on x470, just that my gigabyte x370 aorus k5 only supports unbuffered ecc, and I don't know how well it's supported.

drescherjm · May 9, 2018

only supports unbuffered ecc

I expect that unbuffered ECC is the only hope for any zen based AMD system other than EPYC.

Official Gigabyte · May 9, 2018

At least for us ECC (unbuffered) support is based on PCB layers. You need 6 layer boards to support it, IE the Gaming 7. (Of course the CPU has to support it as well.)

The full list for GIGABYTE is:

X470 AORUS Gaming 7 WIFI
X370 Gaming 5, Gaming K7, AB350N Gaming WIFI.

drescherjm · May 9, 2018

Official Gigabyte said:
At least for us ECC (unbuffered) support is based on PCB layers. You need 6 layer boards to support it, IE the Gaming 7. (Of course the CPU has to support it as well.)

The full list for GIGABYTE is:

X470 AORUS Gaming 7 WIFI
X370 Gaming 5, Gaming K7, AB350N Gaming WIFI.

Thanks, I really appreciate the official response.

Peat Moss · May 14, 2018

Aluminum said:
Good luck, none of this shit is well documented and you get endless finger pointing when you try to verify. Its all unofficial, hell its hard enough just to verify it on threadripper which is official...maybe. Summit ridge worked on x370 on good boards, pinnacle ridge should work on good x470 boards they really aren't much different. It does seem that raven ridge is fucked with no ecc possible though, probably something to do with gpu imc added (random amd reddit comments saying half-yes vs motherboard vendors saying no).

The only true positive is ECC corrected errors logged by your OS. All the bit width, registry and other bullshit is unverifiable. Its also possible to have false negatives (no logging) and inducing errors without totally crashing or fucking up your system is a problem in itself.

Still better than intel though, where you know the answer is "fuck you" + you just reminded them of another feature to remove/segment with the next generation.

What do you mean by, "All the bit width, registry and other bullshit is unverifiable."?

Are there any systems that do this ?

drescherjm · May 24, 2018

I got my Ryzen 2700 and the ASUS X470 Prime PRO + 16 GB Crucial DDR4 2400 EUDIMM Kit (ECC Unbuffered). I put it together last night. In the BIOS / UEFI I did not see any mention of ECC. The specifications say ECC. I hope I was not lied to in the ASUS specifications. I did not have time to do any real testing since it was late..

https://www.asus.com/us/Motherboards/PRIME-X470-PRO/specifications/

drescherjm · May 27, 2018

After at least 1 day of kill-ryzen testing, I am pretty sure that my Ryzen 2700 does not have the linux / gcc bug discussed in this thread:

https://community.amd.com/thread/215773

Also I am very impressed in its energy efficiency. It's been running hours on end compiling with all 16 threads at full load and it's staying around 60C with the stock heatsink at a low speed. I believe it is the ASUS normal profile I selected in the UEFI/BIOS.

drescherjm · Jun 4, 2018

BTW. I believe ECC is enabled and working.

Here is a little info of the CPU and os / kermel.

jmd1 ~/shell-scripts # uname -a

Linux jmd1.comcast.net 4.16.13-gentoo-20180603-1145-jmd1.comcast.net #3 SMP Sun Jun 3 11:52:55 EDT 2018 x86_64 AMD Ryzen 7 2700 Eight-Core Processor AuthenticAMD GNU/Linux

This tells me its enabled.

jmd1 ~/shell-scripts # dmesg | grep ECC
[ 8.557846] systemd[1]: systemd 238 running in system mode. (+PAM -AUDIT -SELINUX +IMA -APPARMOR +SMACK -SYSVINIT +UTMP -LIBCRYPTSETUP +GCRYPT -GNUTLS +ACL -XZ +LZ4 +SECCOMP +BLKID -ELFUTILS +KMOD -IDN2 -IDN +PCRE2 default-hierarchy=hybrid)
[ 9.132922] EDAC amd64: Node 0: DRAM ECC enabled.

This tells me there have been 0 errors (I expect that from server experience ECC errors should be rare)

jmd1 ~/shell-scripts # edac-util -v
mc0: 0 Uncorrected Errors with no DIMM info
mc0: 0 Corrected Errors with no DIMM info
mc0: csrow0: 0 Uncorrected Errors
mc0: csrow0: mc#0csrow#0channel#0: 0 Corrected Errors
mc0: csrow0: mc#0csrow#0channel#1: 0 Corrected Errors
edac-util: No errors to report.

This tells me the mode of error correction for the first rank

jmd1 ~/shell-scripts # cat /sys/devices/system/edac/mc/mc0/rank0/dimm_edac_mode
SECDED

Same goes for the 2nd rank

jmd1 ~/shell-scripts # cat /sys/devices/system/edac/mc/mc0/rank1/dimm_edac_mode
SECDED

Here is what SECDED means

EDAC_SECDED
Single bit error correction, Double detection

drescherjm · Jun 4, 2018

I am thinking of trying to increase the memory frequency to see if I can trigger ECC corrections.

Edit: I set it to 2666 to begin this (output is from lshw)

*-memory
description: System Memory
physical id: 2e
slot: System board or motherboard
size: 16GiB
capabilities: ecc
configuration: errordetection=multi-bit-ecc
*-bank:0
description: DIMM DDR4 Synchronous Unbuffered (Unregistered) 2666 MHz (0.4 ns)
product: 9ASF1G72AZ-2G3B1
vendor: Micron Technology
physical id: 0
serial: 1C1B9C25
slot: DIMM_A1
size: 8GiB
width: 64 bits
clock: 2666MHz (0.4ns)
*-bank:1
description: [empty]
product: Unknown
vendor: Unknown
physical id: 1
serial: Unknown
slot: DIMM_A2
*-bank:2
description: DIMM DDR4 Synchronous Unbuffered (Unregistered) 2666 MHz (0.4 ns)
product: 9ASF1G72AZ-2G3B1
vendor: Micron Technology
physical id: 2
serial: 1C1B9ECB
slot: DIMM_B1
size: 8GiB
width: 64 bits
clock: 2666MHz (0.4ns)
*-bank:3
description: [empty]
product: Unknown
vendor: Unknown
physical id: 3
serial: Unknown
slot: DIMM_B2

Aluminum · Jun 4, 2018

That is one way, but in my experience the Zen IMC and/or AGESA memory trainer/timing boot check is usually the weaker link versus known good b-die. Don't forget to feed them decent voltage too.

Another way is a hairdryer, start slowly at low setting and far away.

The foolproof way from ancient times is to tape over 1 r/w pin to force an error to be corrected every time but no idea if this is still a viable method with current ddr design. (if you can even boot this way past the first billion+ errors you're golden)

drescherjm · Jun 4, 2018

The foolproof way from ancient times is to tape over 1 r/w pin to force an error to be corrected every time but no idea if this is still a viable method with current ddr design.

I actually thought about that.

alienb · Jun 15, 2018

I have the ASROCK 470 ITX and it didn't work with the ECC 32gb registered DIMMs I have. (samsung chips)

drescherjm · Jun 16, 2018

alienb said:
I have the ASROCK 470 ITX and it didn't work with the ECC 32gb registered DIMMs I have. (samsung chips)

I'm pretty sure you need unbuffered ECC.

alienb · Jun 16, 2018

Yes I expected this. Just wanted to let someone else know if they were wondering. This stick did work inside of x99 and x299 Boards, though.

vkfu · Jun 20, 2018

I have an Asus Prime X370 Pro running with 2 x 16GB DDR4-2400 Crucial ECC DIMMs and ECC is enabled. When I was shopping for my X470 motherboard, I considered the Asus X470 Prime Pro, but saw that its webpage didn't mention ECC and found no ECC DIMMs listed in its QVL. So I went with an ASRock X470 Taichi instead. I might have bought the X470 Prime Pro instead if I had known it would support ECC.

osrk · Jun 20, 2018

Official Gigabyte said:
At least for us ECC (unbuffered) support is based on PCB layers. You need 6 layer boards to support it, IE the Gaming 7. (Of course the CPU has to support it as well.)

The full list for GIGABYTE is:

X470 AORUS Gaming 7 WIFI
X370 Gaming 5, Gaming K7, AB350N Gaming WIFI.

That was a great response. I went with AsRock as it's listed on there specs specifically but had I known this I would have considered gigabyte too. Please communicate to gigabyte engineering or marketing to put a little effort into this. With the lack of workstation boards from Supermicro or tyan people who want to use Ryzen for zfs or esxi workstations need to find boards with ECC. It will drive a few more board sales, I promise. My last two purchases for ryzen boards were driven by this feature that a lot of boards likely already have.

Jedibeeftrix · Jul 31, 2018

Official Gigabyte said:
At least for us ECC (unbuffered) support is based on PCB layers. You need 6 layer boards to support it, IE the Gaming 7. (Of course the CPU has to support it as well.)

The full list for GIGABYTE is:

X470 AORUS Gaming 7 WIFI
X370 Gaming 5, Gaming K7, AB350N Gaming WIFI.

any B450 boards meet this 6 layer pcb requirement for ECC unbuffered support (running in ECC mode)?

dexvx · Aug 1, 2018

Asrock B450M mentioned ECC support as well.

https://www.asrock.com/mb/AMD/B450M Pro4/index.us.asp

- AMD Ryzen series CPUs (Pinnacle Ridge) support DDR4 3200+(OC) / 2933/2667/2400/2133 ECC & non-ECC, un-buffered memory*
- AMD Ryzen series CPUs (Summit Ridge) support DDR4 3200+(OC) / 2933(OC) / 2667/2400/2133 ECC & non-ECC, un-buffered memory*

May have to buy one to test it out. Need to put in a SAS HBA.

IdiotInCharge · Aug 1, 2018

Question here: will Ryzen Pro CPUs be needed for full ECC support, and would these boards support that?

Also be interested if they could get that support down to ITX...

drescherjm · Aug 1, 2018

I think the biggest need is BIOS / motherboard support.

dexvx · Aug 1, 2018

Ryzen Pro is only for the APU's for ECC. The non-APU's should all have ECC support baked.

I was thinking about a B350/B450 for a transcoding storage server, but then I realized there's no display if you use Pinnacle/Summit Ridge. And also no IPMI.

So it looks like AMD still has a ways to go for the 1S server market. Basically it seems like its Epyc or nothing.

IdiotInCharge · Aug 1, 2018

dexvx said:
Ryzen Pro is only for the APU's for ECC.

Not that I'm disagreeing with your point, but AMD does have Ryzen Pro SKU equivalents for all consumer Ryzen CPUs, not just the APUs, which is the source of my question.

I'm also poking at the idea that there's a difference between 'officially supported' and 'it probably works/AMD didn't try to break it'. And I would certainly be looking at the APUs!

dexvx · Aug 2, 2018

IdiotInCharge said:
Not that I'm disagreeing with your point, but AMD does have Ryzen Pro SKU equivalents for all consumer Ryzen CPUs, not just the APUs, which is the source of my question.

I'm also poking at the idea that there's a difference between 'officially supported' and 'it probably works/AMD didn't try to break it'. And I would certainly be looking at the APUs!

Didn't even realize AMD made a Ryzen Pro based on Summit Ridge. But those look MIA, can't find stock of them anywhere.

I think the only difference between officially supported and 'AMD didn't disable it' is that if you are in need of support, then you're SOL. As people have mentioned before, actually forcing an ECC event and seeing how it is handled is the only way to actually verify it works.

IdiotInCharge · Aug 2, 2018

dexvx said:
Didn't even realize AMD made a Ryzen Pro based on Summit Ridge. But those look MIA, can't find stock of them anywhere.

Then that's the case for all of them, really. They've been announced, talked about for a year, and they're really just Xeon'd-up consumer parts, so if they are a 'thing' we should be able to buy them, right?

And having 'official' as in 'you know it's working' support for ECC is kind of a big deal for the stuff that more or less needs it the most, like file servers. The 'Pro' APUs are especially interesting because they would support a convergence of strong dependability for NAS while also providing performance for VM stuff and the ability to pass real GPU compute resources to said VMs or be used for say media transcoding.

Of course, we'd also want a Supermicro-style board that provides 10Gbit for the NAS channel, a pair of 1Gbit ports for WAN/LAN (firewall/IDS/IPS and routing if you're brave) and a 1Gbit port for IPMI, along with an eight-channel SAS controller, all on mITX so it could be used as a converged workcenter appliance

.

drescherjm · Aug 2, 2018

I have not had a single recorded ECC correction (not really expected at stock but possible if overclocking the ram). Although I only pushed it up to DDR4 2700 and the system has been off for most of the last few weeks due to an incompatibility (or other issue) with a G1 WDC Black 512GB NVMe drive. I just don't have time to debug at the moment.

vkfu · Aug 20, 2018

drescherjm said:
incompatibility (or other issue) with a G1 WDC Black 512GB NVMe drive

I've actually had some with the G1 WDC Black 512GB PCIe drive as well. This is the first time I have seen anyone else report a similar experience.

I have two and was able to confirm it was a device issue because they both would fall off the bus in AM4 motherboards but were fine in an Intel C236 motherboard. I found that lower quality PSUs without DC-DC regulation (Corsair CX430, CX430M) were more problematic but I still saw this issue on an ASRock Taichi X470 with better PSUs (Corsair RM550x, EVGA 550 G3). I installed a G2 WDC Black 500GB NVMe in the same system and had no problems. I also found a FW update for the G1 WD Black but haven't tested the drive since applying the update.

drescherjm · Aug 20, 2018

I have a reasonably high end 1000W modular PS on that system. The problem seems to have reduced its frequency ( to less than 1 time per week of 24/7 uptime) however that is no good for what I want to do with the system.

vkfu · Aug 21, 2018

drescherjm said:
I have a reasonably high end 1000W modular PS on that system. The problem seems to have reduced its frequency ( to less than 1 time per week of 24/7 uptime) however that is no good for what I want to do with the system.

Did you try the updated firmware? Mine shipped with B35500WD but the latest is B35900WD

Code:

root@nanoluteus:~/server-status# nvme fw-log /dev/nvme0n1
Firmware Log for device:nvme0n1
afi  : 0x1
frs1 : 0x4457303035353342 (B35500WD)

I have also found that setting the BIOS Power Control Idle Control option to "Typical Current Idle" helped with some idle lockup issues I experienced (see https://bugzilla.kernel.org/show_bug.cgi?id=196683)

drescherjm · Aug 21, 2018

Thanks. I was going to move it to a windows box to update the firmware. It looks like I have the latest so no need for that. I have not updated the BIOS/UEFI on my X470 board however.

Code:

jmd1 ~ # nvme fw-log /dev/nvme0n1
Firmware Log for device:nvme0n1
afi  : 0x1
frs1 : 0x4457303039353342 (B35900WD)

drescherjm · Aug 21, 2018

Since you mentioned linux. The reduction in frequency seemed to be related to a kernel update. I noticed that there was some nvme updates a few revisions back somewhere around 4.17.12.

drescherjm · Aug 21, 2018

vkfu said:
I have also found that setting the BIOS Power Control Idle Control option to "Typical Current Idle" helped with some idle lockup issues I experienced (see https://bugzilla.kernel.org/show_bug.cgi?id=196683)

I just finished reading the entire thread. I am not sure that this issue is what I have.

What happens to me is after some period of time the system hard locks up. After pressing the reset button the sometimes (not always) the nvme drive is not detected. When the nvme drive disappears, I have to first power off the machine and then boot from a sysrescue usb stck reinstall grub on the nvme drive then reboot.

There is one other weird symptom that had happened 2 times. If I was logged in via ssh (GUI already locked up) when this happened on htop there was 1 thread had an entire core 100% used in a kernel task. Several tasks were unkillable, however I could read and write to the nvme drive however eventually the ssh would disconnect.

Before the kernel update (that I mention in the previous post) I was able to trigger this behavior by rebuilding a lot of large packages in gentoo (like chromium).

BTW, The nvme drive has ZOL installed and always scrubbs without errors even after it disappears from the system.

Wingman_ice · Aug 22, 2018

vkfu said:
I've actually had some with the G1 WDC Black 512GB PCIe drive as well. This is the first time I have seen anyone else report a similar experience.

I have two and was able to confirm it was a device issue because they both would fall off the bus in AM4 motherboards but were fine in an Intel C236 motherboard. I found that lower quality PSUs without DC-DC regulation (Corsair CX430, CX430M) were more problematic but I still saw this issue on an ASRock Taichi X470 with better PSUs (Corsair RM550x, EVGA 550 G3). I installed a G2 WDC Black 500GB NVMe in the same system and had no problems. I also found a FW update for the G1 WD Black but haven't tested the drive since applying the update.

I had a compatibility problem with a FusionIO IODrive2 1.2TB card. Couldn't boot with it in the 3rd 16xPCIe slot or the 2nd 16x PCIe slot in Gen3 mode. It only works with the PCIe mode set to Gen2 for the first 2 PCIe slots. Why I imagine it didn't work in the 3rd slot due to that card needing at least 8x bandwidth. My Fusion IODrive Duo works just fine in the 3rd 16x slot. This was on the Gigabyte Aorus Gaming 7 Wifi MB.

vkfu · Aug 24, 2018

drescherjm said:
I just finished reading the entire thread. I am not sure that this issue is what I have.

What happens to me is after some period of time the system hard locks up. After pressing the reset button the sometimes (not always) the nvme drive is not detected. When the nvme drive disappears, I have to first power off the machine and then boot from a sysrescue usb stck reinstall grub on the nvme drive then reboot.

There is one other weird symptom that had happened 2 times. If I was logged in via ssh (GUI already locked up) when this happened on htop there was 1 thread had an entire core 100% used in a kernel task. Several tasks were unkillable, however I could read and write to the nvme drive however eventually the ssh would disconnect.

Before the kernel update (that I mention in the previous post) I was able to trigger this behavior by rebuilding a lot of large packages in gentoo (like chromium).

BTW, The nvme drive has ZOL installed and always scrubbs without errors even after it disappears from the system.

I had the 2017 WD Black 512GB become inaccessible and be undetected by the BIOS even after I pressed the reset button. I would have to power cycle the machine to see the drive again. That matches your experience, although it would boot up again from the drive without any problems.

I was hoping that the FW update would fix the issue. Oh well. All of my testing was with Ubuntu 16.04 and 18.04. So my kernels were considerably older than yours.

I considered requesting support from WDC but it's been easier to just get new drives. In addition to the 2018 WD Black, I've also used ADATA NVME drives with Realtek controllers and they've all been trouble free.

I'm about to give the 2017 WD Black with updated FW another try in a new system and will report what happens.

drescherjm · Aug 27, 2018

I just purchased a 1TB 960 EVO ( on sale for $248 + tax at amazon lightning deal ). Hopefully that fixes the issue..

drescherjm · Sep 1, 2018

I have got the 960 EVO installed / system cloned. It may be a placebo effect however it seems significantly faster in STR.

One thing I did not mention is since I am using zfs (on linux) for my root filesystem I don't have TRIM (yet - its under development). I am not sure if that caused any problems with the WDC black however I had less than 100GB used of a 5XX GB G1 black now the same amount used in a 1TB 960 EVO..

IdiotInCharge · Sep 1, 2018

drescherjm said:
One thing I did not mention is since I am using zfs (on linux) for my root filesystem I don't have TRIM (yet - its under development). I am not sure if that caused any problems with the WDC black however I had less than 100GB used of a 5XX GB G1 black now the same amount used in a 1TB 960 EVO..

If you're not using more than one drive, why use ZFS with its known limitations? It's a brilliant filesystem, but what does it do for you that say EXT4 wouldn't?

drescherjm · Sep 1, 2018

I have zfs on several single drive installations (home and work). I make good use of the other features like snapshots and send / receive. The lack of TRIM has not caused an issue except for possibly the WDC G1 Black. My 10+ year old core2quad system (which the ryzen is supposed to replace) has a 1TB Sandisk ssd single disk zfs root for probably 2 years now.

X470 ECC Support

[H]F Junkie

Gawd

[H]F Junkie

[H]F Junkie

[H]F Junkie

Verified GBT Rep

[H]F Junkie

Gawd

[H]F Junkie

[H]F Junkie

[H]F Junkie

[H]F Junkie

Gawd

[H]F Junkie

Gawd

[H]F Junkie

Gawd

Limp Gawd

[H]ard|Gawd

Limp Gawd

[H]ard|Gawd

NVIDIA SHILL

[H]F Junkie

[H]ard|Gawd

NVIDIA SHILL

[H]ard|Gawd

NVIDIA SHILL

[H]F Junkie

Limp Gawd

[H]F Junkie

Limp Gawd

[H]F Junkie

[H]F Junkie

[H]F Junkie

n00b

Limp Gawd

[H]F Junkie

[H]F Junkie

NVIDIA SHILL

[H]F Junkie