Google calls out AMD and Intel on high failure rates on excessively dense chips.

Lakados

[H]F Junkie
Joined
Feb 3, 2014
Messages
10,276
https://sigops.org/s/conferences/hotos/2021/papers/hotos21-s01-hochschild.pdf

TLDR;
Google researchers have published a paper describing what they call “mercurial” cores. Mercurial cores are cores that are subject to what Google calls “corrupt execution errors,” or CEEs. One critical component of CEEs is that they are silent..

“We expect CPUs to fail in some noticeable way when they miscalculate a value, whether that results in an OS reboot, application crash, error message, or garbled output. That does not happen in these cases. CEEs are symptoms of what Google calls “silent data corruption,” or the ability for data to become corrupted when written, read, or at rest without the corruption being immediately detected.”

Testing was done on CPU’s with 8 to 64 cores on new smaller processes. No determinable pattern was found, Google is asking AMD and Intel to step up internal testing.

intentionally using these broken cores however they were able to generate encryptions that could only be decrypted by that individual faulty CPU that encrypted the data which they state is both exciting and terrifying.
 
Sounds great until you have to recover encrypted data from a server with a bad CPU. I guess don't do any disk encryption with these CPUs unless you don't care if it's portable?

I kinda suspected this, anyway. There were too many issues with memory stability for it to just be a memory issue. Likely something wrong with the memory controllers or how they communicate with the CPU and RAM.
 
Also, sounds like hardware dyslexia—the CPU has an issue with reading/writing data. It understands anything it reads/writes fine because it's always misinterpreting the data the same (or nearly the same), but data from other CPUs is garbled because they (mis)interpret the data differently.
 
Sounds great until you have to recover encrypted data from a server with a bad CPU. I guess don't do any disk encryption with these CPUs unless you don't care if it's portable?

I kinda suspected this, anyway. There were too many issues with memory stability for it to just be a memory issue. Likely something wrong with the memory controllers or how they communicate with the CPU and RAM.
What the paper is describing isn't a memory thing but an odd failure with the transistors on the CPU's themselves and how that interacts with the built-in error correction algorithms. From what I understood anyways, that paper is above me in so many ways
 
Also, sounds like hardware dyslexia—the CPU has an issue with reading/writing data. It understands anything it reads/writes fine because it's always misinterpreting the data the same (or nearly the same), but data from other CPUs is garbled because they (mis)interpret the data differently.
In other words, an endianness/byteswap problem taken up to 11.

The difference being that if some processors and transmission protocols are big-endian (68k, PowerPC) and others are little-endian (x86), you can at least put standards in place to account for this and byteswap where appropriate. (Bonus points if it's bi-endian like ARMv3 and later and can work either way.)

This sounds more like CPUs of the same family not outputting the bits and bytes consistently with each other due to manufacturing errors, despite supposedly being the exact same product with the exact same architectural standards, which is understandably much more difficult to correct for - moreso if it's not consistent throughout an entire stepping/revision of that CPU and implementing a workaround on one example completely breaks other examples. You'd have to hand-tailor a fix for each defective CPU, which is by no means realistic for anyone to implement.

I can only hope that whatever platform I upgrade to next year, once DDR5 platforms roll out, won't have this problem. I already remember how TSX is effectively useless on my 4770K, despite being one of Haswell's much-touted features, because of some errata that Intel had overlooked.

EDIT: Just now saw Lakados' reply above. If it's not memory-related, it definitely wouldn't be akin to the endianness problem at all. Makes you wonder what makes transistors fail like that... physics becomes more and more of a pain with how small they're getting, as I understand.
 
  • Like
Reactions: Nobu
like this
In other words, an endianness/byteswap problem taken up to 11.

The difference being that if some processors and transmission protocols are big-endian (68k, PowerPC) and others are little-endian (x86), you can at least put standards in place to account for this and byteswap where appropriate. (Bonus points if it's bi-endian like ARMv3 and later and can work either way.)

This sounds more like CPUs of the same family not outputting the bits and bytes consistently with each other due to manufacturing errors, despite supposedly being the exact same product with the exact same architectural standards, which is understandably much more difficult to correct for - moreso if it's not consistent throughout an entire stepping/revision of that CPU and implementing a workaround on one example completely breaks other examples. You'd have to hand-tailor a fix for each defective CPU, which is by no means realistic for anyone to implement.

I can only hope that whatever platform I upgrade to next year, once DDR5 platforms roll out, won't have this problem. I already remember how TSX is effectively useless on my 4770K, despite being one of Haswell's much-touted features, because of some errata that Intel had overlooked.

EDIT: Just now saw Lakados' reply above. If it's not memory-related, it definitely wouldn't be akin to the endianness problem at all. Makes you wonder what makes transistors fail like that... physics becomes more and more of a pain with how small they're getting, as I understand.
Other places reporting on this are describing it as a randomized series of hardware failures that give a result similar to the early Pentium FDIV bug where you get the processor making basic math mistakes. but due to the large number of cores and the billions of transistors on each core no two failures give the same wrong errors.
 
Other places reporting on this are describing it as a randomized series of hardware failures that give a result similar to the early Pentium FDIV bug where you get the processor making basic math mistakes. but due to the large number of cores and the billions of transistors on each core no two failures give the same wrong errors.
Ah, the FDIV bug... I remember that. Makes me wonder how Quake functioned as intended if that game sold lots of Pentiums specifically because of how much it relied on the FPU in the days before 3dfx and GLQuake.

So each defective CPU is uniquely wrong, as it were, but still functional enough to somehow run most software without crashing because it's consistently wrong, or the errors are subtle enough to where you don't notice until it's too late... kind of like having a system with a bad stick of RAM that still passes POST and boots the OS somehow, but screws you over with silent data corruption. Not fun!
 
Ah, the FDIV bug... I remember that. Makes me wonder how Quake functioned as intended if that game sold lots of Pentiums specifically because of how much it relied on the FPU in the days before 3dfx and GLQuake.

So each defective CPU is uniquely wrong, as it were, but still functional enough to somehow run most software without crashing because it's consistently wrong, or the errors are subtle enough to where you don't notice until it's too late... kind of like having a system with a bad stick of RAM that still passes POST and boots the OS somehow, but screws you over with silent data corruption. Not fun!
They are subtle enough that the internal error correction algorithms "correct" for it but those corrections aren't always 100% accurate so you don't get crashes but you get subtle differences and in operational data centers where you have a 100,000 CPU's running upwards of 6 million cores those inconsistencies are leading to data corruption and compile errors as data is shifted around not because the data is bad but because those CPU's have misinterpreted the data and passed on a slight fault that compounds over time as it hits more bad cores. So me you and the other guy probably never going to have an issue with this this is a huge data center issue.
 
Back
Top