AMD's EPYC Rome Chips Crash After 1,044 Days of Uptime

erek

[H]F Junkie
Joined
Dec 19, 2005
Messages
10,890
“A clock timer bug brings second-gen EPYCs to a halt.”

“And then there are the folks that just want to join the uptime club and set a record. To do that, you have to beat the computer onboard the Voyager 2 spacecraft. Yeah, the one that was the second to enter interstellar space. That computer has been running for 16,735 days (48+ years), and counting.

For terrestrial records, 6,014 days (16 years) seems to be the record for a server, but I've seen plenty of debate over other contenders for the crown. (The small/r/uptimeporn/Reddit community has plenty of examples of extended uptimes.)

In either case, you won't get to break that type of record with any of the EPYC Rome chips -- this errata will not be fixed, so not all your cores will exceed the 1,044-day threshold by much under any circumstances. AMD's note says it won't fix the issue — perhaps the company decided the issue is too costly to fix in silicon, or a microcode/firmware fix has too much performance overhead, or maybe there simply aren't enough impacted customers to make the fix worthwhile.

In either case, disabling the server's CC6 sleep state will help you sleep at night, or you could just make sure to reboot every 1,000 days or so.”

1685796245172.png

Source: https://www.tomshardware.com/news/amds-epyc-rome-chips-could-hang-after-1044-days-of-uptime
 
We've experienced this and wasn't sure what was going on. Definitely up that long! Came in, hypervisor was ON, fans spinning full speed (loud AF) and only a power cycle fixed it. Everything came back fine.
 
We've experienced this and wasn't sure what was going on. Definitely up that long! Came in, hypervisor was ON, fans spinning full speed (loud AF) and only a power cycle fixed it. Everything came back fine.

Bug in AMD EPYC "Rome" Processors Puts Them to Sleep After 34 Months of Uptime

by Fouquin Today, 21:49 Discuss (0 Comments)
AMD recently published an errata for their second generation EPYC processors based on Zen 2 which states that, "A core will fail to exit CC6 after about 1044 days after the last system reset." 1044 days is roughly 34 months, or just shy of 3 years of total uptime, and is actually an over estimate according to some sysadmin sleuths on Reddit and Twitter that did the math and discovered the actual time is 1042 days and 12 hours. The problem occurs because the CPU REFCLK counts 10ns ticks in a 54-bit signed integer, and if you count just over 9 quadrillion of these ticks you get the resulting overflow at 1042.4999 days. Once this overflow occurs the cores are stuck forever in a zombie state, and will not take any external interrupt requests. Well, forever until you flip the power switch off and back on again, which will reset the counter.

It's certainly impressive that this problem was discovered at all, as it suggests that more than a single system has been running for almost three years straight without a single restart. Though this does put EPYC "Rome" out of the running for any possible awards for longest running systems, it may serve as a reminder to initiate system updates or patches for other vulnerabilities that have been discovered in the four years since that generation of processor were first launched. AMD does not plan to issue any fix for the CC6 bug, instead recommending that administrators disable CC6 to avoid the cores entering the zombified state, or simply initiating a
1685844251955.png
 
Amazing that all updates can be done hot these days meaning that these machines don't even have to be rebooted otherwise after 3 years.

Oh the pandemonium that must occur after the IT department has to schedule the one reboot cycle every 3 years for all their servers.
 
  • Like
Reactions: erek
like this
Amazing that all updates can be done hot these days meaning that these machines don't even have to be rebooted otherwise after 3 years.

Oh the pandemonium that must occur after the IT department has to schedule the one reboot cycle every 3 years for all their servers.
yeah imagine hundreds of physical racks needing rebooted because you know they're all hosting thousands of VMWare vSphere VMs and rebooting a VM doesn't mitigate this bug..
 
yeah imagine hundreds of physical racks needing rebooted because you know they're all hosting thousands of VMWare vSphere VMs and rebooting a VM doesn't mitigate this bug..
The dreaded…. reboot day. It’s a black out day boys. We’re going to be rebooting for 14 hours.
 
  • Like
Reactions: erek
like this
yeah imagine hundreds of physical racks needing rebooted because you know they're all hosting thousands of VMWare vSphere VMs and rebooting a VM doesn't mitigate this bug..
Thats when you do controlled reboots and with proper fail over and stretched clusters, no one should even notice. These days if rebooting a single server impacts anything, your doing it wrong..
 
Thats when you do controlled reboots and with proper fail over and stretched clusters, no one should even notice. These days if rebooting a single server impacts anything, your doing it wrong..
This. We have continuous reboots happening here... right now at any given moment (if anyone cared to look - no one does), you'd see between 100 and 200 baremetal machines rebooting. We don't pay VMware (I wouldn't even want to see what that'd cost, it'd be stupid), and we don't bother with any live migration on qemu... it's literally not worth the engineering effort to implement for us. Similar with implementing hot patching - why bother when it's super easy to just let the reboots keep going and pull the new image to boot.
 
  • Like
Reactions: erek
like this
Thats when you do controlled reboots and with proper fail over and stretched clusters, no one should even notice. These days if rebooting a single server impacts anything, your doing it wrong..
Was going to say, in a properly built host cluster rebooting the physical hosts shouldn't be that big of a problem. Just make sure to put them in maintenance mode and migrate VM's.
 
I doubt we can go even a year without rebooting due to server updates/software updates and general maintenance. None of my users ever notice because the vm load is shifted to the other boxes while one is being maintained automatically. Then we hit end of service life (warranty basically expires for hardware) and have no choice but to replace the servers in around a 4-5 year time span so that means 1 potentially forced reboot in the life of a server chip if you somehow a server software had no major updates in 3 years that required a reboot.
 
I doubt we can go even a year without rebooting due to server updates/software updates and general maintenance. None of my users ever notice because the vm load is shifted to the other boxes while one is being maintained automatically. Then we hit end of service life (warranty basically expires for hardware) and have no choice but to replace the servers in around a 4-5 year time span so that means 1 potentially forced reboot in the life of a server chip if you somehow a server software had no major updates in 3 years that required a reboot.
Exactly, and if someone does have a box like this running for this long, then they are way behind on patching and are running an insecure environment and possibly unsupported config as well.
 
Back
Top