Ryzen issues with Linux

GiGaBiTe

2[H]4U
Joined
Apr 26, 2013
Messages
2,527
Has anyone been able to get a Ryzen chip stable in Linux?

I built a Ryzen 1500x system about a month ago now for use as a multipurpose Linux server and I'm not able to get it to run stable no matter what I do. I did the standard debugging and found the original memory sticks were veeeeery bad:



I returned them for replacements and did another memtest for 40+ hours and they were good. I've removed all the cards, tried four different power supplies, different drive configurations, different BIOS versions, BIOS configurations, memory speeds, etc. and nothing helps. CPU temperatures are fine, it runs between 80-130F at all times.

Basically what happens is after a variable amount of time, the system will hard lock on whatever it's currently doing. The Xorg server will freeze and the machine becomes unresponsive to pings, so it's dead until a reboot.
 
There could be several things gumming up the works here. A more specific list of the core hardware and what Linux distribution and version you are running might very well be helpful. Does this happen at stock settings? If so, you could try disabling SMT and see if it gets any better. Not saying that SMT is bad, but disabling it could help narrow the scope.
 
At stock i have no issues on archlinux with my r5 1600. Overclocked, fine unless I adjust the bclk, but then it's not quite stable in windows either (I blame memory instability for this).
 
I've built a Muti-Media rig using 1800X and memory according to the motherboard specs. G.Skill in this case. I'm running Zorin 12.4 Linux on one hard drive because I'm an Nub in Linux Land. But it was one of the most easiest installs I have ever encountered. No hiccups what so ever. I run multiple operating systems on my rig due to legacy issues on equipment that I still use and of course screwing around. So each Os is on a different SSD in a swap caddie. I've had similar situations with Micro-$haft 10 and it was internet related. I also have a back up rig and a test bench for situations like this.

Each Computer built has its own life it seems and what works on one comp might not work on another. I feel for you on your frustration because I think we all been there one time or another. In my case I would do a complete tear down. Swapping out components to make sure they are not the problem (including the mother board) and making sure all of my connections are snug and clean. Again I feel for you because this is going to be a long and frustrating task on hand but if you can rule out any hardware issues then it has to be a software one.

Good luck with this.

Added: OH I liked what Nobu just commented above too!
 
That Ryzen bug is supposed to be kind of difficult to randomly trigger if that is what it is. It happens when there is a lot of heavy multi-threaded activity of a specific type going on with SMT enabled, and disabling SMT eliminates that (a reason I suggested it).
 
There could be several things gumming up the works here. A more specific list of the core hardware and what Linux distribution and version you are running might very well be helpful. Does this happen at stock settings? If so, you could try disabling SMT and see if it gets any better. Not saying that SMT is bad, but disabling it could help narrow the scope.

Fedora 28 (also tried OpenSUSE which had the same issue.)

Ryzen 1500x (Stock)
2 x 4 GB Patriot DDR4-2666 (tried both stock 2133 and the XMP 2666)
Gigabyte AB350M-DS3H
Geforce 8600GT and R7 240 (not at the same time, swapped them out for testing, currently has R7 240)

The four PSUs were either new or known working units from Cooler Master, Thermaltake or Antec from 450-600W

I haven't tried disabling SMT yet, wasn't aware that was an option. The UEFI setup is irritating to get around in to say the least.

That said, could it be the issue some were having with early ryzen chips? Try the kill-ryzen script (linked on that phoronix article) and see if it crashes, check the manufacture date of your chip. ;)

I've been reading there are lots of issues with some Ryzen chips, but didn't come across that yet. I'll try it out once I can clear some space off my bench, as I have a pile of machines which also need attention. I figure I had waited long enough for all of these issues to be ironed out, but it seems like both AMD and Intel have had all sorts of grief with the latest generations of chips.

I've built a Muti-Media rig using 1800X and memory according to the motherboard specs. G.Skill in this case. I'm running Zorin 12.4 Linux on one hard drive because I'm an Nub in Linux Land.

Good to stay with newbie friendly distros for the first few years. I wouldn't recommend jumping into any Redhat based distro without having extensive experience, because you'll get lost and frustrated quickly. Debian can be the same way.

Each Computer built has its own life it seems and what works on one comp might not work on another. I feel for you on your frustration because I think we all been there one time or another. In my case I would do a complete tear down. Swapping out components to make sure they are not the problem (including the mother board) and making sure all of my connections are snug and clean. Again I feel for you because this is going to be a long and frustrating task on hand but if you can rule out any hardware issues then it has to be a software one.!

I've been a system builder for around 25 years and this is by far one of the top 10 rigs which have given me the most hell. Two of those few I can remember off the top of my head were a Rambus Pentium 4, and a PIII-550 which would BSOD when installed into a case.
 
If that is the case I bet AMD might warranty replace that chip with a newer stepping?

On the long AMD foum post about this issue, there were reports of AMD testing CPUs before shipping a RMA. Although with that said some users got a defective CPU from the RMA and had to do the process over again. I have not seen any reports that AMD will ship a ryzen2 as a replacment over a defective ryzen1 CPU. Also ryzen2 CPUs don't seem to have this problem at all.
 
If that is the case I bet AMD might warranty replace that chip with a newer stepping?
Possible. I haven't been following closely as I didn't suffer from this issue, but I do know there were RMAs fulfilled.
 
If that is the case I bet AMD might warranty replace that chip with a newer stepping?

the bug's insanely hard to trigger unless you do very specific steps to trigger it which is what the kill bug tester is.. in the real world it's unlikely the average user will ever be effected by it.

as far as the actual issue, it could be the board, other thing i'd check is which memory slots are used as the two primary when all 4 aren't populated since it's different between the manufactures.. some are A1/B1 some are A2/B2.

another thing you could possibly try which helped my setup was turning off gear down mode in bios, basically sets the ram to 1.5T instead of 1T and can help with stability.
 
I'm running Manjaro with a ryzen 2700x no issues here.

Sounds like you have defective hardware.
 
the bug's insanely hard to trigger unless you do very specific steps to trigger it which is what the kill bug tester is.. in the real world it's unlikely the average user will ever be effected by it.

The bug tester was created specifically because it was happening in 'the real world', though I note that you disclaim it by saying 'average user'; but we're on [H]ardOCP which I wouldn't call an 'average user' place.
 
On the lighter side of the news, if the issue is with the board, a replacement B350 is super cheap. I personally have nothing but nice things to say about the Asus B350 Prime Plus - it even has 2 honest-to-god PCI slots for your legacy cards should you have any. My wife is running an R7 1700 @ 3.8GHz and RAM @ XMP 2933 24-hour Prime95 stable all day long on this rig. I had gotten the X370 Asus equivalent (Prime X370-A; same exact board layout, but with the X370 instead of the B350) and it was nowhere near as stable on this same setup so I returned it.
 
What kernel are you using? At least with the Red Hat until recently there was a bug in the power saving features (auto down clocking) and if you disabled it they were stable. As of the 4.17 family of Kernels on my Fedora box my 1600 has been really stable. Just back fighting with Nvidia drivers.

Edit: Saw you mention Fedora above. I'm guessing I still have the fix set on my box.... I've been in a its stable don't mess with it mood.

https://bugzilla.redhat.com/show_bug.cgi?id=1450769
 
Last edited:
  • Like
Reactions: Nobu
like this
The bug tester was created specifically because it was happening in 'the real world', though I note that you disclaim it by saying 'average user'; but we're on [H]ardOCP which I wouldn't call an 'average user' place.
Yeah, can even trigger it on windows, it's just harder.
 
Should be OK, assuming that you're using a modern version of Linux.

On my Ryzen 5 2600X, I have a separate hard drive that has CentOS 6.10 on it, and the installation went smoothly. It's running 64 bit versions of NMR Pipe, as well as 32 bit versions of CYANA, without any hiccups. I suspect that a reasonably modern version of Fedora should behave the same.
 
Hard lock? Try adjusting c-state options in bios. Make sure latest motherboard bios, etc. See if a notch of voltage or increasing vrm freq makes any difference.
 
That said, could it be the issue some were having with early ryzen chips? Try the kill-ryzen script (linked on that phoronix article) and see if it crashes, check the manufacture date of your chip. ;)

So I tried this and after it finishes setting up the build environment, it instantly fails. I also tried a second Ryzen testing script and it fails as well.

Disabled SMT and ran it again, it did something for a longer period of time and came back with:

"Test Failed: loopCountToFailure=[0] elapsedTimeInSeconds=[190]"
 
I assume you are talking about this

https://github.com/suaefar/ryzen-test

Did you try this on ubuntu-18.04?

I ask because the script does not work on the latest ubuntu. The gcc version fails to build on all systems. There is a pull request that fixes this (by using a newer gcc) after a mod to put the ubuntu install step back in.
 
Got a pic of your cpu?
I would need to see the date code to give you some advice.
You would be look for something like this: 1726SUS
17 is for 2017, 26 is the week number, SUS has to do with manufacturing plant(s)
Most cpus that came back as RMA replacements were 1726 or newer, and most were SUS.
So if you have say 1712PGT then I would bet the cpu is the problem.
Again, you want 1726SUS or newer. Also the new 2000 series Ryzen cpus seem fine from what I've heard.

More info:

https://www.phoronix.com/scan.php?page=article&item=new-ryzen-fixed&num=1


https://community.amd.com/thread/215773?start=0&tstart=0
 
Oxalin forked and updated the script to use a later version of gcc, if you want to try that before downloading an older ubuntu livecd.
 
Here's a picture of the CPU.

8yO68RY.jpg


In the mean time, I've played around with turning SMT on and off, disabling C states, adjusting voltages, changing power settings in Linux.

The symptoms still persist.
 
Sounds like a genuine hardware failure, then. Only thing left is to try and swap the CPU or mainboard and see which one the the problem follows...
 
Yeah, that one should be one from after they fixed the bug.

Just to verify, have you tried a different video card, or tried putting the video card in a different slot? A locked gpu kernel module can hang the system, and a defective or improperly installed video card can cause that, as well as insufficient power for the video card.
 
Just to verify, have you tried a different video card, or tried putting the video card in a different slot? A locked gpu kernel module can hang the system, and a defective or improperly installed video card can cause that, as well as insufficient power for the video card.

I've tried two known working cards from my horde. One was an R7 240 and the other was an 8600GT. Only the 8600GT requires a molex power connector, and the PSU has more than enough power to drive it (500W Cooler Master.) I don't have too many other cards which will fit in this box to try.
 
I've tried two known working cards from my horde. One was an R7 240 and the other was an 8600GT. Only the 8600GT requires a molex power connector, and the PSU has more than enough power to drive it (500W Cooler Master.) I don't have too many other cards which will fit in this box to try.
Cool, just wanted to verify.
 
I swapped in some generic DDR4-2133 from Altex and found that disabling the monitor from sleeping prevents the hard lock (system hasn't locked up in 3 days so far.) Why the monitor going to sleep causes the hard lock is unknown.
 
I swapped in some generic DDR4-2133 from Altex and found that disabling the monitor from sleeping prevents the hard lock (system hasn't locked up in 3 days so far.) Why the monitor going to sleep causes the hard lock is unknown.
Sounds like a bug in the drm kernel driver, I think. I could be wrong. Err, not drm...whatever handles c-states, or whatever handles modesetting
 
Last edited:
Whatever is causing it is really low level since it happens with both AMD and Nvidia cards.
 
I still recommend that you try the kill_ryzen script. Make sure its the updated version since the original fails with ubuntu-18-04. If it fails AMD will be happy to RMA the CPU. I believe they are going to ask for a picture of your case to see if the airflow is reasonable.
 
  • Like
Reactions: xorbe
like this
I still recommend that you try the kill_ryzen script. Make sure its the updated version since the original fails with ubuntu-18-04. If it fails AMD will be happy to RMA the CPU. I believe they are going to ask for a picture of your case to see if the airflow is reasonable.

It fails.

X5fNnbZ.png
 
I've excluded everything else but the board and CPU. Gigabyte has a less than stellar reputation for RMA service (I had to use it about a decade ago and they never fixed the problem) so I don't really expect anything but a headache from them. I already started a support ticket with them and I'm getting the runaround "did you try Windows" and other inane questions.

I've tried that kernel parameter, in addition to disabling various combinations of 0-7 states in the BIOS and several combinations with the kernel parameter with no luck. I've never had to RMA a CPU before, so this will be an adventure I'd rather not go on.
 
Back
Top