Some Ryzen Linux Users Are Facing Issues With Heavy Compilation Loads

I don't think it's as simple as that. There are many stress tests that heat up the CPU more that run fine on any Ryzen. Also temps tend to plateau pretty quickly, long before the errors start happening. My 1800X hits about 64C on the die when running Prime95 Small FFT or the AIDA64 stress test, which is nowhere near the marginal limits of the CPU. It seems only very specific tasks trigger this issue.

You are correct it's not supposed to have a issue at 64c is also has no issue with prime95. Simple fact is this error is caused by a constant high load which is going to raise the temps which is then causing the issue when the bandwidth is nearing it's max and a signal degrades and it creates a error. Epyc and Threadripper are not affected due to the fact they use a different connection since there are multiple dies not just a ccx, would be my assumption. Either way it's not a huge deal and people that have the issue will get a fix, very few people are going to use the computer the way this fault happens.
 
Looks like AMD are telling people who have this problem to RMA the CPU. So it's essentially a silent recall. Wondering whether I should send back my CPU - I want a fully working one, but being without a CPU for weeks kind of sucks and the problem doesn't really affect me at all.

And how do you know the problem is not affecting you now? This is a problem on data corruption. It was first detected when someone found that a "f" in the name of a folder had been changed to a "i". The problem is amplified when compiling heavily because the data corruption is reported as error in the compiling logs, and that is how the problem was confirmed and characterized. I would recommend you to run the kill_RyZen script to confirm the problem and then consider if RMA or not your chip.
 
You are correct it's not supposed to have a issue at 64c is also has no issue with prime95. Simple fact is this error is caused by a constant high load which is going to raise the temps which is then causing the issue when the bandwidth is nearing it's max and a signal degrades and it creates a error. Epyc and Threadripper are not affected due to the fact they use a different connection since there are multiple dies not just a ccx, would be my assumption. Either way it's not a huge deal and people that have the issue will get a fix, very few people are going to use the computer the way this fault happens.

This is all wrong.
 
So we are now seeing the consequence of AMD remaining silent on the issue and leaving users in the dark, so they had to create their own imperfect tests: Users are running the test scripts, seeing conftest segfault and wrongly concluding that their Ryzen CPU has a bug:


This is a problem on data corruption. It was first detected when someone found that a "f" in the name of a folder had been changed to a "i".
I consider it unlikely that this folder name change is caused by this problem, which manifests in wrong addresses after IRETQ rather than wrong data (hence the segfaults). If you access data at a wrong address and write it to disk, then much more than just a single letter would become corrupt.
 
I think the biggest reason for RMA is AMD gets their hands on the chips will high probability of having the issue as to ascertain the cause. Trying to duplicate it when not every chip has it by going through batches is an arduous process.
 
This is all wrong.

Your always wrong so were all good then. You like to predict stuff yet your silent, so you cant find what someone else has said what it is yet and repost it as your own yet?
 
So we are now seeing the consequence of AMD remaining silent on the issue and leaving users in the dark, so they had to create their own imperfect tests: Users are running the test scripts, seeing conftest segfault and wrongly concluding that their Ryzen CPU has a bug:


I consider it unlikely that this folder name change is caused by this problem, which manifests in wrong addresses after IRETQ rather than wrong data (hence the segfaults). If you access data at a wrong address and write it to disk, then much more than just a single letter would become corrupt.

Yep data corruption is more likely to result in totally random changes, probably including illegal characters causing even more problems. Also data within the files would change, probably making them unreadable.

I agree keeping silent has potentially caused more problems for AMD than if they had spoken out earlier. Now people are going to attribute a bunch of problems to this errata, that are in fact caused by totally different things, like user error, unstable memory settings (people pushing memory/Infinity Fabric clockspeeds to the breaking point probably doesn't help), faulty RAM, unstable overclocks, software bugs, faulty motherboards... AMD need to figure this out ASAP and issue some kind of official statement. They have to provide some kind of tool to check if you´re affected and streamline the RMA process. They can't pretend it's just a "Linux issue" because sooner or later other cases of this bug will be discovered.

I just hope this isn't a repeat of the Phenom TLB fiasco and we'll see Ryzen 1750, 1850 etc. soon...
 
Yep data corruption is more likely to result in totally random changes, probably including illegal characters causing even more problems. Also data within the files would change, probably making them unreadable.

I agree keeping silent has potentially caused more problems for AMD than if they had spoken out earlier. Now people are going to attribute a bunch of problems to this errata, that are in fact caused by totally different things, like user error, unstable memory settings (people pushing memory/Infinity Fabric clockspeeds to the breaking point probably doesn't help), faulty RAM, unstable overclocks, software bugs, faulty motherboards... AMD need to figure this out ASAP and issue some kind of official statement. They have to provide some kind of tool to check if you´re affected and streamline the RMA process. They can't pretend it's just a "Linux issue" because sooner or later other cases of this bug will be discovered.

I just hope this isn't a repeat of the Phenom TLB fiasco and we'll see Ryzen 1750, 1850 etc. soon...

People cause their own problems all the time and then blame someone else. You push hardware your going to have a issue at some point. Simple fact is a issue has to be repeatable and certain variables have to be taken in to account. The idea that AMD can just put something out there tomorrow is silly. It takes months in the car industry as well to figure out a issue and issue a fix, even once you know whats broken, you have to come up with a way to fix it. Linux is such a tiny part of the market unlike Windows that they have the luxury of time on their side.
 
Linux is such a tiny part of the market unlike Windows that they have the luxury of time on their side.

In reality Windows is the minority OS those days, from a global point of view. And Linux is very important in workstation/server/HPC; that is the reason why AMD official statement mentioned EPYC and TR.
 
Dammit, now I gotta install Linux and reproduce a fault to get a better chip. Wait! I can sell this rig and get a Threadripper! (rubs hands together with nefarious grin)
 
Now damnit AMD, what are Intel fanboys going to to bring up to besmirch the Ryzen chip now that this problem is resolved. The nerve of some companies... fixing problems on the silicon level within 4 months of the problem being identified. Hell Intel took a year with their bios fix for an errata issue, but here is AMD acting like an upstart again and fixing errata within 4 months of a problem being noticed.

Oh well, I'm sure if nothing else the boys over at Intel will think up some new even more ridiculous reason to claim AMD is failing in every possible metric.
 
Now damnit AMD, what are Intel fanboys going to to bring up to besmirch the Ryzen chip now that this problem is resolved. The nerve of some companies... fixing problems on the silicon level within 4 months of the problem being identified. Hell Intel took a year with their bios fix for an errata issue, but here is AMD acting like an upstart again and fixing errata within 4 months of a problem being noticed.

Oh well, I'm sure if nothing else the boys over at Intel will think up some new even more ridiculous reason to claim AMD is failing in every possible metric.

Wait a moment, isn't this thread full of people that pretended that the problem didn't exist? Didn't the look for excuses such as it is was a problem with compiler or distro or linux kernel or B350 mobos or anything else except the chip?

Also it is more than four months since that the problem was identified by users, but we don't know when the problem was identified by AMD. Everything that we know is that chips manufactured prior to week 25 (middle of June) have the problem.

Also, people purchasing online still can receive a chip with the problem. In that case they have to RMA it to get one new without the problem.

Finally, don't forget that still issues/problems remain with RyZen, as the freeze issue

https://www.phoronix.com/forums/for...ux-no-compiler-segmentation-fault-issue/page2
 
Last edited:
I think it is really shitty AMD made batches of defective CPUs and isn't posting information on their site about the affected weeks. I think they should be offering replacements to anyone with the affected manufacturing runs. Instead they know not many people will be running Linux so they can just sit back and wait for them to call customer care.
 
I received both my RMA'd chips from AMD (R7-1700X, R7-1700 - both had the SEGV issue). The R7-1700X is UA1727SUS, and the R7-1700 is UA1730SUS. So far - the 1700X passed testing, but the 1700 is still running (got 12 more hours to go).

Same Stepping - Same Microcode (from AGESA 1.0.0.6a), the new chips are running lower volts though stock.


Looks like both CHIPS were tested by AMD before sending them to me (boxes were opened through the bottom). This may be a QA / binning issue and may end up affecting chips in those batches once they hit retail. Have not heard of anybody getting a UA1725+ chip in retail yet (unless you purchase a ThreadRipper).
 
My 1500X is about to go in. I am just waiting for my A12-9800 to arrive, so I can use the machine while it's down. At least I'll get a free hsf with the rma.
 
Back
Top