Some Ryzen Linux Users Are Facing Issues With Heavy Compilation Loads

funks · Aug 4, 2017

Ocellaris said:
If it takes that many steps for people to run a stability test, they simply won't care about it.

Half of the instructions are to write the ISO image into a USB Flash drive, and boot with the USB Flash drive. Much easier than trying to install Linux on the HDD and learning all the commands. A bunch of people tried it on reddit and ran into this issue.

My two ryzen systems are affected (tested it using this method).

Ocellaris · Aug 4, 2017

funks said:
Half of the instructions are to write the ISO image into a USB Flash drive, and boot with the USB Flash drive. Much easier than trying to install Linux on the HDD and learning all the commands. A bunch of people tried it on reddit and ran into this issue.

My two ryzen systems are affected (tested it using this method).

Did you log a bug with AMD? Even if people run the test and see the failure, they won't necessarily care since it doesn't affect them.

But good job with the directions. The situation sucks and AMD hasn't acknowledged it yet from what I can tell.

funks · Aug 4, 2017

Ocellaris said:
Did you log a bug with AMD?

Yes, and so did a couple of people on the amd community forums. I actually wrote the procedure above (LiveUSB) to make it easier for Windows guys to see if they have the same problem on their systems. A bunch of guys tried it and ran into the problem, the problem may be more widespread but given the relatively small number of Free*NIX users compared to Windows - they may not know any better. I believe a large number of chips are affected, if there are good copies around then it means AMD needs to beef up QA procedures.

Some have already played Silicon RMA lottery with AMD and got back chips that failed the same way. One guy did note that his new replacement was supposedly tested by AMD internally and looks like it's working better (he's going to perform more test this weekend.

Michael of Phoronix duplicated this issue on his 1800X system as well and wrote an article about it - http://www.phoronix.com/scan.php?page=news_item&px=Ryzen-Test-Stress-Run

Ocellaris · Aug 4, 2017

Dumb question... has anyone run your test and NOT run into an issue?

funks · Aug 4, 2017

funks said:
Yes, and so did a couple of people on the amd community forums. I actually wrote the procedure above (LiveUSB) to make it easier for Windows guys to see if they have the same problem on their systems. A bunch of guys tried it and ran into the problem, the problem may be more widespread but given the relatively small number of Free*NIX users compared to Windows - they may not know any better. I believe a large number of chips are affected, if there are good copies around then it means AMD needs to beef up QA procedures.

Some have already played Silicon RMA lottery with AMD and got back chips that failed the same way. One guy did note that his new replacement was supposedly tested by AMD internally and looks like it's working better (he's going to perform more tests this weekend).

Michael of Phoronix duplicated this issue on his 1800X system as well and wrote an article about it - http://www.phoronix.com/scan.php?page=news_item&px=Ryzen-Test-Stress-Run

Yes, it runs on my Intel systems just fine (ran it for two days). Additionally, if I disable opcache on my primary RIG (Taichi), can't disable it on my secondary right (no UEFI option) - looks like I gain some stability but I'm not sure if it'll eventually fail if I run it continuously for several days.

Deleted member 82943 · Aug 4, 2017

Haven’t read all the thread but are there still issues under heavy compliation?

Ocellaris · Aug 4, 2017

gigatexal said:
Haven’t read all the thread but are there still issues under heavy compliation?

The TLDR is now these issues are super easy to reproduce.

Deleted whining member 223597 · Aug 4, 2017

Ocellaris said:
The TLDR is now these issues are super easy to reproduce.

Sounds to me like this is only an issue with Linux in specific cases. I guess that sucks for people using Linux, but I couldn't care less.

/dev/null · Aug 4, 2017

-Strelok- said:
Sounds to me like this is only an issue with Linux in specific cases. I guess that sucks for people using Linux, but I couldn't care less.

Except the issue shows up under WSL in windows as well...and who knows how many other corner cases that haven't been discovered. This is a major problem.

Deleted whining member 223597 · Aug 4, 2017

/dev/null said:
Except the issue shows up under WSL in windows as well...and who knows how many other corner cases that haven't been discovered. This is a major problem.

I still don't get your point, when is anyone outside of developers going to use WSL? Look I understand that we shouldn't be making excuses here for bugs and that AMD should fix it, but at the end of the day who is actually affected by this?

/dev/null · Aug 4, 2017

Wait until games start using all threads & hit the same bug....and you have random crashes. Maybe not now, maybe not next year....

Basially: Using all threads at the same time under high cpu load can cause a program to crash.

Deleted whining member 223597 · Aug 4, 2017

/dev/null said:
Wait until games start using all threads & hit the same bug....and you have random crashes. Maybe not now, maybe not next year....

Basially: Using all threads at the same time under high cpu load can cause a program to crash.

So why doesn't Prime95 crash it? If you set that up right it will stress all threads.

funks · Aug 4, 2017

-Strelok- said:
I still don't get your point, when is anyone outside of developers going to use WSL? Look I understand that we shouldn't be making excuses here for bugs and that AMD should fix it, but at the end of the day who is actually affected by this?

High system CPU usage, playing that game, streaming it too as well, some windows background processes running - a program just crashes randomly and you'll blame it on the OS, or the Game as a bug.

funks · Aug 4, 2017

-Strelok- said:
So why doesn't Prime95 crash it? If you set that up right it will stress all threads.

Prime95 is not a "general' use case from a processor use perspective, concurrent code compiles are actually doing "real" work ( lots more context switching going on, usage and flushing of caches).

noko · Aug 4, 2017

ManofGod said:
Yep, both my Ryzen R7 systems are rock solid stable as well at 3.8 Ghz. I would imagine that your Custom SFF case work station is fast too, right?

Custom SFF station I thought was fast until I built two Zen builds - made the I7 6700K look like an I3 from yesterday - still great with gaming so no real lost.

Deleted member 82943 · Aug 4, 2017

Ocellaris said:
The TLDR is now these issues are super easy to reproduce.

Well shit. I might get threadkiller then in hopes they fixed it there

Ocellaris · Aug 4, 2017

-Strelok- said:
So why doesn't Prime95 crash it? If you set that up right it will stress all threads.

The problems looks to be appearing when a many heavy loads are shifting between cores in response to workload changes. Prime95 just maxes out everything.

Deleted member 82943 · Aug 5, 2017

Linux users try this: seems IOMMU related

https://ibiscybernetics.com/blog/2017-05-24.html

xorbe · Aug 5, 2017

I tried for 48 hours to get my 1500X to crash with the one focused program, and parallel compiles out the wazoo, nothing crashed. openSUSE x64 Tumbleweed.

juanrga · Aug 5, 2017

-Strelok- said:
Sounds to me like this is only an issue with Linux in specific cases. I guess that sucks for people using Linux, but I couldn't care less.

As has been stated a dozen of times in this thread, this is a bug on RyZen and it affects users on Linux, Windows, and BSD. Why do you believe that AMD remains silent about this issue?

People that only use the computer to chat on Facebook or play games under Windows wouldn't care about this bug, unless tomorrow some software is encoded in such one way that the bug affects they directly as well. People that uses the computer for some serious work under linux see how the computer crashes. There are suspicion that this bug is also present on ThreadRipper and EPYC,

Some people claims this bug is only present on B1-steeping silicon, and EPYC uses B2-steeping. Other people claims that the bug is also present on EPYC, because the B2 steeping only provides uncore improvements.

juanrga · Aug 5, 2017

-Strelok- said:
So why doesn't Prime95 crash it? If you set that up right it will stress all threads.

Prime95 doesn't stress RyZen chips. RyZen consumes about same power under RyZen than under Excel or Luxmark, because Prime95 doesn't work as a power virus for RyZen chips. On the other hand, RyZen consumes more power under an ordinary x264 workload than under Prime95. All this was demonstrated in threads dealing with power consumption and TDPs on RyZen.

That is the reason why you can be playing with Prime95 and the chip is stable, then run Larrabel's stress test and the system fails in 229 seconds under the heavy load. The stress-run command within the Phoronix Test Suite has been used by enterprise customers for stress testing / burn-ins of hardware and checking for stability. Therefore this is not anecdotal.

JustReason · Aug 5, 2017

juanrga said:
As has been stated a dozen of times in this thread, this is a bug on RyZen and it affects users on Linux, Windows, and BSD. Why do you believe that AMD remains silent about this issue?

People that only use the computer to chat on Facebook or play games under Windows wouldn't care about this bug, unless tomorrow some software is encoded in such one way that the bug affects they directly as well. People that uses the computer for some serious work under linux see how the computer crashes. There are suspicion that this bug is also present on ThreadRipper and EPYC,

Some people claims this bug is only present on B1-steeping silicon, and EPYC uses B2-steeping. Other people claims that the bug is also present on EPYC, because the B2 steeping only provides uncore improvements.

So did Intel update you weekly on their YEAR long issue with SMT in Linux? No they did not, so stop inferring mal-intent on AMDs part for lack of a statement weekly. Linuxs share is far too small for a huge push to fix, unfortunate but a fact, It will likely get pushed back until their release schedule for current chips is settled, so any fix may be a long time coming.

And don't speak of statements others make in regards to EPYC and TR, again insinuating some catastrophic issue with no facts. I could always find some stating that TR OCs to 4.6Ghz on air, but we KNOW that is definitely not true. So leave statements like those out when lacking any proof, it is just fear mongering and with other posts you have made it does not add to your credibility.

funks · Aug 5, 2017

JustReason said:
And don't speak of statements others make in regards to EPYC and TR, again insinuating some catastrophic issue with no facts. I could always find some stating that TR OCs to 4.6Ghz on air, but we KNOW that is definitely not true. So leave statements like those out when lacking any proof, it is just fear mongering and with other posts you have made it does not add to your credibility.

Well, it's bigger than that - looks like the segfaults are also being reproduced on EPYC. And the number of "Linux" boxes in the datacenters is quite large these days. ESX for example runs the Linux kernel and tools, the googles, the facebooks - they run Linux servers. People running the LAMP stack..

AMD will not survive without EPYC, and datacenter folks won't buy this platform if it has issues with Linux which is why I believe this will eventually be fixed.

juanrga · Aug 5, 2017

Things are moving now to fast speed:

(i)
Users at AMD support forums are tired of AMD silence, "Posting on AMD's community forums really won't be getting us anywhere", and are changing the tactic and spreading this problem in every place just to force to AMD at least to recognize the existence of the problem. Users are also soliciting help from rest of RyZen users to get a statistics of how many people is affected. There is a thread about that for instance here

Many folks are now reproducing the problem thanks to the scripts: kill_ryzen and testRyzenGCC.

(ii)
People with access to TreadRipper has been contacted to test the scripts on ThreadRipper. Kyle has been contacted as well.

(iii)
At least one person was able to test the script on EPYC and he found that EPYC is also affected by the bug

funks · Aug 5, 2017

The EPYC running the phoronix test suite may have to do with a specific php build conftest and may not actually be an issue.

The link you posted above though from raydude is a totally different test, just basically compiling GCC (one instance with -j $NUM_LOGICAL_PROCCESORS) in a build loop so that it's easy to repeat (and can run on any machine with 16 gigs of RAM). Most of the people who tried it in the thread also ran into segv, internal compiler errors, make's sub utils failing, and other issues which should never happen ( needs to run for a while though ). Trying to get more people to run it so that we'll have a rough idea how many are affected.

juanrga · Aug 5, 2017

funks said:
The EPYC running the phoronix test suite may have to do with a specific php build conftest and may not actually be an issue.

The link you posted above though from raydude is a totally different test, just basically compiling GCC (one instance with -j $NUM_LOGICAL_PROCCESORS) in a build loop so that it's easy to repeat (and can run on any machine with 16 gigs of RAM). Most of the people who tried it in the thread also ran into segv, internal compiler errors, make's sub utils failing, and other issues which should never happen ( needs to run for a while though ). Trying to get more people to run it so that we'll have a rough idea how many are affected.

Yes raydude is using the testRyzenGCC script whereas the EPYC system did ran the Phoronix stress run.

mvmiller12 · Aug 5, 2017

-Strelok- said:
I still don't get your point, when is anyone outside of developers going to use WSL? Look I understand that we shouldn't be making excuses here for bugs and that AMD should fix it, but at the end of the day who is actually affected by this?

I believe his point is that although there is * currently no KNOWN * Windows native software effected by this, the fact that SOFTWARE triggers the bug, means that it * COULD * be an issue. It just so happens that there is a common use case in *NIX that happens to trigger it. If there is a errata in the Ryzen CPU, it is most certainly OS agnostic.

SighTurtle · Aug 7, 2017

http://www.phoronix.com/scan.php?page=news_item&px=Ryzen-Segv-Response

I assume this relates to what the thread is about?

chithanh · Aug 7, 2017

The conftest segfaults are actually a problem in the test scripts and not a CPU bug. This was pretty clear to anyone who looked more closely at the tests, and also confirmed by Phoronix in an update to their article.

mvmiller12
Windows Subsystem for Linux (WSL) is affected by this problem according to at least one users on AMD community forums.

SighTurtle
Yes, exactly. Finally some communication from AMD.

Deleted member 82943 · Aug 7, 2017

wasn't / isn't kernel 4.12 supposed to fix all of this?

kac77 · Aug 7, 2017

SighTurtle said:
http://www.phoronix.com/scan.php?page=news_item&px=Ryzen-Segv-Response

I assume this relates to what the thread is about?

Pretty much a nothing burger. Basically you need more than just maxing out the threads under Linux. You need to have max utilization under Linux which prevents a thread from being allocated during reassignment. Achieving that isn't easy from your average user.

mvmiller12 · Aug 7, 2017

kac77 said:
Pretty much a nothing burger. Basically you need more than just maxing out the threads under Linux. You need to have max utilization under Linux which prevents a thread from being allocated during reassignment. Achieving that isn't easy from your average user.

I wouldn't call it nothing per se, but it is definitely niche.

juanrga · Aug 8, 2017

Although it is good that AMD finally admits the existence of the problem, I would point that the official communication is not correct when characterizes it "as a performance marginality problem exclusive to certain workloads on Linux."

The problem is not one of performance, but one of data corruption. Moreover, the problem has been reproduced under linux, windows, and BSD.

JimmiG · Aug 8, 2017

Looks like AMD are telling people who have this problem to RMA the CPU. So it's essentially a silent recall. Wondering whether I should send back my CPU - I want a fully working one, but being without a CPU for weeks kind of sucks and the problem doesn't really affect me at all.

/dev/null · Aug 8, 2017

JimmiG said:
Looks like AMD are telling people who have this problem to RMA the CPU. So it's essentially a silent recall. Wondering whether I should send back my CPU - I want a fully working one, but being without a CPU for weeks kind of sucks and the problem doesn't really affect me at all.

I'd imagine if you do it, resale will go up... Some cpus DO work ok. It's quite possible yours isn't affected.

LurkerLito · Aug 8, 2017

JimmiG said:
Looks like AMD are telling people who have this problem to RMA the CPU. So it's essentially a silent recall. Wondering whether I should send back my CPU - I want a fully working one, but being without a CPU for weeks kind of sucks and the problem doesn't really affect me at all.

I say if you ran the test and it happens on your CPU just do the RMA, because even if it doesn't affect you at all that has to be qualified with a "for now". I know it sucks to be without the CPU for the time, but this is one of those things that will probably bite you later when you least expect it. It won't be obvious because what will happen is a random BSOD or app crash that you will blame on a software bug and who know by then AMD might no longer offer a RMA and you'll have to buy a new CPU.

I was one of those that did do the replacement of the Pentium chip with the FDIV bug. It didn't affect me at all but I still did the replacement because like I always say, I didn't pay them with defective money.

chithanh · Aug 8, 2017

Given that some people received defective Ryzen CPUs from RMA, it is quite possible that many or even most display this issue. If you are in doubt, run the ryzen stress tests on Linux or WSL. If you see segfaults (other than conftest), then your CPU is affected.

juanrga said:
The problem is not one of performance, but one of data corruption. Moreover, the problem has been reproduced under linux, windows, and BSD.

I think key to understanding what AMD said is the meaning of the term "performance marginality".
An AMD employee explained in Phoronix forums:

bridgman said:
I think the intent was "performance within the chip" (internal signals) not "performance on a benchmark".

https://www.phoronix.com/forums/node/967927

Gideon · Aug 8, 2017

Its a heat issue, simple as that. That is why you have to run it for hours before a issue will come up. High thermals can cause changes in circuitry that can cause the signal to be impeded. This is likely why some continue to have the issue despite RMA. I am assuming none of you have talked to engineers before, Performance marginality means as thermals approach the marginal limits of the chip and the signal degrades before you reach the limit.

JimmiG · Aug 8, 2017

chithanh said:
Given that some people received defective Ryzen CPUs from RMA, it is quite possible that many or even most display this issue. If you are in doubt, run the ryzen stress tests on Linux or WSL. If you see segfaults (other than conftest), then your CPU is affected.
I think key to understanding what AMD said is the meaning of the term "performance marginality".
An AMD employee explained in Phoronix forums:
https://www.phoronix.com/forums/node/967927

Yeah I don't want to go through the trouble and then end up with another dud. I'll wait a little bit for things to clarify (will be away for 1 week anyway...). They say "early" Ryzen CPUs might have this problem, and I got mine one day after release so it seems very likely it has this fault. My impression is that most/all early CPUs and many later ones have this issue, but AMD don´t want to do a full recall.

Gideon said:
Its a heat issue, simple as that. That is why you have to run it for hours before a issue will come up. High thermals can cause changes in circuitry that can cause the signal to be impeded. This is likely why some continue to have the issue despite RMA. I am assuming none of you have talked to engineers before, Performance marginality means as thermals approach the marginal limits of the chip and the signal degrades before you reach the limit.

I don't think it's as simple as that. There are many stress tests that heat up the CPU more that run fine on any Ryzen. Also temps tend to plateau pretty quickly, long before the errors start happening. My 1800X hits about 64C on the die when running Prime95 Small FFT or the AIDA64 stress test, which is nowhere near the marginal limits of the CPU. It seems only very specific tasks trigger this issue.

/dev/null · Aug 8, 2017

Gideon said:
Its a heat issue, simple as that. That is why you have to run it for hours before a issue will come up. High thermals can cause changes in circuitry that can cause the signal to be impeded. This is likely why some continue to have the issue despite RMA. I am assuming none of you have talked to engineers before, Performance marginality means as thermals approach the marginal limits of the chip and the signal degrades before you reach the limit.

No it's not. Read the entire AMD forum thread. AMD specifically asks for temps, cooling, bios settings & asks many people for pictures of the inside of their case prior to RMA.

Some Ryzen Linux Users Are Facing Issues With Heavy Compilation Loads

Limp Gawd

Fully [H]

Limp Gawd

Fully [H]

Limp Gawd

Deleted member 82943

Guest

Fully [H]

Deleted whining member 223597

Guest

[H]F Junkie

Deleted whining member 223597

Guest

[H]F Junkie

Deleted whining member 223597

Guest

Limp Gawd

Limp Gawd

Supreme [H]ardness

Deleted member 82943

Guest

Fully [H]

Deleted member 82943

Guest

Supreme [H]ardness

2[H]4U

2[H]4U

razor1 is my Lover

Limp Gawd

2[H]4U

Limp Gawd

2[H]4U

[H]ard|Gawd

[H]ard|Gawd

Gawd

Deleted member 82943

Guest

2[H]4U

[H]ard|Gawd

2[H]4U

2[H]4U

[H]F Junkie

2[H]4U

Gawd

2[H]4U

2[H]4U

[H]F Junkie