Some Ryzen Linux Users Are Facing Issues With Heavy Compilation Loads

juanrga

2[H]4U
Joined
Feb 22, 2017
Messages
2,804
https://www.phoronix.com/scan.php?page=news_item&px=Ryzen-Compiler-Issues

It originally looked as an bug on GCC, but last research seems to confirm this is part of the older SMT/uop bug. Remember that earlier engineering samples had uop or the SMT disabled due to this bug. It seems that the bug wasn't completely corrected in retail chips. The problem seems to be traced down to the IRETQ instruction and it is not reproducible when threads are running on different cores.

AMD is recommending to disable SMT

https://community.amd.com/thread/216084
https://community.amd.com/message/2796982

Surely a microcode update will fix this.
 
So now SMT is broken too? FMA loads bugged, Virtualization bugged.
 
I haven't encountered this issue myself on any of my Ryzen Linux boxes, but it seems there are a number of Ryzen Linux users who are facing segmentation faults and sometimes crashes when running concurrent compilation loads on these Zen CPUs.

Better this then the run away train that "Remote security exploit in all 2008+ Intel platforms"" which is fused of in desktop models or is it ?

That quote is from the 1st link you wrote.


Last link you used has this:
Hello.



As a software guy, I compile a lot of code, and occasionally gcc crashes with a segmentation fault for no obvious reason. I seem to remember that the problem also manifested as illegal instruction errors sometimes but I'm not sure about that anymore.. I have a Ryzen 1800X CPU and Asus Prime B350-Plus mainboard with UEFI BIOS 0609 (latest). My RAM is on the QVL and running at 3200 MHz but that shouldn't matter.

There is a lot of information in this thread to which I did not contribute: Gentoo Forums :: View topic - Segfaults during compilation on AMD Ryzen.

I'll summarize it: Different people, different gcc versions, different optimization levels, different software compiled, different RAM clocks including very low ones, different Ryzen models and mainboard models, Some of them tried swapping several pieces of hardware to no avail.

I have little to add: I can reproduce the segfaults on Ubuntu 17.04. And nothing else crashes for me after the latest UEFI + AGESA update.

Mean time between crashes is about an hour when compiling continuously.



I think you should try hard to reproduce and fix this at AMD. Compiling anything on Linux with gcc while using all CPU threads should suffice.



Thanks in advance.

[Edited: reordered to be more coherent, removed redundancy]

Yeah it is odd that a B350 board crashes under heavy load with 8 cores since none of the current B350 board can handle that load for a good time. You would have known that if you knew anything about Ryzen.

On the 1st forum link from AMD you just took a reply

gkeawe 23-May-2017 08:22 (in response to hifigraz)


Can you do me a favor and go into the BIOS under CPU Configuration and disable SMT for the CPU. I'd like to see if it has any affect on the stability of the system with SMT disabled. I've been testing my Ryzen 1700 with and without SMT and would like to see if anyone else is getting similar results regarding random reboots and system instabilities. Keep me posted. Thanks, hang in there I know there's a reason for these issues and a workaround until the platform matures

Hardly proof of anything since the person that replied has no standing on the forum , it might have well been yourself writing this.
 
Yeah it is odd that a B350 board crashes under heavy load with 8 cores since none of the current B350 board can handle that load for a good time. You would have known that if you knew anything about Ryzen.
Uhm, you have a preference for AMD, don't you?
Because you're straight up shitting on AM4 as a whole.
 
Uhm, you have a preference for AMD, don't you?
Because you're straight up shitting on AM4 as a whole.

No he is shitting on B350 chipset. People are hoarding X370 chipset mobos for a good reason and that reason is not because of SLI or few extra connections nobody ever uses that x370 comes with. Most B350 boards are apparently quite shitty, with inadequate VRM designs (and poorly cooled too) that barely hold together stock Ryzens and flat out overheat when overclocked.
 
No he is shitting on B350 chipset. People are hoarding X370 chipset mobos for a good reason and that reason is not because of SLI or few extra connections nobody ever uses that x370 comes with. Most B350 boards are apparently quite shitty, with inadequate VRM designs (and poorly cooled too) that barely hold together stock Ryzens and flat out overheat when overclocked.
And the funny story here is that Ryzen holds much lower value if i have to fucking overpay for my motherboard for it work properly.

But hey, in the end marketing stunt from AMD worked, unlocked Ryzens are unlocked, and we don't care if it won't work once overclocked if you paid less than $130 for a mobo.
 
And the funny story here is that Ryzen holds much lower value if i have to fucking overpay for my motherboard for it work properly.

But hey, in the end marketing stunt from AMD worked, unlocked Ryzens are unlocked, and we don't care if it won't work once overclocked if you paid less than $130 for a mobo.

I admit the mobo issue is one of the reasons why I am hestitating on switching my Sandy for Ryzen. There is lot to love about Ryzen performance/value but I refuse to pay a premium for mobo with RGB fleshlight built-in and other nonsense just to get a stable overclocking platform.

But I'm not sure AMD is to blame here. I blame mobo manufacturers personally.
 
But I'm not sure AMD is to blame here. I blame mobo manufacturers personally.
Sure, mobo manufacture share some of the blame for implementation, but ultimately everyone knew from get go that it would not use quality VRMs and stuff, enabling overclocking on it was just a cheap trick to score sympathy points. You don't name chipsets after Intel's chipsets without consequences.
 
AMD unlocked all their new chips, still waiting to see this from intel, and they released designs for a few different chipsets for motherboard manufacturers to make. They released different designs based on customer preference and price. Obviously the lower price motherboards don't overclock as well as these boards beefier brothers, this is a constant on both sides of the chip aisle for way before these chipsets or the chipsets before it came along.

Sure, mobo manufacture share some of the blame for implementation, but ultimately everyone knew from get go that it would not use quality VRMs and stuff, enabling overclocking on it was just a cheap trick to score sympathy points. You don't name chipsets after Intel's chipsets without consequences.

From my own experience, I built a 6600K system with a $160 ASUS motherboard for a friend and it couldn't hold a 100mhz overclock without shutting down, ended up just not overclocking the more expensive 'K' chip. My 1700 with a $150 ASUS X370 board holds a 700mhz overclock without trouble. So it would seem that intel would be to blame for ASUS making a Z170 board with advertised but ultimately shitty overclock functions and I would have to praise AMD for ASUS making a board that can overclock. Does this make sense, because it doesn't for me.
 
I built a 6600K system with a $160 ASUS motherboard for a friend and it couldn't hold a 100mhz overclock without shutting down, ended up just not overclocking the more expensive 'K' chip. My 1700 with a $150 ASUS X370 board holds a 700mhz overclock without trouble
If you are going to bring up anecdotes, let me bring one too: https://www.hardocp.com/article/2017/06/04/asus_rog_crosshair_vi_hero_ryzen_motherboard_review/7

That's the consistent picture that matters, and as Pieter put it, the consistent picture is not good for B350.
 
B350 chipset should not be the choice for overclocking on for a 8 core chip. There are cheaper x370 boards out there that will work far better, you dont have to buy the highest end x370 boards. AMD would have been better off to not allow 8 core overclocking on the B350 chipsets, but some of them are built well and can handle it. Same posters always trying to find something on systems they dont even use.
 
I have heard that there were a couple real CPU bugs fixed in a recent AGESA update. Can't tell if that is the end of it though.
 
B350 bashing is nonsense. B350 barely does anything, nothing to go wrong.
No problems here, even crossfire'd RX480's that haters say shouldn't work.
I spent like $60 for an ASUS Prime B350+ on day #1, still close to same
price today. There are a few too many LEDs. But all the regs got adequate
heatsinks if you have some air movement in the vicinity.

I can't fathom why zero voltage switching VRMs havn't caught on.
despite this next link, you don't need 48V distribution to do this.
http://www.vicorpower.com/industries-computing/48v-direct-to-cpu
 
Last edited:
B350 bashing is nonsense. B350 barely does anything, nothing to go wrong.
No problems here, even crossfire'd RX480's that haters say shouldn't work.
I spent like $60 for an ASUS Prime B350+ on day #1, still close to same
price today. There are a few too many LEDs. But all the regs got adequate
heatsinks if you have some air movement in the vicinity.

I can't fathom why zero voltage switching VRMs havn't caught on.
despite this next link, you don't need 48V distribution to do this.
http://www.vicorpower.com/industries-computing/48v-direct-to-cpu

I dont think were really bashing it, just should not be the first choice for overclocking since the VRM is usually not as good. X370 is the better choice for pushing clocks and running 24 hours a day on it.
 
TBH I was weary of the B350 boards and took the plunge, Ryzen really isn't any more of a headache than earlier AM3(+) platforms, I can say the fun I had with my 1055T was actually worse than Ryzen... There were ram compatibility issues, and even CPU issues way back then. The only thing I'm missing from my 4670K is single threaded performance my eyes can't see a difference with anyway...But let me tell you alt - tabbing out of Watch Dogs doesn't turn my PC into a slide show anymore...
 
B350 bashing is nonsense. B350 barely does anything, nothing to go wrong.
No problems here, even crossfire'd RX480's that haters say shouldn't work.
I spent like $60 for an ASUS Prime B350+ on day #1, still close to same
price today. There are a few too many LEDs. But all the regs got adequate
heatsinks if you have some air movement in the vicinity.

I can't fathom why zero voltage switching VRMs havn't caught on.
despite this next link, you don't need 48V distribution to do this.
http://www.vicorpower.com/industries-computing/48v-direct-to-cpu



Someone that has put the B350 through some stress testing. Came to the conclusion 4 cores and overclocked is the limit of most if not all current boards.
I posted this several times for users on [H] know what the limitations are because B350 and 8 core is such a good "deal" .
 
Really, its nothing to do with the chipset. Nothing prevents building a B350 board with top notch VRMs or a X370 board with absolute junk.

Just try to find a model that will meet your needs.
 
B350 boards can have up to 4+3+2 -phase VRM (CPU+SoC+RAM), but most B350 boards are really 3+2+1 or so. Some do look like they have more phases but those have actually only duplicated components for the same phases.
Many X370 boards have 6-phase VRM for the CPU.

If you are going to run more than four cores then the board will also need heat sinks on the VRM - which not all B350 boards have.
And there were some settings that I don't really understand yet that the X370 did have but the B350 did not.

And BTW, over here where I live, a motherboard can cost twice as much as it does in the US and we don't get them all. So choosing becomes more important. The ridiculous unnecessary shortage of a good mATX board for Ryzen has held me off for months now.
 
In case the OP wasn't sufficiently clear, this doesn't have any to do with chipsets or overclocking. The problem is appearing on non-overclocked chips running on X370 mobos.

This is a CPU bug, and it was communicated to AMD in April.
 
In case the OP wasn't sufficiently clear, this doesn't have any to do with chipsets or overclocking. The problem is appearing on non-overclocked chips running on X370 mobos.

This is a CPU bug, and it was communicated to AMD in April.

Yet the OP has been debunked on almost all counts and not able to post any valid information in any of his links, let alone a quote from pertinent source about the exact issue.
https://hardforum.com/threads/some-...vy-compilation-loads.1936605/#post-1043046431

The youtube video acknowledges that you are not able to run 1800x on high load on B350 without problems attached to it. The OP has yet to understand what this means when hardware is forced a full load it either crashes or fails whichever comes first.

Spouting your own opinion on this matter does not make it so.
 
Some are still in denial mode I see. Meanwhile AMD is investigating the issue...

Your links don't prove any of this , why would you troll on this forum information from April , what purpose do you have showing links which are not from April nor confirm any issues...

If I would now go to the Intel forum and post about the dead C2000 series from Intel I would get a warning or banned, months after it happened..
 
Better this then the run away train that "Remote security exploit in all 2008+ Intel platforms"" which is fused of in desktop models or is it ?

This is a show stopper bug for me at home and at work. It absolutely needs to be fixed before I can consider AMD for a workstation purchase ( which I am in the market for both at home and work).
 
This is a show stopper bug for me at home and at work. It absolutely needs to be fixed before I can consider AMD for a workstation purchase ( which I am in the market for both at home and work).

The Intel processor only remote security exploit due to the security component of the chip having effectively root level bare metal access beyond the reach of the operating system is a showstopper for buying an amd processor?

Or were you saying the amd micro-op code issue that generates unexpected slow behavior on heavy linux loads when smt is enabled is the show stopper for you?
 
amd micro-op code issue that generates unexpected slow behavior on heavy linux loads when smt is enabled is the show stopper for you?

That would be my usage (heavy loads under linux) both at home and at work. And from my reading its not just slow behavior. It ranges from application instability/crashes to total system lockups under high load.
 
Last edited:
On several of the threads that are linked users were still able to reproduce this last week with several theories what the cause of this is and also several workarounds that seemed to reduce the chances of hitting this bug.
 
On several of the threads that are linked users were still able to reproduce this last week.
Shit. Sad. I need this to be solved because my main use case will be compiling tons of stuff all day long.
 
Last week I bought an Ryzen5 1600x with an Asus Prime X370-Pro, 2x8GB DDR4-2666 for office use. I cant say the board was expensive at 155 €, nor was the R5. What I can say it, it beats my KabyLake in any disciplin other than gaming. No issues from A-Z, it even overclocks all cores to 4GHz ( with this cheap Asus borad again, I tend to love the Prime series by now ), runs cool with 60-65°C after 1h p95-AVX.

I may boot Ubuntu LTS from a stick and test myself...that is a real bad situation for AMD then....NOOOOOO !

Not being able to compile is a BAD THING in my world as well. Virtual support as well.


AMD, have you again fucked it up ?
 
Last edited:
I see that some have mentioned that AGESA 1.0.0.6 fixes the problem. But then others are still reporting the crashing. It is unclear what exactly they have tried however.
 
This is a show stopper bug for me at home and at work. It absolutely needs to be fixed before I can consider AMD for a workstation purchase ( which I am in the market for both at home and work).

And you forgot the part what I linked above that the poster on phoronix does not experience the bug itself. Some people do and some don't. But that is not my issue all the links describe a different problem with different causes.

this post makes sense somewhat maybe

Some Ryzen Linux Users Are Facing Issues With Heavy Compilation Loads
Written by Michael Larabel in AMD on 2 June 2017 at 04:09 PM EDT. 108 Comments
I haven't encountered this issue myself on any of my Ryzen Linux boxes, but it seems there are a number of Ryzen Linux users who are facing segmentation faults and sometimes crashes when running concurrent compilation loads on these Zen CPUs. A Phoronix reader pointed out some of the resources for Ryzen Linux customers facing problems namely when running heavy compilation tasks, like on Arch and Gentoo. AMD hasn't yet found the root cause of this issue, but given the spread of users affected, appears to be related to the processor itself. Those interested in learning more about these Ryzen compilation issues can find a number of open threads on the matter such as on the Gentoo forums, AMD Community, as well as some entries via this Google Doc tracking Gentoo users having the problem. AMD is expected to update their community thread when a solution is found. Some workarounds include fiddling with Load Line Calibration (LLC) from the BIOS and some users have found success if disabling the SMT functionality while others are still encountering the problems even if they turn off SMT on their Ryzen 7 CPUs. The issue is happening on multiple versions of GCC but I haven't seen any reports when using LLVM/Clang or alternative compilers.

So Drescherjm does the bold part make sense to you for fixing a micro op cache bug by fiddling with load line calibration?

Not sure if they tested the same problems would arise under Windows , what makes Linux so special ;) .
 
Not sure if they tested the same problems would arise under Windows , what makes Linux so special

Possibly could be how the OS is scheduling threads makes it less likely on windows.
 
Possibly could be how the OS is scheduling threads makes it less likely on windows.

The combination of unfinished bios maybe not that well optimized kernel and some user errors (someone posted about smt disabled not working). But I'll stick to my comments on running 8-6 core on current B350 is asking for problems (especially with heavy load for a sustained time).
 
.....CUT

So Drescherjm does the bold part make sense to you for fixing a micro op cache bug by fiddling with load line calibration?

...CUT....

Why not, LLC may stabilize some circuits that otherwise crash and produce all sorts of errors. been there myself, not while compiling but when overclocking and stress testing.

Some under-voltage/too-little-current can cause strange things to happen.

If it cures it, why not, would be an easy fix but maybe not with a B350 board that isnt really the VRM wonder by definition.
 
Why not, LLC may stabilize some circuits that otherwise crash and produce all sorts of errors. been there myself, not while compiling but when overclocking and stress testing.

Some under-voltage/too-little-current can cause strange things to happen.

If it cures it, why not, would be an easy fix but maybe not with a B350 board that isnt really the VRM wonder by definition.

It is not what the OP stated a micro op cache bug that can be fixed by load line calibration. It would be a very dubious problem if that was needed for a stock platform and also much more apparent in stress testing not exclusively show up under Linux only.

Load line calibration is in general only used for when a cpu comes out of idle or sleep and gets a stupid amount of workload to deal with not being able to switch to a higher voltage fast enough to handle it.
It makes little to no sense that this would actually "fix" a cache bug Given with the current state of AM4 bios I wonder if that has really got an impact overall, it shows that some systems prolly don't get enough power under load rather then a bug with a micro op cache (because that is all that LLC is doing really).

You could say that a workload as rendering blender or cinebench would nearly as much show the same symptoms as compiling source under GCC in Linux. That is why I am wondering if GCC under Windows would show the same problem(s).
 
Not sure if they tested the same problems would arise under Windows , what makes Linux so special ;) .
Windows generally does not have make -j9. No, Prime95 sucks balls in terms of stress testing when compared to it. Especially P95 on Ryzen.
 
Back
Top