Need help with random BSODs - strongly suspect CPU

nimbulan

n00b
Joined
Jun 15, 2020
Messages
42
I've been having random BSODs for a while now, much more frequently lately. And when I say random - I really mean random. Just in the past week I've had stop codes 0x192, 0x139, 0x3B, 0x50, 0xA and 0xE3, all with different triggering processes and call stacks. Attempting to analyze the memory dumps with WinDbg hasn't provided any useful information and WhoCrashed always insists that the crashes are driver bugs caused by an unknown driver that are unlikely to be caused by hardware failure. The system will also occasionally lock up without a bluescreen, a few seconds after a weird glitch like the start menu failing to display properly or my music suddenly going into a 2 second loop (the size of the audio output buffer in the application,) and on one such occasion a LiveKernelEvent 1A1 was recorded in the event log. The crashes do not appear to be affected by load level, as they occur just as often when the system is idle as when I'm using the PC or playing a game. I've noticed unusual instability in some programs as well, particularly games running on the Unity engine where many seem to crash regularly on this system.

Troubleshooting steps: Extensive RAM testing, removing pairs of RAM sticks, both downgrading and updating the BIOS, updating motherboard chipset drivers, updating video drivers, updating Windows, CPU and GPU stability tests, and moving the GPU and SSD to the other slots on the motherboard. The only thing that's made any difference is the BIOS update, which actually made the BSODs much more frequent, but also fixed the recent "power reporting deviation" issue so I suspect the extra power was helping stability. The system has never been overclocked (I did try PBO briefly but found it did not boost performance at all and turned it off shortly after). All system temps are within normal ranges even under heavy load.

Finally this morning I ran Prime95's stress test and all threads running on core 6 (verified with Ryzen Master) error out almost immediately and threads on core 4 also tend to error within a couple minutes when running the highest level of the stress test. As I've not used this troubleshooting method before, is this a definite indication of a faulty CPU or could the motherboard still be the culprit?

Hardware:
AMD Ryzen 3700x w/ Gammaxx 400 cooler
4x Crucial Ballistix 8GB DDR4-3200 16-18-18-36
ASRock x570 Phantom Gaming 4
nVidia GTX 1070 (from old PC, no reason to suspect any issues)
Corsair RM750x power supply (also from old PC, and only 2 years old or so)
Adata XPG SX8200 Pro 512 GB NVMe SSD
Windows 10 (May 2020 update)

Edit: Also downloaded OCCT and ran the standard OCCT test, which started spitting out 30+ errors per second shortly after starting.
 
Last edited:

Psycrow

Limp Gawd
Joined
Feb 26, 2010
Messages
446
Hard to say exactly...I will give you something to work with here.

Try give ur ram or cpu more or less volt - power in bios ?
Fiddle around in bios and disable all the smart and fancy stuff. ?

Disable everything you can in bios and test.
Then enable stuff one by one again.

Has it been working from day 1 ?
Also try unplug gfx card and boot up with onboard gfx card.
Disable any dvd/usb/Ssd drives beside win 10

Not a solution to fix it but ideeas to help
 

primetime

Supreme [H]ardness
Joined
Aug 17, 2005
Messages
6,605
-- try relaxing the ram timings just to rule that out...motherboards and there bios versions sometimes have issues in this area with rams factory timings
-- I would also try bumping the cpu and or possibly ram voltages to see what effect it has, because under volting to have similar effects. (bios bugs)
-- it is possible the power supply fault....I would check the rails with a meter while stress testing and idle usage.
 

sinisterDei

[H]ard|Gawd
Joined
Dec 1, 2004
Messages
1,221
If you think the problem is with cores 4 and 6, then I would use Ryzen Master to simply disable them and try again without them on your system. If you can turn off cores and your problems are massively reduced / go away, then you're likely looking at a CPU problem.

The motherboard, and power supply, are still possibilities, but checking the CPU can happen first.
 

nimbulan

n00b
Joined
Jun 15, 2020
Messages
42
Has it been working from day 1 ?
I actually think it's been a problem since I built the PC (put it together right after the Ryzen 3000 release.) At the time I was getting a bluescreen every 2 months or so which I couldn't really diagnose and ended up just assuming it was something that would go away with BIOS/driver updates. It's gotten progressively worse, particularly in the past 6 months. I also suspect that this failure is the reason why I've had so much trouble with games running on the Unity engine crashing on this system and that's been an issue since I put it together.
-- try relaxing the ram timings just to rule that out...motherboards and there bios versions sometimes have issues in this area with rams factory timings
-- I would also try bumping the cpu and or possibly ram voltages to see what effect it has, because under volting to have similar effects. (bios bugs)
-- it is possible the power supply fault....I would check the rails with a meter while stress testing and idle usage.
I can adjust some RAM timings, but the Prime95 test I ran is specifically designed to stress the CPU and cache and not the RAM so I wouldn't expect it to cause a problem, especially so quickly in that test. I have also done pretty extensive RAM testing during my troubleshooting process and found no problems there.

One thing I've noted is that the recent BIOS update made my system much less stable. I don't think it's because it's undervolting my CPU now, but it was potentially overvolting it before (improving stability) due to the power reporting deviation issue. In any case all my voltages read normal and are rock solid during testing, at least according to all the sensors in my PC. How would I go about testing them with a meter? I don't think motherboards tend to have test pads outside of super high end boards.
If you think the problem is with cores 4 and 6, then I would use Ryzen Master to simply disable them and try again without them on your system. If you can turn off cores and your problems are massively reduced / go away, then you're likely looking at a CPU problem.

The motherboard, and power supply, are still possibilities, but checking the CPU can happen first.
I did not know that Ryzen Master could do that. I just disabled those two cores and OCCT showed no errors in 10 minutes when it had 400+ errors in 30 seconds previously. I think that's pretty definitive evidence, personally, though CPU failures are so rare I don't think I've ever met anyone who's dealt with one so I have no point of comparison.
 

sinisterDei

[H]ard|Gawd
Joined
Dec 1, 2004
Messages
1,221
AMD will replace your CPU, at least I've heard of folks having decent experiences with them. Please let me know how it goes, but I think that's your path going forward.
 

nimbulan

n00b
Joined
Jun 15, 2020
Messages
42
AMD will replace your CPU, at least I've heard of folks having decent experiences with them. Please let me know how it goes, but I think that's your path going forward.
Another interesting note: This morning I tried bumping up VDDCR from 1.1 to 1.15V and that caused OCCT to run clean for 3 minutes, then very suddenly start spitting out errors at the same rate as before. Quite strange.

Also disabled XMP so my RAM's running at 2400 MHz and that changed nothing, as expected.
 

Psycrow

Limp Gawd
Joined
Feb 26, 2010
Messages
446
Are you using the right ramm ? are they compatiple with the motherboards vendor list ?
 

nimbulan

n00b
Joined
Jun 15, 2020
Messages
42
Are you using the right ramm ? are they compatiple with the motherboards vendor list ?
I don't see the exact RAM model listed, but there are quite a few almost identical models listed from Crucial and ASRock doesn't seem to do a lot of RAM testing (last time I looked at the list I think there were only 8 models listed from 2 manufacturers.) RAM was definitely my first suspicion when I started troubleshooting, but the tests always come up clean so I have no reason to believe there's any problem with the sticks.
 

Nobu

Supreme [H]ardness
Joined
Jun 7, 2007
Messages
4,314
Could be bad local cache on the cpu causing corruption and crashes? Can you run a cache test on the cpu?
 

Nobu

Supreme [H]ardness
Joined
Jun 7, 2007
Messages
4,314
I'd also recommend running any future tests on a fresh install or live media, just to rule out any corruption of os files as a source of errors and crashes, since youve said its gotten progressively worse.
 

nimbulan

n00b
Joined
Jun 15, 2020
Messages
42
Could be bad local cache on the cpu causing corruption and crashes? Can you run a cache test on the cpu?
Yeah bad cache is my suspicion. I've been specifically running the small FFT test in Prime95 which is intended to test all 3 levels of CPU cache. I assume the OCCT test has a similar focus since it's intended to test the CPU specifically.
 

primetime

Supreme [H]ardness
Joined
Aug 17, 2005
Messages
6,605
at this point i would do a quick cpu RMA with AMD as it does some what indicate a faulty cpu
 

nimbulan

n00b
Joined
Jun 15, 2020
Messages
42
at this point i would do a quick cpu RMA with AMD as it does some what indicate a faulty cpu
Yeah I did submit an RMA request to AMD today. Still waiting to hear back. Looking into a new CPU cooler too because the mounting system on the Gammaxx is horrendous and I'm not looking forward to trying to install it again. They seem to have fixed it with a new version of the cooler but I have the old model, unfortunately.
 

Psycrow

Limp Gawd
Joined
Feb 26, 2010
Messages
446
Get Noctua air cooler.
Get the latest model. It sthe best air cooler and lowest noise..actualy there is no noice

Also ask Crucial Ballistix if ur motherboard is compatible with ur ramm
 

nimbulan

n00b
Joined
Jun 15, 2020
Messages
42
Get Noctua air cooler.
Get the latest model. It sthe best air cooler and lowest noise..actualy there is no noice

Also ask Crucial Ballistix if ur motherboard is compatible with ur ramm
Any particular Noctua you recommend? I know they're fantastic coolers but I have a hard time justifying the price.

And yes Crucial's website indicates that my RAM and motherboard are compatible.

CPU is in transit to AMD. Probably going to be a slow process since it's going FedEx ground all the way across the entire country (US.)
 

nimbulan

n00b
Joined
Jun 15, 2020
Messages
42
Ok I finally got my replacement CPU today (AMD's RMA process is a bit slow) and got my PC back up and running. Prime95 and OCCT are running clean, so looks like I was right about it being the CPU. Hopefully that's the end of my hardware problems for a few years.

I also bought a Scythe Mugen 5 so I wouldn't have to deal with trying to reinstall that nightmare Gammaxx 400. I was considering the Arctic 34 eSports too, but found several people who had to contact Arctic to get some missing mounting hardware and didn't want any more delays getting my PC back up and running.
 

primetime

Supreme [H]ardness
Joined
Aug 17, 2005
Messages
6,605
Ok I finally got my replacement CPU today (AMD's RMA process is a bit slow) and got my PC back up and running. Prime95 and OCCT are running clean, so looks like I was right about it being the CPU. Hopefully that's the end of my hardware problems for a few years.

I also bought a Scythe Mugen 5 so I wouldn't have to deal with trying to reinstall that nightmare Gammaxx 400. I was considering the Arctic 34 eSports too, but found several people who had to contact Arctic to get some missing mounting hardware and didn't want any more delays getting my PC back up and running.
how you liking the new cooler? (same one i use)...i even added a 2nd fan, but idk if it really makes a big difference in normal usage. Im using a graphite pad instead of grease...works great and super easy. Soon as the new generation cpus drop this year im gonna unload my 2700+ real cheap to an [H] member
 

sinisterDei

[H]ard|Gawd
Joined
Dec 1, 2004
Messages
1,221
Im using a graphite pad instead of grease
These add a few degrees; I wouldn't recommend for a 3000 series (or higher). The localized temperature spikes are significantly higher on the 3000 series thanks to the 7nm process node, so you definitely don't want anything impeding cooling.
 

nimbulan

n00b
Joined
Jun 15, 2020
Messages
42
how you liking the new cooler? (same one i use)...i even added a 2nd fan, but idk if it really makes a big difference in normal usage. Im using a graphite pad instead of grease...works great and super easy. Soon as the new generation cpus drop this year im gonna unload my 2700+ real cheap to an [H] member
Seems to be working fine. From what I've seen adding a second fan won't do much - you might knock 1C off the temps. I personally use Arctic MX-4 and also installed a Cougar Vortex PWM fan in place of the Scythe fan since it's more powerful but still extremely quiet (did the same with the Gammaxx.) I was a bit worried since it seemed like the clamping force was rather low but cooling performance appears normal so I'm not going to worry about it.

On another note I have run into one odd issue. When first booting the PC yesterday, and again when cold booting this morning, the BIOS apparently failed to set the RAM timings and reset them to default. Going into the BIOS immediately afterward and re-enabling XMP works just fine, however. I did notice that the infinity fabric clock was being set to 1500 MHz for some reason (was on auto) so I manually set it to 1600 MHz. I'm hoping that will take care of the problem though if it doesn't, does anybody have another suggestion? I do find it quite odd that I never had a single issue with RAM timings with the faulty CPU but am with the new working one.

Edit: Well it's actually worse than I thought. I absolutely cannot get it to post with XMP enabled this morning.

Edit 2: Spent half an hour messing around with RAM settings this morning. Increasing voltage didn't help but I was finally able to determine that the RAS timings seemed to be the problematic ones and relaxing those slightly got it to POST in one try. Hopefully it keeps working. Makes me wonder if the new CPU has a low binned memory controller on it since I had no issues before.
 
Last edited:
Top