Can an nVME m.2 induce system instability?

Discussion in 'SSDs & Data Storage' started by Formula.350, Jun 12, 2018.

  1. Formula.350

    Formula.350 Gawd

    Messages:
    929
    Joined:
    Sep 30, 2011
    Because that seems to be exactly what's happening to me, and the m.2 is the only thing introduced that I can put my finger on being the culprit :\ So I'm wondering if this is something other people have heard of happening? Beyond a situation where BClk is raised, which on most Ryzen boards also increased PCIe frequency (1:1)...


    Now for the "too-long" portion:

    System details that are different than my sig: Running at 3.7GHz 1.225V (CPU VID will drop to around 1.81V-1.91V), DRAM is at 3200 CL14 with fairly tight subtimings and set at 1.36V (1.37V actual operating), CPU-NB (SoC) is 0.9875V.
    My motherboard does not have BClk capability, so all buses are running at their intended default base-clock speeds. (Save for the Infinity Fabric since on Ryzen, it's tied to the 1/2 DRAM speed, and maximum official IMC support is DDR4-2667)
    Windows 10 x64 build 1607 (14393.447) - Metered connection, so I can't afford updating heh

    Stability was determined at these settings, but nothing extreme. Just HCI's MemTest at a few hundred percent and AIDA64's Stability Stress Test for a couple hours. Gaming, though, has been rock solid over the last year, and I even let my rig mine Monero for about 2 months straight with zero problems.

    During that time my boot drive was an old ass WD VelociRaptor 150GB SATA II, and then I have 2 other HDDs of various makes & sizes (sub TB), a Samsung 850 EVO 120GB, and an ASUS DVD-R/W drive also connected. Last week I also added a 6TB HGST so that I can start to migrate all my data over to it.

    A few months back during an eBay 30% Off promo code sale I picked up a Samsung OEM PM961 (860 EVO) 240GB m.2 drive, it was a New-Pull from a laptop. I finally got around to cloning my system to it last week, so I can say with fair certainty that it was a brand new drive as I had to initialize it after installing it.

    For the record, on my system it's installed on the m.2 that's directly under my R9 390. It has a "heatsink" cover, courtesy of MSI (built into the board) which does have a thermal pad inside; however, giving it the "finger" test after gaming, it's not at all hot or even warm.

    That being said...

    Nothing else has changed other than the new m.2 and 6TB. The only thing that's occurred has been that I needed to remove my GPU in order to install the drive, otherwise no other hardware was removed or touched (open-bench system). As I said, last week I successfully cloned Win10 off the VelociRaptor and onto the m.2. Boots up just fine. I installed Samsung's nVME driver that I downloaded from their website. "Samsung_NVM_Express_Driver_3.0.exe", but I do not have the Magician software installed (it would probably only work with my 850 EVO anyways, due to this being an OEM drive, as their Backup software did not recognize the m.2).

    Yet ever since then I've been having random BSoD's for various reasons (only a couple seemed overclock related, stating processor thread, or memory management) and also sudden program crashes. For example, I'm typing this on my laptop but the desktop is idling on the screen behind me... yet Firefox just crashed while minimized. Out of the blue. Computer has been on and idle for 30 minutes now aside from very minimal interaction (checking Winver and moused to corner to make windows transparent to see if it was HCI MemTest I had been using), so it's not like there's any heat being generated by anything... Also been playing Skyrim the past few days, running off the m.2 to test it out, which virtually no load times for worldspace is awesome! Yet a few nights ago it would CTD repeatedly, within the first hmm 45seconds of playing? The system also eventually had hung when I put it to sleep.

    2 days ago I began to increase some voltages just a tiny bit. Bumped up the VCore. BIOS set to 1.250V, but HWiNFO reads...
    CPU_VID 1.250V
    VDDCR_CPU 1.238V
    Motherboard Vcore 1.280V
    CPU VRM VOut 1.238V

    Which the CPU_VID still drops to the 1.181-1.194V range under a stress-test load, but CPU_VID is the Ryzen-controlled voltage, so is what it's "requesting", from my technical readings of how Ryzen behaves. And that the VRM is generally what you want to take as what's actually being fed to the CPU. The only one that changes under load is the CPU_VID, otherwise the others hold their readings very well. The Motherboard Vcore will rarely, briefly, hit 1.288V, and VDDCR_CPU is actually 1.237V right now, but other than that it's fine.
    NB I eventually bumped up to 1V. HWiNFO reads 1.008V
    Even bumped RAM to 1.37V, which is 1.392V under load.

    The bumps to voltage SEEM to have helped, but not fixed the issue. :\
    In AIDA's stress test, running just the Cache test, it failed within probably 30 seconds. Running just the FPU test, it lasted about 4 minutes before I got a "MEMORY_MANAGEMENT" BSoD... :(


    I guess for now I'm just going to have to disable the overclock on the CPU and try loosening my RAM timings.
    That, or uninstall the Samsung nVME driver and/or the 6TB, and hope the cause somehow lies there. The driver is plausible. The 6TB I'm kinda doubting.


    I'm honestly open to any ideas, theories, suggestions, etc.
    (Re-installing Windows would be a last-resort, as I'm really happy with how I have it configured ATM, hence why I cloned in the first place. heh)

    Thanks :)
     
  2. TheSmJ

    TheSmJ 2[H]4U

    Messages:
    2,731
    Joined:
    Dec 20, 2006
    Do you see any storage related errors in Event Viewer/System?
     
  3. mwroobel

    mwroobel [H]ardness Supreme

    Messages:
    4,842
    Joined:
    Jul 24, 2008
    My first suggestion is a fresh install of your OS of choice, sometimes cloning existing machines and changing your boot type (eg AHCI to RAID, SATA to nVMe) can introduce issues. Do a fresh install and run it at stock, and see if the problems disappear. Run it for a bit and then try to clock it up, still with a fresh OS. See if that resolves the issues. Without a second system with the same components, it is impossible to rule out a hardware failure of some kind but start at stock and see if that makes a difference. What is the BSOD errorcode, are they the same code every time and do you have a dump file to scan?
     
  4. Formula.350

    Formula.350 Gawd

    Messages:
    929
    Joined:
    Sep 30, 2011
    Not that I can see. Fortunately, and unfortunately. heh It's nice that there aren't any meaning things are otherwise working correctly... but would sorta be nice to have some in order to have a lead on what's the hangup.

    All it provided beyond "Information" level alerts was the one 'Critical" event stating the obvious: The system didn't shut down properly. I nominate M$ System Events for the next Hotels.com "Captain Obvious" actor..


    For clarification, whether it is of consequence, it's still not being treated as a UEFI drive.
    Definitely know the pitfalls of "recycling" a Windows install, but it's actually never came back to bite me in the ass yet... lol

    Anyways, in all my years of computing I've very rarely run into issues that had enough severity (or mystery) that required me to dig through Event Logs or have Crash Dumps... As such, I'm not entirely sure where one would even find them, but it doesn't seem to be where it says it should be (unless SystemRoot is not C:\, but it isn't on my PageFile drive, either).
    "Startup and Recovery" configuration is as follows:
    Code:
    System failure:
    [X] Write an event to the system log
    [X] Automatically restart
         Write debugging information:
            "Automatic memory dump"
         Dump file:
            %SystemRoot%\MEMORY.DMP
         [X] Overwrite any existing file
         [ ] Disable automatic deletion of memory dumps when disk space is low

    And C:\ (the m.2) is 170GB free of 238GB, so that last one being unchecked shouldn't be impacting anything :\

    According to the M$ Documentation, it's apparently in the actual PageFile?

    In other words, you all have my permission to treat me like an idiot child if needed :p

    Again, I'm prepared to do the reinstall, but I really hate trying to re-configure Windows these days, in order to get it setup just the way I like it. There's just way too much shit that M$ has setup in ways I don't like, and I haven't documented what I've done either, so it's a bit hard to anticipate I'd end up getting all the under-the-hood changes in place. Alas, I will ultimately do so (or at least make a temporary partition for it) if that's what it'd take to solve the instability.

    EDIT: I've changed it to create a "Full Memory Dump" and set the pagefile to my 6TB at >16GB so that it can actually store it all. Ah and I'll go change the dump location to there as well, since it's clearly not writing to C:\ as indicated.

    Actually... Now that I think about it, if the Dump is in the pagefile, there WAS a small PageFile.sys located on C:\, which I found odd since I have all paging disabled on the SSDs. Was only a few hundred MBs. Nevermind, this was SwapFile.sys

    Time to game and invoke a crash I guess... heh
     
    Last edited: Jun 12, 2018
  5. Formula.350

    Formula.350 Gawd

    Messages:
    929
    Joined:
    Sep 30, 2011
    Took awhile but finally happened. This one was after Skyrim crashed, which looked to be due to the graphics driver taking a dump, not fully sure. All I do know is that it crashed due to the audio device (on my R9 390, as I use the HDMI audio) was no longer recognized. Troubleshooter was useless,saying "Oh, it's just cuz your speakers aren't plugged in", and so I manually executed "audiodg.exe" from Run... which prompted the BSoD.
    IRQL NOT LESS OR EQUAL

    Before the BSoD I had checked Event Viewer and there was absolutely no mention of any issues within the last hour :|

    As far as ones after rebooting. There are 5 warnings, 4 are "Kernel-PnP" complaining about a USB driver failure, which pretty certain it's my thumb drive (hope it didn't kick the bucket, all my BIOSes are on it).
    Then 1 "Wininit" which says "Custom dynamic link libraries are being loaded for every application. The system admin should review the list of libraries to ensure they are related to trusted applications". That doesn't seem relevant, but there it be :p


    I see, there's why the crash dump wasn't writing. HAS to be on the Boot Drive -.- Guess I change it yet again.

    There are a number of Errors but that's a lot to try and convey...
    HOWEVER... This one is possibly relevant:
    Event 1001, BugCheck
    The computer has rebooted from a bugcheck. The bugcheck was:
    0x0000000a (0xffffe006fc85a80, 0x0000000000000002, 0x0000000000000000, 0xfffff803c22e3b30).
    A dump was saved in C:\WINDOWS\Minidump\061218-7828-01.dmp.
    Report id: 25765391-0ba5-4a09-b079-313f972d40f4​

    So mwroobel, here it be in all its miniature glory. If I manage to obtain a full dump I'll be sure to provide it!
     

    Attached Files:

  6. Denpepe

    Denpepe Gawd

    Messages:
    684
    Joined:
    Oct 26, 2015
    Any chance you can put the old drive in and see if it works as normal? Could help troubleshooting.
     
    drescherjm likes this.
  7. sirmonkey1985

    sirmonkey1985 [H]ard|DCer of the Month - July 2010

    Messages:
    20,426
    Joined:
    Sep 13, 2008
    every time i've seen that bsod error it's almost always been tied to memory stability issues for me. if it was me, i'd default everything including turning off the XMP profile, see if you have the issue come up again.. if it does, try the old hard drive. if it doesn't crash then try adding the xmp profile back, and slowly add 1 thing at a time til it fails again. the other thing is check and see what sata ports on your board are tied to the cpu and which ones might be shared with the m.2 slots, see if you're possibly running one of your other drives off it.

    as far as voltages go if none of the above helps try bumping the SoC voltage to 1.1v and also check and see if your board has an LLC voltage option for the cpu to clean up the vdroop.
     
  8. Formula.350

    Formula.350 Gawd

    Messages:
    929
    Joined:
    Sep 30, 2011
    There is, yes, and that was indeed going to be another of my "Last Resort" options before I reinstalled windows. Being such an old and small drive, I had no plans to repurpose it at this time, so it's just sitting on a shelf right now.

    Yea, just to clarify for everyone, I do not think the nVME drive is specifically the direct cause of the instability, as in I don't think it's faulty or anything. My suspicion is that its presence on the bus is somehow causing it. Like, I know that with Ryzen everything is rather intricately tied together, versus (as I understand it) Intel where they've more or less segregated all of the components on a chip level. For example on the Ryzen there's no baked in way to clock the BClk without also raising the PCIe and SATA frequencies, unless the motherboard makers design in external clock circuits for each (IOW, it's that AMD has taken a rather old-school approach, unfortunately). So I can kind of see how possibly a high-speed primary drive like an m.2 nVME SSD could be effected by overclocking. Just seems peculiar that it hadn't manifested before and does now with minimal effort. Also peculiar is that it's very random.... borderline schizo.... in when it manifests. For example, two days ago when I was running HCI MemTest again. At the beginning one of the 8 instances tripped a single error, but then zero errors afterwards for 250% on all instances. Yesterday was not the same outcome, producing quite a few errors.

    Anyways, I DO like the theory of it perhaps riding on the same shared SATA as another drive as somehow causing an issue, even if the SSD is running in PCIe mode. So just in case, looks like I'd want to free-up SATA port 5:
    Yc7MMf4.png

    As for LLC, I do have it set to Level 1, which MSI realllly dropped the ball IMO when it comes to LLC. Level 1 is the only one that (according to their information in BIOS) offsets with an increase in voltage under load. All the others essentially are varying degrees of lowering it... which I simply don't comprehend the logic behind as to me that would imply having to run a much higher voltage for IDLE. *heavy shrug*
    Anyways, as I've had this setup since Ryzen launched last year, I dialed it in long ago. I never even bothered trying to higher than 3.7GHz (a couple attempts from Windows but that's it), so my goal was never to push it to the limit. I've also been blessed with being able to use DDR4-3200 from the shipping BIOS, and also never needed much on the SoC like others have. (Also read on Overclock.net of a guy having no issues with 3466 at ~0.94V more recently, so I think the SummitPI 1.0.0.6 AGESA and newer have made that need for higher SoC voltage... well, no longer needed heh)


    Just to start knocking things off the list, I've uninstalled the Samsung nVME drivers to try that. Update: Problem persists.
    If things still go sideways, I'll remove all the other HDDs (as they're not technically needed ATM), see if that helps. Update: Didn't help *sigh*
    Next will be removing my overclocks; CPU first Update: Nope..., then RAM if needed.
    After that it'll be restoring said overclocks, booting off the old HDD, but still having the m.2 SSD installed (to determined if it's some issue of the clone).
    THEN it'll be booting off that drive but with the m.2 removed if things still are problematic.

    If none of those solve it, my last thoughts are that having removed my GPU to put the nVME in, is somehow to blame for all of this. Not quite sure how, but I've certainly encountered weirder gremlins in my day.


    I appreciate all the ideas, so by all means keep 'em coming if you have one :)

    EDIT: Speaking of 'all ideas'... Anyone think, considering I have a first-batch CPU, that somehow this could be related to the SegFault issue? I understand that it mainly impacts the Linux folks, but figured I'd pitch the idea...
    My BIOS does have the option to disable OPCache to try and mitigate the issue, so if anyone finds it plausible, I can add that to the "Shakedown" list since it's easy enough to do.


    Update: :( Skyrim just randomly crashed on me, so that would appear to rule out the Samsung driver. Unfortunate. Would've sure been nice if it was
    that simple. lol
     
    Last edited: Jun 17, 2018
  9. Formula.350

    Formula.350 Gawd

    Messages:
    929
    Joined:
    Sep 30, 2011
    After yesterday with only the one Skyrim crash, I didn't have any other issues for hours. Today I was just mainly idling, referencing stuff on the laptop while changing the INIs on the desktop. Idle for a minute or so, moved the mouse and randomly threw a BSoD!
    KMOD_EXCEPTION_NOT_HANDLED
    What failed: Wdf01000.sys

    mwroobel ... With the dump file being 16GB though, I'll definitely have to carry out any sort of scanning on my end. What sort of thing(s) do you want me to run on it and figure out?

    That being said... if this is of use, here is some info from the Event Viewer on it just whining about shutting down improperly, but the other events didn't log this info:
    Code:
    EventData
      BugcheckCode: 30
      BugcheckParameter1: 0xffffffffc000001d
      BugcheckParameter2: 0xfffff8054c253eac
      BugcheckParameter3: 0xfffff801fd1dc940
      BugcheckParameter4: 0xffffd9087872e098
      SleepInProgress: 0
      PowerButtonTimestamp: 0
      BootAppStatus: 0
      Checkpoint: 0
      ConnectedStandbyInProgress: false
      SystemSleepTransitionsToOn: 1
      CsEntryScenarioInstanceId: 0
    Mainly it's the Bugcheck stuff that I hope is of use.

    Downloaded the stupid-massive Debugging Tools for Windows package. Loaded up the 16GB dump. This one seems to indicate it was my Mouse and/or the driver for...the USB port? That is the cause, assuming I'm reading it right, that is. Which would indeed coincide with having moved the mouse and getting a sudden BSOD. Won't really have more data to compare it to until another BSOD turns up.
    Tried loading one of the MiniDumps but was complaining about "Kernel symbols are WRONG. Please fix symbols to do analysis", so it's not workin out. Not sure if it's due to loading it from the MiniDump folder versus the root C: directory, or what. The Mem Dump, it downloaded the symbols all on its own :\


    EDIT: Just got another BSOD. This time it's...
    DRIVER IRQ NOT LESS OR EQUAL
    What failed: atikmpag.sys

    Looks like tomorrow I'll continue down the checklist.



    Update: *sigh* Was playing Skyrim for 6+hrs straight, zero issues with all the extra SATA components unplugged except for the 6TB drive... Then as my computer sit idle on the desktop whilst I helped a friend with their own computer issues... it randomly BSOD: "MEMORY MANAGEMENT" :|
    Update2: Boy! It must be trying to get my attention, doesn't like me working on the laptop, or something... It loaded the desktop and a moment later, another BSOD! Just a generic, non-driver "IRQ_NOT_LESS_OR_EQUAL"
     
    Last edited: Jun 15, 2018
  10. Formula.350

    Formula.350 Gawd

    Messages:
    929
    Joined:
    Sep 30, 2011
    Alright, small update. I've been crossing off the items on my checklist (post #8) and looks like it may come down to having throwing off my RAM's stability, but that's still not a certainty yet given I'm using Skyrim of all things as the initial test. Which as we know, Bethesda games are not known for being bug-free, and Gamebryo based games love to Crash To Desktop randomly, even on completely stable machines.

    That being said, I have been running some real tests. Since it's late, it's not been a real stress test length of time, but thus far it has been doing fine at DDR4-3066 (exact same timings as 3200) for 1hr17min on the AIDA "Cache" stress test, and prior to that it passed 30 minutes on the "System Memory" stress test.

    Though, I may need to take a step backwards first, because I had initially given the CPU clock the sign-off with only one CTD of Skyrim before I restarted and downclocked my RAM. I may need to use BSOD or a stress test failure as a more empirical result, not a buggy-game.

    Regardless, I have a much easier time believing that it [the SSD] has caused my RAM/IMC to become unstable, than I do that it has caused my CPU to be. Reason being, the CPU is a 1700X with water cooling, and I only had it manually overclocked to 3.7GHz (base is 3.4GHz). Pretty much all of the first gen Ryzen 7s can do 3.9GHz no problem. and I'm not certain it'd be the VCore (1.25V) being too low considering it had been stable at that voltage up until the nVME install.


    EDIT: Alright the real test begi.... Sure, Steam, go ahead, now is a great time to download a Client update, nevermind the fact there's only 285MB of RAM free! *kills process*
    AS I was saying... lol A real test is underway with HCI Memtest.

    I know that ideally I should have 16 instances open and each on its own thread, but I'm running 8 instead. All set with an equal amount of RAM, 1925MB (15,400 total out of the 15,950, minus OS resources, leaving 250-300MB free currently). Unfortunately I got side-tracked on posting this, and after it's had time to run, I think one, maybe two instances are tapping into VMEM. The majority of them are at 163-171%, 'maybe' one is at only 151%, and the last one is at only 98% :| Guess perhaps I should've gone with my initial gut choice of 1875MB...

    Will let this run for a few hours or until considerable errors show up. I'm seriously hoping the downclocked RAM will do the trick because I'm kinda over this whole fiasco, considering everything had been fine. I still find it plausible that, given Ryzen's architecture, an nVME SSD on the PCIe bus could cause RAM to destabilize, assuming that the memory was on the literal edge of stability in the first place. Which I had indeed done my best to tighten the subtimings to the best of my ability, and am also running the kit at 14-14-14 instead of the advertised 15-15-15 (can't run Odd CLs again due to PinnaclePI AGESA, which was the same case with SummitPI early on until AMD 'unlocked' that around v1.0.0.4, but you have to had GearDownMode disabled).


    UPDATE: Well isn't that interesting.... o_0
    I was planning to stop them at 270% as each reached it, but was caught up in a video so the majority hit >300%. The last one, that straggling instance, was the only one left running. I look up and see it had reached 310% and so I was going to stop it...

    Computer is frozen. Actually, according to the clock, it's been frozen for 22 minutes! :\

    All 8 instances reported zero errors this entire time, and yet I sit here looking at one heck of an error that's stemmed from SOMEwhere :( What confuses me the most is the fact is it happened when the system was not under a load! (I don't consider one thread at 100% or what Windows calls 6% total CPU time, to be under "load")

    Don't think anyone is still reading this thread, but if so...
    Any ideas on where to look for the cause of "Windows freezing"?
     
    Last edited: Jun 19, 2018