NiceHash problems

Engr62 · Jul 5, 2018

I've been having a problem with my mining rig over the last week.

My rig consists of the following five GPUs: (2) GTX 1070s, (2) GTX 1060s (6Gb), and (1) GTX 970. My motherboard is an ASRock H81 Pro BTC R2.0 with an i3-4130 CPU and 4Gb of DDR3. The GPUs are connected to the PCI-E slots using Rebbic VER009S PCI-E risers. I'm running the latest NiceHash miner and my Geforce driver is 398. This rig has been performing flawlessly for about 6 months... until this past week.

When I start mining, things usually (though not always) run smoothly for a few minutes. Then, some error will flash in the output window in red, and the excavator will stop and relaunch. Sometimes the it will then mine for a few minutes before the same thing repeats. When it first started, the error message said something to the effect of "CUDA memory error using Daggerhashimoto at line 1835." I opened a support ticket with NiceHash, and they told me to disable Daggerhashimoto for the GTX 970 because it didn't have enough memory for the algorithm. I tried this, but still had the same problem... the excavator flashes an error, then restarts.

So, I tried disabling the Daggerhashimoto algo for all of the cards, but I get the same behavior as before (although with different error messages). I've tried disabling each card individually in the miner and changing out PCI-E risers, but no luck.

Has anyone here had similar problems where the excavator closes, then restarts, and repeats? Is there a log file that will help trouble-shoot this better?

MrRuckus · Jul 5, 2018

Engr62 said:
When it first started, the error message said something to the effect of "CUDA memory error using Daggerhashimoto at line 1835."

I had this same issue in a rig I have in my garage that had a GTX 980 and a GTX 1080 in. My only solution was to take out the 980 and put it back in my main box. Im basically just using spare parts and experimenting with mining. I havent taken the time to troubleshoot where the problem was coming from as the 2 cards apart mine just fine. I was using risers in the one rig and have my 980 on a riser in my main rig. They worked flawlessly together for months before I updating either the miner or the drivers and that started occurring. Each card was running on its own 850w PWS so I know it wasn't power.

I would take out a couple of cards and see how it goes. The 1080 works fine by itself without changing anything, so again, Im not sure of the exact cause yet.

Parja · Jul 5, 2018

How big is your virtual memory? Try setting it to like 40GB and see if that eliminates the issue.

d.v. · Jul 6, 2018

there's only so many thing that will cause those cuda errors:

1. make sure you have enough virtual memory set. the rule of thumb is 1:1 virtual memory to physical memory on the gpus.
2. not enough power from the psu for your overclock/underclock/tdp settings.
3. version of video drivers and/or nicehash miner not agreeing with one or all of your gpus for a certain algo, plus potentially your overclock/underclock settings.

i've had this issue many times and have spent countless hours trying to chase down the issue. i even when as far as to remove the gpu and put it in another rig like what MrRuckus mentioned. in the end, i learned that it's usually 1 or 2 gpus that are acting up. unfortunately, the rig works as a single unit and if one card goes bad, the whole rig drops.

for all my cases with this scenario, it was #3 above. when nicehash comes out with a new algo, they release a new miner and that algo is checked by default. the miner will benchmark the new algo and attempt to mine it when it becomes profitable enough. for me, it was lyra2z. i found out that some of my 1080 and 1080ti gpus did not like this algo for whatever reason. in the end i simply unchecked that algo for some of my gpus and it's been smooth solid since. weird but i guess that's the silicon lottery.

my advice is to leave your rig the way it is, don't mess with any settings. mine with 1 card at a time for about 15-20 min each and see if it crashes. eventually you'll come across the gpu that will crash on its own and that will be your culprit card. make sure you test them all in case you have more than 1 bad gpu. once you isolate the bad ones, repeat with each algo until you find out which algo the card doesn't like then uncheck it and life goes on, or try messing with your overclock/underclock settings until the algo likes it.

good luck, this is a PITA when it happens. part of the joys of being a miner.

Archaea · Jul 6, 2018

It’s likely your page file, if you never set that up. Early on I fixed some of my random error messages that way. I agree on 1:1 page file size in windows against your GPU memory AND System RAM. (Or at least half if you have huge memory cards like 1080ti - it’s probably actually just 1:1 with the RAM that’s in use vs the capacity — but that’s difficult to judge.)

Failing that:
Try turning off your overclock. Run them at stock core and memory settings and use MSI afterburner to turn them down to about a 75% power level. That ought to be a good stability test. Increase your overclock back slowly until you find your issue.

If you still have problems, fall back to the step of disabling all the cards and adding them back one at a time, in several hour intervals until you find the problem.

But start with increasing your page file. With your cards and System RAM you’ll want a 36GB Page file.

8gb x 2 for 1070
6gb x 2 for 1060
4gb x 1 for 970
4gb System RAM

= 36 GB total system RAM

36x1024 = 36,864MB should be your page file size.

No, Windows doesn’t manage your page file well automatically in a mining rig. Yes, you should manually set it.

Engr62 · Jul 6, 2018

Thanks for all of the help.

I had not set my virtual memory size because it had been running fine for so long. I think the problems started with the latest version of NiceHash (2.0.2.5 I think is the version number), so there may be some conflict with the drivers. I updated the drivers from 388 to 398 and still had the problem.

So last night, I set up my virtual memory to be an initial size of 40,000 MB (Max = 45,000 MB). I had previously disabled the Daggerhashimoto algo for all of the cards, so I re-enabled it for all of the cards with the exception of the GTX 970. Setting the page file seemed to help a lot. The rig seemed to mine almost as normal (for an extended period at least).

I moved the excavator output window to a location on my screen other than the default so I'd know if it restarted (as it would be in the default location if it restarted). I checked on the rig a few times over the next hour, and the output window was still where I moved it. So, I left it running for a while longer (maybe about an hour more). When I checked on it again, the output window was at the default location--meaning it had restarted. However, it didn't appear to be restarting frequently because the temperature history of my GPUs (as shown in MSI Afterburner) was steady. Before, when the excavator was restarting frequently, I could see that the GPU temps were dropping, then returning to their normal mining temps over and over.

I found the NiceHash log file and did a search for "error" and the only thing that turned up was something about a connection loss. But no specific algo errors that were showing in the output window before setting the virtual memory. I'm not sure if the log file would have that information since I didn't find it before I changed my virtual memory.

I guess my next step per d.v.'s advice is to start running it one card at a time.

Archaea · Jul 6, 2018

You may not have a problem anymore.

Perhaps the single restart you saw was related to a profit switching Change vs an error.

Parja · Jul 6, 2018

d.v. said:
for all my cases with this scenario, it was #3 above. when nicehash comes out with a new algo, they release a new miner and that algo is checked by default. the miner will benchmark the new algo and attempt to mine it when it becomes profitable enough. for me, it was lyra2z. i found out that some of my 1080 and 1080ti gpus did not like this algo for whatever reason. in the end i simply unchecked that algo for some of my gpus and it's been smooth solid since. weird but i guess that's the silicon lottery.

Yeah, Lyra2z works the core a bit harder than the other "typical NVIDIA algos". I had to back down my overclock on a few of my cards to get them back to stable with 2z. It's a pretty profitable algo at times, though, so the slight drop in core clock was well worth it.

Archaea said:
Perhaps the single restart you saw was related to a profit switching Change vs an error.

With NiceHash 2.0.X.X, everything GPU runs within Excavator and it switches algos on the fly, so that shouldn't really happen.

Engr62 · Jul 7, 2018

Parja said:
With NiceHash 2.0.X.X, everything GPU runs within Excavator and it switches algos on the fly, so that shouldn't really happen.

First, I forgot to mention in my previous posts that I'm not overclocking the GPUs or CPU in my rig. I have the GPUs set at 70% power through MSI Afterburner, and the memory is untouched (+0).

I think I'm still having a problem, although it's greatly reduced with manually setting up the virtual memory.

The excavator will run for quite some time (anywhere from 40 min. to 1 hour) before it restarts and runs again for an extended period. So, I'm not seeing the noticeable dips in the GPUs' temperature histories like I did before manually setting the virtual memory. I've tried looking through the log file, but the only thing I can find when searching for "error" is "connection lost." I've tried watching the output window to see if I can see an error message flash like before, but there doesn't appear to be an error in red like I was seeing before I increased the virtual memory. My plan was to start removing cards today to see if I could find the problematic one. However, I my put that on hold because...

I did a test with a computer that I mine on when not using it for analysis work. It is a dual Xeon E5-2670, 64 Gb RAM, and MSI GTX 1060 6gb system. I have not changed the virtual memory settings from the default on this system (Windows 10). The Geforce driver is version 388 and the NiceHash miner release is 2.0.2.4 on this computer. Last night, I moved the excavator output window away from its default location on this computer. When I checked it this morning, the excavator must have restarted because it was in its default location.

There is something going on with both of these systems, and it seems to coincide with the Daggerhashimoto algo rising to prevalence when Equihash dropped off.

I would really like to get this fixed, but the BTC levels mined don't seem to be off very much if at all. Maybe it will sort out when a new version of NiceHash is released.

NiceHash problems

Engr62

Gawd

MrRuckus

Limp Gawd

Parja

[H]F Junkie

d.v.

Limp Gawd

Archaea

[H]F Junkie

Engr62

Gawd

Archaea

[H]F Junkie

Parja

[H]F Junkie

Engr62

Gawd