Help me track down a [potentially] faulty GPU

Joined
Oct 23, 2018
Messages
1,015
I've been having stability problems with a trio of different renderers. All three crash and blame memory errors in their logs. All three software makers told me to reach out to Nvidia because that usually means bad hardware. Nvidia says they agree that this is probably due to bad hardware, but they need me to figure out which GPU is at fault and that has to be done via manual hardware swaps.

Since the GPUs are set up in pairs for the cooling, that's where I started. I seemed to get some signal initially and was pretty happy while thinking that I was about to solve the problem. Turns out that was premature. So, now I'm at a loss. Any suggestions on what to try next?

Here is what I've tried so far. Note that at no point has an NVLink bridge been present.
  • Configuration:
    • Slot 1: GPU 1
    • Slot 2: GPU 2
    • Slot 3: GPU 3
    • Slot 4: GPU 4
    • Result: Render crash (illegal memory access)
  • Configuration:
    • Slot 1: GPU 1
    • Slot 2: GPU 2
    • Slot 3: Empty
    • Slot 4: Empty
    • Result: Render crash (illegal memory access)
  • Configuration:
    • Slot 1: GPU 3
    • Slot 2: GPU 4
    • Slot 3: Empty
    • Slot 4: Empty
    • Result: Render completes successfully (approx 4hrs)
  • Configuration:
    • Slot 1: GPU 1
    • Slot 2: Empty
    • Slot 3: Empty
    • Slot 4: Empty
    • Result: Render completes successfully (approx 8hrs)
  • Configuration:
    • Slot 1: GPU 2
    • Slot 2: Empty
    • Slot 3: Empty
    • Slot 4: Empty
    • Result: Render completes successfully (approx 8hrs)

Motherboard is Rampage VI Extreme, GPUs are Titan RTX, driver is the current Studio release (442.92 - though Nvidia tech support does not think this is driver-related).
 
1 & 2 Crashing, 3 and 4 not.

1st question - is this reproducible? IE Try again.

If this is, then the problem is likely GPU 1 AND/OR GPU2

If so, you do:

GPU 1 and 3
GPU 1 and 4
GPU 2 and 3
GPU 2 and 4

I'm guessing one pair of those should crash and should identify the problem as either GPU1 or GPU2 being faulty.
 
It seems that 1 & 2 crash only when they're both in the machine (although I've only run that config once so far). When GPU 1 or GPU 2 is the only card in the machine, they don't seem to crash.

I have GPU 1 in there solo right now re-running the same render. I will try repeating the 1 & 2 combo tomorrow to see if that crashes.
 
What kind of PSU are you using in this system? Also are you using solo runs or are you daisy chaining the 8 pin power connectors?
 
What kind of PSU are you using in this system? Also are you using solo runs or are you daisy chaining the 8 pin power connectors?

I'm running a Corsair AX1600i and 2x solo runs to each GPU (not the optional Y-cables).

I just put GPU 1 & 2 back into the system together and the renders instantly crash again. So, these two cards appear to work perfectly when solo but crash when together. Hmmm.
 
ive seen several people around here have issues with their over current protection being too aggressive. have another psu to try?
Experience with the 1200i I use in my testing bench tells me that the OCP on those is basically the same as any other PSU in the sense that the system would just shut down if the OCP gets tripped.

It's supposed to be configurable, though, if you use their software. I haven't tried it since they got rid of corsairlink, since the thing they replaced it with (Cue or something) sucks massively.
 
This PSU is set up as single rail, so any OCP hiccup would hit all cards concurrently.

Tried any PCIe Gen link bifurcation manipulation?

I have not. How would that work?

Small testing update:
I tried swapping the slots GPUs 1 & 2 were in. Still crashed.

I'm going to retest GPUs 3 & 4 to see if maybe I just got lucky before when they worked.
 
This PSU is set up as single rail, so any OCP hiccup would hit all cards concurrently.



I have not. How would that work?

Small testing update:
I tried swapping the slots GPUs 1 & 2 were in. Still crashed.

I'm going to retest GPUs 3 & 4 to see if maybe I just got lucky before when they worked.
Also try some other combinations. For instance:

1 and 3
1 and 4
2 and 3
2 and 4

Operating under the assumption that you have one flaky card, you need to narrow down which card it is, and you currently only know that some combination of 1 and 2 causes the problem.
 
This PSU is set up as single rail, so any OCP hiccup would hit all cards concurrently.



I have not. How would that work?

Small testing update:
I tried swapping the slots GPUs 1 & 2 were in. Still crashed.

I'm going to retest GPUs 3 & 4 to see if maybe I just got lucky before when they worked.

Your board should have bios options to change the PCIe lane configurations such as switching the lane spec (1/2/3) as well as bifurcation (slot speeds).

Id definitely play with the lane configurations via the bios to ensure its not the board.
 
Also try some other combinations. For instance:

1 and 3
1 and 4
2 and 3
2 and 4

Operating under the assumption that you have one flaky card, you need to narrow down which card it is, and you currently only know that some combination of 1 and 2 causes the problem.

3 & 4 are still assembled together at the moment (via the water bridge). To save time on swapping that in and out, I'm going to try testing 1 + 3&4 next, and 2 + 3&4 after that. Each of those will have two slot configurations tested as well.

Going to let the re-test of 3 & 4 run for a bit longer, but it's been going for an hour now with no errors. When the errors have happened in other tests, it is normally within the first 5 minutes.
 
It was brute forced, but I believe I have narrowed this down to GPU #2. While it seemed to work ok solo, it does not work with any combination of the other GPUs and slots.

Going to give the remaining three cards an overnight test just to quadruple check that those are fine as long as #2 isn't there.
 
SMFH. System crashed overnight and now GPU 3 is a zombie that Nvidia Control Panel cannot see, Iray cannot see, but which seems to consume power during Iray renders.

I think I'd give up right about now if there were some third party I could hand this off to.
 
Swap it to a different slot and run it solo, see if it shows back up. Might have to DDU as well... I've had that happen once a long time ago.
 
I haven't split 3 & 4 apart yet, but I did test them with GPU 3 in slots 1, 2, and 3 with the same results. Did full driver uninstall via DDU, rebooted, downloaded fresh 442.92 installer, and.... the installer stops most of the way through. The drivers and control panels are there, but for some reason the installer can't finish no matter how many reboots and retries I throw at it. Even tried with a second DDU uninstall with no luck. The drivers see GPU 4 but they do not see GPU 3. Connecting a display to GPU 3 does nothing. Changing the slot that GPU 3 is in does nothing. But are you ready for the weird part?

Iray does not see the GPU, but the GPU pulls 200+W during Iray renders.
 
How do you know it's pulling 200 watts during renders? Does it show up in the device manager? If you plug a monitor into it, do you get a display? What happens if you try to run a game on it?

I know it's a pain in the ass, but I think it's time to pull off the water jackets and put the stock heatsinks back on so you can test these things one at a time.
 
How do you know it's pulling 200 watts during renders? Does it show up in the device manager? If you plug a monitor into it, do you get a display? What happens if you try to run a game on it?

I know it's a pain in the ass, but I think it's time to pull off the water jackets and put the stock heatsinks back on so you can test these things one at a time.

GPU-Z reports both GPUs as pulling 200+W and the PSU reports total system draw of ~500W.

When I plug a monitor into it, I get no display. With no display, I cannot run a game on it.
 
GPU-Z reports both GPUs as pulling 200+W and the PSU reports total system draw of ~500W.

When I plug a monitor into it, I get no display. With no display, I cannot run a game on it.
Clearly, it's time to yank the jacket and RMA that card.

You may still need to troubleshoot the other three, if #2 is suspect as well, but two failures in so short a time says to me that maybe the problem is the motherboard or power supply. When you run a render, what does the power supply say that the 12V rail voltage is?
 
Clearly, it's time to yank the jacket and RMA that card.

Would love to send it back, but it appears that Nvidia is no longer offering tech support to customers under the guise of "the virus." The current process seems to be that I send them an update, wait a few days, send a reminder that they still need to reply, wait a day, send another reminder, and then maybe get a response the following day. They've shut off their phone support entirely (looks like AMD's phone support is live though 🤷‍♂️).

You may still need to troubleshoot the other three, if #2 is suspect as well, but two failures in so short a time says to me that maybe the problem is the motherboard or power supply. When you run a render, what does the power supply say that the 12V rail voltage is?

The renders crashing has actually been an issue for months, but I've mostly been able to work around it until recently. GPU 3 going on strike is brand new though.

When I [used to] run renders on all four GPUs, the 12V rail would hold steady at 12.05-12.09V. It is currently doing the same, but with only two GPUs on it so I don't know if it might fall apart at higher loading.
 
Do you have any other systems to drop these cards in Thunderdolt ?

Any friends / family / coworkers / [H] brethren nearby?

Unfortunately, I do not.

Good news though: finally got an answer from Nvidia last night. Just need to send them serials for GPUs 2 & 3 to confirm warranty status.

More good news: My GPU 1 & 4 sanity check did not lead to any new iffy results. They were able to complete a 14hr render without any trouble.
 
Back
Top