Help me track down a [potentially] faulty GPU

Thunderdolt · May 24, 2020

I've been having stability problems with a trio of different renderers. All three crash and blame memory errors in their logs. All three software makers told me to reach out to Nvidia because that usually means bad hardware. Nvidia says they agree that this is probably due to bad hardware, but they need me to figure out which GPU is at fault and that has to be done via manual hardware swaps.

Since the GPUs are set up in pairs for the cooling, that's where I started. I seemed to get some signal initially and was pretty happy while thinking that I was about to solve the problem. Turns out that was premature. So, now I'm at a loss. Any suggestions on what to try next?

Here is what I've tried so far. Note that at no point has an NVLink bridge been present.

Configuration:
- Slot 1: GPU 1
- Slot 2: GPU 2
- Slot 3: GPU 3
- Slot 4: GPU 4
- Result: Render crash (illegal memory access)
Configuration:
- Slot 1: GPU 1
- Slot 2: GPU 2
- Slot 3: Empty
- Slot 4: Empty
- Result: Render crash (illegal memory access)
Configuration:
- Slot 1: GPU 3
- Slot 2: GPU 4
- Slot 3: Empty
- Slot 4: Empty
- Result: Render completes successfully (approx 4hrs)
Configuration:
- Slot 1: GPU 1
- Slot 2: Empty
- Slot 3: Empty
- Slot 4: Empty
- Result: Render completes successfully (approx 8hrs)
Configuration:
- Slot 1: GPU 2
- Slot 2: Empty
- Slot 3: Empty
- Slot 4: Empty
- Result: Render completes successfully (approx 8hrs)

Motherboard is Rampage VI Extreme, GPUs are Titan RTX, driver is the current Studio release (442.92 - though Nvidia tech support does not think this is driver-related).

mda · May 24, 2020

1 & 2 Crashing, 3 and 4 not.

1st question - is this reproducible? IE Try again.

If this is, then the problem is likely GPU 1 AND/OR GPU2

If so, you do:

GPU 1 and 3
GPU 1 and 4
GPU 2 and 3
GPU 2 and 4

I'm guessing one pair of those should crash and should identify the problem as either GPU1 or GPU2 being faulty.

Thunderdolt · May 24, 2020

It seems that 1 & 2 crash only when they're both in the machine (although I've only run that config once so far). When GPU 1 or GPU 2 is the only card in the machine, they don't seem to crash.

I have GPU 1 in there solo right now re-running the same render. I will try repeating the 1 & 2 combo tomorrow to see if that crashes.

Starfalcon · May 25, 2020

What kind of PSU are you using in this system? Also are you using solo runs or are you daisy chaining the 8 pin power connectors?

Thunderdolt · May 25, 2020

Starfalcon said:
What kind of PSU are you using in this system? Also are you using solo runs or are you daisy chaining the 8 pin power connectors?

I'm running a Corsair AX1600i and 2x solo runs to each GPU (not the optional Y-cables).

I just put GPU 1 & 2 back into the system together and the renders instantly crash again. So, these two cards appear to work perfectly when solo but crash when together. Hmmm.

RazorWind · May 25, 2020

What happens if you do 1 and 3, 2 and 4, etc?

pendragon1 · May 25, 2020

Thunderdolt said:
I'm running a Corsair AX1600i

ive seen several people around here have issues with their over current protection being too aggressive. have another psu to try?

RazorWind · May 25, 2020

pendragon1 said:
ive seen several people around here have issues with their over current protection being too aggressive. have another psu to try?

Experience with the 1200i I use in my testing bench tells me that the OCP on those is basically the same as any other PSU in the sense that the system would just shut down if the OCP gets tripped.

It's supposed to be configurable, though, if you use their software. I haven't tried it since they got rid of corsairlink, since the thing they replaced it with (Cue or something) sucks massively.

atp1916 · May 25, 2020

Tried any PCIe Gen link bifurcation manipulation?

Thunderdolt · May 25, 2020

This PSU is set up as single rail, so any OCP hiccup would hit all cards concurrently.

atp1916 said:
Tried any PCIe Gen link bifurcation manipulation?

I have not. How would that work?

Small testing update:
I tried swapping the slots GPUs 1 & 2 were in. Still crashed.

I'm going to retest GPUs 3 & 4 to see if maybe I just got lucky before when they worked.

RazorWind · May 25, 2020

Thunderdolt said:
This PSU is set up as single rail, so any OCP hiccup would hit all cards concurrently.

I have not. How would that work?

Small testing update:
I tried swapping the slots GPUs 1 & 2 were in. Still crashed.

I'm going to retest GPUs 3 & 4 to see if maybe I just got lucky before when they worked.

Also try some other combinations. For instance:

1 and 3
1 and 4
2 and 3
2 and 4

Operating under the assumption that you have one flaky card, you need to narrow down which card it is, and you currently only know that some combination of 1 and 2 causes the problem.

atp1916 · May 25, 2020

Thunderdolt said:
This PSU is set up as single rail, so any OCP hiccup would hit all cards concurrently.

I have not. How would that work?

Small testing update:
I tried swapping the slots GPUs 1 & 2 were in. Still crashed.

I'm going to retest GPUs 3 & 4 to see if maybe I just got lucky before when they worked.

Your board should have bios options to change the PCIe lane configurations such as switching the lane spec (1/2/3) as well as bifurcation (slot speeds).

Id definitely play with the lane configurations via the bios to ensure its not the board.

Thunderdolt · May 25, 2020

RazorWind said:
Also try some other combinations. For instance:

1 and 3
1 and 4
2 and 3
2 and 4

Operating under the assumption that you have one flaky card, you need to narrow down which card it is, and you currently only know that some combination of 1 and 2 causes the problem.

3 & 4 are still assembled together at the moment (via the water bridge). To save time on swapping that in and out, I'm going to try testing 1 + 3&4 next, and 2 + 3&4 after that. Each of those will have two slot configurations tested as well.

Going to let the re-test of 3 & 4 run for a bit longer, but it's been going for an hour now with no errors. When the errors have happened in other tests, it is normally within the first 5 minutes.

Thunderdolt · May 25, 2020

It was brute forced, but I believe I have narrowed this down to GPU #2. While it seemed to work ok solo, it does not work with any combination of the other GPUs and slots.

Going to give the remaining three cards an overnight test just to quadruple check that those are fine as long as #2 isn't there.

noko · May 26, 2020

Do they all have the same bios version?

Thunderdolt · May 26, 2020

noko said:
Do they all have the same bios version?

They do. I believe only one version has been released for this card (90.02.23.00.01).

Thunderdolt · May 26, 2020

SMFH. System crashed overnight and now GPU 3 is a zombie that Nvidia Control Panel cannot see, Iray cannot see, but which seems to consume power during Iray renders.

I think I'd give up right about now if there were some third party I could hand this off to.

lopoetve · May 26, 2020

Swap it to a different slot and run it solo, see if it shows back up. Might have to DDU as well... I've had that happen once a long time ago.

RazorWind · May 26, 2020

What happens if you run just #3 solo? Does the system post?

Thunderdolt · May 26, 2020

I haven't split 3 & 4 apart yet, but I did test them with GPU 3 in slots 1, 2, and 3 with the same results. Did full driver uninstall via DDU, rebooted, downloaded fresh 442.92 installer, and.... the installer stops most of the way through. The drivers and control panels are there, but for some reason the installer can't finish no matter how many reboots and retries I throw at it. Even tried with a second DDU uninstall with no luck. The drivers see GPU 4 but they do not see GPU 3. Connecting a display to GPU 3 does nothing. Changing the slot that GPU 3 is in does nothing. But are you ready for the weird part?

Iray does not see the GPU, but the GPU pulls 200+W during Iray renders.

RazorWind · May 26, 2020

How do you know it's pulling 200 watts during renders? Does it show up in the device manager? If you plug a monitor into it, do you get a display? What happens if you try to run a game on it?

I know it's a pain in the ass, but I think it's time to pull off the water jackets and put the stock heatsinks back on so you can test these things one at a time.

Thunderdolt · May 26, 2020

RazorWind said:
How do you know it's pulling 200 watts during renders? Does it show up in the device manager? If you plug a monitor into it, do you get a display? What happens if you try to run a game on it?

I know it's a pain in the ass, but I think it's time to pull off the water jackets and put the stock heatsinks back on so you can test these things one at a time.

GPU-Z reports both GPUs as pulling 200+W and the PSU reports total system draw of ~500W.

When I plug a monitor into it, I get no display. With no display, I cannot run a game on it.

RazorWind · May 26, 2020

Thunderdolt said:
GPU-Z reports both GPUs as pulling 200+W and the PSU reports total system draw of ~500W.

When I plug a monitor into it, I get no display. With no display, I cannot run a game on it.

Clearly, it's time to yank the jacket and RMA that card.

You may still need to troubleshoot the other three, if #2 is suspect as well, but two failures in so short a time says to me that maybe the problem is the motherboard or power supply. When you run a render, what does the power supply say that the 12V rail voltage is?

Thunderdolt · May 26, 2020

RazorWind said:
Clearly, it's time to yank the jacket and RMA that card.

Would love to send it back, but it appears that Nvidia is no longer offering tech support to customers under the guise of "the virus." The current process seems to be that I send them an update, wait a few days, send a reminder that they still need to reply, wait a day, send another reminder, and then maybe get a response the following day. They've shut off their phone support entirely (looks like AMD's phone support is live though

).

RazorWind said:
You may still need to troubleshoot the other three, if #2 is suspect as well, but two failures in so short a time says to me that maybe the problem is the motherboard or power supply. When you run a render, what does the power supply say that the 12V rail voltage is?

The renders crashing has actually been an issue for months, but I've mostly been able to work around it until recently. GPU 3 going on strike is brand new though.

When I [used to] run renders on all four GPUs, the 12V rail would hold steady at 12.05-12.09V. It is currently doing the same, but with only two GPUs on it so I don't know if it might fall apart at higher loading.

atp1916 · May 26, 2020

Do you have any other systems to drop these cards in Thunderdolt ?

Any friends / family / coworkers / [H] brethren nearby?

Armenius · May 27, 2020

atp1916 said:
Do you have any other systems to drop these cards in Thunderdolt ?

Any friends / family / coworkers / [H] brethren nearby?

I agree. Sounds like a power supply or motherboard issue.

Thunderdolt · May 27, 2020

atp1916 said:
Do you have any other systems to drop these cards in Thunderdolt ?

Any friends / family / coworkers / [H] brethren nearby?

Unfortunately, I do not.

Good news though: finally got an answer from Nvidia last night. Just need to send them serials for GPUs 2 & 3 to confirm warranty status.

More good news: My GPU 1 & 4 sanity check did not lead to any new iffy results. They were able to complete a 14hr render without any trouble.

Help me track down a [potentially] faulty GPU

Gawd

2[H]4U

Gawd

[H]ard|Gawd

Gawd

Supreme [H]ardness

Extremely [H]

Supreme [H]ardness

[H]ard|DCoTM x1

Gawd

Supreme [H]ardness

[H]ard|DCoTM x1

Gawd

Gawd

Supreme [H]ardness

Gawd

Gawd

Extremely [H]

Supreme [H]ardness

Gawd

Supreme [H]ardness

Gawd

Supreme [H]ardness

Gawd

[H]ard|DCoTM x1

Extremely [H]

Gawd