Video card issues - advice needed

Ultima99 · Dec 11, 2010

OK guys, I am in need of some third party advice on a system issue I've been dealing with for a couple of months now.

First the detailed system specs:

PSU: BFG LS680 as reviewed here at the [H] purchased late Nov 2008
http://hardocp.com/article/2008/06/09/bfg_ls_series_power_supplies/8

CPU: Intel Core i7 920 Rev C0. Currently not OC'd. Has been lightly in the past.

Heatsink: Cogage TRUE Spirit /w stock fan.

Motherboard: Asus P6T Deluxe running current BIOS

RAM: Corsair TR3X6G1333C9 which is 3 x 2GB DIMMs.
http://www.corsair.com/_datasheets/tr3x6g1333c9.pdf - Corsairs pdf with specs on the kit as a whole
http://www.corsair.com/_datasheets/TW3X4G1333C9.pdf - the corresponding 4GB kit pdf that has more info on the individual modules (which are the exact same)

Other:
Creative X-Fi Titanium Fatality
3 HDD
2 Optical drives
Dell Ultrasharp 2407 1920x1200 (main monitor, all gaming is on this one)
Dell Ultrasharp 2005 1650x1050
APC Back-UPS XS 1500 about 1 yr old
Win7 x64

Now for the video card details. (Note, I am not OC'ing these cards and heat does not appear to be an issue with any of them)
I bought an EVGA GTX260 (216) SSC Edition card late in December 2008. All was well with no known issues until around 2 months ago when I decided to give Metro 2033 a try after seeing the talk about its great graphics. Previously I was a hardcore WoW raider and had never had any issues. I had also played Farcry 2, Just Cause 2, Starcraft 2, and Civ5 without issue. Then I tried Metro and things changed, quickly. The crashes were very predictable. For those who have played at the very begining when you go to the surface with Miller, you put the gas mask on and go out into the hall, my system would crash coming up the stairs about 8 or 9 times out of 10. It was a hard lockup, the screen would be a solid color. So I to test my card stability I run Furmark and it crashes about 30-90 second in almost every time. Wow, SC2, and Civ5 are still fine.

So my though process is Metro is a game that is known to push cards hard and its finding faults in my card that none of my other games could expose. I sumbit a ticket with EVGA, we go through the standard stuff like reinstalling drivers and blah blah none of which helps and they approve me for an RMA, another GTX 260. I get this card and I still have some crashes in Metro and Furmark from the start, although it is less frequent than with the other card so I email EVGA and let them know. Later that day I get a phone call from a friendly guy named Mike, who is an RMA manager at EVGA to discuss my problem. We cover lots of things including system specs and such and he decided to send me another card, this time a GTX280 standard clock that he himself will be stress testing first.

I get the GTX280 and put the pedal to metal right away and Metro, Furmark, and everything are running great. Awesome, my faith in my system and EVGA is restored, for about a week. I'm having issues again, but of a very different nature this time. Textures tearing is really bad on SC2 similar to how this screenshot looks under where it says "Lost Coast", only worse.
http://www.playtool.com/pages/artifacts/background010000.jpg

WoW has some tearing and a sprinkling of off pixels that appear to be "stuck" that is most noticeable on the login screen. Civ5 has terrible tearing. Most games are unplayable, sometimes it crashes when tearing is present, sometimes not, very frustrating.

I inform Mike of these new developments and we are emailing back and forth about my system, specs, and testing. We check my PSU 12v which seems to be fine, my RAM timings and voltage which are fine. The problems grow worse over time, and soon it seems anything that stresses the video card at all, including Flash 10.1 accelerated video has a chance to crash it. I even did a fresh install of Win7 on a spare HDD that made no difference. So he agrees to test and send me another GTX280.

This is the card I've currently had for about 3 weeks, it was perfect at first as well, but it started messing up a week ago. These errors were different as well, no texture tearing like before, no crashes with solid color screens like the cards before that. When this one crashes it exhibits the signs of a Video RAM issue looking very much like this:

http://www.playtool.com/pages/artifacts/graphicscardmemory3.gif

Here is some things I have noticed about the issue, especially with the current card. Initially is was an odd crash here and there, but became worse over time. It is by far most likely to crash running a game either upon boot, or coming out of sleep mode. Once I get a solid boot with no crashes, I can game away on it all day with no issues, however if I put it into sleep mode when I go out or go to bed for the night it is extremely likely to crash when I wake it. The same applies if I shut it down instead. One night I decided to perform a test, I had been gaming for a few hours without issue and let the machine run all night while I slept, and when I woke up I was able to run games fine with no crashing. I did try turning on ACPI 2.0 in the BIOS as well as Repost Video BIOS on S3 resume just to see if it made a difference coming out of sleep mode, it has not changed anything.

Here is the testing I have done to qualify my systems stability.

CPU:
Prime95 is stable on all 3 standard torture tests, Small FFT, In place large FFT, and Blend for over an hour on each. This has been tested many times, and my cpu temps are good. I am quite sure my CPU is fine.

RAM
Memtest86+ 4.10 reports no issues with my RAM. I have let several passes run on individual sticks, 2 at a time in alternating slots, and all 3 at a time.

My RAM is correctly programmed by SPD. It is running at 1.7V which is plenty for triple channel. My current timings are 8-8-8-20 @ 1066MHz which is more conservative than how Corsair tests them @ SPD settings of: 7-7-7-20 at 1066MHz (per the 2nd pdf link above for my RAM). I am quite certain my RAM is fine.

PSU:
In the BIOS my 12V rail reads 12.30. At idle in Windows it is 12.21. The lowest I've ever seen it go under extreme load such as OCCT, Furmark, or Prime95 (max heat and power consumption) + Furmark is 11.88. I don't think that is any cause for concern. I know the PSU's are common culprits if a system is suspected of killing video cards. I'm looking for opinions on this or suggestions for better testing.

Additonal PSU note: I am able to see the load in Watts for my system on my UPS's display. The highest I've seen was while running Prime95 + Furmark where it registered 520 Watts. Additionally my main monitor is on the UPS as well which was on at the time, but I have tested it since to use 60-65 Watts on average, meaning my systems peak load on the UPS was 455-465 Watts, which should not be a problem for my 680W PSU.

Motherboard & Other:
For the heck of it I have also ran some other benchmarks stressing the subsystems such as Sandra's benchmarks. No issues with any of this testing. I am unsure of good ways to specifically test my PCI-E slot, but my motherboard is the next most likely thing that could be causing issues (if it turns out to be my system).

With all the history out of the way I will say this. I have wavered back and forth between blaming their cards and my system. All the tests I know to run do not show any issues with my system. My initial GTX260 card was fine for over a year and half on everything short of Metro and Furmark and even then everything else I ran was fine to up the time when I sent it in. There was no visible degredation over time I've seen with their RMA cards. During the times when their RMA cards were making my system unusable so often I swapped it out with an old Geforce 7600 card, which over this whole ordeal has spent a good deal of time in my system which has yet to exhibit any issues, even for the light gaming I have done on it (with much reduced settings of course). Furmark is 100% stable on the 7600 card btw.

I am also aware of where repaired/refurbished RMA cards come from. They are someone elses card that failed and was repaired by the mfg. I know from these very HardForums that often video RAM issues can be caused by microfractures in the solder on RAM chips and the mfg will bake a card similar to what some H posters have done to reliquify the solder and fill in the microfractures making the connection good again.

This leads me to 2 possible conclusions:

1. Either my PSU or motherboard is slowly killing cards, but with testing and background above I have lots of doubts that this is the case. Speculation and further testing advice is VERY welcomed.

OR

2. The cards they are sending me are at fault. Keep in mind here both EVGA and my initial stress testing show a good card. However in my use which I imagine is like many of you, I either turn my PC off at night or put it into sleep mode. The point I am making here is this. During their stress testing the card is subject to one long burn period. In my case it is powered up in the morning, heating during play, and cools down at night. There testing did not account for this at all. Over the period of weeks is the process of heating and cooling causing the solder joints to refracture on these cards that they claim to have repaired? Speculation on this point is also very welcomed.

This problem is driving me nuts and conflicting evidence is making it hard to figure out.

Thanks in advance to all who read this, I'm hoping with the collective knowledge and experience present in these forums can help me nail this problem for good.

Dangman · Dec 11, 2010

Ultima99 said:
PSU:
In the BIOS my 12V rail reads 12.30. At idle in Windows it is 12.21. The lowest I've ever seen it go under extreme load such as OCCT, Furmark, or Prime95 (max heat and power consumption) + Furmark is 11.88. I don't think that is any cause for concern. I know the PSU's are common culprits if a system is suspected of killing video cards. I'm looking for opinions on this or suggestions for better testing.

Additonal PSU note: I am able to see the load in Watts for my system on my UPS's display. The highest I've seen was while running Prime95 + Furmark where it registered 520 Watts. Additionally my main monitor is on the UPS as well which was on at the time, but I have tested it since to use 60-65 Watts on average, meaning my systems peak load on the UPS was 455-465 Watts, which should not be a problem for my 680W PSU.

Yeah BIOs and any software PSU voltage readers are inherently very inaccurate. You need to use a digital multimeter to correctly identify the PSU voltages.

Also, the wattage readings you're getting from the UPS should be taken with a grain of salt as those tends to be ballpark only wattage readings. In some cases however, those devices can be be off by as much as 100W or so.

Ultima99 said:
During the times when their RMA cards were making my system unusable so often I swapped it out with an old Geforce 7600 card, which over this whole ordeal has spent a good deal of time in my system which has yet to exhibit any issues, even for the light gaming I have done on it (with much reduced settings of course). Furmark is 100% stable on the 7600 card btw.

To be fair, that 7600 card draws nowhere near the amount of power that the other cards you mention draw. So it could be a situation where your low power card is not hitting threshold where the PSU starts exhbiting problems.

Ultima99 · Dec 11, 2010

Danny Bui said:
Yeah BIOs and any software PSU voltage readers are inherently very inaccurate. You need to use a digital multimeter to correctly identify the PSU voltages.

Also, the wattage readings you're getting from the UPS should be taken with a grain of salt as those tends to be ballpark only wattage readings. In some cases however, those devices can be be off by as much as 100W or so.

To be fair, that 7600 card draws nowhere near the amount of power that the other cards you mention draw. So it could be a situation where your low power card is not hitting threshold where the PSU starts exhbiting problems.

Yeah I know they aren't any kind of replacement for a multimeter and kill-a-watt, but I dont have access to those and even so I do have over 200W clearance between the PSU rating and what my UPS says.

Also aware of the 7600 not being a gaming card, but its the only other card I have access to right now.

Thanks for the reply.

sirmonkey1985 · Dec 11, 2010

a kill-a-watt is about as accurate as the UPS reading. so that as well you have to take with a grain of salt. the problem isnt the ups or KaW meters its the active PFC used in modern PSU's that make them inaccurate.

in the two pictures those are obvious signs of video memory failing. the odd thing is its happened with 3 different cards which lead's me to believe its something else in the system causing it. another way of testing it is downclocking the video memory. it should take the stress off the video memory and stop those problems. if it doesnt then its the cards that have completely failed but something else in the system is obviously causing them to fail. only other suggestion i have is to test these cards in a different system and see what happens.

as far as the powering on and off. its possible but highly doubt thats the cause of the card its self failing on its own. only way this would be possible is if your playing say civ 5 or metro 2033 turning the game off and powering down right after it. then its possible the card hasnt had enough time to cool down. but again i doubt it since the card should be within idle temps in 30-45 seconds.

mhenley · Dec 11, 2010

I would recommend trying a different power supply. Find a local retailer that has a good one with a 30 day return policy and no restocking fees. See if Mike can send one more card, and replace the power supply first, check system stability (not vid) then replace vid card and stress. If it fails within a month, return the power supply for full refund and move on to the next item for troubleshooting.

If it ends up not being the PSU, I would question motherboard next.

Out of curiosity, when you return the cards, does Mike test them again to verify that he gets the same issues that you did?

SoulCreatr · Dec 11, 2010

mhenley said:
I would recommend trying a different power supply. Find a local retailer that has a good one with a 30 day return policy and no restocking fees. See if Mike can send one more card, and replace the power supply first, check system stability (not vid) then replace vid card and stress. If it fails within a month, return the power supply for full refund and move on to the next item for troubleshooting.

If it ends up not being the PSU, I would question motherboard next.

Out of curiosity, when you return the cards, does Mike test them again to verify that he gets the same issues that you did?

I agree with this. I did it a while back with a PSU from a circuit city before they were bought out. Showed mine was faulty, so I sent it back to Antec for a replacement unit. Returned the store bought PSU after I got the replacement and had it working and stress tested. The symptoms you are describing sound like the video card is starving for juice, because the 7600 seems to play fine. So either the PSU can't handle the higher loads now (leaky caps or something?) or the motherboard is not getting enough juice to the card from the pci-e port. PSU is the easier thing to replace in this case.

Ultima99 · Dec 12, 2010

sirmonkey1985 said:
a kill-a-watt is about as accurate as the UPS reading. so that as well you have to take with a grain of salt. the problem isnt the ups or KaW meters its the active PFC used in modern PSU's that make them inaccurate.

in the two pictures those are obvious signs of video memory failing. the odd thing is its happened with 3 different cards which lead's me to believe its something else in the system causing it. another way of testing it is downclocking the video memory. it should take the stress off the video memory and stop those problems. if it doesnt then its the cards that have completely failed but something else in the system is obviously causing them to fail. only other suggestion i have is to test these cards in a different system and see what happens.

as far as the powering on and off. its possible but highly doubt thats the cause of the card its self failing on its own. only way this would be possible is if your playing say civ 5 or metro 2033 turning the game off and powering down right after it. then its possible the card hasnt had enough time to cool down. but again i doubt it since the card should be within idle temps in 30-45 seconds.

I did try underclocking the RAM some, I think around 10% which didn't help, but I might go ahead and try it at an even lower speed just to see.

I know what you mean with the 3 cards failing and all, but how then did my original 260 card work fine for a year and half only failing on Metro and Furmark. Everything but those 2 programs ran fine right up until I sent the card in for the RMA, I never had to replace it with the 7600 due to WoW or browser crashes. If it wasn't for this fact I'd have replaced the PSU long ago.

As for the power cycling, sometimes when its late I do shut it down pretty quickly after exiting a game. Perhaps I ought to change this behavior and go check email or something before shutting down just in case.

Thanks for the reply.

Ultima99 · Dec 12, 2010

mhenley said:
I would recommend trying a different power supply. Find a local retailer that has a good one with a 30 day return policy and no restocking fees. See if Mike can send one more card, and replace the power supply first, check system stability (not vid) then replace vid card and stress. If it fails within a month, return the power supply for full refund and move on to the next item for troubleshooting.

If it ends up not being the PSU, I would question motherboard next.

Out of curiosity, when you return the cards, does Mike test them again to verify that he gets the same issues that you did?

Good idea, I will try this for sure. Too bad BFG is gone making the warranty useless on my PSU.

And yes I should have mentioned their testing results on the returned cards, I just forgot since there was so much in that post it was bound to happen.

On my original card (only ever failed on Metro and Furmark) they had no issues, not sure if they tried those 2 programs or not.

On the first RMA card (similar to the first) they reported no issues

The 2nd RMA card (lots of texture tearing) he was able to reproduce texture tearing on several games (i had also tested this card in another system with the same results as i had)

Thanks for the reply and the PSU idea.

Ultima99 · Dec 12, 2010

SoulCreatr said:
I agree with this. I did it a while back with a PSU from a circuit city before they were bought out. Showed mine was faulty, so I sent it back to Antec for a replacement unit. Returned the store bought PSU after I got the replacement and had it working and stress tested. The symptoms you are describing sound like the video card is starving for juice, because the 7600 seems to play fine. So either the PSU can't handle the higher loads now (leaky caps or something?) or the motherboard is not getting enough juice to the card from the pci-e port. PSU is the easier thing to replace in this case.

Unfortunately my PSU is BFG and we all know what that means. So I'll just have to buy one and keep it if the current one is indeed bad.

Luckily Asus is still around in case it does come down to the motherboard.

Thanks for the reply.

btw, anyone have a comment on the reported minimum of 11.88V on the 12V rail? I know it might not be accurate, but if it is correct, would that be acceptable?

mhenley · Dec 12, 2010

Ultima99 said:
I did try underclocking the RAM some, I think around 10% which didn't help, but I might go ahead and try it at an even lower speed just to see.

I know what you mean with the 3 cards failing and all, but how then did my original 260 card work fine for a year and half only failing on Metro and Furmark. Everything but those 2 programs ran fine right up until I sent the card in for the RMA, I never had to replace it with the 7600 due to WoW or browser crashes. If it wasn't for this fact I'd have replaced the PSU long ago.

As for the power cycling, sometimes when its late I do shut it down pretty quickly after exiting a game. Perhaps I ought to change this behavior and go check email or something before shutting down just in case.

Thanks for the reply.

The only reason why I can see it working in some and not others is because not every game can force a video card to 100% load which also means the highest possible power draw from the PSU.

Please post back with the results of your testing.

Ultima99 · Dec 12, 2010

A quick update:

I made some phone calls and found out someone I know has a multimeter I can borrow.

I've never tested a PSU in this manner before.

Can anyone who has done this give me some tips? Or maybe a link to a good guide?

Thanks.

Dangman · Dec 12, 2010

From the old and now defunct BFG PSU testing guide:

Using a multi-meter to check voltages

If you are experiencing problems with your computer system and would like to find out if the problem may be caused by a power supply with voltages that are out of specification, it is recommended to use a digital multi-meter (DMM).

First, make sure your DMM is capable of reading DC voltage and that it can read voltages to the 100th place (aka: a "resolution" of .01V.) Find an unused power connector and insert the probes into the appropriate pins. Apply the DMM's black probe (-) to a ground wire. Black is always ground, and it doesn't matter which ground you use as all grounds in the power supply terminate to the same location. Apply the DMM's red probe (+) to a colored wire that corresponds with a voltage you want to test. Red is +5V, yellow is +12V and orange is +3.3V.

Voltages are within specification if they are within 5% of the labeled voltage. For example: Tolerance for +12V would be 11.4V to 12.6V, +5V would be 4.75V to 5.25V, etc.

Another way to probe for voltages is to take the 20+4-pin power connector from the motherboard, remove the +4 and use the exposed pins in the motherboard connector for your DMM's probe. The exposed pin directly below the red wire on the main power connector is another +5V. Below that is a common ground. On the next column of pins, the pin below the last yellow wire is another +12V. The pin below this is +3.3V.

Using a multi-meter to determine voltage regulation

Often voltage regulation is mistakenly reported as how much above or below the median value a power supplys voltage is (for example: the +12V being at +12.1V, or the +5V being at +5.1V, etc.)
Voltage regulation is actually a gauge of how much or, preferably, how little the voltages drop going from a low to high load. If one were to think about this in context; it would actually be easier for a power supply manufacturer to put all of the units voltages at a higher than normal value so even under load with poor voltage regulation the unit would still be over the median value.
To properly measure voltage regulation, a digital multi-meter should be used on a lead that has no other components on it that could potentially cause enough resistance to sway the results. First, you should measure the power supplys voltages while the PC is idle. In the BIOS after any hard drives have spun up to full RPM is a good idle load. Next, measure the voltages again while the PC is under load. The difference between the two values is what youre looking for. For example: If the voltage on the +5V only drops 0.05V, you are effectively witnessing 1% voltage regulation

Ultima99 · Dec 12, 2010

Thanks a ton for the guide Danny.

I got the multimeter and have done some fairly extensive testing on it. I want to see what you guys think about these numbers. In the first set I also took 5V just to check the accuracy of the internal sensors.

BIOS Readings

BIOS
5V 5.094
12V 12.25

MultiMeter
5V 5.14
12V 12.37

It looks like the true voltage is actually higher than what is reported by the system internally.

Windows Idle Readings
HWMon 12.21
MM 12.40

Prime95 maxheat/power test
HWMon 12.16
MM 12.37

Furmark
HWMon 11.88-11.93 (it seems the internal readings go in 0.05 increments, it was always 11.88 or 11.93 never inbetween)
MM 12.23

Prime95 + Furmark
HWMon 11.88-11.93
MM 12.17-12.23

I ran these tests with the GTX280, it seemed to be somewhat more agreeable today, meaning Furmark ran fine for 15 minutes at 1440x900 windowed with 2x AA. However when I ran fullscreen at 1920x1200, which is what I game at, it crashed within a minute. GPU temp never exceeded 86C.

Also the wattage load on my PSU readout showed I had my system up to 490 Watts while running Prime + Furmark @ 1440.

I also checked the AC voltage at the wall and coming out of my UPS. Both were fine, hovering at 119.9-120.2

Ultima99 · Dec 12, 2010

Now I am going to conduct RL gaming tests that will likely result in many crashes. Please let me know what you guys think about the info I just posted.

Dangman · Dec 12, 2010

PSU voltages seem to be in spec.

Ultima99 · Dec 12, 2010

Wow this is driving me nuts.

Nothing is crashing now. Nothing. Furmark? Fine. Metro? Fine. Its all working like it should be...

I took it out of the machine for a couple days when it started getting really bad and put in the 7600. Now its working, except the 1 Furmark crash I had earlier when testing my PSU with the multimeter.

This could be how I am sending cards back to EVGA only for them to say 2 of the 3 were fine, but what the hell is going on here? If I leave this card in for a few days of use and shutdown at night will I be back to crash city in a week?

As Danny said, my PSU seems to be fine, but those are normally the first suspect when this stuff happens. Is my PSU really fine? Is there a good way to test my motherboard, as that is the next most likely culprit?

Dangman · Dec 12, 2010

The way I test mobos is usually Prime95 and Furmark at the same time. This loads most of the motherboard's subsystems. At least that's my theory anyway.

Ultima99 · Dec 12, 2010

Well if its my motherboard then is it the PCI-E slot? I've ran Prime + Furmark, but I need a way to distinguish a motherboard issue from video card failure.

Dangman · Dec 13, 2010

Ultima99 said:
Well if its my motherboard then is it the PCI-E slot? I've ran Prime + Furmark, but I need a way to distinguish a motherboard issue from video card failure.

Swapping out motherboard is usually the best indicator of a mobo issue.

Ultima99 · Dec 13, 2010

Danny Bui said:
Swapping out motherboard is usually the best indicator of a mobo issue.

Lol thats the answer I didnt want to see.

Ultima99 · Dec 13, 2010

Called Asus and got an RMA approved for my motherboard.

Still considering replacing the PSU just to make sure even though all the test appear to come back with great results.

Dangman · Dec 13, 2010

Good luck with the RMA.

Ultima99 · Dec 14, 2010

Thanks.

Already tore down my system and its going back tomorrow. Installed my old Athlon64 board for use while I wait on the replacement. I feel like I've gone back in time with 1.5GB of RAM.

I will say one thing left me extremely pleased. Windows 7 has again proven itself to me. I popped in the new board and powered up and it handled the change beautifully. Normally I am against swapping a motherboard without a reinstall but MS is really on the ball with this.

Dangman · Dec 14, 2010

Ultima99 said:
Thanks.

Already tore down my system and its going back tomorrow. Installed my old Athlon64 board for use while I wait on the replacement. I feel like I've gone back in time with 1.5GB of RAM.

I will say one thing left me extremely pleased. Windows 7 has again proven itself to me. I popped in the new board and powered up and it handled the change beautifully. Normally I am against swapping a motherboard without a reinstall but MS is really on the ball with this.

Yeah Windows 7 is surprisingly very robust when it comes to mobo changes and such.

Ultima99 · Dec 14, 2010

Danny Bui said:
Yeah Windows 7 is surprisingly very robust when it comes to mobo changes and such.

Windows 7 is quite robust about most everything. Yet so many, even here on the hardforum cling to XP as if it was the OS their great great great grandfather brought over with them on the boat from Europe and it would be a dishonor to discard such a prized family heirloom.

Video card issues - advice needed

Supreme [H]ardness

Ninja Editor SuperMod

Supreme [H]ardness

[H]ard|DCer of the Month - July 2010

[H]ard|Gawd

Limp Gawd

Supreme [H]ardness

Supreme [H]ardness

Supreme [H]ardness

[H]ard|Gawd

Supreme [H]ardness

Ninja Editor SuperMod

Supreme [H]ardness

Supreme [H]ardness

Ninja Editor SuperMod

Supreme [H]ardness

Ninja Editor SuperMod

Supreme [H]ardness

Ninja Editor SuperMod

Supreme [H]ardness

Supreme [H]ardness

Ninja Editor SuperMod

Supreme [H]ardness

Ninja Editor SuperMod

Supreme [H]ardness