Anyone have massive EUE's in the last week?

Sunin

[H]ard|DCer of the Month - August 2008
Joined
Dec 27, 2005
Messages
3,421
About 3/4 of my farm went down last week to EUEs... no real reason I can see of. Anyone know what could have happened? New work units or client?
 
The real reason our GPU clients hang?

"Customers are apparently shouting at graphics chip makers to do more about correcting memory systems.

One of the downsides of the current generation of GPUs is a lack of error correction which causes problems for high performance users. According to HPC Wire, Graphics chip vendors are aware of the problem and it appears to be only a matter of time before GPUs get a memory makeover.

In the good old days graphics processors didn't really need to be concerned with error-prone memory. No one really cared if a pixel's colour was off by a bit or two. (Or if the pixel showed up at all. sub.ed.). GPU makers did not bother with error corrected memory, however with general-purpose computing on graphics processing units, otherwise know as GPGPU, it started to become crucial.

Once you start to use the GPU as a math accelerator and a memory bit flips in a data value the computer becomes unreliable. The reason that general-purpose computing can be done on GPUs at all is because errors on standard graphics hardware are still rare. From a programming point of view the safest way to tackle the problem is run the code twice, which is unfortunately a bit slow.

Patricia Harrell, AMD's director of Stream Computing, said that there was a need for more robust data protection in GPUs. Error corrected memory is a requirement for a number of customers, especially those looking to deploy GPUs at scale. She pointed out that although individual memory error rates are low, as you add more GPUs to the system, and run applications for longer periods of time, the chances of hitting a flipped memory bit increases proportionally. The AMD FireStream 9270 board uses GDDR5 memory, so data protection is already in place at the memory interface in this product. The memory controller sends and receives data to and from the DRAM, buffers the data locally while the DRAM calculates the integrity. If there is a problem the memory controller does the retry automatically.

Harrell said that AMD was talking a cautious approach to error correcting GPUs because you could end up with kit that is too big and hot. You also lose all the performance advantages GPGPU was originally intended for.

Andy Keane, general manager of the GPU computing business unit at Nvidia said that his outfit would be doing something about the problem soon. ECC memory is a hard requirement in datacentres and so Nvidia has to build that kind of support into its roadmap. He was not sure how long it will take but Nvidia already has a pretty good idea of the timeline. A pretty good guess will be one to two years. ?
 
I've been running smooth lately.

Except for a slight power outage this afternoon, but that is just my power company sucking like usual.

 
My GTX260 crashes at least every other day with the current drivers 190.62. It wasn't a problem at all with the 185s. They worked well in vista but I haven't tried using the old drivers in Win 7 yet. My 8800GTs never seem to have any problems.
 
My GTX260 crashes at least every other day with the current drivers 190.62. It wasn't a problem at all with the 185s. They worked well in vista but I haven't tried using the old drivers in Win 7 yet. My 8800GTs never seem to have any problems.
Most of my GPUs are 8800GTs. They nearly never have issues unless Stanford releases unstable WUs but that doesn't happen often. I'm still using v177.89 drivers in Win XP. No errors in the past week except for one or two but that wasn't client or driver related.
 
All my GPU's seem to have been well the past several weeks. The only real issue I have seen lately was with the SMP clients.
 
I'll get them with the ATI client when it repeatedly downloads the same bad WU 5 times till it pauses. But that's only happened once in the past week. I think it's been about six months since my NV clients did that.
 
Back
Top