GPUGRID - Really?

wareyore

HDCOTY 2023
Joined
Jan 1, 2014
Messages
1,890
Tell me you don't want me to crunch anymore.


1707851159590.png
 
Oof. That don't seem like a whole lotta work. Admittedly I'm not very familiar with that particular project but still..128 WUs?
 
Tell me you don't want me to crunch anymore.


View attachment 634787
To by pass this, I fired up new instances and go back to the original host after 24 hours have elapsed. Got to work around some of the idiosyncrasies that plague some of the boinc projects. Yeah, they don't make it easy like set and forget thing. Also, this daily quota limit usually is imposed on host generating (significant) errors, at least on some projects. In gpugrid case, most errors are due to insufficient gpu memory availability.
 
To by pass this, I fired up new instances and go back to the original host after 24 hours have elapsed. Got to work around some of the idiosyncrasies that plague some of the boinc projects. Yeah, they don't make it easy like set and forget thing. Also, this daily quota limit usually is imposed on host generating (significant) errors, at least on some projects. In gpugrid case, most errors are due to insufficient gpu memory availability.
Titan V sucks these days. 10% errors is what I saw yesterday. I haven't looked today.
 
Yep, if only greedy NVIDIA hadn't disabled one of the memory chips. Thanks a lot NVIDIA...
 
At least the P100 has 16GB and chewing these quantum chemistry tasks smoothly without any issue. ;)
 
I'm not sure I'd call it an Nvidia issue. I mean I have several NV GPUs crunching PG and after nearly 100k tasks collectively not a single error. On the other hand, my AMD VII Pro turned just over 10% error on E@H, that's a card with ECC memory on a board with ECC memory. It is what it is I guess. Just like life, it's a crap shoot lol.
 
I just mean that specifically on the Titan V, NVIDIA decided to disable one of the four HBM modules because it hurts their soul to give consumers fully functional hardware. On these WUs 16GB memory = zero errors, and 12GB memory = ~15-20% errors.
 
I just mean that specifically on the Titan V, NVIDIA decided to disable one of the four HBM modules because it hurts their soul to give consumers fully functional hardware. On these WUs 16GB memory = zero errors, and 12GB memory = ~15-20% errors.
15% error seems kind of high. In the gpugrid forum someone reported about 5% error on the Titan V. Maybe run nvidia-smi and check if there are any other background processes consuming GPU memory. See screenshot below on my P100. The Firefox web consumes about 143MB. Not much but worth stopping that process for example. Can't do much with Xorg or gnome-shell.
1707922614469.png
 
It idles at 86MB (or put another way, there are 86MB of processes that aren't the task python script).
 
It's an issue with their app...ATM Beta anyways, not sure if its the same issue on the Quantum Chemistry one. Many of us have complained about it, especially the ones that the WU runs for a significant amount of time. I don't think the dev is capable of fixing the known issue. Thread(s) on their forum about the amount of errors.
 
  • Like
Reactions: EXT64
like this
Back
Top