GPUGRID - Really?

wareyore · Feb 13, 2024

Tell me you don't want me to crunch anymore.

Holdolin · Feb 13, 2024

Oof. That don't seem like a whole lotta work. Admittedly I'm not very familiar with that particular project but still..128 WUs?

pututu · Feb 13, 2024

wareyore said:
Tell me you don't want me to crunch anymore.

View attachment 634787

To by pass this, I fired up new instances and go back to the original host after 24 hours have elapsed. Got to work around some of the idiosyncrasies that plague some of the boinc projects. Yeah, they don't make it easy like set and forget thing. Also, this daily quota limit usually is imposed on host generating (significant) errors, at least on some projects. In gpugrid case, most errors are due to insufficient gpu memory availability.

wareyore · Feb 13, 2024

pututu said:
To by pass this, I fired up new instances and go back to the original host after 24 hours have elapsed. Got to work around some of the idiosyncrasies that plague some of the boinc projects. Yeah, they don't make it easy like set and forget thing. Also, this daily quota limit usually is imposed on host generating (significant) errors, at least on some projects. In gpugrid case, most errors are due to insufficient gpu memory availability.

Titan V sucks these days. 10% errors is what I saw yesterday. I haven't looked today.

EXT64 · Feb 13, 2024

Yep, if only greedy NVIDIA hadn't disabled one of the memory chips. Thanks a lot NVIDIA...

pututu · Feb 14, 2024

At least the P100 has 16GB and chewing these quantum chemistry tasks smoothly without any issue.

Holdolin · Feb 14, 2024

I'm not sure I'd call it an Nvidia issue. I mean I have several NV GPUs crunching PG and after nearly 100k tasks collectively not a single error. On the other hand, my AMD VII Pro turned just over 10% error on E@H, that's a card with ECC memory on a board with ECC memory. It is what it is I guess. Just like life, it's a crap shoot lol.

EXT64 · Feb 14, 2024

I just mean that specifically on the Titan V, NVIDIA decided to disable one of the four HBM modules because it hurts their soul to give consumers fully functional hardware. On these WUs 16GB memory = zero errors, and 12GB memory = ~15-20% errors.

pututu · Feb 14, 2024

EXT64 said:
I just mean that specifically on the Titan V, NVIDIA decided to disable one of the four HBM modules because it hurts their soul to give consumers fully functional hardware. On these WUs 16GB memory = zero errors, and 12GB memory = ~15-20% errors.

15% error seems kind of high. In the gpugrid forum someone reported about 5% error on the Titan V. Maybe run nvidia-smi and check if there are any other background processes consuming GPU memory. See screenshot below on my P100. The Firefox web consumes about 143MB. Not much but worth stopping that process for example. Can't do much with Xorg or gnome-shell.

EXT64 · Feb 14, 2024

It idles at 86MB (or put another way, there are 86MB of processes that aren't the task python script).

bluestang · Feb 14, 2024

It's an issue with their app...ATM Beta anyways, not sure if its the same issue on the Quantum Chemistry one. Many of us have complained about it, especially the ones that the WU runs for a significant amount of time. I don't think the dev is capable of fixing the known issue. Thread(s) on their forum about the amount of errors.

GPUGRID - Really?

wareyore

HDCOTY 2023

Holdolin

Hard DC'er of the Month February 2021

pututu

[H]ard DC'er of the Year 2021

wareyore

HDCOTY 2023

EXT64

[H]ard|DCer of the Year 2020

pututu

[H]ard DC'er of the Year 2021

Holdolin

Hard DC'er of the Month February 2021

EXT64

[H]ard|DCer of the Year 2020

pututu

[H]ard DC'er of the Year 2021

EXT64

[H]ard|DCer of the Year 2020

bluestang

Gawd