World Community Grid

SmokeRngs

[H]ard|DCer of the Month - April 2008
Joined
Aug 9, 2001
Messages
17,134
Lol a bird in the hand is worth two in the bush. Also have you tried to run more than two in Windows? I would keep increasing number of tasks till you reach 100% gpu usage in afterburner and you should get into that 150 range your talking about and get some good points. I mean i assume good points since i don't have a RX570 lol. Best utilization maybe
In Windows the power usage didn't change when I was running a single WU or two. The only difference was there was very little downtime on the GPU for processing. When it was processing on the GPU the GPU was at 100% utilization and I was only running more than one work until so when it was doing the CPU processing the GPU wasn't idle or at least not very often. Two possible explanations for this. The Polaris architecture is more efficient at running these or it's simply not powerful enough not to be running at 100% utilization for the work units. It's exactly the same under Manjaro. That's why I haven't bothered to set it up to run a third work unit simultaneously. When crunching a work unit the GPU is at 100% use and two work units are set to run so the GPU downtime while CPU crunching is kept to a minimum.

That said, the points look good enough. The GPU points are blowing away what my 5800x and 2600x have been doing combined on CPU. It will take a few days to stabilize the points output but so far for the day I've seen a 400k+ point increase at the halfway mark for the day's stats and the GPU was not crunching anything for several hours of that half day's stats. Not bad for an extra 50w-55w of power over an idle GPU (idle is usually around 32w-33w for the GPU and currently using 85w-89w while crunching.)
 

SmokeRngs

[H]ard|DCer of the Month - April 2008
Joined
Aug 9, 2001
Messages
17,134
Looks like with less than a full day of GPU crunching (plus several hours downtime while gaming) a lowly RX570 is putting out about 1.3 million points. After stabilizing and maybe a bit less gaming it could be doing around 1.5-1.6 million a day I figure. Normal output for the 2600x and 5800x combined is around 200k-220k per day.
 

pututu

[H]ard DCOTM x2
Joined
Dec 27, 2015
Messages
2,064
May 2021 update: OpenPandemics

27 May 2021

Summary

The project has added GPU power to the existing strong CPU power that supports research for potential COVID-19 treatments.



OpenPandemics.jpg
Background
OpenPandemics - COVID-19 was created to help accelerate the search for potential COVID-19 treatments. The project also aims to build a fast-response, open source toolkit that will help all scientists quickly search for treatments in the event of future pandemics.
In late 2020, the researchers announced that they had selected 70 compounds (from an original group of approximately 20,000) that could be promising to be investigated as potential inhibitors of the virus that causes COVID-19. Lab testing is currently underway for 25 of these compounds.
GPU work units
We recently completed beta testing and have released GPU work units for this project. Currently, the project is sending out 1,700 new work units every 30 minutes. We expect to be sending out GPU work at this pace for the foreseeable future.
We will continue to create and release regular work units that use CPU power. This will help keep the work going at a good pace, and will ensure that everyone who wants to contribute computing power can participate.
Stress test of World Community Grid's technical infrastructure
Earlier this month, the World Community Grid tech team wanted to determine the upper limit of computational power for the program, and to find out if the current infrastructure would be able to support the load if we provided enough GPU work to meet the demand.
The scientists for OpenPandemics - COVID-19 provided us with approximately 30,000 batches of GPU work (equal to the amount of work done in about 10 months by CPUs), and we let these batches run until they were fully processed.
The stress test took eight days to run, from April 26 through May 4, 2021. Thank you to everyone who participated in this important test. We expect to have a forum post from the tech team soon to summarize what they learned about World Community Grid's current capabilities and limitations.
Current status of work units
CPU
  • Available for download: 1,322 batches
  • In progress: 6,240 batches
  • Completed: 44,810 batches
    5,596 batches in the last 30 days
    Average of 186.5 batches per day
  • Estimated backlog: 7.1 days*

    *The research team is building more work units.
GPU
  • In progress: 2,391 batches
  • Completed: 37,569 batches
    35,296 batches in the last 30 days
    (largely due to the stress test)
    Average of 1,176.5 per day
    (again, largely due to the stress test)
 

pututu

[H]ard DCOTM x2
Joined
Dec 27, 2015
Messages
2,064
Open Pandemics GPU stress test (Apr) update. The WCG team has identified bottlenecks in their infrastructure and discusses potential future changes to enhance the overall performance.

OpenPandemics GPU Stress Test​

Background
In March 2021, World Community Grid released a GPU version of the Autodock research application. Immediately, there was a strong demand for work from volunteer machines; in fact, there was considerably higher demand than supply of GPU work units.

The World Community Grid tech team wanted to determine the upper limit of computational power for the program, and to find out if the current infrastructure would be able to support the load if enough GPU work was provided to meet the demand.

Additionally, the OpenPandemics - COVID-19 scientists and their collaborators at Scripps Research are exploring novel promising target sites on the spike protein of the SARS-CoV2 virus that could be vulnerable to different ligands, and they were eager to investigate this target as quickly and thoroughly as possible. They provided World Community Grid with approximately 30,000 batches of work (equal to the amount of work done in about 10 months by CPUs), and we let these batches run until they were fully processed.

The stress test took 8 days to run, from April 26 through May 4, 2021.

The results outlined below represent World Community Grid's current technical capabilities. This information could help active and future projects make decisions about how they run work with us, keeping in mind that they have varying needs and resources.

Summary
The key findings of the stress test revealed the following points:
  • We had previously determined that in 2020 the volunteers contributing to World Community Grid delivered the computing power similar to a cluster of 12,000 computing nodes running at 100% capacity 24x7 for the entire year where node each contains 1 Intel Core i7-9700K CPU @ 3.60GHz processor from CPUs only. We can now further state that the volunteers are able to provide an additional 8x that computing power from GPUs.
  • The current World Community Grid infrastructure is able to meet the load generated by this computing power with this particular mix of research projects. However, the infrastructure was pushed to its limit, and any further growth or possibly a different mix of research projects would require increased infrastructure.
  • The OpenPandemics - GPU workunits consisted of many small files that created high IO load on both the volunteers' computers and the World Community Grid infrastructure. If we were able to combine these small files into a few larger files, this may reduce the IO load on both the volunteers' computers and on the World Community Grid infrastructure. This change would likely allow the infrastructure to handle a greater load from the volunteers and improve the experience for the volunteers.
  • On the back side of the pipeline, backing up the data and sending results to Scripps server does not appear to be a bottleneck. However, running OpenPandemics at a higher speed will cause the research team to focus the majority of their time and energy on preparing input data sets and archiving returned data rather than performing analysis of the results and moving the interesting results to the next step in the pipeline. As a result, the project will remain at its current speed for the foreseeable future.
  • Now that we are able to quantify the capabilities of World Community Grid, scientists can use this information as a factor in their decision-making process in addition to their labs' resources and their own data analysis needs.
Bottlenecks identified
During the test, there were three major issues where the system became unstable until we could identify the bottlenecks and resolve them.

Prior to Launch
Before the launch of the stress test when we were creating the individual workunits to send to volunteers, we exhausted the available inodes on the filesystem. This prevented new files or directories from being created on the filesystem, and as a result it caused an outage for our back-end processes and prevented results being uploaded from volunteer machines. We resolved this issue by increasing the max number of inodes allowed and then we added a monitor to warn us if we start approaching the new limit.

Launch
Shortly after releasing the large supply of workunits, we experienced an issue where the connections from our load balancer to the backend servers reached their maximum configured limits and blocked new connections. This appears to be caused by connections being created by clients that opened connections and stalled out or very slowly downloaded work. We implemented logic in the load balancer to automatically close those connections. Once this logic was deployed, the connections from the front-end became stable and work was able to flow freely.

Packaging ramps up
The next obstacle occurred when batches started to complete and packaging became a heavy load on the system. Several changes were made to improve this issue:

The process of marking batches completed in order to start the packaging process originally was run only every 8 hours. We changed that so that batches would be marked complete and packaged every 30 minutes.
Our clustered filesystem had configuration options that were sub-optimal and could be improved. We researched how the configuration could be improved in order to increase the performance and then made those changes.

Following these changes, the system was stable even with packaging and building occurring at a high level. The time to package a batch dropped from 9 minutes to 4.5 minutes and the time to build a batch dropped by a similar amount. Upload and downloads performed reliably as well. However, we were only able to run in this modified configuration during the final 12 hours of the stress test. It would have been useful to run with these settings for longer and to confirm that they resulted in a stable system over an extended period of time.

Potential future changes to further enhance performance

Clustered filesystem tuning

The disk drives that back our clustered filesystem only reached about 50% of their rated throughput. It is possible that the configuration of the filesystem could be further optimized to further increase performance. This is not certain, but if the opportunity exists to engage an expert with deep experience optimizing high performance clustered filesystems using IBM Spectrum Scale, this could be a worthwhile avenue to explore.


Website availability
We identified an issue where high IO load on the clustered filesystem will cause problems with the user experience and performance of the website. These two systems are logically isolated from each other, but share physical infrastructure due to system design. This degradation of the performance of the website should not have happened, but it clearly did. We want to determine for certain why this issue exists, but at this time we believe that this issue stems from the way our load balancer, HAProxy, is configured.

We have HAProxy running as a single instance with one front-end for all traffic passing the data back to multiple back-ends for each of the different types of traffic. We could instead run HAProxy with multiple instances on the same system provided that there are separate IP address for each instance to bind to. If we were to run a separate instance for website traffic and a second instance for all BOINC traffic, we expect that this would allow website traffic to perform reliably, even if the BOINC system was under heavy load.


Thank you to everyone who participated in the stress test.
 

AgrFan

[H]ard DCOTM October 2012
Joined
Sep 29, 2007
Messages
545
No, this does not include resends. Final tasks will be sent out towards the end of July. MIP should be finished by July 31st.
 
Top