• Some users have recently had their accounts hijacked. It seems that the now defunct EVGA forums might have compromised your password there and seems many are using the same PW here. We would suggest you UPDATE YOUR PASSWORD and TURN ON 2FA for your account here to further secure it. None of the compromised accounts had 2FA turned on.
    Once you have enabled 2FA, your account will be updated soon to show a badge, letting other members know that you use 2FA to protect your account. This should be beneficial for everyone that uses FSFT.

WCG?

WCG is back! I've got over 8000 MCMs to validate that are backlogged, but it's running smoothly again. The latest:

December 4, 2025

  • BOINC feeder/scheduler reporting "tasks committed to other platforms" is resolved - details are further down about the resolution and future plans to keep this issue from coming back.
  • Validation Backlog has begun for workunits that were held over the break, and workunits that fell through our new validation logic unvalidated. We intend to ramp up these passes in the coming days, and will report on progress and project expected dates for fully backfilling all such cases and finally catching up validations to in flight work next week, now that we know our scripting works to backfill validations.
  • We will not restart the file_deleter or db_purge BOINC services until we have validated every file we possess that was uploaded before/after the break, including sending resends for some cases of "orphans".
  • What was the workaround for the feeder/scheduler blockage due to hr_class mismatch between results for the same workunit? The resolution to the issue that we chose for now, was to simply purge stale feeder entries effectively resetting their hr_class (homogenous redundancy) to 0 and allowing any host/platform to download the result if the result sits in memory for too long. The feeder can be started with a CLI option and specified time frame for occupancy of a result in a slot before it considers this course.
  • What does resetting hr_class=0 as a workaround accomplish? The hr_class=0 reset matches the value assigned to fresh workunit results being sent out for the first time, essentially dictating to the scheduler that any host/platform may claim and compute this result (i.e., _0 and _1 results have hr_class=0, resends consult the hr_class of the host that reported results already). There is some computational overhead, as a second tier of validation is then required to validate the exact gene signatures and their scores are "the same" between these results computed on different platforms in the case of purged resends that had their hr_class reset to 0. We intend to disable hr_class (homogenous redundancy) completely for MCM1 at some point in the future, and instead rely directly on this currently secondary validation, and record of the delta between exact scores and verification of equivalent gene signatures found for these results sent to different platforms to ensure they are within a reasonable error bound/tolerance as a rule.
  • Does this workaround affect the integrity of MCM1 results? No, but it does introduce a new edge cases to account for. The score can vary within the upper and lower bound of possible floating point error between platforms for the same workunit. Ensuring that the floating point calculations are not different enough to invalidate the computational result is a vastly easier problem when using the hr_class mechanism. However, because MCM1 produces a list of genes as well as a score, the only additional validation criteria we incur by disabling hr_class is ostensibly "score is just below the threshold on this system" exclusion, and "score is just above the threshold on this system" inclusion, for specific signatures very close to the configured threshold. In these cases, we can take the union of these additional results slightly above or below the threshold score, between all results for a workunit, provided the rest of the results above the threshold are equivalent.
  • Why have hr_class at all for MCM1 then? Indeed. We intend to track the above cases and any other cases among validation failures where we can discern any unforseen effect of allowing resends to potentially go to different platforms, try this "disable hr_class if the feeder gets stuck" system for MAM1 which does have a numerical optimization routine to explore the signature search space that could change the actual signatures under test due to floating point error and so may not be a good candidate for this (and yet the calculations are valid, so any reasonable overlap or a "canary" or "spike-in" validation system might be considered sufficient validation...). If we are satisfied with the outcome of post-processing results that came from different platforms, we can disable it. This will accelerate throughput and discovery for MCM1 and possibly MAM1 while buying time to resolve this issue more permanently for applications such as ARP1 that this thinking does not apply to, where the floating point calculations must be byte-wise equivalent between results or the result is simply invalid. Once we can confirm that newer 8.x+ BOINC clients permitting WSL on Windows hosts is the only source of this hr_class confusion bug, and possibly the "W"/"W" os_name and os_version truncation bug, we can apply a targeted fix.
 
I still have a backlog of work from before November 7th still waiting validation. My plan is to switch back after they're all validated and the purge process is operational. My three Dell boxes are getting 130K+ PPD running FAH CPU work. That's good enough for me.
 
  • December 15, 2025
    • Forum service restored, after degraded service starting roughly 03:00 UTC, December 15th, 2025 led to a crash at roughly 12:30 UTC same day - service was restored at approximately 20:00 UTC Dec 15th, 2025. We have seen this before, due to database connections waiting indefinitely or so long as to eventually reach the thread pool maximum, causing OOM kill of the forum application under WAS, this is the meaning of the ForumUserServlet unable to initialize message displayed while the application was down. Previously, poor parameterization of the thread pool under WAS caused connections to the database to stay open instead of timing out under lock contention, we thought we had ameliorated the issue on the WAS side, but clearly this needs another look.
    • At the very least, we will deploy alerts for the specific WARN and ERROR messages logged by WAS leading up to the crash which should provide a window of many hours within which we can fix this manually in the future, before the forum application crashes, pending a confirmed fix or workaround (e.g. better parameters and logic for managing thread pool).
    • After announcing we would be accelerating validations "in the coming days", validations stalled again last week. Unfortunately, we continue to face issues with MCM1 validations. There are multiple categories of missed validations - orphaned "singles", mis-routed one or both results, incorrectly invalidated result pairs, missing resend condition, and now floating point tolerance too stringent for hr_class reset workunits which was the workaround to the impossible platform logjam issue, at the expense of having to validate workunits based on similarity of scientific results within a tolerance instead of strict equivalence of scientific results. The concept of validity for MCM1 became result pairs that have equivalent gene signature membership for signatures above the threshold score, and an equivalent list of gene signatures above the threshold score, and a similar score within a configurable error bound passed to the validator on startup, when the two workunits run with the same parameters and random seed value. While these cases should therefore be validated by the secondary validator we subscribed to the "validation failure" queue downstream of the primary, checksum based validator, our tolerance for floating point error was too stringent and we will be replaying the failure queue from an earlier offset to catch these cases for recently resent workunits.
    • Regarding our approach to crediting workunits held during the downtime by scanning the filesystem and checking the database, the process began last week after indexing the locations of result files for all workunits across all filesystems on the backend, so that validations that involved file transfers could avoid the processing of walking remote filesystems and simply fetch the required remote result from wherever it had been uploaded or archived. Initial testing suggested we would catch most if not all missed validations using this approach, though the scripts each running on each worker node would have to run for some time. Clearly, despite thinking perhaps the timestamp-based approach was simply getting through points in time with few missing validations of any case early on, we are not making the expected progress. We are reviewing logs and stats on what has been processed so far to figure out what we missed, and how to adjust. Some validations have occured for each case, just nowhere near the expected throughput/hit rate we projected. So, we are tentatively hopeful we can fix this quickly and start finally making a dent.
 
How very frustrating. Still, I'm getting plenty of workunits and about 2 in 3 are validating, but the backlog is growing. Have almost 11,000 MCMs waiting for validation - pretty good unintended bunkering. Edit: checking results of yesterday, December 16, I see that a huge number suddenly validated. Got 12 days worth of MCMs validating on a quad core machine, for example.
 
Last edited:
I now have about 13,400 MCMs waiting to validate, but if what they are saying is true, it looks like WCG will be fully back soon. Oddly, my 64 bit machines are getting pretty good validation rates day to day (about 60% validating), but not so the 32 bit ones (about 10%). How odd.

December 24, 2025

  • Pushed changes to BOINC transitioner to fix bug with upload bucket calculation for resends last night. This should drastically increase validations from resends going forward. We have also rebuilt the combined validation and assimilation pipeline for MCM1 and MAM1, which will finally enable us to start going through the validation backlog and clearing it out. We will try to leave a period of roughly 1-2 weeks for results to remain visible in a validated state (undeleted from the result table) as we validate and purge PV jail, which means removing the files from the in-memory cache and the result records from the database. We expect the process to take about 3 weeks.
  • Working to resolve reported issue with profile changes again not propagating from website to BOINC clients the issue has a different presentation this time, we are working to resolve the issue.
  • MAM1 beta workunits, and soon small smoke tests of the production pipeline, are being released - last week, we began smoke testing the MAM1 beta project (beta30) again, and we are working on the Windows and GPU-enabled builds in preparation for beginning the production run, keeping in mind the application we are using to run MAM1 will be backported to MCM1 as well so that we can take advantage of the modern features of the PyTorch/LibTorch backend. Thank you to volunteers for reporting outcomes in the forums across multiple threads. In the new year as we begin daily beta testing of MAM1 to roll out the initial production runs for the project and add platforms and GPU compat, we will have a dedicated thread for reporting issues, outcomes, and asking questions about the beta30 application and results on the forums.
  • Thank You for supporting open science through WCG. Happy Holidays and Happy New Year to all volunteers!
 
Backlogged MCMs are validating quickly. Down to just more than 10,000 still waiting, from 13,400 from Wednesday. Looks like about 1 per minute, now. It would seem like their last post stating they've sorted it out mostly, is true.
 
The 64 bit bonus! I noticed that some of my machines seemed to be folding MCMs slower than they should be - all of them being 32 bit Win XP. Out of a sense of finding-what-the-flying-F was going on, I put 64 bit Linux on an older quad (a Phenom II), and viola!, the folding time went from ~3.7 hours to ~2.6 hours. Amazing. Edit: it did not increase the heat output of the CPU - still hanging around 45 C, like before.
 
Last edited:
The 64 bit bonus! I noticed that some of my machines seemed to be folding MCMs slower than they should be - all of them being 32 bit Win XP. Out of a sense of finding-what-the-flying-F was going on, I put 64 bit Linux on an older quad (a Phenom II), and viola!, the folding time went from ~3.7 hours to ~2.6 hours. Amazing. Edit: it did not increase the heat output of the CPU - still hanging around 45 C, like before.
I'm surprised that anyone is still using XP for anything these days...
 
I'm surprised that anyone is still using XP for anything these days...
It still does the job for some things. I'd rather not go to the trouble of upgrading machines without cause. But the 32 bit thing is a problem for MCMs, no question.
 
You will still get 32-bit work on a 64-bit OS. Adding the no_alt_platform option to your cc_config.xml file will stop getting 32-bit work. This may limit how much work gets downloaded since you will be limited to 64-bit work only. This setting will stop x86 work for other BOINC projects also.

How to stop receiving 32-bit tasks

intelx86 = 32-bit
x86_64 = 64 bit
 
You will still get 32-bit work on a 64-bit OS. Adding the no_alt_platform option to your cc_config.xml file will stop getting 32-bit work. This may limit how much work gets downloaded since you will be limited to 64-bit work only. This setting will stop x86 work for other BOINC projects also.

How to stop receiving 32-bit tasks

intelx86 = 32-bit
x86_64 = 64 bit
Careful with that setting as it can cause you not to get GPU work from some Projects because their GPU app is 32bit.
 
This anti-XP ism is just getting toxic.
Well I wasn't being toxic, just surprised. I can see a niche case where a legacy app/program req's 32 bit but even then I'd opt for Win10 or 7 or even Vista for chrissakes over XP. Maybe there's not as many bots crawling on the internet looking for XP anymore since it's so out of date but when it first went unsupported you'd be found and hacked in SECONDS. There was a registry hack to emulate Server 2003 to get protection updates for awhile longer but those are long gone now as well. Risky business to still be running it.
 
Sorry, a bit of joke about anti-Xp ism, Toconator. Didn't mean it badly. Irony doesn't come off well in text, often. I get what people say about XP getting hacked, but I run the virus scanners and stuff and it keeps turning up clean. But, I use a modernish browser and ruthless ad blocking, too. I figure XP is a small target anymore.
 
Sorry, a bit of joke about anti-Xp ism, Toconator. Didn't mean it badly. Irony doesn't come off well in text, often. I get what people say about XP getting hacked, but I run the virus scanners and stuff and it keeps turning up clean. But, I use a modernish browser and ruthless ad blocking, too. I figure XP is a small target anymore.
Ah, yes sarcasm doesn't translate in text very well especially without italics or some such and yes the overwhelming majority of malicious actors have most likely left XP behind.. The Kernel is much more secure on Vista on up tho. Just sayin' ...
 
Yeah, I have computers with 7, XP, 11, and now one running Linux. XP is really snappy with spinner drives (yes, I still have a bunch). Same machines with 7 or Vista run like they have leg chains - everything sloooooows down. Likely upgrade to Linux on new build this year... thinking of going for a Ryzen 7000 series, but with RAM prices... ugh.
 
  • February 19, 2026
    • We are rolling out BOINC server release 1.6.1 in production, as we successfully re-based our BOINC server build onto recently released upstream v1.6.1 and cherry-picked changes from legacy and new WCG development, passing all tests and adding new ones for WCG-specific customizations. The scheduler, transitioner, and feeder will now run alongside the create_work daemons, Redpanda brokers, file_upload_handlers, and combined validator_assimilator daemons on the BOINC backend cluster and serve only the node local partition, handling only BOINC client requests routed to that server by the load balancer.
    • We are migrating the SQLite instances across the BOINC backend nodes to a Citus-Data v14.0 PostgreSQL database cluster, completing our transition to a horizontally scalable architecture for the BOINC backend via the Redpanda cluster, PostgreSQL database cluster, and our changes to enforce end-to-end data locality in workunit processing within individual partitions by changing signed URL generation to include buckets owned by a specific partition/node on the backend, and providing the load balancer a mapping of buckets to the correct backend node. While we have had bugs in multiple components as we made changes to the backend, all components now consistently apply the same hashing function to a given workunit ID to achieve correct routing, and as we loop the database and scheduler into this design we expect to finally be able to communicate the value of the new architecture in results as we restart ARP1, and finally rollout MAM1 and seek to backport the MAM1 application to the MCM1 project, for BOINC clients running on devices that can handle it. We will not close the path for less powerful devices to contribute, only add the path for more powerful devices to contribute more than they currently can.
    • MAM1 v7.0.8 for Linux and Windows, and ARP1 will release as soon as the BOINC server v1.6.1 and distributed PostgreSQL database cluster rollouts are complete, we will then also enable stats export. We will be able to gather the remaining backlog for validation from the postgres cluster more easily than the currently bogged down legacy BOINC database, and perform the final phase of validation backlog reconciliation.
    • Results page under My contribution, as correctly guessed in the forums, the botched rollout of the new results page was a direct attempt to reduce load on the BOINC database from the APIs now that result tables have grown so large, while attempting to preserve the ability of volunteers to inspect their results and confirm for themselves the validation state of their PV jails, which we have finally made progress on. After restoring basic functionality of the new server-side in-memory caching approach, we left several issues unresolved and proceeded to work on the above items. A fix is incoming to enable downloading all results instead of the 500 currently displayed, and to fix the conditions that cause the summary stats to show more results in the numerator than in the denominator, and to address the behaviour of filtering among others.
    • Fixed Apache mod-security rules blocking Team Challenge and Invite to Challenge functionality, and we suspect multiple other 403 forbidden issues to be related to the upgrades to the Apache webserver and HAProxy load balancer at the start of the transition. More fixes are incoming for the website, once we have completed the rollout of the final pieces of our new horizontally scalable architecture, removing the BOINC database as a single point of contention, and launching all remaining applications as described above.
 
Sounds like some good progress being made. Lets hope for a smooth transition so that people can have some of their faith restored into this project.
 
  • March 6, 2026
    • Upcoming Database Migration Will Require Taking BOINC Database Offline Between March 10th-11th - we have been testing the distributed Postgres architecture in our new staging environment. We intend to switch over to our Postgres build of the BOINC server v1.6.1 release next week, between March 10th-11th. The cutover will require a brief pause while we deploy the updated scheduler and feeder to each Postgres worker node, and rollout CDC via Kafka in production to match the staging deployment. With CDC, we will keep the legacy databases in sync with workunit state managed by the Postgres cluster.
    • CRUD operations from create_work and validator_assimilator daemons will be pipelined by a journaling process to keep workunit state eventually consistent in the legacy BOINC database, so that the current API layer and business logic will continue to work as usual. What we gain is the IOPS available to the BOINC database will scale to the number of servers we deploy in the Postgres cluster, and batch updates to the database will preserve data locality, completing the horizontally scalable design.
    • This should resolve the long-standing validation issues, high disk utilization for the current BOINC database, and enable us to return to strict homogenous redundancy classes for MCM1 workunits, solve the os_name and os_version buffer overflow and trunctation bug, and give us access to the new features implemented in the more recent BOINC releases, such as running containers as applications on the grid.
    • We are able to produce and run MCM1 and MAM1 workunits through this system in staging, and will launch ARP1 shortly after we confirm it delivers as expected and described above. We also should then be able to release a larger quantity of workunits than we were previously able to for ARP1 and accelerate the completion of the project.
 
  • March 16, 2026
    • Data transfer from legacy MariaDB to Postgres is ongoing for the largest partitioned table in the BOINC database, the result table. All other tables were transferred successfully to the postgres cluster.
    • We tested every component in production, and we expect to bring the system back online for users at the load balancer as soon as the data transfer and final integrity check between the legacy and new database are complete.
 
  • March 23, 2026
    • Many issues identified and fixed after bringing the WCG backend online last week Thursday, March 19th, 2026 running the new Postgres-enabled BOINC v1.6.1 backend in production: We found multiple missing indexes in our Postgres schema that the small scale of our staging environment did not reveal as necessary. We also had to setup the Postgres extension pgbouncer to pool connections to on each Postgres worker node to pool the large number of connections, and address multiple oversights in our CDC pipeline that resulted in incorrectly decoding user prefs coming from MariaDB into Citus reference tables, and no stats updates for the first few update windows on the website since we came back online. The credit_flusher that gathers the global view of what results need their state updated in MariaDB from the Citus coordinator was omitting key columns from those updates. We also had to iterate on MariaDB binlog settings together with Kafka to get CDC working in prod. With the bottleneck load now spread across the IOPS available to 6 servers without requiring a full migration of the legacy BOINC database which is under 100% disk utilization for the first time at the new site while under production load, we are now going to be able to clear the validation backlog, launch and resume projects, fix the daily stats on the website, resume the stats dump, and fix the results page so that Volunteers can verify the state of their backlogs. A server status page has been implemented in our staging environment, we plan to push it to the live site this week or next.
    • We have one major issue with one of the six backend servers at the moment (3pm UTC); we have reached out to hosting for assistance: We were able to recover the Postgres cluster morning of March 23rd, 2026 (EST time) from the crash yesterday March 22nd, but this 1/6 postgres worker is behaving differently than the other partitions despite identical config and executables running on it. It has become unresponsive again, and we reached out to hosting about this in case it is hardware or cloud. We have blocked that server at the load balancer from receiving scheduler requests, so the download and reporting path should be functional for all requests now, but uploads mapped to that server will continue to display HTTP transient errors until we resolve the problem.
 
For the last two days a good number have been validating. If it keeps up like this for a couple weeks, all the backlog might be cleared up. Been really getting frustratted at the promises to get WCG running properly, but am fairly optimistic at this point it will finally work again. No new work units yet, but that's ok with me, as the winter has ended early.
 
For the last two days a good number have been validating. If it keeps up like this for a couple weeks, all the backlog might be cleared up. Been really getting frustratted at the promises to get WCG running properly, but am fairly optimistic at this point it will finally work again. No new work units yet, but that's ok with me, as the winter has ended early.
Good stuff but winter ended early? had almost a foot of snow yesterday and about 3-4 inches this morning. Sun's out this afternoon but I'm not taking my winter tires off for a few weeks yet...
 
Good stuff but winter ended early? had almost a foot of snow yesterday and about 3-4 inches this morning. Sun's out this afternoon but I'm not taking my winter tires off for a few weeks yet...
Heh, it was 95F here today. Quite a bit warmer than usual for the end of March. Don't normally hit temps like this before June or July.
 
We had an all-time record here for the whole of March. Hit 94. Crazy. That's higher than the historical average for July.
 
We had an all-time record here for the whole of March. Hit 94. Crazy. That's higher than the historical average for July.

Pretty sure we've had new all time highs for at least three days out of the past week and a half or so.
All right guys time for me to switch to Canadian. Was -11C when I went to work this morning but sun came out and it got up to 6C. It'll only get up to -1C tomorrow (~30F) and theres chances of more snow maybe in a few days but it should improve into April hopefully. In recognition and respect of the temps you're having to endure I've acquired some tallboy beers in your honor as I manage my boxen. Sorry, no tamales up here but cheers anyways. 🍻
 
You Canadians are really lucky to have this folding advantage over us. I might just move up there to join in.
 
WCG is sending me work units again, but nothing has been validating for 2 days. More unintended bunkering. I'm taking advantage of an over-the-night cold snap to fold again.
 
WCG is sending me work units again, but nothing has been validating for 2 days. More unintended bunkering. I'm taking advantage of an over-the-night cold snap to fold again.
Had a bunch send this afternoon. Might be fixed now. 1 or 2 boxen crunch WCG and fold at all times but just a few threads to BOINC for power and heat reasons.
 
  • April 9, 2026
    • BOINC web traffic has been blocked at the load balancer for maintenance, all BOINC scheduler, downloads, uploads will be met with HTTP 503 codes until maintenance is completed - we expect completion between April 10th and April 11th, but no earlier than 18:00 UTC on April 10th to allow projected file transfers and migration of sharded database table records between citus postgres workers to complete. We will update here and on the forums if we expect extended maintenance over the weekend. Volunteers should expect that a successful rollout during this maintenance window will increase workunit availability, and put another dent in the validation backlog. https://www.cs.toronto.edu/~juris/jlab/wcg.html
 
April 13, 2026

We are aware of the web site and forum issues - looking into it. Our certificates are valid.We are back online - yet - we are still waiting for some answers from UT and UHN about the cause of this issue. We will update once we have some answers. ME: their domain may have expired.
 
Last edited:
  • April 17, 2026
    • More problems - this time issues with the data center.
    • Website/Forum Outage - possibly due to a service interruption in our cloud environment that hosting is working to fix and prevents us from accessing our running instances for maintenance, and may be responsible for other issues although this is currently unclear. Hosting provided an ETA of 1h at 19:00 UTC today April 17th, 2026, and we will keep volunteers posted as we get information and attempt to come back online.
    • BOINC Backend Outage Ongoing after brief success window on Wednesday, April 15th, 2026 - we enabled the BOINC stats dump after seeing the new architecture handle load and fixing the 503 upload issue. However, in attempting to fix the 404s on the download path by rebuilding the input files and writing them to the tmpfs cache from which downloads are served, we pushed several nodes into various failure states such as SUnreclaim slab memory exhaustion due to the overhead of writing each file to tmpfs en masse, and ill-advised queries run against postgres before EXPLAIN and without paging results to disk that backed up everything else waiting on postgres and caused soft lockup on the node. In addition to issues with the io_method = 'io_uring' setting in combination with our network-attached volumes for the postgres datadir, a setting we may have to change and evaluate before restarting. The naive "should be backup tonight" note in the forums by the WCG Tech was based on having recovered from one of these soft lockups many times in the past few weeks, and before understanding the reason for the initial crash or causing the later crashes on other nodes while attempting various methods of safely regenerating workunits that were throwing 404s on download after being assigned by the scheduler. We will bring the system online as soon as we are confident the results the scheduler will assign have download URLs with files at those paths, and the cluster is stable again.
I'm getting really close to retiring from WCG. Too many problems since they took over from IBM. I don't see how any beneficial science is getting done. They're obviously not capable of running a large-scale BOINC project.
 
Last edited:
I am, too. They've had months to get it right, and haven't fixed it yet. I'm pretty much done with folding until next winter anyway. If it's still giving problems, next winter means a new project.
 
With so few bio/medical projects out there, it is hard to not just set and forget the project. Let it crunch when it has work and then crunch something else when it doesn't. I have a lot of Android devices and have even fewer choices to run on those. It is a sad state when researchers can find it more attractive and in many cases cheaper to just do the research in house or in cloud resources...
 
Back
Top