Bigadv heads up

Grandpa_01

[H]ard|DCer of the Year 2013
Joined
Jun 4, 2011
Messages
1,175
Just though I would give a heads up on some really big bigadv WU's that have been popping up recently I had 1 of them today a 6903 that was 10,000,000 steps. My 4P ran for a couple of hours on it before I noticed and it did not even do 1%. The log said the completed file was going to be over 7GB in size and there was another person that reported getting a 6904 that was 13,000,000 steps. So you might want to watch your boxes and delete them if you get them they could put a serious hurt on your production. http://foldingforum.org/viewtopic.php?f=19&t=20692
 
because they are bad I assume is the point, a 6903 should not take a couple of hours to do 1% ;)
 
I had picked up a p6903 unit that is 500k steps. Took a while to figure out the slow tpf was due to double the number of steps as normal.
 
I will repeat the question.

The work units are borked, at some point something has been corrupted and when you down load instead of doing one WU you will be doing many.

Each 6903/6904 is made up of 250,000 steps - this is a normal WU and you all know how long it takes to run on your system(s).
In these cases you are folding not just your WU but the WU that preceded this one i.e you are on gen 51 of P6903 R2 C9, if you have a 500,000 step WU you are also folding Gen 50 which should have been already completed, uploaded and formed into the new WU, the forming has not happened for some reason.

In grandpa's case he was attempting to fold all 52 gens of that particular PRC, still within the normal timeframe for a 6904 which just isn't going to happen, hence the need to delete
 
That explains it! :eek:

Yesterday I had a 6903 that doubled, yes doubled the TPF of usual 6903s. I spent a couple hours troubleshooting, restarts, and Linux fiddling. AAARGHH. There was nothing wrong with my system, it appears.

Thanks for the post, Grandpa.
 
I've had 18 EUE's today on my SR-2 dual hex. I was running 6903's the last time I looked. Do you think there is a connection?
 
I do know look for this in log I know that the 1 I had woul continually start over but I did not see any EUE's
Code:
WARNING: This run will generate roughly 7425 Mb of data

starting mdrun 'Overlay'
10000000 steps,  40000.0 ps.
[02:19:25] Completed 0 out of 10000000 steps  (0%)
 
My last 6903 on my 4p would just crash the client and let the computer sit idle.
 
How do I delete a work unit in linux ubuntu?
 
If you seriously want to delete it :D ... then stop fah client then

1- rm -R work (this will remove work folder)
2- rm unitinfo.txt queue.dat (option .. remove FAHlog.txt)

That's what I would do
 
Pocatello, the graphical user interface method is:
1. stop the client
2. open the "fah" folder and delete/move to trash the Work folder and the queue.dat and unitinfo files.

sbinh, some of us have not yet discovered the great pleasure of command line interface. :)
 
Thanks.

CLI worked fine. I deleted some stuff.

But my dual hex is only running on one CPU with a new work unit. Is that because I had 18 EUE's?
 
Personally, I don't think EUE would cause that issue.
You might want to check your system to make sure the other CPU is still working right.
 
One needs to remove machinedependent.dat as well to reduce likelihood of getting such (broken) unit again...
 
I was wondering why tpfs were sucky and now I know why. Thanks for the heads up.
 
On one box I use the GPU FAH tracker as I have a video card that crunches, as of late I have had some WU for the SMP client that no not show the PPD.
 
Is it safe to go back into the -bigadv world?
 
Is it safe to go back into the -bigadv world?

Keep an eye on the units you are assigned. I was assigned my 4th bad one this morning.

Code:
[09:46:58] Project: 6904 (Run 0, Clone 31, Gen 39)
[09:46:58] 
[09:46:58] Assembly optimizations on if available.
[09:46:58] Entering M.D.
                         :-)  G  R  O  M  A  C  S  (-:

                   Groningen Machine for Chemical Simulation

                            :-)  VERSION 4.5.3  (-:

        Written by Emile Apol, Rossen Apostolov, Herman J.C. Berendsen,
      Aldert van Buuren, Pär Bjelkmar, Rudi van Drunen, Anton Feenstra, 
        Gerrit Groenhof, Peter Kasson, Per Larsson, Pieter Meulenhoff, 
           Teemu Murtola, Szilard Pall, Sander Pronk, Roland Schulz, 
                Michael Shirts, Alfons Sijbers, Peter Tieleman,

               Berk Hess, David van der Spoel, and Erik Lindahl.

       Copyright (c) 1991-2000, University of Groningen, The Netherlands.
            Copyright (c) 2001-2010, The GROMACS development team at
        Uppsala University & The Royal Institute of Technology, Sweden.
            check out http://www.gromacs.org for more information.


                               :-)  Gromacs  (-:

Reading file work/wudata_00.tpr, VERSION 4.5.4-dev-20110530-cc815 (single precision)
[09:47:07] Mapping NT from 48 to 48 
Starting 48 threads
Making 2D domain decomposition 8 x 6 x 1

WARNING: This run will generate roughly 7421 Mb of data

starting mdrun 'Overlay'
10000000 steps,  40000.0 ps.
[09:47:13] Completed 0 out of 10000000 steps  (0%)
 
I was also assigned a new 6903 that was bad. I removed it and got another one. Removed that one got a bad 6904, which I removed and picked up 6901.
 
Let me know when it is safe to go back in the water.
 
I got one of these buggers this morning, so it is not safe yet.
 
Have two 6903's picked up last night/early today.
No problem with either.......so far
 
Did SF change the Preferred deadline for 6903/6904? Two (2) of mine 6903s completed within 4 days but received only 20k+ each.
 
According to the psummary they are still at 5 and 5.6 days so not yet. You can post over at the FF and have one of the mods look them up.
6903 130.237.232.237 ha_shooting 2533797 5.00 12.00 22706.00 100 GRO-A5 Description kasson 38.05
6904 130.237.232.237 ha_shooting 2533797 5.60 17.00 31541.00 100 GRO-A5 Description kasson 37.31
 
I just dumped another one of these bad units - that is 5 of them in 5 days for me.
Over what period of time do I need to worry about losing the bonus for an 80% completion rate?
 
Just curious but have you noted the run, clone, gen of those WU's are they the same on or are they different ones. If they are the same one each time try deleting the machinedependent.dat, queue.dat, and the unitinfo.txt if you have not already done that. I do not understand why you are getting so many of them I have 5 bigadv rigs running and have only picked up the 1 so far. I know that Stanford servers will reassign a WU to a rig if that WU has not been returned in time so that maybe what is happening in your case. This is just a guess but it may be worth a try.
 
Each one has been a different unit, none of them previously reported on FF, and all of them on the same machine. Work directory and machinedata.dat deleted after each one.

Stanford wants to move to big-16 but they do not seem to be addressing an issue that takes these same machines out until there is direct user intervention - not to mention that they are uselessly working and consuming electricity at 100% the entire time.

Stanford - Ready, Fire, Aim. TM
 
I will see if I can get in touch with some of the people at Stanford and hopefully get a resolution. Kasson has been quiet lately he may be travelling or involved in something at the moment. He is usually pretty good at fixing issues on his end.
 
Got one yesterday on my 4p, crippled that damn thing and I didn't know till today. So much for making a run at the top20....

Grrrr
 
Let me know when it is safe to go back in the water.
 
I just grabbed a 6903, i did rm machinedependent.dat as well just to make sure
 
2011_11_22_tagline_jaws.jpg
 
I just dumped another one of these bad units - that is 5 of them in 5 days for me.
Three in the last 24 hours here...

Not sure if completion rate is lifetime, but it seems to be. Otherwise those who dump WUs should lose their QRB fairly quickly.


Looks like no swimming for Pocatello just yet. ;)
 
Just an update.

I did get ahold of Kasson and they are working on it. It sounds like the only other option would be to shut down bigadv. below is his responce.
Oversized 6903 / 6904
Sent: Fri Feb 10, 2012 1:09 pm

From: Grandpa_01
To: kasson

Hi Dr. Kasson

Were you aware that there are still quite a few of the extra large size WU's still floating around in the wild. In the last 24 or so hrs there have been several of them reported over at the [H]. I just thought I would let you know. Just in case you were not aware. http://hardforum.com/showthread.php?t=1671174

Have a Good Day
Grandpa_01
Rick
Oversized 6903 / 6904
Sent: Fri Feb 10, 2012 3:37 pm
From: kasson
To: Grandpa_01

Yes, I forget where I posted about this, but I'm afraid it's just the ~30 that we've identified. We're in the process of trying to "rehabilitate" them, but the current work server software doesn't let us nuke the WU's the way we used to. (If I did that, it would keep trying to assign empty WU packets, giving the 512-byte download + missing file issue.) We apologize for the problems and are working to fix the WU's as best we can
 
Just got another one.... going regular SMP until they fix this, I can't babysit this rig while Stanford gets right, and it's either SMP or deleting WU's, so I'll stick to SMP

[20:43:40] Project: 6904 (Run 0, Clone 43, Gen 37)
 
Back
Top