Bigadv heads up

Grandpa_01 · Feb 4, 2012

Just though I would give a heads up on some really big bigadv WU's that have been popping up recently I had 1 of them today a 6903 that was 10,000,000 steps. My 4P ran for a couple of hours on it before I noticed and it did not even do 1%. The log said the completed file was going to be over 7GB in size and there was another person that reported getting a 6904 that was 13,000,000 steps. So you might want to watch your boxes and delete them if you get them they could put a serious hurt on your production. http://foldingforum.org/viewtopic.php?f=19&t=20692

Pocatello · Feb 4, 2012

Why would we delete a work unit?

W.Feather · Feb 4, 2012

because they are bad I assume is the point, a 6903 should not take a couple of hours to do 1%

firedfly · Feb 5, 2012

I had picked up a p6903 unit that is 500k steps. Took a while to figure out the slow tpf was due to double the number of steps as normal.

402blownstroker · Feb 5, 2012

Pocatello said:
Why would we delete a work unit?

I will repeat the question.

Nathan_P · Feb 5, 2012

402blownstroker said:
I will repeat the question.

The work units are borked, at some point something has been corrupted and when you down load instead of doing one WU you will be doing many.

Each 6903/6904 is made up of 250,000 steps - this is a normal WU and you all know how long it takes to run on your system(s).
In these cases you are folding not just your WU but the WU that preceded this one i.e you are on gen 51 of P6903 R2 C9, if you have a 500,000 step WU you are also folding Gen 50 which should have been already completed, uploaded and formed into the new WU, the forming has not happened for some reason.

In grandpa's case he was attempting to fold all 52 gens of that particular PRC, still within the normal timeframe for a 6904 which just isn't going to happen, hence the need to delete

Linden · Feb 5, 2012

That explains it!

Yesterday I had a 6903 that doubled, yes doubled the TPF of usual 6903s. I spent a couple hours troubleshooting, restarts, and Linux fiddling. AAARGHH. There was nothing wrong with my system, it appears.

Thanks for the post, Grandpa.

Pocatello · Feb 5, 2012

I've had 18 EUE's today on my SR-2 dual hex. I was running 6903's the last time I looked. Do you think there is a connection?

Grandpa_01 · Feb 5, 2012

I do know look for this in log I know that the 1 I had woul continually start over but I did not see any EUE's

Code:

WARNING: This run will generate roughly 7425 Mb of data

starting mdrun 'Overlay'
10000000 steps,  40000.0 ps.
[02:19:25] Completed 0 out of 10000000 steps  (0%)

Zink · Feb 5, 2012

My last 6903 on my 4p would just crash the client and let the computer sit idle.

Pocatello · Feb 5, 2012

How do I delete a work unit in linux ubuntu?

sbinh · Feb 5, 2012

If you seriously want to delete it

... then stop fah client then

1- rm -R work (this will remove work folder)
2- rm unitinfo.txt queue.dat (option .. remove FAHlog.txt)

That's what I would do

Linden · Feb 5, 2012

Pocatello, the graphical user interface method is:
1. stop the client
2. open the "fah" folder and delete/move to trash the Work folder and the queue.dat and unitinfo files.

sbinh, some of us have not yet discovered the great pleasure of command line interface.

Pocatello · Feb 5, 2012

Thanks.

CLI worked fine. I deleted some stuff.

But my dual hex is only running on one CPU with a new work unit. Is that because I had 18 EUE's?

sbinh · Feb 5, 2012

Personally, I don't think EUE would cause that issue.
You might want to check your system to make sure the other CPU is still working right.

tear · Feb 6, 2012

One needs to remove machinedependent.dat as well to reduce likelihood of getting such (broken) unit again...

tjmagneto · Feb 6, 2012

I was wondering why tpfs were sucky and now I know why. Thanks for the heads up.

Grandpa_01 · Feb 7, 2012

It appears that Stanford has figured things out and are removing the bad WU's and repairing them.
http://foldingforum.org/viewtopic.php?f=19&t=20692&start=15#p206953

Deleted member 12106 · Feb 7, 2012

On one box I use the GPU FAH tracker as I have a video card that crunches, as of late I have had some WU for the SMP client that no not show the PPD.

Pocatello · Feb 8, 2012

Is it safe to go back into the -bigadv world?

402blownstroker · Feb 8, 2012

Both of 'big' rigs grabbed a sane 6903 and 6904 this morning.

KMac · Feb 8, 2012

Pocatello said:
Is it safe to go back into the -bigadv world?

Keep an eye on the units you are assigned. I was assigned my 4th bad one this morning.

Code:

[09:46:58] Project: 6904 (Run 0, Clone 31, Gen 39)
[09:46:58] 
[09:46:58] Assembly optimizations on if available.
[09:46:58] Entering M.D.
                         :-)  G  R  O  M  A  C  S  (-:

                   Groningen Machine for Chemical Simulation

                            :-)  VERSION 4.5.3  (-:

        Written by Emile Apol, Rossen Apostolov, Herman J.C. Berendsen,
      Aldert van Buuren, Pär Bjelkmar, Rudi van Drunen, Anton Feenstra, 
        Gerrit Groenhof, Peter Kasson, Per Larsson, Pieter Meulenhoff, 
           Teemu Murtola, Szilard Pall, Sander Pronk, Roland Schulz, 
                Michael Shirts, Alfons Sijbers, Peter Tieleman,

               Berk Hess, David van der Spoel, and Erik Lindahl.

       Copyright (c) 1991-2000, University of Groningen, The Netherlands.
            Copyright (c) 2001-2010, The GROMACS development team at
        Uppsala University & The Royal Institute of Technology, Sweden.
            check out http://www.gromacs.org for more information.


                               :-)  Gromacs  (-:

Reading file work/wudata_00.tpr, VERSION 4.5.4-dev-20110530-cc815 (single precision)
[09:47:07] Mapping NT from 48 to 48 
Starting 48 threads
Making 2D domain decomposition 8 x 6 x 1

WARNING: This run will generate roughly 7421 Mb of data

starting mdrun 'Overlay'
10000000 steps,  40000.0 ps.
[09:47:13] Completed 0 out of 10000000 steps  (0%)

Buckwheet · Feb 9, 2012

I was also assigned a new 6903 that was bad. I removed it and got another one. Removed that one got a bad 6904, which I removed and picked up 6901.

Pocatello · Feb 9, 2012

Let me know when it is safe to go back in the water.

musky · Feb 9, 2012

I got one of these buggers this morning, so it is not safe yet.

W.Feather · Feb 9, 2012

i just picked up a good one like an hour ago

rascal65 · Feb 10, 2012

Have two 6903's picked up last night/early today.
No problem with either.......so far

sbinh · Feb 10, 2012

Did SF change the Preferred deadline for 6903/6904? Two (2) of mine 6903s completed within 4 days but received only 20k+ each.

Grandpa_01 · Feb 10, 2012

According to the psummary they are still at 5 and 5.6 days so not yet. You can post over at the FF and have one of the mods look them up.
6903 130.237.232.237 ha_shooting 2533797 5.00 12.00 22706.00 100 GRO-A5 Description kasson 38.05
6904 130.237.232.237 ha_shooting 2533797 5.60 17.00 31541.00 100 GRO-A5 Description kasson 37.31

KMac · Feb 10, 2012

I just dumped another one of these bad units - that is 5 of them in 5 days for me.
Over what period of time do I need to worry about losing the bonus for an 80% completion rate?

Grandpa_01 · Feb 10, 2012

Just curious but have you noted the run, clone, gen of those WU's are they the same on or are they different ones. If they are the same one each time try deleting the machinedependent.dat, queue.dat, and the unitinfo.txt if you have not already done that. I do not understand why you are getting so many of them I have 5 bigadv rigs running and have only picked up the 1 so far. I know that Stanford servers will reassign a WU to a rig if that WU has not been returned in time so that maybe what is happening in your case. This is just a guess but it may be worth a try.

KMac · Feb 10, 2012

Each one has been a different unit, none of them previously reported on FF, and all of them on the same machine. Work directory and machinedata.dat deleted after each one.

Stanford wants to move to big-16 but they do not seem to be addressing an issue that takes these same machines out until there is direct user intervention - not to mention that they are uselessly working and consuming electricity at 100% the entire time.

Stanford - Ready, Fire, Aim. TM

Grandpa_01 · Feb 10, 2012

I will see if I can get in touch with some of the people at Stanford and hopefully get a resolution. Kasson has been quiet lately he may be travelling or involved in something at the moment. He is usually pretty good at fixing issues on his end.

Vaulter98c · Feb 10, 2012

Got one yesterday on my 4p, crippled that damn thing and I didn't know till today. So much for making a run at the top20....

Grrrr

Pocatello · Feb 10, 2012

Let me know when it is safe to go back in the water.

Nathan_P · Feb 10, 2012

I just grabbed a 6903, i did rm machinedependent.dat as well just to make sure

tjmagneto · Feb 10, 2012

Amaruk · Feb 10, 2012

KMac said:
I just dumped another one of these bad units - that is 5 of them in 5 days for me.

Three in the last 24 hours here...

Not sure if completion rate is lifetime, but it seems to be. Otherwise those who dump WUs should lose their QRB fairly quickly.

Looks like no swimming for Pocatello just yet.

Grandpa_01 · Feb 10, 2012

Just an update.

I did get ahold of Kasson and they are working on it. It sounds like the only other option would be to shut down bigadv. below is his responce.

Oversized 6903 / 6904
Sent: Fri Feb 10, 2012 1:09 pm

From: Grandpa_01
To: kasson

Hi Dr. Kasson

Were you aware that there are still quite a few of the extra large size WU's still floating around in the wild. In the last 24 or so hrs there have been several of them reported over at the [H]. I just thought I would let you know. Just in case you were not aware. http://hardforum.com/showthread.php?t=1671174

Have a Good Day
Grandpa_01
Rick

Oversized 6903 / 6904
Sent: Fri Feb 10, 2012 3:37 pm
From: kasson
To: Grandpa_01

Yes, I forget where I posted about this, but I'm afraid it's just the ~30 that we've identified. We're in the process of trying to "rehabilitate" them, but the current work server software doesn't let us nuke the WU's the way we used to. (If I did that, it would keep trying to assign empty WU packets, giving the 512-byte download + missing file issue.) We apologize for the problems and are working to fix the WU's as best we can

Vaulter98c · Feb 10, 2012

Just got another one.... going regular SMP until they fix this, I can't babysit this rig while Stanford gets right, and it's either SMP or deleting WU's, so I'll stick to SMP

[20:43:40] Project: 6904 (Run 0, Clone 43, Gen 37)

Bigadv heads up

[H]ard|DCer of the Year 2013

DC Moderator and [H]ard DCOTM x6

[H]ard DCOTM x4 & [H]DCOTY x1

[H]ard|DCer of the Month - February 2012

[H]ard|DCer of the Month - Nov. 2012

[H]ard DCOTM x3

[H]ard|Gawd

DC Moderator and [H]ard DCOTM x6

[H]ard|DCer of the Year 2013

Gawd

DC Moderator and [H]ard DCOTM x6

2[H]4U

[H]ard|Gawd

DC Moderator and [H]ard DCOTM x6

2[H]4U

[H]ard|DCer of the Year 2011

[H]ard DCOTM x2

[H]ard|DCer of the Year 2013

Deleted member 12106

Guest

DC Moderator and [H]ard DCOTM x6

[H]ard|DCer of the Month - Nov. 2012

[H]ard|DCer of the Month - June 2012

n00b

DC Moderator and [H]ard DCOTM x6

[H]ard|DCer of the Year 2012

[H]ard DCOTM x4 & [H]DCOTY x1

Limp Gawd

2[H]4U

[H]ard|DCer of the Year 2013

[H]ard|DCer of the Month - June 2012

[H]ard|DCer of the Year 2013

[H]ard|DCer of the Month - June 2012

[H]ard|DCer of the Year 2013

[H]ard|DCer of the Month - October 2009

DC Moderator and [H]ard DCOTM x6

[H]ard DCOTM x3

[H]ard DCOTM x2

n00b

[H]ard|DCer of the Year 2013

[H]ard|DCer of the Month - October 2009