Server reports problem with unit.

-alias- · Oct 2, 2012

The " Server reports problem with unit" is fixed as far as I know. I've shipped 8 WUs to Stanford without getting this error message, so I consider the problem solved. The problem came only on 810x WUs.

Core32 · Oct 2, 2012

-alias- said:
.......The problem came only on 810x WUs......

Which makes it doubly frustrating IMO.

tear · Oct 2, 2012

Say what you want, I actually like the way this issue has been handled.
Infinitely more professional than ever.

Tobit · Oct 2, 2012

tear said:
Say what you want, I actually like the way this issue has been handled.
Infinitely more professional than ever.

Agreed. Dr. Kasson was relatively on top of everything given it was the weekend. Much better than days of old, communication was there even if it could not be instantly fixed.

Grandpa_01 · Oct 2, 2012

freeloader1969 said:
I've had two 8101's go bad for half a million points. I'll let this one finish and if it fails, I'll be shutting down my folding rigs until Stanford fixes their "problem". My latest one just failed this morning.

freeloader1969 what do you mean by failes, are you getting the server has a problem wit the unit message or are they getting eue errors ox8b erors you should not be getting the server error. If you are it needs to be reported over at the FF they can not fix an issue if they do not know about it. All of the messed up WU's should have been completed by now.

freeloader1969 · Oct 2, 2012

Grandpa_01 said:
freeloader1969 what do you mean by failes, are you getting the server has a problem wit the unit message or are they getting eue errors ox8b erors you should not be getting the server error. If you are it needs to be reported over at the FF they can not fix an issue if they do not know about it. All of the messed up WU's should have been completed by now.

I got the "server has a problem with the unit" last night.

Code:

[02:16:55] Completed 242500 out of 250000 steps  (97%)
[02:45:20] Completed 245000 out of 250000 steps  (98%)
[03:13:48] Completed 247500 out of 250000 steps  (99%)
[03:42:18] Completed 250000 out of 250000 steps  (100%)
[03:42:31] DynamicWrapper: Finished Work Unit: sleep=10000
[03:42:41] 
[03:42:41] Finished Work Unit:
[03:42:41] - Reading up to 64340496 from "work/wudata_04.trr": Read 64340496
[03:42:42] trr file hash check passed.
[03:42:42] - Reading up to 31616784 from "work/wudata_04.xtc": Read 31616784
[03:42:42] xtc file hash check passed.
[03:42:42] edr file hash check passed.
[03:42:42] logfile size: 203100
[03:42:42] Leaving Run
[03:42:42] - Writing 96321256 bytes of core data to disk...
[03:43:14] Done: 96320744 -> 91568336 (compressed to 5.8 percent)
[03:43:14]   ... Done.
[03:43:25] - Shutting down core
[03:43:25] 
[03:43:25] Folding@home Core Shutdown: FINISHED_UNIT
[03:43:27] CoreStatus = 64 (100)
[03:43:27] Sending work to server
[03:43:27] Project: 8101 (Run 22, Clone 1, Gen 60)


[03:43:27] + Attempting to send results [October 2 03:43:27 UTC]
[04:01:56] - Server reports problem with unit.
[04:01:56] - Preparing to get new work unit...
[04:01:56] Cleaning up work directory

Amaruk · Oct 3, 2012

I have been of the opinion that 128.143.231.201 had an in silico 'brain fart' at approximately 22:45 UTC on September 29th, resulting in instantaneous and complete amnesia regarding any and all work it had issued. This resulted in all work started prior, but finished after, to be rejected. With 2.4 day preferred for 8101, that would put latest possible affected QRB-eligible return at about 08:20 UTC October 2nd.

freeloader1969, you're WU was rejected October 2nd at 03:43:27 UTC, which is during that time-frame. However, using TPF of 00:28:28 (based on last 3 frames) puts assignment at around 04:16 UTC on September 30th. Not only is this after the WS potentially lost it's mind on 22:45 UTC September 29th, it's also several hours after Grandpa was reissued Project: 8102 (Run 0, Clone 18, Gen 65) at 00:39:31 on the same day, (Which I have reason to believe was uploaded for full credit.)

While a single anomaly does not necessarily discredit the above hypothesis, I would be curious to know if you were in fact issued a WU that subsequently failed after someone else was (re)issued a WU which uploaded successfully.

tear said:
Say what you want, I actually like the way this issue has been handled.
Infinitely more professional than ever.

Agreed for the most part, with a few niggles... For example, pointing the blame at the AS/Stanford didn't sit to well. V7 clients report to assign3.stanford.edu, whereas 6.34 clients report to assign.stanford.edu - this would mean that two different AS, that both communicate with multiple WS, each simultaneously developed the same issue with just a single WS. William of Ockham and common sense both agree that's not very likely.

But I certainly appreciate both Peter and Joe looking into the matter, especially over the weekend. Enough to even tell them so,..if they hadn't locked the thread...

-alias- · Oct 3, 2012

The thread 128.143.231.201 or Bigadv Collection server broken was locked 02 Oct 2012, 12:19

Grandpa_01 · Oct 3, 2012

I posted the issue over at the FF http://foldingforum.org/viewtopic.php?f=18&t=22596

freeloader1969 said:

I got the "server has a problem with the unit" last night.

Code:

[02:16:55] Completed 242500 out of 250000 steps  (97%)
[02:45:20] Completed 245000 out of 250000 steps  (98%)
[03:13:48] Completed 247500 out of 250000 steps  (99%)
[03:42:18] Completed 250000 out of 250000 steps  (100%)
[03:42:31] DynamicWrapper: Finished Work Unit: sleep=10000
[03:42:41] 
[03:42:41] Finished Work Unit:
[03:42:41] - Reading up to 64340496 from "work/wudata_04.trr": Read 64340496
[03:42:42] trr file hash check passed.
[03:42:42] - Reading up to 31616784 from "work/wudata_04.xtc": Read 31616784
[03:42:42] xtc file hash check passed.
[03:42:42] edr file hash check passed.
[03:42:42] logfile size: 203100
[03:42:42] Leaving Run
[03:42:42] - Writing 96321256 bytes of core data to disk...
[03:43:14] Done: 96320744 -> 91568336 (compressed to 5.8 percent)
[03:43:14]   ... Done.
[03:43:25] - Shutting down core
[03:43:25] 
[03:43:25] Folding@home Core Shutdown: FINISHED_UNIT
[03:43:27] CoreStatus = 64 (100)
[03:43:27] Sending work to server
[03:43:27] Project: 8101 (Run 22, Clone 1, Gen 60)


[03:43:27] + Attempting to send results [October 2 03:43:27 UTC]
[04:01:56] - Server reports problem with unit.
[04:01:56] - Preparing to get new work unit...
[04:01:56] Cleaning up work directory

tear · Oct 3, 2012

Amaruk said:
While a single anomaly does not necessarily discredit the above hypothesis, I would be curious to know if you were in fact issued a WU that subsequently failed after someone else was (re)issued a WU which uploaded successfully.

Earliest re-issue (that succeeded and) that I know of was at 00:39:31 UTC, Sep 30th.

Amaruk said:
Agreed for the most part, with a few niggles... For example, pointing the blame at the AS/Stanford didn't sit to well. V7 clients report to assign3.stanford.edu, whereas 6.34 clients report to assign.stanford.edu - this would mean that two different AS, that both communicate with multiple WS, each simultaneously developed the same issue with just a single WS. William of Ockham and common sense both agree that's not very likely.

True

That was a rushed theory on Peter's part.

Amaruk said:
But I certainly appreciate both Peter and Joe looking into the matter, especially over the weekend. Enough to even tell them so,..if they hadn't locked the thread...

Agreed completely. They need to work a little bit in that department.

Server reports problem with unit.

-alias-

Limp Gawd

Core32

[H]ard|Gawd

tear

[H]ard|DCer of the Year 2011

Tobit

[H]ard|DCer of the Month - March 2010/May 2011

Grandpa_01

[H]ard|DCer of the Year 2013

freeloader1969

2[H]4U

Amaruk

n00b