[H]ard|Forum

Go Back   [H]ard|Forum > Bits & Bytes > Distributed Computing

Reply
 
Thread Tools Search this Thread
  #1  
Old 07-31-2008, 10:28 AM
relic [H]ard|DCer of the Month - August 2007, 8.9 Years
 
relic is offline
6.22 SMP - stability still a problem

So far I have tested the 6.22 client on 4 SMP systems with the -SMP flag set, all have EUEed. Three were running stable on 2x5.91s + 1xGPU2 for weeks. The other was running 1x5.91 and 1xGPU2, again stable for weeks.

Last night I continued testing on one box (Vista x64 Q6600 3GHz 1.25vcore, 8GB, 8800GT) CPU temps at 100% load are 55c max load. GPU is 57c max load.

Two 5.91 clients with a GPU2 client will run stable on this system for weeks.
On the same system a solitary 6.22 client will crash with the old:
Folding@home Core Shutdown: MISSING_WORK_FILES...Client-core communications error: ERROR 0x1 Usually it would fail about 8-10% into a WU.

Now in the past this was stated to be a problem with the system or the WU. The WU is a 2665, nothing strange there. The system is very mildly OCed, so I stepped it down to stock but still encountered the MISSING_WORK_FILES error with 6.22. (Not that OC mattered here, as stated earlier TWO 5.91s + GPU2 will run without issue for weeks on end) Same problem without the OC.

So I tried turning the CPU usage down. Bingo. At 90% setting it seems to be stable again. So now I am running two 6.22 -SMP clients set to 90% and one GPU2. Obviously this stresses the CPU to 100%. Definately a 6.22 client problem not a WU or system error as the same WU on the same system works in 5.91 at 100% settings.

Now Stanford is well know for finger-pointing and blaming everything but their own software, or using the never ending excuse "it's beta code" (have you noticed that almost all of F@Hs code is beta until it's too old to be useful?), so I expect this craptacular client to be released "as is" on Saturday and the 5.91 to go away even though it's more stable.

Now you're forewarned.
We're getting another piece of garbageware, and we have to figure out how to get it to run.
At least we have a band-aid for the poor programming.
  #2  
Old 07-31-2008, 10:32 AM
Xilikon [H]ard|DCer of the Year 2008, 5.3 Years
 
Xilikon is offline
You are correct, I'm seeing a few reports of this issue so I'll run kick a few asses around. this is especially disturbing since it's a very stupid to release a very new client version a few days before deadline and make this version the official one, basically stuffing it down the throats of all users. This is bad betatesting pratice (a good pratice would be to avoid releasing a beta client close to a deadline and instead offer a choice of either version until all the kinks get worked out).

Thanks for putting the light on the stupid solution.

__________________
| Fold for the [H]orde !!! Infos in www.hardfolding.com

| Intel Core i7 920 D0 - 4.0 GHz (21x191 MHz)
| Heatkiller 3.0CU nickel plated - Black Ice SR-1 360 - MCP355 - D-Tek Fuzion GPU v2 + UNi-Sink
| EVGA X58 SLI LE - G.Skill Ripjaw 3x2Gb 8-8-8-24
| EVGA GTX 260 Superclocked @ 602/2052/1512 - Dell 2007WFP 20" LCD
| Intel X25-M 80GB G2 - Western Digital WD1600AAJS - Pioneer DVR-212D
| Corsair Obsidian 800D - Corsair VX550W PSU
  #3  
Old 07-31-2008, 10:36 AM
Sunin [H]ard|DCer of the Month - August 2008, 4.1 Years
 
Sunin is offline
Very interesting.

All I can say is 2xSMP 6.22 MPI = 2800-3100 PPD without a single EUE on:

Q6600 @ 3.4Ghz
True HSF
EVGA 780i mobo
2x 88GTX with GPU Clients

I maybe the exception versus the rule though... not really sure!

I'll go double check those clients tonight and make sure they are not EUE'n... They show up green though... not that Fahmon is flawless.

__________________
130,000+PPD For the [H]orde! Team 33 - - 1 OC'd Q6600s & 27 GPUs - 3k+ KW/mo
  #4  
Old 07-31-2008, 10:43 AM
relic [H]ard|DCer of the Month - August 2007, 8.9 Years
 
relic is offline
Quote:
Originally Posted by Xilikon View Post
You are correct, I'm seeing a few reports of this issue so I'll run kick a few asses around. this is especially disturbing since it's a very stupid to release a very new client version a few days before deadline and make this version the official one, basically stuffing it down the throats of all users. This is bad betatesting pratice (a good pratice would be to avoid releasing a beta client close to a deadline and instead offer a choice of either version until all the kinks get worked out).

Thanks for putting the light on the stupid solution.
At least we have a workaround, Xil.
(It's what we've always had to do....nothing new . )

If all we need to do is cut the client down to 90% to fix the EUE issue then we have a way to keep frustrated folks running. I know it's goofy, but it will keep people with us while the client is being fixed.

Thanks for going to rattle the cages.
  #5  
Old 07-31-2008, 10:47 AM
Wheresatom [H]ard|Gawd, 2.9 Years
 
Wheresatom is online now
Quote:
Originally Posted by relic View Post

So I tried turning the CPU usage down. Bingo. At 90% setting it seems to be stable again. So now I am running two 6.22 -SMP clients set to 90% and one GPU2. Obviously this stresses the CPU to 100%. Definately a 6.22 client problem not a WU or system error as the same WU on the same system works in 5.91 at 100% settings.
.
Are you pretty much saying you offer the GPU client the last 10% and it takes it all then?

__________________
Asus P5N-E-SLI || Intel E-6420-3.2 GHz || 4 GB RAM || Windows Vista Ultimate 64
EVGA GTX 260

Folding for the [H]orde!

www.energymodelhelp.com
  #6  
Old 07-31-2008, 10:49 AM
Xilikon [H]ard|DCer of the Year 2008, 5.3 Years
 
Xilikon is offline
A PM has been sent to Peter Kasson and Vijay Pande about the poor execution and decisions about the new beta client. I hope they realize this and do the necessary changes.

This is like having a new Windows version out and telling everyone to upgrade within a few weeks with no chance to stay with the old one. That's poor pratice...

__________________
| Fold for the [H]orde !!! Infos in www.hardfolding.com

| Intel Core i7 920 D0 - 4.0 GHz (21x191 MHz)
| Heatkiller 3.0CU nickel plated - Black Ice SR-1 360 - MCP355 - D-Tek Fuzion GPU v2 + UNi-Sink
| EVGA X58 SLI LE - G.Skill Ripjaw 3x2Gb 8-8-8-24
| EVGA GTX 260 Superclocked @ 602/2052/1512 - Dell 2007WFP 20" LCD
| Intel X25-M 80GB G2 - Western Digital WD1600AAJS - Pioneer DVR-212D
| Corsair Obsidian 800D - Corsair VX550W PSU
  #7  
Old 07-31-2008, 11:03 AM
relic [H]ard|DCer of the Month - August 2007, 8.9 Years
 
relic is offline
Quote:
Originally Posted by Sunin View Post
Very interesting.

All I can say is 2xSMP 6.22 MPI = 2800-3100 PPD without a single EUE on:
I'm sure some systems run fine or the client would never have been released at all.
Well, at least I would hope that is true.
All data points even "it works for me" are important when troubleshooting.

I have redone the other systems with 6.22 at 90% to see if the "fix" is as reproduceable as the error condition. If yes, we know for certain that there is a serious issue in 6.22 that didn't exist in 5.91. At least on some systems.

Suprisingly I am able to reproduce the error on a variety of systems/OSs and both Intel and NVidia chipsets. The only thing in common on all systems is an NVidia GPU (3 8800GTs and an 8700M GT) and the CUDA driver (although slightly differing versions).

I have not tried 6.22 on the P4 (non SMP) yet to see if I can reproduce the EUE condition, perhaps I'll do that for another data point.

It is also possible that through an amazing streak of bad luck all of my WU have been bad ones only when testing the 6.22 client, very unlikely, but possible. I have never seen a SMP client error until loading the 6.22 client. While others have seen them fairly regularly even with the old client.
  #8  
Old 07-31-2008, 11:09 AM
jws2346 [H]ard|Gawd, 3.0 Years
 
jws2346 is offline
I don't quite understand all I know about this BS about client v5.91 vs client v6.22. (which is very little, I haven't even upgraded yet, you know the old sayin' "good to the last WU", err... drop )

It seems to me "old timers" like Mr relic or BillR are very seldom wrong about anything folding wise. ("old timers" = experience folding, not age, except Mr relic, he's got both )

As "Smoke" said (another pretty good authority on folding along with several other "old timers" on this team) they shouldn't have released the latest 6.22 client so soon to the deadline of the 5.91 client.

My question is, "can you set the system clock back before the expiration date of the v5.91 client and still use it for a while". Or at least until some of the "bugs", that I've read about the v6.22 client, are worked out" ?

If these questions aren't particularly on the bright side I apologise in advance

Thanks for any answers

FOLD ON!

__________________
WC'ed Q6600 G0, 8800GS, 24/7 DCing
WC'ed Q6600 GO, 9600GTS, 24/7 DCing
Also one cat semi DCing, her output depends on the amount of "cat nip" she's had, how much sleepin' she does, how much eatin' she does, how much cleanin' she does and of course, how many trips to the "cat box" she makes. (ATM her output ain't much because she seems to be stuck in the sleeping/cleaning and cat box mode)
  #9  
Old 07-31-2008, 11:14 AM
Xilikon [H]ard|DCer of the Year 2008, 5.3 Years
 
Xilikon is offline
I've got some words from Peter Kasson and Vijay Pande. They will discuss about the clients today and will come up with a plan. Chances are that they will end up repackaging 5.91/5.92 with a new expiration date then let all 3 clients coexist for a short while so we can iron out the bugs in the 6.22 client.
__________________
| Fold for the [H]orde !!! Infos in www.hardfolding.com

| Intel Core i7 920 D0 - 4.0 GHz (21x191 MHz)
| Heatkiller 3.0CU nickel plated - Black Ice SR-1 360 - MCP355 - D-Tek Fuzion GPU v2 + UNi-Sink
| EVGA X58 SLI LE - G.Skill Ripjaw 3x2Gb 8-8-8-24
| EVGA GTX 260 Superclocked @ 602/2052/1512 - Dell 2007WFP 20" LCD
| Intel X25-M 80GB G2 - Western Digital WD1600AAJS - Pioneer DVR-212D
| Corsair Obsidian 800D - Corsair VX550W PSU
  #10  
Old 07-31-2008, 11:36 AM
Wheresatom [H]ard|Gawd, 2.9 Years
 
Wheresatom is online now
Quote:
Originally Posted by Xilikon View Post
I've got some words from Peter Kasson and Vijay Pande. They will discuss about the clients today and will come up with a plan. Chances are that they will end up repackaging 5.91/5.92 with a new expiration date then let all 3 clients coexist for a short while so we can iron out the bugs in the 6.22 client.
If they re-package the 5.91 and 5.92 clients with a new expiration date, does that mean we will still have to re-download and install them? I would imagine so right?

__________________
Asus P5N-E-SLI || Intel E-6420-3.2 GHz || 4 GB RAM || Windows Vista Ultimate 64
EVGA GTX 260

Folding for the [H]orde!

www.energymodelhelp.com
  #11  
Old 07-31-2008, 11:37 AM
relic [H]ard|DCer of the Month - August 2007, 8.9 Years
 
relic is offline
Quote:
Originally Posted by Wheresatom View Post
Are you pretty much saying you offer the GPU client the last 10% and it takes it all then?
Wheresatom,
I run 3 clients on one system normally.
Up until 6.22 I could run GPU at "slightly higher" priority, and two SMPs at "idle" priority and set to 100% CPU usage.

With 6.22 the client crashes even when only the 1 client is running, unless set to 90% CPU usage. This is a very strange error as three clients should stress the system much more than just one.

I'm hoping that someone can reproduce the error and we can compare notes to see what is the cause of the 6.22 client instability....and avoid the problem.

Common to my systems able to produce this error:
Intel Core2 CPUs
NVidia GPUs
CUDA Video drivers

RAM varies from 3-8GB
OC's vary from none to 20%
OSs Vista x86, Vista x64 and XP pro x86
Chipsets are 3 Intel and one NVidia
  #12  
Old 07-31-2008, 11:43 AM
MixManSC [H]ardness Supreme, 5.5 Years
 
MixManSC is online now
Quote:
Originally Posted by Wheresatom View Post
If they re-package the 5.91 and 5.92 clients with a new expiration date, does that mean we will still have to re-download and install them? I would imagine so right?
You would have to redownload them but then just get the main exe file and replace it on each box. Just have to stop each client for a little bit to do it.



__________________
Compromise - Let's agree to respect each others views, no matter how wrong yours might be.
  #13  
Old 07-31-2008, 11:58 AM
Xilikon [H]ard|DCer of the Year 2008, 5.3 Years
 
Xilikon is offline
Yes, regardless of the outcome, we must redownload and replace the current smp client. My own opinion is to give a try with the 6.22 beta client since the expiration is in 6 months and not the usual 3 months. Only fall back to 5.91/5.92 if you cannot resolve a issue with 6.22.

The goal they wanted to reach is to have 6.22 exclusively soon since it implement a few necessary changes to better handle newer cores and to get rid of any v5 trace. For this, we need to use them and report issues since beta testing is what it's about. if you don't want to deal with beta testing, you should not use a beta client and use any non-beta client (for SMP, the linux/OSX 6.02 is out of beta in case you care).

We are the [H]orde and we won't abandon for such a petty issue like a beta bug relic proved himself by finding the bandaid issue. Peter also found out that the extra safety checks in 6.22 is overreacting with a A1 EUE so expect a new version shortly.

__________________
| Fold for the [H]orde !!! Infos in www.hardfolding.com

| Intel Core i7 920 D0 - 4.0 GHz (21x191 MHz)
| Heatkiller 3.0CU nickel plated - Black Ice SR-1 360 - MCP355 - D-Tek Fuzion GPU v2 + UNi-Sink
| EVGA X58 SLI LE - G.Skill Ripjaw 3x2Gb 8-8-8-24
| EVGA GTX 260 Superclocked @ 602/2052/1512 - Dell 2007WFP 20" LCD
| Intel X25-M 80GB G2 - Western Digital WD1600AAJS - Pioneer DVR-212D
| Corsair Obsidian 800D - Corsair VX550W PSU
  #14  
Old 07-31-2008, 12:43 PM
relic [H]ard|DCer of the Month - August 2007, 8.9 Years
 
relic is offline
Quote:
Originally Posted by jws2346 View Post
If these questions aren't particularly on the bright side I apologise in advance
All questions are always good ones...it's those who don't ask that make the bigger mistakes.

I think the best plan, since we have a workaround, is as Xilikon suggested. Go to 6.22 if possible. I would suggest dropping back to 90% CPU settings if you have an EUE else leave it at 100%. The 5.91/5.92 repackaging isn't really helping except to keep some people running while they sort out the 6.22 errors. So use the repackaged 5.9x clients as a last resort.

I have no idea how this reduced CPU usage fixes the client crashing problem, I only know that it is reproduceable and effective. The programmers have to take it from there, but at least they have something to work with.
  #15  
Old 07-31-2008, 02:08 PM
BillR [H]ard|DCer of the Month - March 2007, 8.0 Years
 
BillR is offline
Quote:
Originally Posted by jws2346 View Post
I don't quite understand all I know about this BS about client v5.91 vs client v6.22. (which is very little, I haven't even upgraded yet, you know the old sayin' "good to the last WU", err... drop )

It seems to me "old timers" like Mr relic or BillR are very seldom wrong about anything folding wise. ("old timers" = experience folding, not age, except Mr relic, he's got both )

As "Smoke" said (another pretty good authority on folding along with several other "old timers" on this team) they shouldn't have released the latest 6.22 client so soon to the deadline of the 5.91 client.

My question is, "can you set the system clock back before the expiration date of the v5.91 client and still use it for a while". Or at least until some of the "bugs", that I've read about the v6.22 client, are worked out" ?

If these questions aren't particularly on the bright side I apologise in advance

Thanks for any answers

FOLD ON!

Heh, you keep making allusions to (Mr.) relic’s age vs. mine. The truth is relic is but a teenager in the big scope of things, I’m all buy old enough to be his daddy and that my friend is one disgusting thought.

I believe I have an answer to how to make the new program run, reliably although I can take no responsibility for relic’s hardware building skills.




__________________
"The best laid schemes o' Mice an' Men, gang aft agley." ~R. Burns
I really don’t respect the boundaries of the laws because I was not consulted when the laws were made.
"Alcohol, tobacco & firearms should be a convenience store, not a government agency."
“The Constitution is not an instrument for the government to restrain the people, it is an instrument for the people to restrain the government - lest it come to dominate our lives and interests.” – Patrick Henry

Make no attempt to alter the lies that have become the new truth
  #16  
Old 07-31-2008, 02:28 PM
BillR [H]ard|DCer of the Month - March 2007, 8.0 Years
 
BillR is offline
I just installed, then reinstalled the 6.22 program successfully three different times with repeatable results.

Step one, uninstall all instances of the old SMP client. If you were running two in windows before uninstall both. Done properly each uninstall should require a standard reboot to free up the files in use. At that point manually delete all the folders containing anything about SMP that had been in use before.

Now, down load the combo 32-64 bit client 6.22 it is the least buggered of the two.

Double click to install and it may well choose to use one of the directories you used before even though by now you should have deleted it. This is all good. Allow it to do so.

Prior to this in Vista you had to disable user control, well, don’t do that this time. As long as you are the admin in your OS and have to enter with a password you are all set.

Double click the install bat, enter your username and passwords, you will get the normal “if you see this twice crap press any key to close screen”, again, this is good.

Now, make a shortcut to your new .exe file and add –smp –configonly to the short cut.

Click on your short cut and do your configuration. I did use advanced options and allowed the size of work units to remain “Normal” CPU idle and 100% no –advanced methods and I left the two mystery config lines empty. I allowed machine #1 to be the default.

Now, go back to your shortcut and edit it through properties and make it read –smp –forceasm then save and close.

Click or double click your shortcut (depending on where you put it, I like quick launch) and bingo you should be up and running.

I’m finding temps to be about the same and at 3.2 a 2665 is hitting 14 min per frame and a 3065 12 min a frame, again at 3.2.

Give it a shot, leme know.

Luck


__________________
"The best laid schemes o' Mice an' Men, gang aft agley." ~R. Burns
I really don’t respect the boundaries of the laws because I was not consulted when the laws were made.
"Alcohol, tobacco & firearms should be a convenience store, not a government agency."
“The Constitution is not an instrument for the government to restrain the people, it is an instrument for the people to restrain the government - lest it come to dominate our lives and interests.” – Patrick Henry

Make no attempt to alter the lies that have become the new truth
  #17  
Old 07-31-2008, 05:56 PM
relic [H]ard|DCer of the Month - August 2007, 8.9 Years
 
relic is offline
Bill,

Almost exactly as I installed. I disabled UAC permanently on my systems as I have no use for it, but otherwise we had similar approaches.

One of the 90% CPU usage systems EUEed today, again at 8% into WU completion.
Apparently our workaround isn't 100% reliable. This doesn't bode well....we'll likely be going back to 5.91 unless an patch is developed quickly.

This kind of instability isn't acceptable.

As a side note, this is again Project: 2665 (Run 0, Clone 207, Gen 16) WU that is having a problem in 6.22.

Code:
[20:45:52] Completed 20000 out of 250000 steps  (8 percent)
[21:01:16] Warning:  long 1-4 interactions
[21:01:17] Gromacs cannot continue further.
[21:01:17] Going to send back what have done.
[21:01:17] logfile size: 0
[21:01:17] Warning: Core could not open logfile.
[21:01:17] - Writing 536 bytes of core data to disk...
[21:01:17]   ... Done.
[21:01:17] - Failed to delete work/wudata_05.sas
[21:01:17] - Failed to delete work/wudata_05.goe
[21:01:17] Warning:  check for stray files
[21:01:17] 
[21:01:17] Folding@home Core Shutdown: EARLY_UNIT_END
[21:01:17] 
[21:01:17] Folding@home Core Shutdown: EARLY_UNIT_END
[21:01:20] CoreStatus = 7B (123)
[21:01:20] Client-core communications error: ERROR 0x7b
[21:01:20] This is a sign of more serious problems, shutting down.

Last edited by relic; 07-31-2008 at 06:02 PM..
  #18  
Old 07-31-2008, 06:08 PM
Xilikon [H]ard|DCer of the Year 2008, 5.3 Years
 
Xilikon is offline
Peter Kasson mentionned that 6.22 included some safety checks which is too agressive, causing EUE for nothing and not able to clean up the shitty mess after a EUE (MISSING_WORK_FILES). Expect a new version very soon.

__________________
| Fold for the [H]orde !!! Infos in www.hardfolding.com

| Intel Core i7 920 D0 - 4.0 GHz (21x191 MHz)
| Heatkiller 3.0CU nickel plated - Black Ice SR-1 360 - MCP355 - D-Tek Fuzion GPU v2 + UNi-Sink
| EVGA X58 SLI LE - G.Skill Ripjaw 3x2Gb 8-8-8-24
| EVGA GTX 260 Superclocked @ 602/2052/1512 - Dell 2007WFP 20" LCD
| Intel X25-M 80GB G2 - Western Digital WD1600AAJS - Pioneer DVR-212D
| Corsair Obsidian 800D - Corsair VX550W PSU
  #19  
Old 07-31-2008, 06:16 PM
MixManSC [H]ardness Supreme, 5.5 Years
 
MixManSC is online now
Like by tomorrow evening?

__________________
Compromise - Let's agree to respect each others views, no matter how wrong yours might be.
  #20  
Old 07-31-2008, 06:50 PM
relic [H]ard|DCer of the Month - August 2007, 8.9 Years
 
relic is offline
Too funny.

Our favorite idiot has chimed in....
Finger pointing as usual.
Why do they put up with this moron?

Quote:

Re: BAD_CORE_FILES
by 7im on Thu Jul 31, 2008 7:19 pm

Are you overclocking? NaN errors tend to be hardware related, and p2665s are extray chewey.
Dumber than a brick. I doubt I've met a more useless individual.
Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT -5. The time now is 08:59 AM.


Valid XHTML 1.0 Transitional

Copyright ©2000 - 2010, Jelsoft Enterprises Ltd.
Copyright 2000 - 2009 KB Networks, Inc.