![]() |
|
#1
|
|||
|
|||
|
6.22 SMP - stability still a problem
So far I have tested the 6.22 client on 4 SMP systems with the -SMP flag set, all have EUEed. Three were running stable on 2x5.91s + 1xGPU2 for weeks. The other was running 1x5.91 and 1xGPU2, again stable for weeks.
Last night I continued testing on one box (Vista x64 Q6600 3GHz 1.25vcore, 8GB, 8800GT) CPU temps at 100% load are 55c max load. GPU is 57c max load. Two 5.91 clients with a GPU2 client will run stable on this system for weeks. On the same system a solitary 6.22 client will crash with the old: Folding@home Core Shutdown: MISSING_WORK_FILES...Client-core communications error: ERROR 0x1 Usually it would fail about 8-10% into a WU. Now in the past this was stated to be a problem with the system or the WU. The WU is a 2665, nothing strange there. The system is very mildly OCed, so I stepped it down to stock but still encountered the MISSING_WORK_FILES error with 6.22. (Not that OC mattered here, as stated earlier TWO 5.91s + GPU2 will run without issue for weeks on end) Same problem without the OC. So I tried turning the CPU usage down. Bingo. At 90% setting it seems to be stable again. So now I am running two 6.22 -SMP clients set to 90% and one GPU2. Obviously this stresses the CPU to 100%. Definately a 6.22 client problem not a WU or system error as the same WU on the same system works in 5.91 at 100% settings. Now Stanford is well know for finger-pointing and blaming everything but their own software, or using the never ending excuse "it's beta code" (have you noticed that almost all of F@Hs code is beta until it's too old to be useful?), so I expect this craptacular client to be released "as is" on Saturday and the 5.91 to go away even though it's more stable. Now you're forewarned. We're getting another piece of garbageware, and we have to figure out how to get it to run. At least we have a band-aid for the poor programming.
|
|
#2
|
|||
|
|||
|
You are correct, I'm seeing a few reports of this issue so I'll run kick a few asses around. this is especially disturbing since it's a very stupid to release a very new client version a few days before deadline and make this version the official one, basically stuffing it down the throats of all users. This is bad betatesting pratice (a good pratice would be to avoid releasing a beta client close to a deadline and instead offer a choice of either version until all the kinks get worked out).
Thanks for putting the light on the stupid solution. ![]()
|
|
#3
|
|||
|
|||
|
Very interesting.
All I can say is 2xSMP 6.22 MPI = 2800-3100 PPD without a single EUE on: Q6600 @ 3.4Ghz True HSF EVGA 780i mobo 2x 88GTX with GPU Clients I maybe the exception versus the rule though... not really sure! I'll go double check those clients tonight and make sure they are not EUE'n... They show up green though... not that Fahmon is flawless. ![]()
|
|
#4
|
|||
|
|||
|
Quote:
(It's what we've always had to do....nothing new . )If all we need to do is cut the client down to 90% to fix the EUE issue then we have a way to keep frustrated folks running. I know it's goofy, but it will keep people with us while the client is being fixed. Thanks for going to rattle the cages.
|
|
#5
|
|||
|
|||
|
Quote:
![]()
|
|
#6
|
|||
|
|||
|
A PM has been sent to Peter Kasson and Vijay Pande about the poor execution and decisions about the new beta client. I hope they realize this and do the necessary changes.
This is like having a new Windows version out and telling everyone to upgrade within a few weeks with no chance to stay with the old one. That's poor pratice... ![]()
|
|
#7
|
|||
|
|||
|
Quote:
Well, at least I would hope that is true. ![]() All data points even "it works for me" are important when troubleshooting. I have redone the other systems with 6.22 at 90% to see if the "fix" is as reproduceable as the error condition. If yes, we know for certain that there is a serious issue in 6.22 that didn't exist in 5.91. At least on some systems. Suprisingly I am able to reproduce the error on a variety of systems/OSs and both Intel and NVidia chipsets. The only thing in common on all systems is an NVidia GPU (3 8800GTs and an 8700M GT) and the CUDA driver (although slightly differing versions). I have not tried 6.22 on the P4 (non SMP) yet to see if I can reproduce the EUE condition, perhaps I'll do that for another data point. It is also possible that through an amazing streak of bad luck all of my WU have been bad ones only when testing the 6.22 client, very unlikely, but possible. I have never seen a SMP client error until loading the 6.22 client. While others have seen them fairly regularly even with the old client.
|
|
#8
|
|||
|
|||
|
I don't quite understand all I know about this BS about client v5.91 vs client v6.22. (which is very little, I haven't even upgraded yet, you know the old sayin' "good to the last WU", err... drop
) It seems to me "old timers" like Mr relic or BillR are very seldom wrong about anything folding wise. ("old timers" = experience folding, not age, except Mr relic, he's got both )As "Smoke" said (another pretty good authority on folding along with several other "old timers" on this team) they shouldn't have released the latest 6.22 client so soon to the deadline of the 5.91 client. My question is, "can you set the system clock back before the expiration date of the v5.91 client and still use it for a while". Or at least until some of the "bugs", that I've read about the v6.22 client, are worked out" ? ![]() If these questions aren't particularly on the bright side I apologise in advance ![]() Thanks for any answers ![]() FOLD ON! ![]()
|
|
#9
|
|||
|
|||
|
I've got some words from Peter Kasson and Vijay Pande. They will discuss about the clients today and will come up with a plan. Chances are that they will end up repackaging 5.91/5.92 with a new expiration date then let all 3 clients coexist for a short while so we can iron out the bugs in the 6.22 client.
|
|
#10
|
|||
|
|||
|
Quote:
![]()
|
|
#11
|
|||
|
|||
|
Quote:
I run 3 clients on one system normally. Up until 6.22 I could run GPU at "slightly higher" priority, and two SMPs at "idle" priority and set to 100% CPU usage. With 6.22 the client crashes even when only the 1 client is running, unless set to 90% CPU usage. This is a very strange error as three clients should stress the system much more than just one. I'm hoping that someone can reproduce the error and we can compare notes to see what is the cause of the 6.22 client instability....and avoid the problem. Common to my systems able to produce this error: Intel Core2 CPUs NVidia GPUs CUDA Video drivers RAM varies from 3-8GB OC's vary from none to 20% OSs Vista x86, Vista x64 and XP pro x86 Chipsets are 3 Intel and one NVidia
|
|
#12
|
|||
|
|||
|
Quote:
![]()
|
|
#13
|
|||
|
|||
|
Yes, regardless of the outcome, we must redownload and replace the current smp client. My own opinion is to give a try with the 6.22 beta client since the expiration is in 6 months and not the usual 3 months. Only fall back to 5.91/5.92 if you cannot resolve a issue with 6.22.
The goal they wanted to reach is to have 6.22 exclusively soon since it implement a few necessary changes to better handle newer cores and to get rid of any v5 trace. For this, we need to use them and report issues since beta testing is what it's about. if you don't want to deal with beta testing, you should not use a beta client and use any non-beta client (for SMP, the linux/OSX 6.02 is out of beta in case you care). We are the [H]orde and we won't abandon for such a petty issue like a beta bug relic proved himself by finding the bandaid issue. Peter also found out that the extra safety checks in 6.22 is overreacting with a A1 EUE so expect a new version shortly.![]()
|
|
#14
|
|||
|
|||
|
Quote:
![]() I think the best plan, since we have a workaround, is as Xilikon suggested. Go to 6.22 if possible. I would suggest dropping back to 90% CPU settings if you have an EUE else leave it at 100%. The 5.91/5.92 repackaging isn't really helping except to keep some people running while they sort out the 6.22 errors. So use the repackaged 5.9x clients as a last resort. I have no idea how this reduced CPU usage fixes the client crashing problem, I only know that it is reproduceable and effective. The programmers have to take it from there, but at least they have something to work with.
|
|
#15
|
|||
|
|||
|
Quote:
![]() I believe I have an answer to how to make the new program run, reliably although I can take no responsibility for relic’s hardware building skills. ![]() ![]()
|
|
#16
|
|||
|
|||
|
I just installed, then reinstalled the 6.22 program successfully three different times with repeatable results.
Step one, uninstall all instances of the old SMP client. If you were running two in windows before uninstall both. Done properly each uninstall should require a standard reboot to free up the files in use. At that point manually delete all the folders containing anything about SMP that had been in use before. Now, down load the combo 32-64 bit client 6.22 it is the least buggered of the two. Double click to install and it may well choose to use one of the directories you used before even though by now you should have deleted it. This is all good. Allow it to do so. Prior to this in Vista you had to disable user control, well, don’t do that this time. As long as you are the admin in your OS and have to enter with a password you are all set. Double click the install bat, enter your username and passwords, you will get the normal “if you see this twice crap press any key to close screen”, again, this is good. Now, make a shortcut to your new .exe file and add –smp –configonly to the short cut. Click on your short cut and do your configuration. I did use advanced options and allowed the size of work units to remain “Normal” CPU idle and 100% no –advanced methods and I left the two mystery config lines empty. I allowed machine #1 to be the default. Now, go back to your shortcut and edit it through properties and make it read –smp –forceasm then save and close. Click or double click your shortcut (depending on where you put it, I like quick launch) and bingo you should be up and running. I’m finding temps to be about the same and at 3.2 a 2665 is hitting 14 min per frame and a 3065 12 min a frame, again at 3.2. Give it a shot, leme know. ![]() Luck ![]() ![]()
|
|
#17
|
|||
|
|||
|
Bill,
Almost exactly as I installed. I disabled UAC permanently on my systems as I have no use for it, but otherwise we had similar approaches. One of the 90% CPU usage systems EUEed today, again at 8% into WU completion. Apparently our workaround isn't 100% reliable. This doesn't bode well....we'll likely be going back to 5.91 unless an patch is developed quickly. This kind of instability isn't acceptable. As a side note, this is again Project: 2665 (Run 0, Clone 207, Gen 16) WU that is having a problem in 6.22. Code:
[20:45:52] Completed 20000 out of 250000 steps (8 percent) [21:01:16] Warning: long 1-4 interactions [21:01:17] Gromacs cannot continue further. [21:01:17] Going to send back what have done. [21:01:17] logfile size: 0 [21:01:17] Warning: Core could not open logfile. [21:01:17] - Writing 536 bytes of core data to disk... [21:01:17] ... Done. [21:01:17] - Failed to delete work/wudata_05.sas [21:01:17] - Failed to delete work/wudata_05.goe [21:01:17] Warning: check for stray files [21:01:17] [21:01:17] Folding@home Core Shutdown: EARLY_UNIT_END [21:01:17] [21:01:17] Folding@home Core Shutdown: EARLY_UNIT_END [21:01:20] CoreStatus = 7B (123) [21:01:20] Client-core communications error: ERROR 0x7b [21:01:20] This is a sign of more serious problems, shutting down. Last edited by relic; 07-31-2008 at 06:02 PM..
|
|
#18
|
|||
|
|||
|
Peter Kasson mentionned that 6.22 included some safety checks which is too agressive, causing EUE for nothing and not able to clean up the shitty mess after a EUE (MISSING_WORK_FILES). Expect a new version very soon.
![]() ![]()
|
|
#19
|
|||
|
|||
|
|
#20
|
|||
|
|||
|
Too funny.
Our favorite idiot has chimed in.... Finger pointing as usual. Why do they put up with this moron? Quote:
|
![]() |
| Thread Tools | Search this Thread |
|
|