Anyone else getting client core errors?

Kendrak · Jun 13, 2012

I came home and my 4P had had 5 errors and had "gone to sleep" anyone else getting hit with this?

It was on a 6903

I switched to normal SMP and so far so good.

Bad WU or is my 4P throwing fits?

402blownstroker · Jun 13, 2012

I have had a few 6903s that sit at the nt mapping, suck up ALL memory ( 32GB physical ad 4GB swap ), and the crash. Just delete the queue machine.dat unitinfo, and work directory and try again.

Kendrak · Jun 13, 2012

I del the unitinfo, but not the que....

I will run SMP for a bit and retry.

musky · Jun 13, 2012

I had a 6904 fail 27 times while I was out of town last week. I think we are getting to the bottom of the barrel on the 690x units.

tear · Jun 13, 2012

Kendrak, can you post or paste the log?

This could partially be explained by 0.7-pre Kraken that restarts
the core more often than it otherwise would (by itself, due to bad WU).

Working on improvements here. We could, naturally, drop the support
for restarting-bad-WUs (because it's not offering any gains, as it turns out)
but I'd like to do better...

ChelseaOilman · Jun 13, 2012

Probably a bad WU. If you tell me which one it is I can report it bad in the db and email Kasson to have him remove it.

You need to delete the work folder, queue.dat and machinedependant.dat to get out of these situations.

402blownstroker · Jun 14, 2012

musky said:
I think we are getting to the bottom of the barrel on the 690x units.

I agree

Core32 · Jun 14, 2012

Not sure how WUs are "handed out" but this must be more than a coincidence or luck.
Two 4Ps (one 6166HE, one 6128)consistently get 6904, almost without fail, day after day for weeks.
The third rig (6166HE) gets 90% 6903 and 10% 6900/1. Never a 6904.
And to stay somewhat on topic, I've had 6904s fail in the past month, 3-4 maybe, multiple times, that only deleting files mentioned above would solve.

theGryphon · Jun 14, 2012

The problematic WUs I had were:

Project: 6903 (Run 3, Clone 8, Gen 117)
Project: 6904 (Run 2, Clone 19, Gen 84)

They kept haunting me even after I deleted everything and started anew.

Thekraken log files were these:

Code:

thekraken: The Kraken 0.7-pre11 (compiled Fri May 25 10:29:44 EDT 2012 by eylem@FAH-H8QGL)
thekraken: Processor affinity wrapper for Folding@Home
thekraken: The Kraken comes with ABSOLUTELY NO WARRANTY; licensed under GPLv2
thekraken: PID: 2885
thekraken: launch binary: ./thekraken-FahCore_a5.exe
thekraken: config file: ./thekraken.cfg
thekraken: Forked 2886.
thekraken: child: ptrace(PTRACE_TRACEME) returns 0
thekraken: child: Executing...
thekraken: 2886: initial attach
thekraken: 2886: Continuing.
thekraken: 2886: logfile fd: 5
thekraken: waitpid() returns 2886 with status 0x0000057f
thekraken: 2886: stopped with signal 0x00000005
thekraken: 2886: Continuing (unhandled trap).
thekraken: waitpid() returns 2886 with status 0x00006500
thekraken: 2886: exited with 101

and

Code:

thekraken: The Kraken 0.7-pre11 (compiled Fri May 25 10:29:44 EDT 2012 by eylem@FAH-H8QGL)
thekraken: Processor affinity wrapper for Folding@Home
thekraken: The Kraken comes with ABSOLUTELY NO WARRANTY; licensed under GPLv2
thekraken: PID: 2890
thekraken: launch binary: ./thekraken-FahCore_a5.exe
thekraken: config file: ./thekraken.cfg
thekraken: Forked 2891.
thekraken: child: ptrace(PTRACE_TRACEME) returns 0
thekraken: child: Executing...
thekraken: 2891: initial attach
thekraken: 2891: Continuing.
thekraken: 2891: logfile fd: 5
thekraken: waitpid() returns 2891 with status 0x0000057f
thekraken: 2891: stopped with signal 0x00000005
thekraken: 2891: Continuing (unhandled trap).
thekraken: waitpid() returns 2892 with status 0x0000137f
thekraken: 2892: stopped with signal 0x00000013
thekraken: 2892: Continuing.
thekraken: waitpid() returns 2891 with status 0x0003057f
thekraken: 2891: stopped with signal 0x00000005
thekraken: 2891: cloned 2892
thekraken: 2891: binding 2892 to cpu 0
thekraken: 2891: talkative FahCore process identified (2892); listening to syscalls
thekraken: 2891: Continuing.
thekraken: waitpid() returns 2893 with status 0x0000137f
thekraken: 2893: stopped with signal 0x00000013
thekraken: 2893: Continuing.
thekraken: waitpid() returns 2891 with status 0x0003057f
thekraken: 2891: stopped with signal 0x00000005
thekraken: 2891: cloned 2893
thekraken: 2891: Continuing.
thekraken: waitpid() returns 2894 with status 0x0000137f
thekraken: 2894: stopped with signal 0x00000013
thekraken: 2894: Continuing.
thekraken: waitpid() returns 2891 with status 0x0003057f
thekraken: 2891: stopped with signal 0x00000005
thekraken: 2891: cloned 2894
thekraken: 2891: Continuing.
thekraken: 2890: (sighandler) got signal 0x00000001
thekraken: waitpid() returns 2894 with status 0x00000009
thekraken: 2894: terminated by signal 9
thekraken: 2894: ignoring clone termination
thekraken: waitpid() returns 2893 with status 0x00000009
thekraken: 2893: terminated by signal 9
thekraken: 2893: ignoring clone termination
thekraken: 2892: terminated by signal 9
thekraken: 2892: ignoring clone termination
thekraken: waitpid() returns 2891 with status 0x00000009
thekraken: 2891: terminated by signal 9

I hope it helps

tear · Jun 14, 2012

This is -pre11 that doesn't contribute to the issue.

Logs look "normal" as in:
- first: core stopped manually (EDIT: nope, something else happened here -- do you have FAHlog.txt corresponding to this run?)
- second: core killed due to OOM

I'd need them from Kendrak

Kendrak · Jun 14, 2012

I will restart add the flag when I get home and see if the problem pops up again. I wanted to run SMP for a bit just to verify I was stable. I was hoping we didn't have another OC issue.

I also have other goodies to play with tonight if I have time.

tear · Jun 14, 2012

You're stable, all right. Gotta start bringing BAs in so EVGA's gap stops narrowing

ChelseaOilman · Jun 14, 2012

theGryphon said:
The problematic WUs I had were:

Project: 6903 (Run 3, Clone 8, Gen 117)
Project: 6904 (Run 2, Clone 19, Gen 84)

Thank you ChelseaOilman.

The WU (P6903,R3,C8,G117) has been reported as a bad WU. Note that the list of reported WUs are stopped daily at 8am pacific time.

Thank you ChelseaOilman.

The WU (P6904,R2,C19,G84) has been reported as a bad WU. Note that the list of reported WUs are stopped daily at 8am pacific time.

I reported both WUs as bad.

jimh425 · Jun 14, 2012

I've had a few intermittent problems but I forget that the WUs can be messed up!

tear · Jun 14, 2012

Let's always include log excerpts to make sure we don't report failures due to OC or whatnot...
I think it's a good practice.

Kendrak · Jun 14, 2012

deleted the files listed and I'm good.

Anyone else getting client core errors?

Kendrak

[H]ard|DCer of the Year 2009

402blownstroker

[H]ard|DCer of the Month - Nov. 2012

Kendrak

[H]ard|DCer of the Year 2009

musky

[H]ard|DCer of the Year 2012

tear

[H]ard|DCer of the Year 2011

ChelseaOilman

[H]ard|Gawd

402blownstroker

[H]ard|DCer of the Month - Nov. 2012

Core32

[H]ard|Gawd

theGryphon

[H]ard|Gawd

tear

[H]ard|DCer of the Year 2011

Kendrak

[H]ard|DCer of the Year 2009

tear

[H]ard|DCer of the Year 2011

ChelseaOilman

[H]ard|Gawd

jimh425

Gawd

tear

[H]ard|DCer of the Year 2011

Kendrak

[H]ard|DCer of the Year 2009