Anyone else getting client core errors?

Status
Not open for further replies.

Kendrak

[H]ard|DCer of the Year 2009
Joined
Aug 29, 2001
Messages
21,141
I came home and my 4P had had 5 errors and had "gone to sleep" anyone else getting hit with this?

It was on a 6903

I switched to normal SMP and so far so good.

Bad WU or is my 4P throwing fits?
 
I have had a few 6903s that sit at the nt mapping, suck up ALL memory ( 32GB physical ad 4GB swap ), and the crash. Just delete the queue machine.dat unitinfo, and work directory and try again.
 
I del the unitinfo, but not the que....

I will run SMP for a bit and retry.
 
I had a 6904 fail 27 times while I was out of town last week. I think we are getting to the bottom of the barrel on the 690x units.
 
Kendrak, can you post or paste the log?

This could partially be explained by 0.7-pre Kraken that restarts
the core more often than it otherwise would (by itself, due to bad WU).

Working on improvements here. We could, naturally, drop the support
for restarting-bad-WUs (because it's not offering any gains, as it turns out)
but I'd like to do better...
 
Probably a bad WU. If you tell me which one it is I can report it bad in the db and email Kasson to have him remove it.

You need to delete the work folder, queue.dat and machinedependant.dat to get out of these situations.
 
Not sure how WUs are "handed out" but this must be more than a coincidence or luck.
Two 4Ps (one 6166HE, one 6128)consistently get 6904, almost without fail, day after day for weeks.
The third rig (6166HE) gets 90% 6903 and 10% 6900/1. Never a 6904.
And to stay somewhat on topic, I've had 6904s fail in the past month, 3-4 maybe, multiple times, that only deleting files mentioned above would solve.
 
The problematic WUs I had were:

Project: 6903 (Run 3, Clone 8, Gen 117)
Project: 6904 (Run 2, Clone 19, Gen 84)

They kept haunting me even after I deleted everything and started anew.

Thekraken log files were these:

Code:
thekraken: The Kraken 0.7-pre11 (compiled Fri May 25 10:29:44 EDT 2012 by eylem@FAH-H8QGL)
thekraken: Processor affinity wrapper for Folding@Home
thekraken: The Kraken comes with ABSOLUTELY NO WARRANTY; licensed under GPLv2
thekraken: PID: 2885
thekraken: launch binary: ./thekraken-FahCore_a5.exe
thekraken: config file: ./thekraken.cfg
thekraken: Forked 2886.
thekraken: child: ptrace(PTRACE_TRACEME) returns 0
thekraken: child: Executing...
thekraken: 2886: initial attach
thekraken: 2886: Continuing.
thekraken: 2886: logfile fd: 5
thekraken: waitpid() returns 2886 with status 0x0000057f
thekraken: 2886: stopped with signal 0x00000005
thekraken: 2886: Continuing (unhandled trap).
thekraken: waitpid() returns 2886 with status 0x00006500
thekraken: 2886: exited with 101

and

Code:
thekraken: The Kraken 0.7-pre11 (compiled Fri May 25 10:29:44 EDT 2012 by eylem@FAH-H8QGL)
thekraken: Processor affinity wrapper for Folding@Home
thekraken: The Kraken comes with ABSOLUTELY NO WARRANTY; licensed under GPLv2
thekraken: PID: 2890
thekraken: launch binary: ./thekraken-FahCore_a5.exe
thekraken: config file: ./thekraken.cfg
thekraken: Forked 2891.
thekraken: child: ptrace(PTRACE_TRACEME) returns 0
thekraken: child: Executing...
thekraken: 2891: initial attach
thekraken: 2891: Continuing.
thekraken: 2891: logfile fd: 5
thekraken: waitpid() returns 2891 with status 0x0000057f
thekraken: 2891: stopped with signal 0x00000005
thekraken: 2891: Continuing (unhandled trap).
thekraken: waitpid() returns 2892 with status 0x0000137f
thekraken: 2892: stopped with signal 0x00000013
thekraken: 2892: Continuing.
thekraken: waitpid() returns 2891 with status 0x0003057f
thekraken: 2891: stopped with signal 0x00000005
thekraken: 2891: cloned 2892
thekraken: 2891: binding 2892 to cpu 0
thekraken: 2891: talkative FahCore process identified (2892); listening to syscalls
thekraken: 2891: Continuing.
thekraken: waitpid() returns 2893 with status 0x0000137f
thekraken: 2893: stopped with signal 0x00000013
thekraken: 2893: Continuing.
thekraken: waitpid() returns 2891 with status 0x0003057f
thekraken: 2891: stopped with signal 0x00000005
thekraken: 2891: cloned 2893
thekraken: 2891: Continuing.
thekraken: waitpid() returns 2894 with status 0x0000137f
thekraken: 2894: stopped with signal 0x00000013
thekraken: 2894: Continuing.
thekraken: waitpid() returns 2891 with status 0x0003057f
thekraken: 2891: stopped with signal 0x00000005
thekraken: 2891: cloned 2894
thekraken: 2891: Continuing.
thekraken: 2890: (sighandler) got signal 0x00000001
thekraken: waitpid() returns 2894 with status 0x00000009
thekraken: 2894: terminated by signal 9
thekraken: 2894: ignoring clone termination
thekraken: waitpid() returns 2893 with status 0x00000009
thekraken: 2893: terminated by signal 9
thekraken: 2893: ignoring clone termination
thekraken: 2892: terminated by signal 9
thekraken: 2892: ignoring clone termination
thekraken: waitpid() returns 2891 with status 0x00000009
thekraken: 2891: terminated by signal 9

I hope it helps ;)
 
This is -pre11 that doesn't contribute to the issue.

Logs look "normal" as in:
- first: core stopped manually (EDIT: nope, something else happened here -- do you have FAHlog.txt corresponding to this run?)
- second: core killed due to OOM

I'd need them from Kendrak :)
 
Last edited:
I will restart add the flag when I get home and see if the problem pops up again. I wanted to run SMP for a bit just to verify I was stable. I was hoping we didn't have another OC issue.

I also have other goodies to play with tonight if I have time.
 
You're stable, all right. Gotta start bringing BAs in so EVGA's gap stops narrowing :)
 
The problematic WUs I had were:

Project: 6903 (Run 3, Clone 8, Gen 117)
Project: 6904 (Run 2, Clone 19, Gen 84)

Thank you ChelseaOilman.

The WU (P6903,R3,C8,G117) has been reported as a bad WU. Note that the list of reported WUs are stopped daily at 8am pacific time.

Thank you ChelseaOilman.

The WU (P6904,R2,C19,G84) has been reported as a bad WU. Note that the list of reported WUs are stopped daily at 8am pacific time.

I reported both WUs as bad.
 
Let's always include log excerpts to make sure we don't report failures due to OC or whatnot...
I think it's a good practice.
 
Status
Not open for further replies.
Back
Top