Repetitive error after completed WU

Core32 · Apr 17, 2012

I have notice this happening with some regularity over the last few days.
This is on a 4P SuperMicro with 4 x 6166HEs, OC with the BIOS at 245 and an OV of 1.0125V.
Temps on all cores are below 50C. HT-retries are at 0.
A 69xx will finish and upload with no problems.Then on restart I get this:

[20:10:16] Folding@Home Gromacs SMP Core
[20:10:16] Version 2.27 (Thu Feb 10 09:46:40 PST 2011)
[20:10:16]
[20:10:16] Preparing to commence simulation
[20:10:16] - Looking at optimizations...
[20:10:16] - Not checking prior termination.
[20:10:23] - Expanded 57245510 -> 71846524 (decompressed 50.4 percent)
[20:10:23] Called DecompressByteArray: compressed_data_size=57245510 data_size=71846524, decompressed_data_size=71846524 diff=0
[20:10:24] - Digital signature verified
[20:10:24]
[20:10:24] Project: 6903 (Run 1, Clone 16, Gen 46)
[20:10:24]
[20:10:24] Assembly optimizations on if available.
[20:10:24] Entering M.D.
[20:10:33] Mapping NT from 48 to 48
[20:10:38] Completed 0 out of 250000 steps (0%)
[20:25:38] Completed 2500 out of 250000 steps (1%)
[20:28:47] /O failed dir=0, var=00000000010F3810, varsize=21120
[20:28:47] Can't restore state.fcSaveRestoreState: I/O failed dir=0, var=00000000010F3810, varsize=21120
[20:28:47] Can't restore state.fcSaveRestoreState: I/O failed dir=0, var=00000000010F3810, varsize=21120
[20:28:47] Can't restore state.fcSaveRestoreState: I/O failed dir=0, var=00000000010F3810, varsize=21120
.
.
.
.
.
.
.
[20:28:47] Can't restore state.fcSaveRestoreState: I/O failed dir=0, var=00000000010F3810, varsize=21120
[20:28:47] Can't restore state.fcSaveRestoreState: I/O failed dir=0, var=00000000010F3810, varsize=21120
[20:28:47] Can't restore state.fcSaveRestoreState: I/O failed dir=0, var=00000000010F3810, varsize=21120
[20:28:47] Can't restore state.mdrun returned 3
[20:28:47] Gromacs detected an invalid checkpoint. Restarting...fcSaveRestoreState: I/O failed dir=0, var=00000000010F3810, varsize=21120
[20:28:47] Can't restore state.fcSaveRestoreState: I/O failed dir=0, var=00000000010F3810, varsize=21120
[20:28:47] Can't restore state.fcSaveRestoreState: I/O failed dir=0, var=00000000010F3810, varsize=21120
.
.
.
.
.
.
.
.
[20:28:47] Can't restore state.fcSaveRestoreState: I/O failed dir=0, var=00000000010F3810, varsize=21120
[20:28:47] Can't restore state.Resuming from checkpoint
[20:28:47] fcSaveRestoreState: I/O failed dir=0, var=00000000010F3810, varsize=21120
[20:28:47] Can't restore state.fcSaveRestoreState: I/O failed dir=0, var=00000000010F3810, varsize=21120
[20:28:47] Can't restore state.fcSaveRestoreState: I/O failed dir=0, var=00000000010F3810, varsize=21120
[20:28:47] Can't restore state.fcSaveRestoreState: I/O failed dir=0, var=00000000010F3810, varsize=21120
[20:28:47] Can't restore state.fcSaveRestoreState: I/O failed dir=0, var=00000000010F3810, varsize=21120
[20:28:47] Can't restore state.fcSaveRestoreState: I/O failed dir=0, var=00000000010F3810, varsize=21120
[20:28:47] Can't restore state.
[20:28:55] Folding@home Core Shutdown: UNKNOWN_ERROR
[20:28:56] CoreStatus = 62 (98)
[20:28:56] + Restarting core (settings changed)
[20:28:56]
[20:28:56] + Processing work unit
[20:28:56] Core required: FahCore_a5.exe
[20:28:56] Core found.
[20:28:56] Working on queue slot 03 [April 17 20:28:56 UTC]
[20:28:56] + Working ...
[20:28:56]
[20:28:56] *------------------------------*
[20:28:56] Folding@Home Gromacs SMP Core
[20:28:56] Version 2.27 (Thu Feb 10 09:46:40 PST 2011)
[20:28:56]
[20:28:56] Preparing to commence simulation
[20:28:56] - Looking at optimizations...
[20:28:56] - Not checking prior termination.
[20:29:03] - Expanded 57245510 -> 71846524 (decompressed 50.4 percent)
[20:29:03] Called DecompressByteArray: compressed_data_size=57245510 data_size=71846524, decompressed_data_size=71846524 diff=0
[20:29:04] - Digital signature verified
[20:29:04]
[20:29:04] Project: 6903 (Run 1, Clone 16, Gen 46)
[20:29:04]
[20:29:04] Assembly optimizations on if available.
[20:29:04] Entering M.D.
[20:29:13] Mapping NT from 48 to 48
[20:29:18] Completed 0 out of 250000 steps (0%)

And repeat.
So far the only remedy I have found is to wipe out the /work folder, queue.dat and machinedependant.dat files.
Then on restart fah, it runs fine, multple WUs maybe and then this again.
I thought I had read about this error or a similar one in another thread discussed but could not locate it with search.
Any ideas?

Core32 · Apr 17, 2012

tear · Apr 17, 2012

Very odd error.

dir=0 means reading.

What I find strange is that it attempts to read during normal
operation... (something I wouldn't expect -- all state information
should already be in memory).

Any I/O or filesystem errors in 'dmesg'? Does 'df -hT' report
enough amount of disk space?

Resumption from broken checkpoint yields similar messages
but it's clear it's not what you're doing here.... (or are you?)

Messages don't make much sense to me. Unit starts at 0%
on one hand but tries to resume from checkpoint on the other?

No idea...

Core32 · Apr 17, 2012

tear said:
Very odd error.

dir=0 means reading.

What I find strange is that it attempts to read during normal
operation... (something I wouldn't expect -- all state information
should already be in memory).

Any I/O or filesystem errors in 'dmesg'? Does 'df -hT' report
enough amount of disk space?

Resumption from broken checkpoint yields similar messages
but it's clear it's not what you're doing here.... (or are you?)

Messages don't make much sense to me. Unit starts at 0%
on one hand but tries to resume from checkpoint on the other?

No idea...

Will this command suffice to find errors? dmesg | grep "error" ?
df -hT shows 133G available.
This IS the HDD that I cloned from my other 4P rig .......

extide · Apr 17, 2012

Sounds like a disk error. Anything ugly looking in dmesg?

Core32 · Apr 17, 2012

dmesg has a LARGE output.
Any thing in particular I should grep for?
Like I mention above?
I do see this one line in dmesg: [ 9.606817] PM: Resume from disk failed.

tear · Apr 17, 2012

Skip the grep, Core32. Just run 'dmesg' and go back starting from the end.

Like extide says, errors should be evident...(you'll be able to tell right away that
they don't belong there).

extide · Apr 17, 2012

Yeah, it's a log from the entire time the box has been booted up. A recent error will be in the last few lines. The number in brackets at the left is a timestamp, seconds since boot.

Core32 · Apr 17, 2012

Nothing stands out. Lots of lines about ECC being disabled:

EDAC MC: Ver: 2.1.0 Sep 19 2010
[ 18.841230] EDAC amd64_edac: Ver: 3.3.0 Sep 19 2010
[ 18.850348] EDAC amd64: This node reports that Memory ECC is currently disabled, set F3x44[22] (0000:00:18.3).
[ 18.850937] EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not load.
[ 18.850938] Either enable ECC checking or force module loading by setting 'ecc_enable_override'.
[ 18.850940] (Note that use of the override may cause unknown side effects.)
[ 18.851166] amd64_edac: probe of 0000:00:18.2 failed with error -22

I have re-booted since the last failure if that matters.

extide · Apr 17, 2012

When a failure occurs, check in there.

You can use dmesg -c to clear it so each time you run dmesg -c it will only show new lines.

Pocatello · Apr 17, 2012

Have you tried turning it off and on again?

Core32 · Apr 18, 2012

Pocatello said:
Have you tried turning it off and on again?

Yes. And a restart, and aslo a controlled shut down, power on cycle.
All of those "clear" the issue but it comes back again, eventually.
If it reappears aftet this round I will format and rebuild the drive without doing the clone.
Re-install everything from scratch that is.
The cloning did not go as smoothly as I expected and then I created some files/foldes as root instead of as user which caused some R/W permissions problems as well.

Pocatello · Apr 18, 2012

Hmm.

http://www.youtube.com/watch?v=nn2FB1P_Mn8

Repetitive error after completed WU

Core32

[H]ard|Gawd

Core32

[H]ard|Gawd

tear

[H]ard|DCer of the Year 2011

Core32

[H]ard|Gawd

extide

2[H]4U

Core32

[H]ard|Gawd

tear

[H]ard|DCer of the Year 2011

extide

2[H]4U

Core32

[H]ard|Gawd

extide

2[H]4U

Pocatello

DC Moderator and [H]ard DCOTM x7

Core32

[H]ard|Gawd

Pocatello

DC Moderator and [H]ard DCOTM x7