• Some users have recently had their accounts hijacked. It seems that the now defunct EVGA forums might have compromised your password there and seems many are using the same PW here. We would suggest you UPDATE YOUR PASSWORD and TURN ON 2FA for your account here to further secure it. None of the compromised accounts had 2FA turned on.
    Once you have enabled 2FA, your account will be updated soon to show a badge, letting other members know that you use 2FA to protect your account. This should be beneficial for everyone that uses FSFT.

Repetitive error after completed WU

Core32

[H]ard|Gawd
Joined
Mar 3, 2012
Messages
1,065
I have notice this happening with some regularity over the last few days.
This is on a 4P SuperMicro with 4 x 6166HEs, OC with the BIOS at 245 and an OV of 1.0125V.
Temps on all cores are below 50C. HT-retries are at 0.
A 69xx will finish and upload with no problems.Then on restart I get this:
[20:10:16] Folding@Home Gromacs SMP Core
[20:10:16] Version 2.27 (Thu Feb 10 09:46:40 PST 2011)
[20:10:16]
[20:10:16] Preparing to commence simulation
[20:10:16] - Looking at optimizations...
[20:10:16] - Not checking prior termination.
[20:10:23] - Expanded 57245510 -> 71846524 (decompressed 50.4 percent)
[20:10:23] Called DecompressByteArray: compressed_data_size=57245510 data_size=71846524, decompressed_data_size=71846524 diff=0
[20:10:24] - Digital signature verified
[20:10:24]
[20:10:24] Project: 6903 (Run 1, Clone 16, Gen 46)
[20:10:24]
[20:10:24] Assembly optimizations on if available.
[20:10:24] Entering M.D.
[20:10:33] Mapping NT from 48 to 48
[20:10:38] Completed 0 out of 250000 steps (0%)
[20:25:38] Completed 2500 out of 250000 steps (1%)
[20:28:47] /O failed dir=0, var=00000000010F3810, varsize=21120
[20:28:47] Can't restore state.fcSaveRestoreState: I/O failed dir=0, var=00000000010F3810, varsize=21120
[20:28:47] Can't restore state.fcSaveRestoreState: I/O failed dir=0, var=00000000010F3810, varsize=21120
[20:28:47] Can't restore state.fcSaveRestoreState: I/O failed dir=0, var=00000000010F3810, varsize=21120
.
.
.
.
.
.
.
[20:28:47] Can't restore state.fcSaveRestoreState: I/O failed dir=0, var=00000000010F3810, varsize=21120
[20:28:47] Can't restore state.fcSaveRestoreState: I/O failed dir=0, var=00000000010F3810, varsize=21120
[20:28:47] Can't restore state.fcSaveRestoreState: I/O failed dir=0, var=00000000010F3810, varsize=21120
[20:28:47] Can't restore state.mdrun returned 3
[20:28:47] Gromacs detected an invalid checkpoint. Restarting...fcSaveRestoreState: I/O failed dir=0, var=00000000010F3810, varsize=21120
[20:28:47] Can't restore state.fcSaveRestoreState: I/O failed dir=0, var=00000000010F3810, varsize=21120
[20:28:47] Can't restore state.fcSaveRestoreState: I/O failed dir=0, var=00000000010F3810, varsize=21120
.
.
.
.
.
.
.
.
[20:28:47] Can't restore state.fcSaveRestoreState: I/O failed dir=0, var=00000000010F3810, varsize=21120
[20:28:47] Can't restore state.Resuming from checkpoint
[20:28:47] fcSaveRestoreState: I/O failed dir=0, var=00000000010F3810, varsize=21120
[20:28:47] Can't restore state.fcSaveRestoreState: I/O failed dir=0, var=00000000010F3810, varsize=21120
[20:28:47] Can't restore state.fcSaveRestoreState: I/O failed dir=0, var=00000000010F3810, varsize=21120
[20:28:47] Can't restore state.fcSaveRestoreState: I/O failed dir=0, var=00000000010F3810, varsize=21120
[20:28:47] Can't restore state.fcSaveRestoreState: I/O failed dir=0, var=00000000010F3810, varsize=21120
[20:28:47] Can't restore state.fcSaveRestoreState: I/O failed dir=0, var=00000000010F3810, varsize=21120
[20:28:47] Can't restore state.
[20:28:55] Folding@home Core Shutdown: UNKNOWN_ERROR
[20:28:56] CoreStatus = 62 (98)
[20:28:56] + Restarting core (settings changed)
[20:28:56]
[20:28:56] + Processing work unit
[20:28:56] Core required: FahCore_a5.exe
[20:28:56] Core found.
[20:28:56] Working on queue slot 03 [April 17 20:28:56 UTC]
[20:28:56] + Working ...
[20:28:56]
[20:28:56] *------------------------------*
[20:28:56] Folding@Home Gromacs SMP Core
[20:28:56] Version 2.27 (Thu Feb 10 09:46:40 PST 2011)
[20:28:56]
[20:28:56] Preparing to commence simulation
[20:28:56] - Looking at optimizations...
[20:28:56] - Not checking prior termination.
[20:29:03] - Expanded 57245510 -> 71846524 (decompressed 50.4 percent)
[20:29:03] Called DecompressByteArray: compressed_data_size=57245510 data_size=71846524, decompressed_data_size=71846524 diff=0
[20:29:04] - Digital signature verified
[20:29:04]
[20:29:04] Project: 6903 (Run 1, Clone 16, Gen 46)
[20:29:04]
[20:29:04] Assembly optimizations on if available.
[20:29:04] Entering M.D.
[20:29:13] Mapping NT from 48 to 48
[20:29:18] Completed 0 out of 250000 steps (0%)

And repeat.
So far the only remedy I have found is to wipe out the /work folder, queue.dat and machinedependant.dat files.
Then on restart fah, it runs fine, multple WUs maybe and then this again.
I thought I had read about this error or a similar one in another thread discussed but could not locate it with search.
Any ideas?
 
Very odd error.

dir=0 means reading.

What I find strange is that it attempts to read during normal
operation... (something I wouldn't expect -- all state information
should already be in memory).

Any I/O or filesystem errors in 'dmesg'? Does 'df -hT' report
enough amount of disk space?

Resumption from broken checkpoint yields similar messages
but it's clear it's not what you're doing here.... (or are you?)

Messages don't make much sense to me. Unit starts at 0%
on one hand but tries to resume from checkpoint on the other?


No idea...
 
Very odd error.

dir=0 means reading.

What I find strange is that it attempts to read during normal
operation... (something I wouldn't expect -- all state information
should already be in memory).

Any I/O or filesystem errors in 'dmesg'? Does 'df -hT' report
enough amount of disk space?

Resumption from broken checkpoint yields similar messages
but it's clear it's not what you're doing here.... (or are you?)

Messages don't make much sense to me. Unit starts at 0%
on one hand but tries to resume from checkpoint on the other?


No idea...

Will this command suffice to find errors? dmesg | grep "error" ?
df -hT shows 133G available.
This IS the HDD that I cloned from my other 4P rig .......
 
dmesg has a LARGE output.
Any thing in particular I should grep for?
Like I mention above?
I do see this one line in dmesg: [ 9.606817] PM: Resume from disk failed.


 
Skip the grep, Core32. Just run 'dmesg' and go back starting from the end.

Like extide says, errors should be evident...(you'll be able to tell right away that
they don't belong there).
 
Yeah, it's a log from the entire time the box has been booted up. A recent error will be in the last few lines. The number in brackets at the left is a timestamp, seconds since boot.
 
Nothing stands out. Lots of lines about ECC being disabled:
EDAC MC: Ver: 2.1.0 Sep 19 2010
[ 18.841230] EDAC amd64_edac: Ver: 3.3.0 Sep 19 2010
[ 18.850348] EDAC amd64: This node reports that Memory ECC is currently disabled, set F3x44[22] (0000:00:18.3).
[ 18.850937] EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not load.
[ 18.850938] Either enable ECC checking or force module loading by setting 'ecc_enable_override'.
[ 18.850940] (Note that use of the override may cause unknown side effects.)
[ 18.851166] amd64_edac: probe of 0000:00:18.2 failed with error -22

I have re-booted since the last failure if that matters.
 
When a failure occurs, check in there.

You can use dmesg -c to clear it so each time you run dmesg -c it will only show new lines.
 
Have you tried turning it off and on again?
 
Have you tried turning it off and on again?

Yes. And a restart, and aslo a controlled shut down, power on cycle.
All of those "clear" the issue but it comes back again, eventually.
If it reappears aftet this round I will format and rebuild the drive without doing the clone.
Re-install everything from scratch that is.
The cloning did not go as smoothly as I expected and then I created some files/foldes as root instead of as user which caused some R/W permissions problems as well.

 
Back
Top