I have notice this happening with some regularity over the last few days.
This is on a 4P SuperMicro with 4 x 6166HEs, OC with the BIOS at 245 and an OV of 1.0125V.
Temps on all cores are below 50C. HT-retries are at 0.
A 69xx will finish and upload with no problems.Then on restart I get this:
And repeat.
So far the only remedy I have found is to wipe out the /work folder, queue.dat and machinedependant.dat files.
Then on restart fah, it runs fine, multple WUs maybe and then this again.
I thought I had read about this error or a similar one in another thread discussed but could not locate it with search.
Any ideas?
This is on a 4P SuperMicro with 4 x 6166HEs, OC with the BIOS at 245 and an OV of 1.0125V.
Temps on all cores are below 50C. HT-retries are at 0.
A 69xx will finish and upload with no problems.Then on restart I get this:
[20:10:16] Folding@Home Gromacs SMP Core
[20:10:16] Version 2.27 (Thu Feb 10 09:46:40 PST 2011)
[20:10:16]
[20:10:16] Preparing to commence simulation
[20:10:16] - Looking at optimizations...
[20:10:16] - Not checking prior termination.
[20:10:23] - Expanded 57245510 -> 71846524 (decompressed 50.4 percent)
[20:10:23] Called DecompressByteArray: compressed_data_size=57245510 data_size=71846524, decompressed_data_size=71846524 diff=0
[20:10:24] - Digital signature verified
[20:10:24]
[20:10:24] Project: 6903 (Run 1, Clone 16, Gen 46)
[20:10:24]
[20:10:24] Assembly optimizations on if available.
[20:10:24] Entering M.D.
[20:10:33] Mapping NT from 48 to 48
[20:10:38] Completed 0 out of 250000 steps (0%)
[20:25:38] Completed 2500 out of 250000 steps (1%)
[20:28:47] /O failed dir=0, var=00000000010F3810, varsize=21120
[20:28:47] Can't restore state.fcSaveRestoreState: I/O failed dir=0, var=00000000010F3810, varsize=21120
[20:28:47] Can't restore state.fcSaveRestoreState: I/O failed dir=0, var=00000000010F3810, varsize=21120
[20:28:47] Can't restore state.fcSaveRestoreState: I/O failed dir=0, var=00000000010F3810, varsize=21120
.
.
.
.
.
.
.
[20:28:47] Can't restore state.fcSaveRestoreState: I/O failed dir=0, var=00000000010F3810, varsize=21120
[20:28:47] Can't restore state.fcSaveRestoreState: I/O failed dir=0, var=00000000010F3810, varsize=21120
[20:28:47] Can't restore state.fcSaveRestoreState: I/O failed dir=0, var=00000000010F3810, varsize=21120
[20:28:47] Can't restore state.mdrun returned 3
[20:28:47] Gromacs detected an invalid checkpoint. Restarting...fcSaveRestoreState: I/O failed dir=0, var=00000000010F3810, varsize=21120
[20:28:47] Can't restore state.fcSaveRestoreState: I/O failed dir=0, var=00000000010F3810, varsize=21120
[20:28:47] Can't restore state.fcSaveRestoreState: I/O failed dir=0, var=00000000010F3810, varsize=21120
.
.
.
.
.
.
.
.
[20:28:47] Can't restore state.fcSaveRestoreState: I/O failed dir=0, var=00000000010F3810, varsize=21120
[20:28:47] Can't restore state.Resuming from checkpoint
[20:28:47] fcSaveRestoreState: I/O failed dir=0, var=00000000010F3810, varsize=21120
[20:28:47] Can't restore state.fcSaveRestoreState: I/O failed dir=0, var=00000000010F3810, varsize=21120
[20:28:47] Can't restore state.fcSaveRestoreState: I/O failed dir=0, var=00000000010F3810, varsize=21120
[20:28:47] Can't restore state.fcSaveRestoreState: I/O failed dir=0, var=00000000010F3810, varsize=21120
[20:28:47] Can't restore state.fcSaveRestoreState: I/O failed dir=0, var=00000000010F3810, varsize=21120
[20:28:47] Can't restore state.fcSaveRestoreState: I/O failed dir=0, var=00000000010F3810, varsize=21120
[20:28:47] Can't restore state.
[20:28:55] Folding@home Core Shutdown: UNKNOWN_ERROR
[20:28:56] CoreStatus = 62 (98)
[20:28:56] + Restarting core (settings changed)
[20:28:56]
[20:28:56] + Processing work unit
[20:28:56] Core required: FahCore_a5.exe
[20:28:56] Core found.
[20:28:56] Working on queue slot 03 [April 17 20:28:56 UTC]
[20:28:56] + Working ...
[20:28:56]
[20:28:56] *------------------------------*
[20:28:56] Folding@Home Gromacs SMP Core
[20:28:56] Version 2.27 (Thu Feb 10 09:46:40 PST 2011)
[20:28:56]
[20:28:56] Preparing to commence simulation
[20:28:56] - Looking at optimizations...
[20:28:56] - Not checking prior termination.
[20:29:03] - Expanded 57245510 -> 71846524 (decompressed 50.4 percent)
[20:29:03] Called DecompressByteArray: compressed_data_size=57245510 data_size=71846524, decompressed_data_size=71846524 diff=0
[20:29:04] - Digital signature verified
[20:29:04]
[20:29:04] Project: 6903 (Run 1, Clone 16, Gen 46)
[20:29:04]
[20:29:04] Assembly optimizations on if available.
[20:29:04] Entering M.D.
[20:29:13] Mapping NT from 48 to 48
[20:29:18] Completed 0 out of 250000 steps (0%)
And repeat.
So far the only remedy I have found is to wipe out the /work folder, queue.dat and machinedependant.dat files.
Then on restart fah, it runs fine, multple WUs maybe and then this again.
I thought I had read about this error or a similar one in another thread discussed but could not locate it with search.
Any ideas?

