ZFS pool question

Chandler · Feb 20, 2015

Code:

nas4free: ~ # zpool status -v fastpool1
  pool: fastpool1
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-9P
  scan: scrub canceled on Fri Feb 20 10:17:17 2015
config:

        NAME        STATE     READ WRITE CKSUM
        fastpool1   ONLINE       0     0     0
          mirror-0  ONLINE       0     0     9
            da0     ONLINE       0     0    80
            da1     ONLINE       0     0    72
          mirror-1  ONLINE       0     0     5
            da2     ONLINE       0     0   105
            da3     ONLINE       0     0    73
          mirror-2  ONLINE       0     0     7
            da4     ONLINE       0     0    68
            da5     ONLINE       0     0    67
          mirror-3  ONLINE       0     0     0
            da6     ONLINE       0     0     0
            da7     ONLINE       0     0     0

errors: No known data errors

I have 4 mirrored pairs of 256GB SSDs in a pool I created. I had a bad power outage (tripped a breaker) while I had things being written and read to and from the pool. It takes about two hours to scrub the pull and I have these errors returned each time. Do I need to manually delete the pool and re-create it? Does this show integrity issues with the disks themselves?

There is a 600GB ZFS volume on the pool mounted with an iSCSI LUN to an esxi box.

I manually mounted the datastores to esxi as vsphere refused to see them in the LUN. I did this via shell access in esxi. I was able to power on two virtual machines and the machines appear to be working fine - but I don't want to proceed with my project if I should just rebuild the array. the only thing important on the volume is the domain controller vm but I can recreate it easily in an hour or so since I never promoted it to be a FMSO holder...

How can I further check the integrity of the data? If I clear the errors and run a new scrub I get errors again.

Shockey · Feb 20, 2015

Do you have a snapshot from before the power issue?

safelyRemove · Feb 22, 2015

If you give a zpool clear, rescrub, and the errors reappear, I think you need to recreate the pool from scratch and restore whatever you can. It seems good when it says there are no known data errors. But, your data can be fine, but the pool has problems and you don't want to keep going..

If the errors don't reappear, you're fine. There was enough redundancy.

bds1904 · Feb 22, 2015

If you are set to anything but sync always there is a good chance you lost data. either roll back to a previous snapshot or restore from backups and recreate the pool.

zrav · Feb 22, 2015

bds1904 said:
If you are set to anything but sync always there is a good chance you lost data. either roll back to a previous snapshot or restore from backups and recreate the pool.

Citation needed.

FnordMan · Feb 23, 2015

Chandler said:
I have 4 mirrored pairs of 256GB SSDs in a pool I created. I had a bad power outage (tripped a breaker) while I had things being written and read to and from the pool

Also: do yourself a favor and invest in a UPS. This would have been a non-issue if you had one of those to take over and give the box time for a clean shutdown. (or enough time to reset the breaker)

smangular · Feb 23, 2015

Agreed a UPS is really cheap insurance. I run a UPS on my ZFS Server.

Cyberpower UPS with AVR offer a good value.

Chandler · Feb 23, 2015

Shockey said:
Do you have a snapshot from before the power issue?

No, I have no snapshots. When I was done settings everything up (hardware settings, installed images, setup iSCSI etc. I came in the next day and installed two installations of server 2012R2, ADS, and called it a day.

safelyRemove said:
If you give a zpool clear, rescrub, and the errors reappear, I think you need to recreate the pool from scratch and restore whatever you can. It seems good when it says there are no known data errors. But, your data can be fine, but the pool has problems and you don't want to keep going..

If the errors don't reappear, you're fine. There was enough redundancy.

Maybe I will just export my VMs and hope they are okay, then recreate the pool.

ONLY the SSD pool had issues, and ONLY the MX100 drives - not the EVO 840s I had in the system. Interesting, right?

bds1904 said:
If you are set to anything but sync always there is a good chance you lost data. either roll back to a previous snapshot or restore from backups and recreate the pool.

Since I am using a few mirrored pairs and they are SSDs over iSCSI, yes I used always sync - but there appears to be some flipped bits going on still though..?

smangular said:
Agreed a UPS is really cheap insurance. I run a UPS on my ZFS Server.

Cyberpower UPS with AVR offer a good value.

FnordMan said:
Also: do yourself a favor and invest in a UPS. This would have been a non-issue if you had one of those to take over and give the box time for a clean shutdown. (or enough time to reset the breaker)

I a few APC 1500 Smart UPS boxes for other servers - but it isn't enough to power all of this. I am looking to purchase a SMX3000LV + SMX120BP (extra battery) totaling around $1600.00 for the package. Is there a cheaper alternative that I can create shutdown scripts for AND run at full load for 15-20 minutes? (1600 watts)

omniscence · Feb 23, 2015

You have a chance to run into these type of problems if you use SSDs without power-loss protection.

The sync setting is mainly relevant for what is on top of ZFS. In any case ZFS has to trust the underlying devices, which are not always reliable if you use consumer SSDs.

safelyRemove · Feb 23, 2015

Chandler said:
Maybe I will just export my VMs and hope they are okay, then recreate the pool.

I suspect that will work, since there seems to be no data errors.

Chandler said:
ONLY the SSD pool had issues, and ONLY the MX100 drives - not the EVO 840s I had in the system. Interesting, right?

Yes. My guess is the SSDs were rewriting completely unrelated blocks in the middle of the power failure, and this corrupted 'old' data. If it had corrupted 'new' data, ZFS would normally notice that upon restart and rollback a few seconds worth of activity automatically. But it doesn't realize that SSDs can scramble old, 'safe' data, and of course, wouldn't know what to do anyway.

Other SSDs might fail in less bad ways, but the only way to be sure is to have a capacitor.

Chandler · Feb 23, 2015

safelyRemove said:
I suspect that will work, since there seems to be no data errors.

Yes. My guess is the SSDs were rewriting completely unrelated blocks in the middle of the power failure, and this corrupted 'old' data. If it had corrupted 'new' data, ZFS would normally notice that upon restart and rollback a few seconds worth of activity automatically. But it doesn't realize that SSDs can scramble old, 'safe' data, and of course, wouldn't know what to do anyway.

Other SSDs might fail in less bad ways, but the only way to be sure is to have a capacitor.

At the time, I thought the MX100 had some sort of capacitor in there to finish writing data after a power loss - but I was mistaken and already opened all the ones I purchased. Oh well, I can't go back now! i will just have backups on cheaper spindles as was my plan anyway.

safelyRemove · Feb 23, 2015

I was trying to figure out what could have permanent checksum errors, yet show no data errors: the Free Space Map.

Possibly, ZFS would assume an invalid section of Free Space Map to be full, meaning your pool has some amount of unusable space, but could operate forever. The other possibility is your system will crash when it attempts to allocate in the damaged area.

This also means the damage may have been to old or newly written data.

Chandler · Feb 23, 2015

safelyRemove said:
I was trying to figure out what could have permanent checksum errors, yet show no data errors: the Free Space Map.

Possibly, ZFS would assume an invalid section of Free Space Map to be full, meaning your pool has some amount of unusable space, but could operate forever. The other possibility is your system will crash when it attempts to allocate in the damaged area.

This also means the damage may have been to old or newly written data.

I wish I could compare a before and after snapshot and could compare file hashes, etc. Oh well. The only thing I can do now is re-create the pool and hope for the best. I didn't get to it today but will tomorrow.

omniscence · Feb 24, 2015

You could manually import the pool with a rollback to an older TXG as a last resort, but this may be detrimental to the pool health and is irreversible, so make block level copies first.

danswartz · Feb 24, 2015

It's also possible the checksum errors didn't overlap. e.g. block A on one side of a mirror was bad and block B on the other side? I thought I remembered that if all copies were bad, it would tell you the specific files that were affected? Maybe only after a scrub though?

Chandler · Feb 24, 2015

danswartz said:
It's also possible the checksum errors didn't overlap. e.g. block A on one side of a mirror was bad and block B on the other side? I thought I remembered that if all copies were bad, it would tell you the specific files that were affected? Maybe only after a scrub though?

Yeah, three of the mirrored pairs had errors and will not repair themselves accordingly. I was hoping it would resolve itself in a scrub but that did not work. I am just recreating today - hopefully the drives themselves don't have any issues. I can't image why they would...

smangular · Feb 24, 2015

Chandler said:
I a few APC 1500 Smart UPS boxes for other servers - but it isn't enough to power all of this. I am looking to purchase a SMX3000LV + SMX120BP (extra battery) totaling around $1600.00 for the package. Is there a cheaper alternative that I can create shutdown scripts for AND run at full load for 15-20 minutes? (1600 watts)

While I don't know your full load details, Cyberpower and Tripplite often can beat APC pricing.

For example,

PR3000LCDRTXL2U May do the job
CyberPower Selector

Here may be the winner, ~$970
Tripp Lite 3000VA SmartPro Rackmount UPS System (SM3000RM2U)

see tripplite product selector

Chandler · Feb 26, 2015

I did purchase a Cyberpower product (OL3000RTXL2U + BP72V60ART2U + RMCARD303). I am plugging it up to a 110v/30 AMP circuit provided by a NEMA-5-30p plug.

I destroyed the pool, recreated it, setup a ZVO, iscsi target, connected to esxi, created a datastore, ran a scrub, no issues, then copied a 40gb VM to the datastore using WINSCP SFTP.

I powered on the VM, began copying a second VM, Ran a scrub (While copying second vm).

The scrub was running slow, 1.2Mb/s. after thirty minutes from the last time I ran zpool status ssdpool I got this:

Code:

  pool: ssdpool
 state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
        repaired.
  scan: scrub in progress since Thu Feb 26 05:53:41 2015
        5.57G scanned out of 64.6G at 4.18M/s, 4h0m to go
        228K repaired, 8.63% done
config:

        NAME        STATE     READ WRITE CKSUM
        ssdpool     DEGRADED     0     0     0
          mirror-0  ONLINE       0     0     0
            da0     ONLINE       0     0     0  (repairing)
            da1     ONLINE       0     0     0  (repairing)
          mirror-1  DEGRADED     0     0     0
            da2     FAULTED      6     4     0  too many errors  (repairing)
            da3     ONLINE       0     0     0  (repairing)
          mirror-2  ONLINE       0     0     0
            da4     ONLINE       0     0     0  (repairing)
            da5     ONLINE       0     0     0  (repairing)
          mirror-3  DEGRADED     0     0     0
            da6     FAULTED    140   127     1  too many errors
            da7     ONLINE       0     0     0

errors: No known data errors

What is going on? How do multiple mirrors fail like this? I understand one drive having an error, then the mirror itself attempt to repair itself. Why are the drives with no errors stating they are being repaired?

Chandler · Feb 26, 2015

Scrub finished:

Code:

nas4free: ~ # zpool status ssdpool
  pool: ssdpool
 state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
        repaired.
  scan: scrub repaired 536K in 1h3m with 0 errors on Thu Feb 26 06:56:43 2015
config:

        NAME        STATE     READ WRITE CKSUM
        ssdpool     DEGRADED     0     0     0
          mirror-0  ONLINE       0     0     0
            da0     ONLINE       0     0     1
            da1     ONLINE       0     0     0
          mirror-1  DEGRADED     0     0     0
            da2     FAULTED      6     4     0  too many errors
            da3     ONLINE       0     0     0
          mirror-2  ONLINE       0     0     0
            da4     ONLINE       0     0     0
            da5     ONLINE       0     0     1
          mirror-3  DEGRADED     0     0     0
            da6     FAULTED    140   127     1  too many errors
            da7     ONLINE       0     0     0

errors: No known data errors

omniscence · Feb 26, 2015

So many parallel distributed errors on a newly created pool? You should check the SMART data of the disks and your power supply. Maybe the sudden power loss damaged it - unlikely if it is a quality PSU, but I had a dead Seasonic PSU after a blackout some years ago.

Chandler · Feb 26, 2015

da2

Code:

=== START OF INFORMATION SECTION ===
Model Family:     Crucial/Micron MX100/M500/M510/M550 Client SSDs
Device Model:     Crucial_CT256MX100SSD1
Serial Number:    14370D3DB535
LU WWN Device Id: 5 00a075 10d3db535
Firmware Version: MU01
User Capacity:    256,060,514,304 bytes [256 GB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 6
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Thu Feb 26 08:30:41 2015 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x84)	Offline data collection activity
					was suspended by an interrupting command from host.
					Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever
					been run.
Total time to complete Offline
data collection: 		(  445) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 (   3) minutes.
Conveyance self-test routine
recommended polling time: 	 (   3) minutes.
SCT capabilities: 	       (0x0035)	SCT Status supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   100   100   000    Pre-fail  Always       -       0
  5 Reallocate_NAND_Blk_Cnt 0x0033   100   100   000    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       822
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       41
171 Program_Fail_Count      0x0032   100   100   000    Old_age   Always       -       0
172 Erase_Fail_Count        0x0032   100   100   000    Old_age   Always       -       0
173 Ave_Block-Erase_Count   0x0032   100   100   000    Old_age   Always       -       3
174 Unexpect_Power_Loss_Ct  0x0032   100   100   000    Old_age   Always       -       40
180 Unused_Reserve_NAND_Blk 0x0033   000   000   000    Pre-fail  Always       -       2159
183 SATA_Interfac_Downshift 0x0032   100   100   000    Old_age   Always       -       1
184 Error_Correction_Count  0x0032   100   100   000    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
194 Temperature_Celsius     0x0022   064   054   000    Old_age   Always       -       36 (Min/Max 24/46)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   100   100   000    Old_age   Always       -       41973
202 Percent_Lifetime_Used   0x0031   100   100   000    Pre-fail  Offline      -       0
206 Write_Error_Rate        0x000e   100   100   000    Old_age   Always       -       0
210 Success_RAIN_Recov_Cnt  0x0032   100   100   000    Old_age   Always       -       0
246 Total_Host_Sector_Write 0x0032   100   100   000    Old_age   Always       -       341572795
247 Host_Program_Page_Count 0x0032   100   100   000    Old_age   Always       -       11497234
248 Bckgnd_Program_Page_Cnt 0x0032   100   100   000    Old_age   Always       -       66106230

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Vendor (0xff)       Completed without error       00%       819         -
# 2  Vendor (0xff)       Completed without error       00%       796         -
# 3  Vendor (0xff)       Completed without error       00%       676         -
# 4  Vendor (0xff)       Completed without error       00%       621         -
# 5  Vendor (0xff)       Completed without error       00%       576         -
# 6  Vendor (0xff)       Completed without error       00%       548         -
# 7  Vendor (0xff)       Completed without error       00%       504         -
# 8  Vendor (0xff)       Completed without error       00%       440         -
# 9  Vendor (0xff)       Completed without error       00%       413         -
#10  Vendor (0xff)       Completed without error       00%       371         -
#11  Vendor (0xff)       Completed without error       00%       327         -
#12  Vendor (0xff)       Completed without error       00%       284         -
#13  Vendor (0xff)       Completed without error       00%       240         -
#14  Vendor (0xff)       Completed without error       00%       197         -
#15  Vendor (0xff)       Completed without error       00%       142         -
#16  Vendor (0xff)       Completed without error       00%        99         -
#17  Vendor (0xff)       Completed without error       00%        79         -
#18  Vendor (0xff)       Completed without error       00%        48         -
#19  Short offline       Completed without error       00%         3         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

da6

Code:

=== START OF INFORMATION SECTION ===
Device Model:     Samsung SSD 850 EVO 250GB
Serial Number:    S21NNSAFC16514Z
LU WWN Device Id: 5 002538 da0009afa
Firmware Version: EMT01B6Q
User Capacity:    250,059,350,016 bytes [250 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4c
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Thu Feb 26 08:30:41 2015 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)	Offline data collection activity
					was never started.
					Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever
					been run.
Total time to complete Offline
data collection: 		(    0) seconds.
Offline data collection
capabilities: 			 (0x53) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					No Offline surface scan supported.
					Self-test supported.
					No Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 133) minutes.
SCT capabilities: 	       (0x003d)	SCT Status supported.
					SCT Error Recovery Control supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   099   099   000    Old_age   Always       -       821
 12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       50
177 Wear_Leveling_Count     0x0013   099   099   000    Pre-fail  Always       -       1
179 Used_Rsvd_Blk_Cnt_Tot   0x0013   100   100   010    Pre-fail  Always       -       0
181 Program_Fail_Cnt_Total  0x0032   100   100   010    Old_age   Always       -       0
182 Erase_Fail_Count_Total  0x0032   100   100   010    Old_age   Always       -       0
183 Runtime_Bad_Block       0x0013   100   099   010    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0032   068   061   000    Old_age   Always       -       32
195 Hardware_ECC_Recovered  0x001a   200   200   000    Old_age   Always       -       0
199 UDMA_CRC_Error_Count    0x003e   087   087   000    Old_age   Always       -       12454
235 Unknown_Attribute       0x0012   099   099   000    Old_age   Always       -       12
241 Total_LBAs_Written      0x0032   099   099   000    Old_age   Always       -       240439216

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

Warning! SMART Selective Self-Test Log Structure error: invalid SMART checksum.
SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
  255        0    65535  Read_scanning was never started
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

There are just the two drives that show as faulted.
The other drives all passed as well. I reviewed each one paying special attention to these two. Nothing seems fishy to me.

Chandler · Feb 26, 2015

I just cleared the errors then ran another scrub, no more faults after it resilvered...? That seems odd to me. But I did get chksum errors.

Code:

nas4free: ~ # zpool status -x
  pool: ssdpool
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-9P
  scan: resilvered 30.9G in 0h3m with 0 errors on Thu Feb 26 08:48:41 2015
config:

        NAME        STATE     READ WRITE CKSUM
        ssdpool     ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            da0     ONLINE       0     0     0
            da1     ONLINE       0     0     0
          mirror-1  ONLINE       0     0     0
            da2     ONLINE       3     0   259
            da3     ONLINE       0     0     2
          mirror-2  ONLINE       0     0     0
            da4     ONLINE       0     0     0
            da5     ONLINE       0     0     1
          mirror-3  ONLINE       0     0     0
            da6     ONLINE       0     0   207
            da7     ONLINE       0     0     0

errors: No known data errors

dandragonrage · Feb 26, 2015

Edit: Missed the second page, I guess. Nevermind.

Chandler · Feb 26, 2015

Maybe I should just find some sas drives? *Sigh*

safelyRemove · Feb 26, 2015

If you've never had these problems before, then either the power failure or something you did after that is causing this.

Are the SSDs connected individually, or through cables/enclosures? Have you run a memory test?

Notice those two SSD SMART results show lots of UDMA CRC errors.

omniscence · Feb 27, 2015

Yes, the UDMA CRC errors are way to high for a single power loss and they probably still increase if you take the checksum errors into account.
Check SATA cables, power cables and PSU.
It could probably also be the controller or controller ports that have issues.

Chandler · Feb 27, 2015

I believe we lost power two or three times total when the breakers flipped - I wasn't here for it - the box didn't really have any critical in use data on it.

The SSDs are on a Dell T610 using the hot swap bays. The power there checks out okay. I was able to check with a multi meter.

The console has some error messages when scrubbing/ reading data:

safelyRemove · Feb 27, 2015

You have a couple drives that seem good. And a couple drives that seem bad. Shut the system down, swap them in the drive bays, and see whether the problem moves.

Why did the breakers trip? And why two or three times? Possibly your PSU is damaged in a way that is not apparent on the multimeter, but is causing problems for the SSD electronics.

Chandler · Feb 27, 2015

is there a proper way to pull and swap drives around?

Shockey · Feb 27, 2015

Chandler said:
is there a proper way to pull and swap drives around?

perform clean shutdown
swap drives around
turn on
profit?

_Gea · Feb 27, 2015

If your hardware is hot plug capable:
- pool export
- move disk
- pool import

- scrub

On a production server/ critical data:
- move the pool to another working/tested machine, scrub there
to decide if the disks or the server is the problem.

Chandler · Feb 27, 2015

_Gea said:
If your hardware is hot plug capable:
- pool export
- move disk
- pool import

- scrub

On a production server/ critical data:
- move the pool to another working/tested machine, scrub there
to decide if the disks or the server is the problem.

Okay, I have done the export and swap.

Before Export or drive swap:

Code:

  pool: ssdpool
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
	attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
	using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-9P
  scan: scrub repaired 11.5M in 0h46m with 0 errors on Thu Feb 26 10:29:48 2015
config:

	NAME        STATE     READ WRITE CKSUM
	ssdpool     ONLINE       0     0     0
	  mirror-0  ONLINE       0     0     3
	    da0     ONLINE       0     0    37
	    da1     ONLINE       0     0    21
	  mirror-1  ONLINE       0     0     3
	    da2     ONLINE       0     0    14
	    da3     ONLINE       0     0    12
	  mirror-2  ONLINE       0     0     2
	    da4     ONLINE       0     0    14
	    da5     ONLINE       0     0    11
	  mirror-3  ONLINE       0     0     0
	    da6     ONLINE       0     0     0
	    da7     ONLINE       0     0     0

errors: No known data errors

I unmounted the Datastore in ESXI and disabled iSCSI.

Now I execute (via ssh)

Code:

zpool export ssdpool

Code:

zpool export saspool

Successful/ no error return.
I exported saspool so I had extra bays in the external storage box in-case I need to put some of the SSD in there. I am just going to swap the drives around internally first.

I swap The drives around, I offset each driver by one bay/ moved 0 to 1, 1 to 2, etc. (All 8 bays are full)
Execute:

Code:

zpool import ssdpool

Successful/ no error return.

Status output directly after import:

Code:

  pool: ssdpool
 state: ONLINE
  scan: scrub repaired 11.5M in 0h46m with 0 errors on Thu Feb 26 10:29:48 2015
config:

	NAME        STATE     READ WRITE CKSUM
	ssdpool     ONLINE       0     0     0
	  mirror-0  ONLINE       0     0     0
	    da0     ONLINE       0     0     0
	    da6     ONLINE       0     0     0
	  mirror-1  ONLINE       0     0     0
	    da3     ONLINE       0     0     0
	    da5     ONLINE       0     0     0
	  mirror-2  ONLINE       0     0     0
	    da4     ONLINE       0     0     0
	    da2     ONLINE       0     0     0
	  mirror-3  ONLINE       0     0     0
	    da7     ONLINE       0     0     0
	    da1     ONLINE       0     0     0

errors: No known data errors

Now I execute a scrub:

Code:

zpool scrub ssdpool

Status during Scrub:

Code:

  pool: ssdpool
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
	attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
	using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-9P
  scan: scrub in progress since Fri Feb 27 08:54:16 2015
        6.14G scanned out of 85.3G at 17.2M/s, 1h18m to go
        1.65M repaired, 7.20% done
config:

	NAME        STATE     READ WRITE CKSUM
	ssdpool     ONLINE       0     0     0
	  mirror-0  ONLINE       0     0     0
	    da0     ONLINE       0     0     1  (repairing)
	    da6     ONLINE       0     0     1  (repairing)
	  mirror-1  ONLINE       0     0     0
	    da3     ONLINE       0     0     0  (repairing)
	    da5     ONLINE       0     0     1  (repairing)
	  mirror-2  ONLINE       0     0     0
	    da4     ONLINE       0     0     1  (repairing)
	    da2     ONLINE       0     0     2  (repairing)
	  mirror-3  ONLINE       0     0     0
	    da7     ONLINE       0     0     0
	    da1     ONLINE       0     0     0

errors: No known data errors

Status after scrub:

Code:

  pool: ssdpool
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-9P
  scan: scrub repaired 17.5M in 1h7m with 0 errors on Fri Feb 27 10:01:17 2015
config:

        NAME        STATE     READ WRITE CKSUM
        ssdpool     ONLINE       0     0     0
          mirror-0  ONLINE       0     0     4
            da0     ONLINE       0     0    20
            da6     ONLINE       0     0    19
          mirror-1  ONLINE       0     0     2
            da3     ONLINE       0     0    13
            da5     ONLINE       0     0    10
          mirror-2  ONLINE       0     0     0
            da4     ONLINE       1     0     9
            da2     ONLINE       0     0    10
          mirror-3  ONLINE       0     0     0
            da7     ONLINE       0     0     0
            da1     ONLINE       0     0     0

errors: No known data errors

Smart data

CRC errors now have 1760 MORE then when I originally posted smart for this drive.

OLD

Code:

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   100   100   000    Pre-fail  Always       -       0
  5 Reallocate_NAND_Blk_Cnt 0x0033   100   100   000    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       822
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       41
171 Program_Fail_Count      0x0032   100   100   000    Old_age   Always       -       0
172 Erase_Fail_Count        0x0032   100   100   000    Old_age   Always       -       0
173 Ave_Block-Erase_Count   0x0032   100   100   000    Old_age   Always       -       3
174 Unexpect_Power_Loss_Ct  0x0032   100   100   000    Old_age   Always       -       40
180 Unused_Reserve_NAND_Blk 0x0033   000   000   000    Pre-fail  Always       -       2159
183 SATA_Interfac_Downshift 0x0032   100   100   000    Old_age   Always       -       1
184 Error_Correction_Count  0x0032   100   100   000    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
194 Temperature_Celsius     0x0022   064   054   000    Old_age   Always       -       36 (Min/Max 24/46)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   100   100   000    Old_age   Always       -       41973
202 Percent_Lifetime_Used   0x0031   100   100   000    Pre-fail  Offline      -       0
206 Write_Error_Rate        0x000e   100   100   000    Old_age   Always       -       0
210 Success_RAIN_Recov_Cnt  0x0032   100   100   000    Old_age   Always       -       0
246 Total_Host_Sector_Write 0x0032   100   100   000    Old_age   Always       -       341572795
247 Host_Program_Page_Count 0x0032   100   100   000    Old_age   Always       -       11497234
248 Bckgnd_Program_Page_Cnt 0x0032   100   100   000    Old_age   Always       -       66106230

NEW

Code:

=== START OF INFORMATION SECTION ===
Model Family:     Crucial/Micron MX100/M500/M510/M550 Client SSDs
Device Model:     Crucial_CT256MX100SSD1
Serial Number:    14370D3DB535
LU WWN Device Id: 5 00a075 10d3db535
Firmware Version: MU01
User Capacity:    256,060,514,304 bytes [256 GB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 6
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Fri Feb 27 09:03:28 2015 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x80)	Offline data collection activity
					was never started.
					Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever
					been run.
Total time to complete Offline
data collection: 		( 1190) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 (   3) minutes.
Conveyance self-test routine
recommended polling time: 	 (   3) minutes.
SCT capabilities: 	       (0x0035)	SCT Status supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   100   100   000    Pre-fail  Always       -       0
  5 Reallocate_NAND_Blk_Cnt 0x0033   100   100   000    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       847
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       42
171 Program_Fail_Count      0x0032   100   100   000    Old_age   Always       -       0
172 Erase_Fail_Count        0x0032   100   100   000    Old_age   Always       -       0
173 Ave_Block-Erase_Count   0x0032   100   100   000    Old_age   Always       -       3
174 Unexpect_Power_Loss_Ct  0x0032   100   100   000    Old_age   Always       -       41
180 Unused_Reserve_NAND_Blk 0x0033   000   000   000    Pre-fail  Always       -       2159
183 SATA_Interfac_Downshift 0x0032   100   100   000    Old_age   Always       -       0
184 Error_Correction_Count  0x0032   100   100   000    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
194 Temperature_Celsius     0x0022   065   054   000    Old_age   Always       -       35 (Min/Max 24/46)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   100   100   000    Old_age   Always       -       43733
202 Percent_Lifetime_Used   0x0031   100   100   000    Pre-fail  Offline      -       0
206 Write_Error_Rate        0x000e   100   100   000    Old_age   Always       -       0
210 Success_RAIN_Recov_Cnt  0x0032   100   100   000    Old_age   Always       -       0
246 Total_Host_Sector_Write 0x0032   100   100   000    Old_age   Always       -       373788971
247 Host_Program_Page_Count 0x0032   100   100   000    Old_age   Always       -       12504232
248 Bckgnd_Program_Page_Cnt 0x0032   100   100   000    Old_age   Always       -       66183663

DA6 Now DA7
CRC errors now have 1592 MORE then when I originally posted smart for this drive.
Old

Code:

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   099   099   000    Old_age   Always       -       821
 12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       50
177 Wear_Leveling_Count     0x0013   099   099   000    Pre-fail  Always       -       1
179 Used_Rsvd_Blk_Cnt_Tot   0x0013   100   100   010    Pre-fail  Always       -       0
181 Program_Fail_Cnt_Total  0x0032   100   100   010    Old_age   Always       -       0
182 Erase_Fail_Count_Total  0x0032   100   100   010    Old_age   Always       -       0
183 Runtime_Bad_Block       0x0013   100   099   010    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0032   068   061   000    Old_age   Always       -       32
195 Hardware_ECC_Recovered  0x001a   200   200   000    Old_age   Always       -       0
199 UDMA_CRC_Error_Count    0x003e   087   087   000    Old_age   Always       -       12454
235 Unknown_Attribute       0x0012   099   099   000    Old_age   Always       -       12
241 Total_LBAs_Written      0x0032   099   099   000    Old_age   Always       -       240439216

SMART Error Log Version: 1
No Errors Logged

NEW

Code:

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   099   099   000    Old_age   Always       -       845
 12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       51
177 Wear_Leveling_Count     0x0013   099   099   000    Pre-fail  Always       -       1
179 Used_Rsvd_Blk_Cnt_Tot   0x0013   100   100   010    Pre-fail  Always       -       0
181 Program_Fail_Cnt_Total  0x0032   100   100   010    Old_age   Always       -       0
182 Erase_Fail_Count_Total  0x0032   100   100   010    Old_age   Always       -       0
183 Runtime_Bad_Block       0x0013   100   099   010    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0032   064   061   000    Old_age   Always       -       36
195 Hardware_ECC_Recovered  0x001a   200   200   000    Old_age   Always       -       0
199 UDMA_CRC_Error_Count    0x003e   086   086   000    Old_age   Always       -       14046
235 Unknown_Attribute       0x0012   099   099   000    Old_age   Always       -       13
241 Total_LBAs_Written      0x0032   099   099   000    Old_age   Always       -       273638386

SMART Error Log Version: 1
No Errors Logged

reddi · Feb 27, 2015

UDMA CRC errors usually mean a faulty connection between drive and motherboard/controller.
Those increased UDMA CRC erros could still be from before rotating the drives? You should check for UDMA CRC errors on the drives that are now connected to the SATA ports that were earlier connected to the drives which had the errors.

If the new drives are now affected, the best case would be a loose SATA cable.
You try a couple of things to find the fault by chewcking if the CRC error count still rises after each step:
- reattach the cable (loose connection)
- switch cables around with a cable that works without errors (defective cable)
- attach the SATA/SAS cable to a different working port on the motherboard/controller (defective sata port)
If the CRC count still rises after those steps on the new drives it probably means the backplane has a fault or the SATA connector on the backplane is not clean.

If the new drives are not affected but the old drives have still rising CRC errors, the hot swap drive caddy might keep the drive from making good contact to the SATA backplane. You could try re attaching the drive to the caddy or better swapping the caddy with one of the error free drives. If that doesn't help, the best case now would be a dirty SATA connector on the drive.

Chandler · Mar 2, 2015

I just rotated the drives in the hotswap bay on the server. I did not change how they interface with the motherboard. There is a black-plane going to a LSI SAS9207-8i card. They were originally connected to a PERC H700 but that won't work with ZFS as it can't be cross flashed and only works in RAID.

I will try to add them to the external storage bay now. Unfortunately if I do that I will WAY cap my bandwidth on that adapter due to the expander in the 3U SGI Rackable (SE3016) box.

I have the SE3016 enclosure attached to a LSI SAS9200-8i (6Gb/s). I can always get another and replace the expander too... but they aren't 2.5" drive friendly I have to get adapters...

Chandler · Mar 2, 2015

I swapped over to my SE3016

I exported the pool first:

Code:

zpool export ssdpool

I swapped out the drives Then imported:

Code:

zpool import ssdpool

I ran a status, no errors...
I ran a scrub here are the results after:
[/code] pool: ssdpool
state: ONLINE
scan: scrub repaired 0 in 0h11m with 0 errors on Mon Mar 2 06:23:41 2015
config:

NAME STATE READ WRITE CKSUM
ssdpool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
da5 ONLINE 0 0 0
da7 ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
da4 ONLINE 0 0 0
da6 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
da1 ONLINE 0 0 0
da2 ONLINE 0 0 0
mirror-3 ONLINE 0 0 0
da0 ONLINE 0 0 0
da3 ONLINE 0 0 0

errors: No known data errors
[/code]

Interesting how the errors went away all together. I expected to have to rebuild the pool. Should I destroy it and rebuild it anyway?

safelyRemove · Mar 2, 2015

You can't fake a clean scrub. At that moment, your pool is fine.

Possibly, the data is corrupting as it is read out of the SSDs. Could be the SSDs, could be the power, could be your backplane, cables, or controller/ports. Only testing can answer the question.

brutalizer · Mar 2, 2015

You have some problems with your hardware, thats for sure. Your data should be safe if the scrub is clean.

Be happy you use ZFS, because it detects the slightest problem. If you had used NTFS then you would never had been notified and your data would have been slowly but surely corrupted. Read this story, that is quite similar to your:
https://blogs.oracle.com/elowe/entry/zfs_saves_the_day_ta

"... I've been using ZFS internally for awhile now. For someone who used to administer several machines with Solaris Volume Manager (SVM), UFS, and a pile of aging JBOD disks, my experience so far is easily summed up: "Dude this so @#%& simple, so reliable, and so much more powerful, how did I never live without it??"

So, you can imagine my excitement when ZFS finally hit the gate. The very next day I BFU'ed my workstation, created a ZFS pool, setup a few filesystems and (four commands later, I might add) started beating on it.

Imagine my surprise when my machine stayed up less than two hours!!

No, this wasn't a bug in ZFS... it was a fatal checksum error. One of those "you might want to know that your data just went away" sort of errors. Of course, I had been running UFS on this disk for about a year, and apparently never noticed the silent data corruption. But then I reached into the far recesses of my brain, and I recalled a few strange moments -- like the one time when I did a bringover into a workspace on the disk, and I got a conflict on a file I hadn't changed. Or the other time after a reboot I got a strange panic in UFS while it was unrolling the log. At the time I didn't think much of these things -- I just deleted the file and got another copy from the golden source -- or rebooted and didn't see the problem recur -- but it makes sense to me now. ZFS, with its end-to-end checksums, had discovered in less than two hours what I hadn't known for almost a year -- that I had bad hardware, and it was slowly eating away at my data.

Figuring that I had a bad disk on my hands, I popped a few extra SATA drives in, clobbered the disk and this time set myself up a three-disk vdev using raidz. I copied my data back over, started banging on it again, and after a few minutes, lo and behold, the checksum errors began to pour in:

elowe@oceana% zpool status
pool: junk
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool online' or replace the device with 'zpool replace'.
see: http://www.sun.com/msg/ZFS-8000-9P
scrub: none requested
config:

NAME STATE READ WRITE CKSUM
junk ONLINE 0 0 0
raidz ONLINE 0 0 0
c0d0 ONLINE 0 0 0
c3d0 ONLINE 0 0 1
c3d1 ONLINE 0 0 0

A checksum error on a different disk! The drive wasn't at fault after all.
I emailed the internal ZFS interest list with my saga, and quickly got a response. Another user, also running a Tyan 2885 dual-Opteron workstation like mine, had experienced data corruption with SATA disks. The root cause? A faulty power supply.

Since my data is still intact, and the performance isn't hampered at all, I haven't bothered to fix the problem yet. I've been running over a week now with a faulty setup which is still corrupting data on its way to the disk, and have yet to see a problem with my data, since ZFS handily detects and corrects these errors on the fly.

Eventually I suppose I'll get around to replacing that faulty power supply... "

DaPayne · Mar 3, 2015

SafelyRemove has it right. It sounds to me that you may have a weak controller someplace that the power problems brought to light I would look into controller cards, backplanes and power paths.

Chandler · Mar 4, 2015

I am going to have to eliminate the backplane then the controller, then the bus I suppose.

I have 3 15.7K SAS drives that are not being used. I will put data on them and see what happens.

Thanks guys!

ZFS pool question

Limp Gawd

2[H]4U

n00b

Gawd

Limp Gawd

[H]ard|Gawd

Limp Gawd

Limp Gawd

[H]ard|Gawd

n00b

Limp Gawd

n00b

Limp Gawd

[H]ard|Gawd

2[H]4U

Limp Gawd

Limp Gawd

Limp Gawd

Limp Gawd

[H]ard|Gawd

Limp Gawd

Limp Gawd

[H]F Junkie

Limp Gawd

n00b

[H]ard|Gawd

Limp Gawd

n00b

Limp Gawd

2[H]4U

Supreme [H]ardness

Limp Gawd

n00b

Limp Gawd

Limp Gawd

n00b

[H]ard|Gawd

Limp Gawd

Limp Gawd