Extremely high SMART failure rate in my FreeNAS

Abhaxus

Weaksauce
Joined
Dec 7, 2007
Messages
65
I have the following setup built in January of this year:

Fx8320
Asus Crosshair V Formula Z
32GB ECC RAM
5x Seagate 4tb NAS drives
1x Seagate 4tb DM drive pulled from ext case
4x Seagate 500GB 7200.11 drives
Fractal Define R4
Plenty of airflow, room has dedicated A/C
Corsair Cx750m
APC NS1250 UPS

I have now had 3 of the 5 4TB NAS drives report pending sectors and fail self tests. 2 have been replaced by Seagate under warranty and the third had been pulled to RMA tomorrow. Is there anything I could be doing wrong or have setup incorrectly? None of the drives are reporting high temps. I have the box set to do short tests and long tests every 3 days, and the pool is scrubbed every 14th and 29th of the month.

I guess next question is whether Pending/offline uncorrectable sectors are enough to query about. It was only 8 (logical, one physical) sectors in the two most recent failures. The first drive that failed reached 32 within a few days but never failed a short or long test. The most recent drive actually started giving warnings a few months ago but after a reboot and a scrub it returned to 0 for all counts so I didn't RMA. Then yesterday it warns again and failed a long test this morning.

I emailed seagate when I did the second RMA and they said they will gladly swap it if I send them a smartctl output. Haven't had a problem with the RMA yet but I am getting pretty annoyed by the shipping costs and obviously worried that something is causing these errors besides bad drives. A 50% failure rate is nuts.
 
I have the following setup built in January of this year:

Fx8320
Asus Crosshair V Formula Z
32GB ECC RAM
5x Seagate 4tb NAS drives
1x Seagate 4tb DM drive pulled from ext case
4x Seagate 500GB 7200.11 drives
Fractal Define R4
Plenty of airflow, room has dedicated A/C
Corsair Cx750m
APC NS1250 UPS

I have now had 3 of the 5 4TB NAS drives report pending sectors and fail self tests. 2 have been replaced by Seagate under warranty and the third had been pulled to RMA tomorrow. Is there anything I could be doing wrong or have setup incorrectly? None of the drives are reporting high temps. I have the box set to do short tests and long tests every 3 days, and the pool is scrubbed every 14th and 29th of the month.

I guess next question is whether Pending/offline uncorrectable sectors are enough to query about. It was only 8 (logical, one physical) sectors in the two most recent failures. The first drive that failed reached 32 within a few days but never failed a short or long test. The most recent drive actually started giving warnings a few months ago but after a reboot and a scrub it returned to 0 for all counts so I didn't RMA. Then yesterday it warns again and failed a long test this morning.

I emailed seagate when I did the second RMA and they said they will gladly swap it if I send them a smartctl output. Haven't had a problem with the RMA yet but I am getting pretty annoyed by the shipping costs and obviously worried that something is causing these errors besides bad drives. A 50% failure rate is nuts.

how were these shipped to you? were these oem drives or retail? Did the packaging seem lack luster?

Are all the drives the same batch or different batches? If they all had the same manufacture time periods it could have been a bad batch or bad delivery.

Just some thoughts
 
" long tests every 3 days, and the pool is scrubbed every 14th and 29th of the month. "

As an outside observer (I don't own a NAS), isn't this a lot of extra wear and tear on those drives? Not that it's in any way related to your problems.
 
-All oem drives. Two of the drives were same batch. All were shipped in the standard Seagate packaging.
-I am following the standard recommendations from FreeNAS forums fir scrubs and self tests.
-the point about the PSU is what I am worried about. I know it gets terrible reviews. Could that cause such minor prefail issues?
 
I have now had 3 of the 5 4TB NAS drives report pending sectors and fail self tests.
This is normal and has been communicated to you by the harddrive specifications, under uBER (uncorrectable Bit-Error Rate) which is set for 10^-14. This means one bad sector per day at 100% duty cycle, to once per 6 months under low usage.

Is there anything I could be doing wrong or have setup incorrectly?
No, if you use ZFS in a redundant configuration, you will be immune for bad sectors. That is why you use a modern filesystem and not legacy filesystems which cannot cope with this problem and even one unreadable sector can cause havoc.

I have the box set to do short tests and long tests every 3 days, and the pool is scrubbed every 14th and 29th of the month.
SMART self-tests are little more than surface reads. This is totally unnecessary and can discover unreadable sectors which have not been used (written) to for a very long time. Only sectors in use by ZFS are relevant. Unreadable sectors will disappear the moment you write to them. So only the ZFS scrub is necessary - i suggest you stop using SMART self-tests. Only the SMART DATA is relevant. Current Pending Sector, Reallocated Sector Count, UDMA CRC Error Count - those are the three most important SMART attributes.

A 50% failure rate is nuts.
A drive has only failed when it is operating outside its specifications. Bad sectors are absolutely normal according to the 10^-14 uBER spec so a drive with bad sectors is NOT FAILED. Only a very large number of unreadable sectors or bad sectors that have been swapped by reserve sectors because they have been physically damaged, would count as a failure.
 
This is normal and has been communicated to you by the harddrive specifications, under uBER (uncorrectable Bit-Error Rate) which is set for 10^-14. This means one bad sector per day at 100% duty cycle, to once per 6 months under low usage.


No, if you use ZFS in a redundant configuration, you will be immune for bad sectors. That is why you use a modern filesystem and not legacy filesystems which cannot cope with this problem and even one unreadable sector can cause havoc.


SMART self-tests are little more than surface reads. This is totally unnecessary and can discover unreadable sectors which have not been used (written) to for a very long time. Only sectors in use by ZFS are relevant. Unreadable sectors will disappear the moment you write to them. So only the ZFS scrub is necessary - i suggest you stop using SMART self-tests. Only the SMART DATA is relevant. Current Pending Sector, Reallocated Sector Count, UDMA CRC Error Count - those are the three most important SMART attributes.


A drive has only failed when it is operating outside its specifications. Bad sectors are absolutely normal according to the 10^-14 uBER spec so a drive with bad sectors is NOT FAILED. Only a very large number of unreadable sectors or bad sectors that have been swapped by reserve sectors because they have been physically damaged, would count as a failure.

So from what you are saying, everything I have previously read is wrong, including Seagate saying they will swap out the drive with 8 uncorrectable offline/pending sectors? I am using RAIDZ2 so I do have redundancy. Unfortunately my backup box (6 6 2TB drives in md raid5) had a drive physically fail last night (completely went offline until a reboot) so I have no redundancy there. It might not even rebuild if I am to believe the RAID5 fearmongers out there.

I swapped the 4tb drive with a spare but I will keep it for now and not spend the money for RMA. Thanks.
 
Yes, there is a lot of dis-information about this issue. The tendency is that people think that a disk with bad sectors means that the device is faulty. But that is only because in the past 90% of all bad sectors were due to physical damage, whereas now only 10% of all bad sectors are due to physical damage. The other 90% are uBER bad sectors which means the physical sectors are just fine - there just is insufficient ECC bitcorrection available to read the data without errors. After these sectors are overwritten, they work just fine again.

And yes, your manufacturer will accept RMA for disks with pending sectors (Current Pending Sector > 0). The funny thing is that if you format your disks which overwrites all sectors, that means your disk will not have bad sectors any longer (Current Pending Sector = 0) and at that point the manufacturer will not accept your RMA request.

What happens is that your disk with bad sectors is simply overwritten, recertified and shipped to another customer who applied for an RMA. So another customer gets your disk, and you might get the disk of another customer which had bad sectors. Customer happy, reduced cost for the manufacturer and customer doesn't switch brands; everyone happy. :D
 
I swapped the 4tb drive with a spare but I will keep it for now and not spend the money for RMA. Thanks.

watch out for growing relocate and pending sectors since the grow tells your HD is no in good shape.

RMA drives that has many reloctate and pending sector since the HD woul die soon..

my little story on Seagate Baracudo
I had been RMA 4 Drive Seagate Baracuda 3T.
3 Drives is still running, and no relocate and pending sectors.
1 was died 3 months ago and the warranty expired on march 2015
the reason doing RMA was many relocate and pending sectors :D and keep growing.
since in ZoL, resilvering was easy since I did many times. I set raidz2 of 9 3T, all are seagate baracuda with APM disabled.
I was lucky when bought at that time, the warranties mostly were 3 years, with some were 2 years.

spend some for RMA if needed.....
 
ST3000DM001's are the plague on all your houses.
ST4000DM000's have proven to be good, and don't succumb to pre-natal death or infant death syndrome.

Do not whimsically buy drives from *anywhere* but your local shop and make a very safe drive home, take a cab with leafsprings if you don't have a car.

Do not buy any drive you aren't sure about. Never trust individual reports of drive health and statistics from anyone, unless they are running large groups of drives to provide accurate statistics.

I only buy drives after they have passed the 2yr mark @ Backblaze:

https://www.backblaze.com/blog/hard-drive-reliability-stats-for-q2-2015/

blog-graph-all-quarters.png


Tx.
 
Back
Top