Where did my bad block go?

Segfault · Nov 15, 2018

RAIDZ-2, one drive failed to pass smartctl test. Bad sector, test terminated at 10% done. I put the drive offline and ran badblocks -n. After that there was a bad sector pending. I resilvered the RAID set and now it is gone. Where did it go? It just can't disappear, can it? The count of reallocated sectors remains 0. Maybe SMART is not that smart after all? Perhaps someone with deeper knowledge in hard drives can explain it? I'm just curious.

drescherjm · Nov 15, 2018

If you have overwritten it then it can go away without a reallocation.

doublejack · Nov 15, 2018

It is possible that the trouble sector is just not in use right now, or if it is in use it was overwritten and is presently readable. I would still treat the drive as suspect. I had a drive that showed a bad sector pending, and then it had a few of them. It continued to work and lasted almost another year in 24/7 operation, with a caution SMART status. Then maybe six or eight weeks ago the SMART status turned to pending failure and it became very difficult to read some data from the drive. Some data was read just fine, other files or folders would take forever to copy. I ended up pulling the drive from the array to get all the data off the array, so I could do a wipe and rebuild.

TL;DR Have a plan in place to replace the drive

Segfault · Nov 15, 2018

Thanks for replies. Yes, I ordered a new drive as soon as I saw that sector. This drive may still be used somewhere else. In my experience drives fail when they are new or when they get real old. This one has 30k hours on it, not that old yet.

doublejack · Nov 15, 2018

That's my experience as well. Drives die quickly, or they last a really long time, like 5-6+ years (usually 10+ years it seems though). This holds for drives that are always on.

However, I've noticed that for whatever reason drives in the 3-4TB range seem to not follow this pattern. I'm not sure why. Those seem to start dropping at the 3ish year / 25k hour mark, just from normal wear.

Formula.350 · Nov 15, 2018

doublejack said:
That's my experience as well. Drives die quickly, or they last a really long time, like 5-6+ years (usually 10+ years it seems though). This holds for drives that are always on.

However, I've noticed that for whatever reason drives in the 3-4TB range seem to not follow this pattern. I'm not sure why. Those seem to start dropping at the 3ish year / 25k hour mark, just from normal wear.

Me poking my head in where I don't belong, but...

Wasn't that the initial size that had been the density limit, and beyond that was needing to be He(llium) filled? Or was it before that new fancy perpendicular tech began getting used?
Either way, I guess what I'm asking/saying, is that it'd seem like that'd be why there are more failures there as they were at the edge of the old tech's capability and could still be cheaply made.

Again, I'm out of my depth here... I... just came from the front page on account of the thread title giving me a chuckle

lol
Not that it isn't a valid question, though!

And can't you do a low-level format on it to potentially restore the sector in question (in this case, as a somewhat-reliable fix than just accepting that it now reports being OK)? Or even the fact that it may have been bad ever-so-briefly, has condemned it to a lonesome life of non-RAID operation?

likeman · Nov 16, 2018

Formula.350 said:
Me poking my head in where I don't belong, but...

Wasn't that the initial size that had been the density limit, and beyond that was needing to be He(llium) filled? Or was it before that new fancy perpendicular tech began getting used?
Either way, I guess what I'm asking/saying, is that it'd seem like that'd be why there are more failures there as they were at the edge of the old tech's capability and could still be cheaply made.

Again, I'm out of my depth here... I... just came from the front page on account of the thread title giving me a chuckle lol
Not that it isn't a valid question, though!
And can't you do a low-level format on it to potentially restore the sector in question (in this case, as a somewhat-reliable fix than just accepting that it now reports being OK)? Or even the fact that it may have been bad ever-so-briefly, has condemned it to a lonesome life of non-RAID operation?

if the disk is showing any type of errors like this you should change the drive and treat it as failed for raid as you only have to lose 2 disks for it to make all data go puff (and if you leave it and the first disk fails and then when rebuilding another disk has errors as well you can lose the array as well)

also avoid SMR disks (arciave disks or norm in External HDDs now, after writing so much data they drop to floppy disk speeds) as you will likely kill them in less than a year and they be slow (they have to write data in Blocks across 3-4 tracks at a time so they really suck for random writes or RAID) PMR or bust,, a synology NAS will actually drop SMR from the SHR array because they get so slow the software thinks the SMR HDD is failing when its just going really slow on writes

https://blag.nullteilerfrei.de/2018/05/31/pmr-smr-cmr-i-just-want-a-hdd-mr/

Formula.350 · Nov 16, 2018

likeman said:
if the disk is showing any type of errors like this you should change the drive and treat it as failed for raid as you only have to lose 2 disks for it to make all data go puff (and if you leave it and the first disk fails and then when rebuilding another disk has errors as well you can lose the array as well)

also avoid SMR disks (arciave disks or norm in External HDDs now, after writing so much data they drop to floppy disk speeds) as you will likely kill them in less than a year and they be slow (they have to write data in Blocks across 3-4 tracks at a time so they really suck for random writes or RAID) PMR or bust,, a synology NAS will actually drop SMR from the SHR array because they get so slow the software thinks the SMR HDD is failing when its just going really slow on writes

https://blag.nullteilerfrei.de/2018/05/31/pmr-smr-cmr-i-just-want-a-hdd-mr/

Wow lol That just made me feel like a noob, but thank you! All these years, I wasn't aware of the Shingled method, and thus the drawbacks it carries...

I recently picked up a 6TB, which since I've been super reluctant to go large with my previous biggest being 1TB, I decided to splurge and went with an HGST UltraStar instead of the DeskStar (brand new, not one of those BS "Recertified" 25K hour server pulls). Pretty sure she's an He6. But even during that shopping I don't recall seeing any mention of CMR (PMR) or SMR

So with that said, how does one even determine that? Given the speed this one has shown, I am assuming it's not Shingled, but for future reference it'd sure be nice to know if there's anyway to determine that based on the sticker or something?

doublejack · Nov 16, 2018

One case where SMR drives work ok in a raid situation is in a media server. They are fine if most of the workload is reading data, especially if data tends to remain constant once it is written. I've used SMR drives in a raid for a movie library. I only get 30-35 mb/s when writing to the array, but otherwise there's no noticeable difference compared to PMR drives. I used windows storage spaces to create the array.

As far as telling which drives are SMR and which are PMR, I don't think there's a definitive list. A good tell is the speed. SMR drives are all going to be 5,400 or 5,900 rpm. I'm pretty sure that if a drive is helium filled then it is also not SMR.

Thankfully, 3.5" drives are mostly PMR models. It is when you look at 2.5" drives that you find a larger percentage are SMR (especially 1TB and larger models). For home use, though, most 2.5" drives are in laptops and should be SSDs these days.

toast0 · Nov 16, 2018

Sectors can go pending when the drive is having trouble reading them; especially if they're found during a smart test or a background scan. If it's later able to read the sector well or does a successful write to that sector, it's no longer pending. It seems unusual to have a pending sector after badblocks -n, which is expected to write to every sector, but perhaps single sector writes were not doing it for the drive, and the resilver wrote in large enough chunks to satisfy it. Especially possible if it's a 4k sector drive, and badblocks was reading and writing at 512-byte sectors.

If you have a good monitoring system in place, and your data is easy to recover; my threshold is 100 sectors before considering replacement. But if you don't have a good monitoring system, or your data is difficult to recover, a lower threshold is reasonable, I'd probably pull the drive around ten sectors in home use (where I don't check very often, and have generally poor data practices). I understand the philosophy of pull the drive on a single sector error too, but seems too tight for me, personally.

likeman · Nov 26, 2018

toast0 said:
Sectors can go pending when the drive is having trouble reading them; especially if they're found during a smart test or a background scan. If it's later able to read the sector well or does a successful write to that sector, it's no longer pending. It seems unusual to have a pending sector after badblocks -n, which is expected to write to every sector, but perhaps single sector writes were not doing it for the drive, and the resilver wrote in large enough chunks to satisfy it. Especially possible if it's a 4k sector drive, and badblocks was reading and writing at 512-byte sectors.

If you have a good monitoring system in place, and your data is easy to recover; my threshold is 100 sectors before considering replacement. But if you don't have a good monitoring system, or your data is difficult to recover, a lower threshold is reasonable, I'd probably pull the drive around ten sectors in home use (where I don't check very often, and have generally poor data practices). I understand the philosophy of pull the drive on a single sector error too, but seems too tight for me, personally.

i just would not trust a disk that has pending or reallocated sectors as it typically will develop more or flat out fail randomly ( i had drives that are just reporting in smart that everything is OK, but its stuck can't remap a sector) 1 sector fail could be ok but a lot of the time i find thats whats preventing the system from booting or system is going super slow as its going into retry/ECC overdrive trying to eventually read a going bad sector (why i tell my customers if the PC is going slow or its taking long time to open stuff don't ignore it call me, as its easier to clone it when its working not when its dead or not booting any more)

Where did my bad block go?

Segfault

Weaksauce

drescherjm

[H]F Junkie

doublejack

Limp Gawd

Segfault

Weaksauce

doublejack

Limp Gawd

Formula.350

[H]ard|Gawd

likeman

Gawd

Formula.350

[H]ard|Gawd

doublejack

Limp Gawd

toast0

2[H]4U

likeman

Gawd