View Full Version : How do SSD act when errors occur?
Surly73
08-22-2009, 08:40 AM
So, I see lots of paranoia about "over using" SSD blocks - disabling search, pagefile, indexing, blah blah blah....
I think it all a little silly, when no one does anything to preserve much more failure-prone mechanical disks. Furthermore, by the time you've reached the rated number of cycles the drive will be far cheaper to replace, far faster options will be out there, you'll likely think the current drive is far too small and much higher capacity options will be out there.
Anyways:
A big difference I see with SSD is that it seems that more than 50% of the time mechanical disks fail in a "one day it went whack-whack-whack and now it doesn't work at all". Clearly this wouldn't have to be the case for SSDs with no moving parts.
If cells in an SSD started failing parity checks the data could simply be relocated to spare space (like mechanical disks do with individual blocks) and the random access performance would suffer no ill. Other than massive physical damage or total controller failure, I don't see any likely cases for SSDs to suddenly become completely inoperative in the same what the mechanical drives just go "whack" or put scratches on their media.
Does anyone have good info on how current SSDs are programmed to handle errors? I see some decent likelihood that they could be FAR superior to mechanical disks and have far longer lifetimes despite the FUD.
SockMan!
08-22-2009, 09:05 AM
I'm curious to know about this too; unfortunately most SSD manufacturers must not find reliability and longevity to be marketable aside from a vague figure for data retention, largely irrelevant shock and vibration thresholds, and MTBF. It seems that only M-Tron has any mention of ECC and error rates on their product descriptions.
xbeemer
08-22-2009, 11:24 AM
From what I've read (this is from memory, so I can't give citations), when SSDs fail they fail to erase, which means they can't write fresh data, but they still can read. So instead of a brick you get a read only device.
Roman79
08-22-2009, 12:50 PM
From what I've read (this is from memory, so I can't give citations), when SSDs fail they fail to erase, which means they can't write fresh data, but they still can read. So instead of a brick you get a read only device.
If that's the case, that's amazing! I wouldn't have to feel as guilty for not doing a backup regularly :D
Megalomaniac
08-22-2009, 03:15 PM
wow, really? that would be awesome. Most comments about this seem to be along the lines of regular HDDs. That is that the sector is marked as bad and data is moved to a good sector. but the amount of data you can write to a drive actually decreases.
I am guessing, that if all of a sudden you realize that your drive does not store as much data as it used to and the byte count of off, then you might want to replace the drive. but it seems that having it as a read-only device would really be great +1 to SSDs since they would rarely "loose" data
Surly73
08-22-2009, 04:56 PM
From what I've read (this is from memory, so I can't give citations), when SSDs fail they fail to erase, which means they can't write fresh data, but they still can read. So instead of a brick you get a read only device.
A read only device or a read-only cell? I don't see why a bad cell would render the entire device read-only unless firmware was programmed that way.
If the filesystems became aware of how SSDs work, they could work hand-in-hand and basically all that would happen as your SSD wore out is that you would have less available filesystem space. No special reserved blocks, no more unrecoverable errors etc... I'm sure with a scheme like that and even today's MTBFs, the drives would far outlive their usefulness in terms of capacity.
xbeemer
08-22-2009, 09:38 PM
Here is one citation. I think I read about it going ROM instead of Brick somewhere else, but I can't find it now. Still, Anandtech is a pretty good source, and they are reporting a fairly graceful end of life for SSDs.
http://www.anandtech.com/cpuchipsets/intel/showdoc.aspx?i=3403&p=5
"What Happens When Your SSD Fails?
"When your hard drive dies we all know what happens. You go to turn on your machine one day and your OS doesn't boot, or your drive stops getting detected. But with SSDs their lifespan is far more predictable, so what does happen as they near the end of their life? A well designed SSD will have a good enough wear leveling algorithm to make sure that all blocks in the device get equal usage, so that when they fail, they do so at the same time.
"Intel's SSDs are designed so that when they fail, they attempt to fail on the next erase - so you don't lose data. If the drive can't fail on the next erase, it'll fail on the next program - again, so you don't lose existing data. You'll try and save a file and you'll get an error from the OS saying that the write couldn't be completed."
Krycek1
08-22-2009, 10:32 PM
How long before they typically fail. They're very expensive so I would hope it takes a long long time.
Elledan
08-23-2009, 01:02 AM
How long before they typically fail. They're very expensive so I would hope it takes a long long time.
Depends completely on how much data you write to them. With an average of 20 GB/day it'll be measured in years, with two writes a second to random 512k blocks, less than an hour.
SockMan!
08-23-2009, 01:45 AM
I'm a lot more concerned about silent corruption than outright failure regardless of storage technology. You know you have a problem if a drive goes kaput; however, it may take a long time for you realize that a flipped bit corrupted your data - if you ever find out it at all.
Yoda4561
08-23-2009, 07:49 AM
Well if you get a good one they're estimated to last like 10 or 20 years before the flash memory completely fails. That's unlikely though so they put a 3 year warranty on the things ;) That number only applies to normal desktop useage, heavy server loads with lots of writes can reduce it signifigantly hence the need for SLC enterprise ssds. A number of those fail early too, even though they should easily last years. Most of the failures look more like drive controller failures, which IIRC is how most platter drives end up dying too. You should expect it to last at least as long as a platter based drive, typically 3-10 years.
Elledan
08-23-2009, 11:07 AM
Well if you get a good one they're estimated to last like 10 or 20 years before the flash memory completely fails. That's unlikely though so they put a 3 year warranty on the things ;) That number only applies to normal desktop useage, heavy server loads with lots of writes can reduce it signifigantly hence the need for SLC enterprise ssds. A number of those fail early too, even though they should easily last years. Most of the failures look more like drive controller failures, which IIRC is how most platter drives end up dying too. You should expect it to last at least as long as a platter based drive, typically 3-10 years.
That sounds fairly optimistic to me. Perhaps the old SLC drives could make those numbers, but MLC Flash already has only 10% of the write cycles of SLC, and the exact count keeps going down with each new process size (45, 32 nm). There is even Flash memory on the market right now which only has a few hundred write cycles.
Data retention times are also dropping, with 10 years on old-style SLC drives, yet around 1 year on current-gen Flash-based SSDs. I fear what we'll see for numbers in a few years time. 1 month data retention? 100 write cycles?
drgnfang
08-23-2009, 01:56 PM
Not if they want it to go truly mainstream. They will find a way to extend the various life times.
xbeemer
08-23-2009, 02:03 PM
....but MLC Flash already has only 10% of the write cycles of SLC, and the exact count keeps going down with each new process size (45, 32 nm).
...
Data retention times are also dropping, with 10 years on old-style SLC drives, yet around 1 year on current-gen Flash-based SSDs.
This has the ripe stench of unfounded FUD to me. Given the same quality, how could MLC have 10% the write cycles of SLC? Current 2 level MLC requires twice the r/w access, but wear is on the erase, not on the read or write. Also, as die size shrinks so does design sophistication go up. I doubt very much, again given the same quality, that you can conclude small die size = less durability.
Would you kindly back up your assertions?
TheSpoon
08-23-2009, 02:55 PM
CITATIONS NEEDED!
This goes for everyone.
Krycek1
08-23-2009, 06:56 PM
I was planning on getting one of these sometime soon (next few months) but now I've been scared away. They're just too expensive to worry about failing for my taste.
schizo
08-23-2009, 07:31 PM
Hard drives are precision mechanisms with moving parts. They fail all the time, with less notice, and tend to be totally unusable upon failure. You boot, your computer doesn't recognize the drive any more, and it's trash.
SSDs have no moving parts and fail more gracefully. It's enthusiast-level technology right now, no doubt, but it's advancing at or near moore's law, unlike hard drives. Over time prices will continue to dramatically drop and storage volume will cross a threshold where it ceases to be an issue. Magnetic storage is a dying technology, like CRTs in 1999. You may not have a SSD now, but your next computer will.
xbeemer
08-24-2009, 12:08 AM
As a software developer I need to make backups several times a day. It turns out that a dual SSD setup, one for the main working drive and one for the backups works great. Fast, don't have that 2-4 second wait while the backup disk spins up, and I know the data is safe so I only need to archive the backups once or twice a month. SSDs are a great improvement over HDs, at least for me.
I paid for good ones, Corsair for the backup and Intel for the main drive, so I can be fairly certain I won't be loosing data - as has happened (to most of us) on occasion with HDs.
The point being, if you avoid the junk SSDs, you probably don't need to worry about them failing. Nothing is 100% safe, but these are getting there, and are far, far safer than mechanical HDs.
Martel-fs
08-24-2009, 01:56 AM
In flash drives (basically the same technology) what really becomes the problem IS controller failure (or PCB failure, not uncommon in magnetic HDs also). And because of all of the nifty wear-leveling algorithms spreading your data all over the media, recovering anything from a dead controller can be a BIG pain.
Elledan
08-24-2009, 06:33 AM
This has the ripe stench of unfounded FUD to me. Given the same quality, how could MLC have 10% the write cycles of SLC? Current 2 level MLC requires twice the r/w access, but wear is on the erase, not on the read or write. Also, as die size shrinks so does design sophistication go up. I doubt very much, again given the same quality, that you can conclude small die size = less durability.
Would you kindly back up your assertions?
MLC Flash is generally quoted as having 10,000 write cycles versus 100,000 for SLC. This is hardly a secret. Having two bits per cell is the probable cause for this reduced number of write cycles.
Flash is based on capacitance (each cell has to maintain a charge), thus reduced feature sizes will increase leakage and other negative effects.
And a link I found very informative: http://techon.nikkeibp.co.jp/article/HONSHI/20090528/170917/
xbeemer
08-24-2009, 10:44 AM
You're right on the raw chip specs, but wrong and FUD-mongering on the SSDs. Current quality SSDs are MORE reliable than HDs, not only in that they fail gracefully, going Read Only rather than just dropping dead, but the overall expected lifespan is very favorable compared to HDs:
MLC technology isn’t perfect. Its write performance is slower than its read speed. Also, while all Flash memory cells deteriorate over time, the more complex voltage levels of MLCs mean that they’ll endure ten times fewer write cycles than SLCs.
To combat this, Intel’s controller also makes efficient use of this reduced lifespan, so it can still quote a MTBF of 1.2 million hours. This is lower than the two million hours often quoted forSLC-based SSDs, but equal to most enterprise-level conventional hard disks and twice the usual value for consumer disks.
http://articles.directorym.com/Inside_Intels_SSD-a1068362.html
The SSD's controller manages both the flash memory and the data flow to and from the host. To write 1GB of data, competing SSDs need to write 20 to 40 times that amount of data to actually complete the 1GB write. Data gets written in blocks into both DRAM and the flash memory, and by the time you're done with one operation, you've actually written, in a common example, 32GB of data to change 1GB of data. And the complex process bogs down the movement of data through the SATA II bus controller, too.
With its new SSDs, Intel has changed the write strategy by introducing its write amplification technology. Write amplification is defined as the amount of NAND flash writes performed for a requested amount of data writes from the host computer. Instead of requiring 32 times the write cycles, as in the example above, the multiplier is now only 1.1 (or slightly less, according to Intel)--and the amount of overhead has been dramatically decreased, too.
Intel rates its drives for five years of useful life, assuming up to 20GB of data written each day. The company also rates its drives for 1.2 million hours mean time between failure--a spec that hard drive companies typically release only for their enterprise-class drives.
http://www.pcworld.com/article/150771/intels_new_ssd_drive_delivers_blazing_fast_performance.html
Elledan
08-24-2009, 12:08 PM
You're right on the raw chip specs, but wrong and FUD-mongering on the SSDs. Current quality SSDs are MORE reliable than HDs, not only in that they fail gracefully, going Read Only rather than just dropping dead, but the overall expected lifespan is very favorable compared to HDs:
I find it cute that you are willing to state that so forcefully when Flash-based SSDs have only been in widescale use for the past few years, whereas HDDs have seen over 50 years of use. We can not say with any kind of certainty what the durability of SSDs is like.
jimhsu
08-24-2009, 01:21 PM
I've looked at this topic somewhat in my spare time.
Basically, there are 3 failure modes:
1. Fail on erase - when you run out of spare blocks and every block is used. The SSD turns into a "read-only" device (like ROM). Unlike HDDs which typically fail on read (hence bad sectors, etc). The key reason why this is NOT a concern for data reliability is that failure is deterministic -- the SSD knows EXACTLY when it has run out of spare reallocation blocks, and the blocks reallocated can be viewed with SMART. At the very end of its lifetime, the number of reallocated blocks increases exponentially; one could, for example, use a conservative threshold where the SSD would be replaced if 10% of the reallocation blocks are used.
2. Unrecoverable bit errors - occurs 1 out of 10^15 bits as Intel claims (similar to typical hard drives) and results in a "silent" error on that read (a 1 gets accidentally read as a zero). This is because of inherent noise in depositing electrons into each flash cell, where an electron distribution "in-between" two states (e.g. 10 and 11) could be read as either one. Thus, MLC has an order of magnitude worse error rate than SLC, and requires more bits for ECC to obtain the 1 in 10^15 UBE figure. A solution would be to use more parity/error correction, or to retry the read.
3. Controller failure - this is generally poorly documented, but this refers to any failure outside of the NAND itself. HDD have a similar analogue. Could be caused by extreme temperature/humidity ranges, extreme shock, impaling the drive with a screwdriver, etc. Little info on how to solve this.
Because of 3, I would recommend backups regardless. And that's why RAID can be so dangerous for data stability - controller failure is unobvious, underappreciated, and very difficult to recover from.
jimhsu
08-24-2009, 01:26 PM
I find it cute that you are willing to state that so forcefully when Flash-based SSDs have only been in widescale use for the past few years, whereas HDDs have seen over 50 years of use. We can not say with any kind of certainty what the durability of SSDs is like.
Flash memory in general has been used for almost 30 years since invention by Dr. Fujio Masuoka (in mostly military applications at the time). Granted its SLC not MLC, but failure characteristics of NAND are in general pretty well understood. Simply because its mechanical, the failure characteristics of hard drives are typically more complex than NAND, if we take the controller out of the equation (since that is extremely manufacturer dependent).
Surly73
08-24-2009, 01:59 PM
I've looked at this topic somewhat in my spare time.
Basically, there are 3 failure modes:
1. Fail on erase - when you run out of spare blocks and every block is used. The SSD turns into a "read-only" device (like ROM). Unlike HDDs which typically fail on read (hence bad sectors, etc). The key reason why this is NOT a concern for data reliability is that failure is deterministic -- the SSD knows EXACTLY when it has run out of spare reallocation blocks, and the blocks reallocated can be viewed with SMART. At the very end of its lifetime, the number of reallocated blocks increases exponentially; one could, for example, use a conservative threshold where the SSD would be replaced if 10% of the reallocation blocks are used.
To be clear, does this mean that they HAVE implemented a system where block/cell reallocation takes place when a failure (erase, excess read, write, whatever) takes place?
From your language, I assume this is a reserved section of the device just for this purpose - just like on mechanical disks for decades. I'm also assuming that the entire device doesn't go RO until the bitter end.
I made a comment in a previous message in this thread that if the filesystems used became more aware of underlying SSD physiology that there's no reason they couldn't just copy the data to any other block and mark that block "bad" at a filesystem level too. Since there's no physical head moving, there's no performance-robbing fragmentation caused by this so it could be done indefinitely until the volume free-space shrinks so much that it's useless.
All of this is "in theory", of course.
2. Unrecoverable bit errors - occurs 1 out of 10^15 bits as Intel claims (similar to typical hard drives) and results in a "silent" error on that read (a 1 gets accidentally read as a zero). This is because of inherent noise in depositing electrons into each flash cell, where an electron distribution "in-between" two states (e.g. 10 and 11) could be read as either one. Thus, MLC has an order of magnitude worse error rate than SLC, and requires more bits for ECC to obtain the 1 in 10^15 UBE figure. A solution would be to use more parity/error correction, or to retry the read.
I would hope that CRCs and parity error DETECTION (at a minimum) is being used in all block storage devices, SD or mechanical. Silent corruption is a huge PITA and I usually blame this on controller bugs that don't properly check, or let errors slip through. I had two Seagate drives suffer from simultaneous silent corruption in the family. My dad's WinXP install just kept screwing up and grinding itself into worse and worse trouble. Of course he kept defragging which kept corrupting more and more files. SMART found it, but it never once flagged a read or write error.
3. Controller failure - this is generally poorly documented, but this refers to any failure outside of the NAND itself. HDD have a similar analogue. Could be caused by extreme temperature/humidity ranges, extreme shock, impaling the drive with a screwdriver, etc. Little info on how to solve this.
Because of 3, I would recommend backups regardless. And that's why RAID can be so dangerous for data stability - controller failure is unobvious, underappreciated, and very difficult to recover from.
Indeed, controller failure is not unique to SSDs and no particular precautions or concerns should arise that are specific to SSDs.
jimhsu
08-24-2009, 10:31 PM
1. See this from the anandtech article:
Intel actually includes additional space on the drive, on the order of 7.5 - 8% more (6 - 6.4GB on an 80GB drive) specifically for reliability purposes. If you start running out of good blocks to write to (nearing the end of your drive's lifespan), the SSD will write to this additional space on the drive. One interesting sidenote, you can actually increase the amount of reserved space on your drive to increase its lifespan. First secure erase the drive and using the ATA SetMaxAddress command just shrink the user capacity, giving you more spare area.
...
Intel's SSDs are designed so that when they fail, they attempt to fail on the next erase - so you don't lose data. If the drive can't fail on the next erase, it'll fail on the next program - again, so you don't lose existing data. You'll try and save a file and you'll get an error from the OS saying that the write couldn't be completed.
---
I can't speak for the other manufacturers, but to be competitive they have to have similar failure modes.
PS Block reallocation on SSDs is silent, automatic, and done by hardware. Each reallocated block increments the SMART count ID 5 by 1.
2. This is particularly interesting for an SSD because generally unrecoverable bit errors are not correlated across adjacent cells (low spatial autocorrelation), typically, while for HDDs and especially CDs they are (high spatial autocorrelation). This can be appreciated when you think of how large a scratch or a single dust particle is relative to a hard drive platter (thousands/millions of bits). For example (greatly exaggerated):
V = valid cell
F = failed cell
HDD: VVVVVVVFFFFFFFFFFFVVVVVVVVVVVVVVVFFFFFFFVVVVV
SSD: VVVVFVVVVVVFVFVVVVVFVVVFVFFVVVVVVFVVFVVVFVVV
By mathematics alone, a parity scheme with small blocks has a smaller probability of failing for an SSD.
vBulletin® v3.8.2, Copyright ©2000-2010, Jelsoft Enterprises Ltd.