Avoiding SSD data corruption with backup battery

mux · Sep 1, 2012

I've got a pretty technical question, I hope somebody can give me more insight into this.

A couple of days ago somebody on another forum asked me if I could design a backup battery for SSDs. When SSDs with a write cache are suddenly powered down, there is a pretty high probability that either the file system or user data on the drive gets corrupted. Some SSDs mitigate this problem by including super-capacitors in the drive themselves, but this increasingly seems to just be an enterprise feature with an enterprise price.

Most people with a ZFS (+ARC or L2ARC) or Intel SRT setup have to instead rely on a line voltage UPS to shield them from corruption or damage due to power loss. So what he proposed was to make a small, inexpensive board to just sustain power to the SSD and nothing else and avoid data corruption that way. So I made this, and it seems like I can make such a device for a price that makes it interesting to add to consumer SSDs.

Now comes the question, and it's a damned broad one: will this actually work?

If we break this down, the first question would be: is this a useful device to have? I believe it could be, if it works. Nevertheless, any comments you might have is useful on this part.

Now, will the device do what it's intended to do? When power is suddenly lost on an SSD while performing a write operation, any data that is left in the write cache is assumed to be written by the OS or file system implementation, but in reality it is corrupted or lost entirely. Now, is it a fair assumption to make that when the computer suddenly powers down, but the SSD is still on, that it properly flushes the write cache to memory and then goes into an idle, safe-to-disconnect state?

Also, how much time will the SSD need? As I understand it from the SATA specification, upon a PHY error, the device will try to reconnect after a timeout. The timeout is not specified, but any timeouts I can find are in the order of milliseconds, certainly not as much as seconds. I have currently designed my device to power the SSD for 60 seconds after sudden power loss. Is this enough? Is this too much?

Third, how would I go about testing and verifying this? Does anybody have reading material on the probability of data corruption on power down. I know a fair number of people have posted around the internet that they managed to corrupt their disk (and even the firmware, interestingly enough) by suddenly disconnecting and reconnecting the drive just a couple of times. It doesn't seem to take tens or hundreds of tries. But that is hardly scientific evidence.

I'd highly appreciate insights into this problem. My aim is to, ideally, make a device that basically transforms any SSD into an SSD with supercap-feature, albeit external.

mwroobel · Sep 1, 2012

There is a chat about this here. For a basic motherboard-controller-based SSD you have more to worry about other than the SSDs SDRAM cache (and that article goes into some controller models and whether they use SDRAM and/or NAND to do the cachine) and in some cases you need to worry about the OS caching. if you are truly worried about this, most decent disk controller cards (HW RAID) have available battery backup options that will use the card for the cache (and back up via the battery any unflushed transactions) and disable the onboard-caching of the drives themselves.

mux · Sep 1, 2012

Thank you very much for that link, that is very valuable. It seems like especially for journaling file systems and ZFS, which are the filesystems I am targeting this device at, my idea may very well work. Most of the OS-related problems in the xtremesystems-link are not really relevant to SSD data corruption.

I'm now going to try to setup a trial run with a prototype board design to see whether I'm right. In the meantime I am still very much open to insights and suggestions on the subject.

Vincent Tempus · Sep 2, 2012

interesting stuff... i look forward to your results.

mwroobel · Sep 2, 2012

mux said:
Thank you very much for that link, that is very valuable. It seems like especially for journaling file systems and ZFS, which are the filesystems I am targeting this device at, my idea may very well work. Most of the OS-related problems in the xtremesystems-link are not really relevant to SSD data corruption.

I'm now going to try to setup a trial run with a prototype board design to see whether I'm right. In the meantime I am still very much open to insights and suggestions on the subject.

If I understand what you are doing, you are just looking to provide an extra 1-3s (or whatever the cache flush commit time is) of ~.5w on the 5v line AFTER the power supply has been cut. You might want to check out these (I love them for desktop cases, they have the right spacing for 4 drive stacks). They have power leveling caps in the connector (which supposedly also give you a 5ms run should power completely drop but generally just stabilize the level), maybe you could use that as a starting point with larger caps.

mux · Sep 2, 2012

Well, unfortunately I cannot guarantee compatability with any SSD using just capacitors or ultracapacitors. I solved this by using a small sealed lithium ion cell. Interestingly, in the end I didn't even have to relax my temperature/endurance ratings for the battery: as it turns out, ultracapacitors are pretty horrible when it comes to endurance. Roughly the same as lithium ion cells.

If anyone is interested and can read Dutch (heh...), you can follow the project here: http://gathering.tweakers.net/forum/list_messages/1515131

I'm currently really in a testing/research phase: the prototype hardware has been designed and I'm looking for at most 10 people who are willing to go out of their way and test the thing. I'm not necessarily expecting perfect results, but whatever the findings they will certainly be interesting.

mwroobel · Sep 2, 2012

mux-
I thought this might be of interest to you. It gives you the circuit for internal use but could be easily adapted for external use.

mux · Sep 8, 2012

I took the prototype/test hardware in production, expect to start testing around the end of september.

mwroobel · Sep 18, 2012

Just thought you might be interested in this

Avoiding SSD data corruption with backup battery

mux

n00b

mwroobel

Supreme [H]ardness

mux

n00b

Vincent Tempus

Limp Gawd

mwroobel

Supreme [H]ardness

mux

n00b

mwroobel

Supreme [H]ardness

mux

n00b

mwroobel

Supreme [H]ardness