I've got a pretty technical question, I hope somebody can give me more insight into this.
A couple of days ago somebody on another forum asked me if I could design a backup battery for SSDs. When SSDs with a write cache are suddenly powered down, there is a pretty high probability that either the file system or user data on the drive gets corrupted. Some SSDs mitigate this problem by including super-capacitors in the drive themselves, but this increasingly seems to just be an enterprise feature with an enterprise price.
Most people with a ZFS (+ARC or L2ARC) or Intel SRT setup have to instead rely on a line voltage UPS to shield them from corruption or damage due to power loss. So what he proposed was to make a small, inexpensive board to just sustain power to the SSD and nothing else and avoid data corruption that way. So I made this, and it seems like I can make such a device for a price that makes it interesting to add to consumer SSDs.
Now comes the question, and it's a damned broad one: will this actually work?
If we break this down, the first question would be: is this a useful device to have? I believe it could be, if it works. Nevertheless, any comments you might have is useful on this part.
Now, will the device do what it's intended to do? When power is suddenly lost on an SSD while performing a write operation, any data that is left in the write cache is assumed to be written by the OS or file system implementation, but in reality it is corrupted or lost entirely. Now, is it a fair assumption to make that when the computer suddenly powers down, but the SSD is still on, that it properly flushes the write cache to memory and then goes into an idle, safe-to-disconnect state?
Also, how much time will the SSD need? As I understand it from the SATA specification, upon a PHY error, the device will try to reconnect after a timeout. The timeout is not specified, but any timeouts I can find are in the order of milliseconds, certainly not as much as seconds. I have currently designed my device to power the SSD for 60 seconds after sudden power loss. Is this enough? Is this too much?
Third, how would I go about testing and verifying this? Does anybody have reading material on the probability of data corruption on power down. I know a fair number of people have posted around the internet that they managed to corrupt their disk (and even the firmware, interestingly enough) by suddenly disconnecting and reconnecting the drive just a couple of times. It doesn't seem to take tens or hundreds of tries. But that is hardly scientific evidence.
I'd highly appreciate insights into this problem. My aim is to, ideally, make a device that basically transforms any SSD into an SSD with supercap-feature, albeit external.
A couple of days ago somebody on another forum asked me if I could design a backup battery for SSDs. When SSDs with a write cache are suddenly powered down, there is a pretty high probability that either the file system or user data on the drive gets corrupted. Some SSDs mitigate this problem by including super-capacitors in the drive themselves, but this increasingly seems to just be an enterprise feature with an enterprise price.
Most people with a ZFS (+ARC or L2ARC) or Intel SRT setup have to instead rely on a line voltage UPS to shield them from corruption or damage due to power loss. So what he proposed was to make a small, inexpensive board to just sustain power to the SSD and nothing else and avoid data corruption that way. So I made this, and it seems like I can make such a device for a price that makes it interesting to add to consumer SSDs.
Now comes the question, and it's a damned broad one: will this actually work?
If we break this down, the first question would be: is this a useful device to have? I believe it could be, if it works. Nevertheless, any comments you might have is useful on this part.
Now, will the device do what it's intended to do? When power is suddenly lost on an SSD while performing a write operation, any data that is left in the write cache is assumed to be written by the OS or file system implementation, but in reality it is corrupted or lost entirely. Now, is it a fair assumption to make that when the computer suddenly powers down, but the SSD is still on, that it properly flushes the write cache to memory and then goes into an idle, safe-to-disconnect state?
Also, how much time will the SSD need? As I understand it from the SATA specification, upon a PHY error, the device will try to reconnect after a timeout. The timeout is not specified, but any timeouts I can find are in the order of milliseconds, certainly not as much as seconds. I have currently designed my device to power the SSD for 60 seconds after sudden power loss. Is this enough? Is this too much?
Third, how would I go about testing and verifying this? Does anybody have reading material on the probability of data corruption on power down. I know a fair number of people have posted around the internet that they managed to corrupt their disk (and even the firmware, interestingly enough) by suddenly disconnecting and reconnecting the drive just a couple of times. It doesn't seem to take tens or hundreds of tries. But that is hardly scientific evidence.
I'd highly appreciate insights into this problem. My aim is to, ideally, make a device that basically transforms any SSD into an SSD with supercap-feature, albeit external.