SSD without super capacitor as ZIL device ?

geppi · Oct 10, 2012

I'm still wondering if a SSD used as a ZIL device really must have a super capacitor or any other means to assure that data in its volatile DRAM buffer is written to stable NAND Flash memory in case of a power loss.

According to this blog post:

http://milek.blogspot.de/2010/05/zfs-synchronous-vs-asynchronous-io.html

with the setting for proper POSIX compliant behaviour:

>> sync=standard
>>
>> This is the default option. Synchronous file system transactions
>> (fsync, O_DSYNC, O_SYNC, etc) are written out (to the intent log)
>> and then secondly all devices written are flushed to ensure
>> the data is stable (not cached by device controllers).

the data should be flushed from the ZIL drives volatile DRAM cache before ZFS is returning the COMMIT.

So even if the system would lose power and the synchronous file system transactions would get lost from the intent log SSDs volatile DRAM buffer it should not cause any problem on the application side since the write operation was not committed.
It would only be committed after the buffer had been flushed to stable NAND Flash memory.

The following blog post further supports this:

https://blogs.oracle.com/roch/entry/nfs_and_zfs_a_fine

saying that:

>> ZFS is designed to work correctly whether or not the disk write caches are enabled.
>> This is acheived through explicit cache flush requests,
>> which are generated (for example) in response to an NFS COMMIT.
>> Enabling the write caches is then a performance consideration,
>> and can offer performance gains for some workloads.

So it sounds to me as if it's not that important that a SSD does have a super capacitor but more that it is honest about having flushed it's dirty buffers to stable madia.
Of cause I would expect that a SSD with a super capacitor is honest or could at least afford to be dishonest because it will be able to flush it's dirty buffers in an emergency case.

It would be nice if we could feel safe using one of the better performing cheap MLC SSDs without a super capacitor, heavily overprovisioned, as a ZIL device.
However, I'm not 100% sure that my conclusion is correct and that the information on which it is based is still up to date and correct.

stevebaynet · Oct 10, 2012

its still early and i havent googled this, but:

- isnt the point of a ZIL/Slog so that the client write does not have to wait for the write to hit stable media before finishing the process? isnt that what makes NFS so slow unless you add a zil? (or change the sync setting)

- Wouldnt cheaper MLC based SSD's die off alot quicker since really all they will do as Zil is sustain a crap ton of writes (with reads only comming if your server crashed and it needed to replay the transactions in the Zil)

im with you tho, id like to see cheaper Zil options. I know people are talking data integrity, but i mean.. a crude example, my stupid cat sometimes hits the damn power button on my home rig and turns it off (or the button on the power strip) and i dont lose any data, lol. but seriously, when you talking about something like a critical database file... losing power during a write to the file would not neccessarily corrupt it, and even if it did, thats why you have the log file so it can roll back/forward and fix partial writes, etc etc. im sure there is always the chance that this really could corrupt something, and i'd hate to be on the losing percentage of that when it does, but just sayin.

dave99 · Oct 10, 2012

I'm not an expert on zfs by any means, but from what I've found previously, it's a pretty narrow danger window. You have to have a power failure right after a write is confirmed to the system but before it's committed to the zil device, that's a window of milliseconds typically. If you have a server with redundant power supplies from redundant power feeds (aka 2 PSU feeding from 2 UPS), you'd be hard pressed to hit it. Single PSU or feed, you might be taking chances.

patrickdk · Oct 10, 2012

Newer MLC don't die off too quick, unless you putting gigabit+ substained write loads on it all day and night. My average write is 5MB/sec, and the estimate is I will wear through an mlc in approx 3years at that write rate. Now this assumes I'm wasting a 100gig ssd when the slog will only use about 1 to 5 gigs of it ever in a burst.

The supercap is there so that after you write to the ssd, the ssd normally buffers it in it's ram chip, then it later flushs that out to the flash chips, in the window of time, you could loose data, as dave99 says, that is normally a small window.

obrith · Oct 10, 2012

You can lose ~0-5 seconds of transactions.

One of the good things about a supercap SSD is that it can/does safely ignore commits, allowing it to be faster in the ZIL use-case.

Good, larger, under-provisioned MLC's aren't much of an issue anymore - they can write PB's.

The biggest problem is you're trusting the disk to comply with the systems request, and betting your data on it. We know a lot of SSD's ignore other commands (secure-erase, for example), so who knows if it's actually going to wait for the data to be committed before responding as such.

geppi · Oct 10, 2012

Thank you for your comments so far but in all honesty the intention of my post was not to discuss the probability of losing data when applying the sync=disabled setting.

Let's simply assume that there are people out there paranoid enough to assume that shit will hit the fan.

@stevebaynet:
No, the point of the ZIL/Slog is not that the client write does not have to wait for the write to hit stable media.
The ZIL/Slog is the stable media, at least it's supposed to be on stable media and this is the core of my question regarding the SSDs without a super capacitor.

With sync=standard ZFS will always use a ZIL for synchronous write requests. However, as long as you don't specify a dedicated device it will just use a tiny part of your zpool for this intent log.
As far as I understand it, the reason why this is still faster than writing every synchronous request immediately to disk at the place where it belongs is that the intent log is written as a contigous portion to disk,
i.e. gathering multiple random write requests into a sequential write instead of doing the real random writes. This comes later when the whole transaction group is written to disk, which happens about every 5 seconds.
For this the data from the intent log is not used at all. Instead the data which is still in the RAM is used and probably there is more optimization that ZFS does to make the write operation now as sequential as possible.
The whole procedure is described nicely in this blog post:

http://constantin.glez.de/blog/2010/06/closer-look-zfs-vdevs-and-performance

under the heading "Keeping the Cake and Eating it Too".

So in essence the client has to wait for the ZIL write operation to finish as long as you don't specify sync=disabled.
If you do, ZFS does not write to any intent log and just keeps the data in RAM when returning the COMMIT for a synchronous write request.

However, if sync=standard and a dedicated device is used for the ZIL to speedup the random synchronous write operations it would be pointless to use a SSD that could lose the data itself in case of a power loss.
In that case we could have set sync=disabled right from the beginning, spare the money for the SSD and gain even more performance. There is no difference between losing the data from the server DRAM or from the SSD buffer DRAM.
Gone is gone.

But DRAM is also used in normal hard disks for the write buffer. So even if we don't use a SSD as a dedicated ZIL device we would have to assure that our zpool disks don't have their write buffer enabled if the intent log is written to the pool disks.
What saves the day here is that according to the blog posts cited in my initial post, ZFS is well aware of the device write buffers and their volatility and it issues a flush command everytime it needs to assure that data is on stable media.

If this works and is OK for the intent log on the pool disks why should there be a difference when using a dedicated SSD for the ZIL, i.e. why should it require a super capacitor ?

@obrith:
You seem to confirm my assumption and at the same time put the question at the table that scares me as well.
If the SSD does ignore the flush command because it has been tweaked by the manufacturer to show better performance it would be a no go.

Does anybody have information on which SSDs are ignoring or honoring the flush command ?

SSD without super capacitor as ZIL device ?

geppi

n00b

stevebaynet

Limp Gawd

dave99

2[H]4U

patrickdk

Gawd

obrith

Limp Gawd

geppi

n00b