To use ZIL or not to use ZIL?

ponky

Limp Gawd
Joined
Nov 27, 2012
Messages
178
I have 10GbE on my server and desktop. I would like to be able to saturate the whole link, or get as close to 1000MB/s writes as possible.

I have 4x 128GB 850 PRO ssds which I'd like to put in use.

Now, my server has two pools, TANK and VMs. Tank consists of 8x 4TB spindles in one raidz2, so obviously write operations are ridiculously slow. VMs consists of Intel X25M-G2, faster but not fast enough. I'm planning on replacing these Intels with the 850 Pros.

I was thinking of creating 40GB partition on each and adding two mirror vdevs to TANK as ZIL, then move all VMs to tank and use SSDs only for ZIL.

If I understand it correctly, ZFS writes are synchronous and single threaded per application, so multiple mirrored ZILs wont boost my performance as much as I'd like? Can someone correct me if Im mistaken.

Here's my full spec list. If there is a better way to achieve higher write performance, please let me know. ZFS is on virtualized OmniOS, ESXi 5.5.

CPU: Intel Xeon E3-1230v3
MoBo: Supermicro X10SLH-F
Ram: 32GB (4x8GB) Kingston ValueRam ECC UDIMM 1600MHz CL11
SSDs: 2x Intel SSD 520 Series 60GB
4x Samsung 850 PRO 128GB
HDDs: 4x Seagate NAS HDD 4TB
4x Hitachi Deskstar NAS 4TB
HBAs: 2x IBM M1015 (cross-flashed to 9211-8i IT)
Intel X520-DA2 10GbE



Thanks in advance!
 
Mirrored slog devices will never improve performance - striping them maybe. Dunno if the 850 will be good for an slog device. I use an intel s3700 and it works perfectly...
 
Mirrored slog devices will never improve performance - striping them maybe. Dunno if the 850 will be good for an slog device. I use an intel s3700 and it works perfectly...

Yes, mirroring wont help, but it will keep writes safe. However, in theory, multiple mirrors should help, if ZFS stripes the writes between both mirrors. 3700 is a bit too expensive, so I'd really prefer to use 850 Pros unless they suck hard.
 
My understanding with performance on a ZIL that the biggest factor that will hurt you is latency. So I am not sure striping would help either, most likely it would hurt. Since most SSDs are designed around SATA interface latency is not the top most concern for them since the memory latency far lower than SATA controller latency in most cases. This is one of the big advantages of PCIe ssds.
 
My understanding with performance on a ZIL that the biggest factor that will hurt you is latency. So I am not sure striping would help either, most likely it would hurt. Since most SSDs are designed around SATA interface latency is not the top most concern for them since the memory latency far lower than SATA controller latency in most cases. This is one of the big advantages of PCIe ssds.

This is completely correct. You can never improve your ZIL performance by adding SLOG devices. Even if you had a stripe of 10 drives, the ZIL would never start writing the second write IO before the first one is completely written to disk. There is absolutely no parallelism here.
 
Note you would want to mirror your ZIL device for added protection. If you loose the ZIL all uncommited writes will be lost as well.
 
A Zil is a log device. It is not needed for writes during normal operation.
Only in case of a crash, it is readed on next boot to restore last uncommited writes.

If a ZIL fails during regular use, there is no dataloss and the on-Pool ZIL is used then.
A dedicated ZIL failure is only a problem if your system crashed together with a failure of the dedicated ZIL.
 
A ZIL will be there in any case, whether you add an SLOG device or not. Just to not confuse the terms.
And even if you add an SLOG device, for large writes ZFS will use the ZIL on the data disks.
The ZIL is only read in case of system crashes and power losses. You can easily check this by adding an SLOG and watching the IO.
It is an intent log, not a cache. Everything that goes to the ZIL will be committed on the next TXG writeout from RAM.

On the suitability of the 850 Pro: it is generally suggested to get a capacitor-backed SSD as SLOG device.
While I have not seen any proof that the 850 Pro or 840 Pro lose data on power loss, a capacitor-backed SSD can acknowledge writes even faster.
So your alternative would be the capacitor-backed version of the 850 Pro: the 845DC PRO. But this does not exist in 100GB or 200GB versions.

EDIT: Well, Gea was faster.

EDIT2: It would be better to just build a new dedicated SSD pool for the VMs.
You will not benefit from the 850 Pro's main advantage, their very high random read speed if you "waste" them as SLOG.
 
Last edited:
Well, considering an 850 is likely only going get you 100-200MB/sec max speed.

The zuesram caps out around 300MB/sec.

So the only way to get anywhere near 10gbit sync write speeds would be to have severual writers, so that each different writer would use a different slog device (that would be capped between 100-300MB/sec).

It sounds like you will only have one writer, so there isn't going be any hope of filling a 10gbit link using sync writes.
 
Well, considering an 850 is likely only going get you 100-200MB/sec max speed.

The zuesram caps out around 300MB/sec.

So the only way to get anywhere near 10gbit sync write speeds would be to have severual writers, so that each different writer would use a different slog device (that would be capped between 100-300MB/sec).

It sounds like you will only have one writer, so there isn't going be any hope of filling a 10gbit link using sync writes.

Aww, too bad :(. Now I need to figure out what to do with those 850's. Maybe just run them in mirrored stripes and use that as a VM storage.
 
The Samsung Pro are one of the fastest SSDs.
I would overprovision them by around 10% to keep performance high under load.
You can use them in a Raid-Z1 config or buy some more for a Raid-Z2 SSD based datastore.

Mirrors are not needed with SSDs as IOPS is no longer a problem.
A single fast hd is at around hundred iops. Even a massive multiple Raid-10 is at aroung 1000 iops.
A single cheap SSD is sold with 80000 iops with the "problem" that this can go down to 4000 iops under load.
You can reduce this (like with enterprise SSDs) when you overprovison SSDs.
 
Last edited:
The Samsung Pro are one of the fastest SSDs.
I would overprovision them by around 10% to keep performance high under load.
You can use them in a Raid-Z1 config or buy some more for a Raid-Z2 SSD based datastore.

Mirrors are not needed with SSDs as IOPS is no longer a problem.
A single fast hd is at around hundred iops. Even a massive multiple Raid-10 is at aroung 1000 iops.
A single cheap SSD is sold with 80000 iops with the "problem" that this can go down to 4000 iops under load.
You can reduce this (like with enterprise SSDs) when you overprovison SSDs.

I'm not 100% sure but I think they are already overprovisioned by ~8%? 4 disk raidz1 sounds interesting tbh. I'd also get more space.

Btw Gea, did you check email yet?
 
I'm not 100% sure but I think they are already overprovisioned by ~8%? 4 disk raidz1 sounds interesting tbh. I'd also get more space.

Btw Gea, did you check email yet?


Have I missed one??

Btw
The Samsung 1024GB has the same builtin capacity than a 960 GB Sandisk pro extreme where the difference is overprovisioning. On heavy loads the Sandisk is faster otherwise the Samsung wins.
 
One should always use an SSD with power loss protection for a ZFS log device or it defeats the device's fundamental purpose. That is, provide stable storage for the ZIL in case of host failure (crash/power loss). All SSDs have some amount of volatile memory on-board, either as a dedicated cache or integrated within the controller itself.

There's an interesting paper presented at FAST13 detailing Flash based SSDs:

"Understanding the Robustness of SSDs under Power Fault"
http://www.ddrdrive.com/fast13_final80.pdf

Excerpt from the paper's conclusion:

"Our experimental results with fifteen SSDs from five different vendors show that most of the SSDs we tested did not adhere strictly to the expected semantics of behavior under power faults. We observed five out of the six expected failure types, including bit corruption, shorn writes, unserializable writes, metadata corruption, and dead device."

Christopher George
www.ddrdrive.com
 
Last edited:
Yes, I know that paper, but it is almost useless to us because they do not "call names". Additionally even some SSDs with power loss protection failed some of the tests.
The 840 EVO was most probably not tested. I would like to know whether the 840 Pro was part of the test.

BTW, people build all-SSD pools all the time, those share the same vulnerability as a HDD pool with a SSD as SLOG.
And a lot of consumer HDDs seem to not be properly protected against power losses.
 
Last edited:
Mirrored slog devices will never improve performance - striping them maybe. Dunno if the 850 will be good for an slog device. I use an intel s3700 and it works perfectly...

Which S3700 did you get? I was thinking about buying the 100GB version and using a ~10 GB slice.
 
Here's my use case - setting up an all-in-one ESXi & ZFS (OmniOS) box, and plan to use 4x256GB MX100 SSDs as the primary VM datastore. Questions:

1) Would adding in an Intel s3700 (100GB w/ 15GB partition) as the SLOG improve performance? Any benchmarks I could run to determine exactly how much practical benefit there is?

2) When using SSDs as a datastore for VMs, what makes the most sense, a 4 drive RAIDz1 or a 4 drive RAID 1+0? I'm not overly concerned about wasting potential storage space, but if an SSD-based RAIDz1 is fast enough, then I'd like to go that route. Maybe with a SLOG?
 
1. I found s3700 (over-provisioned) to be pretty close to sync=disabled. Best way to tell for sure is to run some kind of benchmark both ways. I add a virtual disk to a win7 VM and run crystaldiskmark.

2. Well, raidz* has the IOPS of a single vdev, so for spinners, this sucks as a VM datastore, but if it's all SSDs, you're probably okay. I'd be concerned about not being able to lose more than one disk. With a raid10, your chance of two random disks hosing you is only 33%...
 
What amount of over-provisioning do you use? I would think just the 100GB s3700 would be more than enough as a dedicated SLOG and then using only a single partition of 10 or 15GB. Thoughts? Is that what you did?

I figure I'll be running multiple VMs and getting very close to fully using my 32GB RAM budget (motherboard max). I think with that in mind, an SLOG would definitely be a benefit. I can't justify a ZeusRAM setup at this level, so the s3700 seems like the next best thing.
 
Oh for sure. I think I have sliced off 20GB of the 100GB. I am only servicing one gigabit nic so that is way more than enough...
 
Why not disable the ZIL entirely with zfs sync=disabled <pool> ?

If the server is on a UPS?
 
Why not disable the ZIL entirely with zfs sync=disabled <pool> ?

If the server is on a UPS?

Because that still doesn't cover the 5 seconds a sync write is in memory if you get a kernel panic.

If you have a zil and sync is enabled it will be written to zil immediately and then flushed after it is flushed to disk. That means even in the event of a crash a completed sync write will be recovered from the zil.
 
Because that still doesn't cover the 5 seconds a sync write is in memory if you get a kernel panic.

If you have a zil and sync is enabled it will be written to zil immediately and then flushed after it is flushed to disk. That means even in the event of a crash a completed sync write will be recovered from the zil.

This is all true but assuming that this is a home setup, is this really such an important risk?

No SLOG saves you from wearing your SSD(s) down. No ZIL gives you good performance at an increased risk that you lose some data if for some reason the machine loses power.
 
This is all true but assuming that this is a home setup, is this really such an important risk?

No SLOG saves you from wearing your SSD(s) down. No ZIL gives you good performance at an increased risk that you lose some data if for some reason the machine loses power.

IMHO if it's a home setup you are far more likely to have a random shutdown or kernel panic. At home you don't usually have a strictly temperature controlled room, dual conversion UPS with online service and triple backup power. On top of that a home setup is usually running on older, sometime more failure prone, sometimes consumer grade hardware.

It all comes down to "What is your time worth?". In my case lack of downtime and not having to restore from backups is well worth my time. That being said, everything is virtualized, including my workstations. I also manage several of my friend's networks and in the perfect storm it would me downtime and/or slow network for them too.

I always ask the question: "Do you want to spend money now or spend time later?"
 
Ok, makes sense, but I hope you are running with ECC memory too then (you're talking about older, consumer grade hardware).
 
I've only just upgraded to SSD's in the last 10 months.

I'm not even going to act like I know what a ZIL or SLOG is to begin with.

*future reading*
 
Ok, makes sense, but I hope you are running with ECC memory too then (you're talking about older, consumer grade hardware).

ECC memory is also a good mention here, one of the reasons I always recommend using last generation server hardware over current generation consumer hardware. (There are some exceptions to that, some worstation boards with desktop processors suport ecc).

ECC allows for correction of single bit errors that are infrequent, but do happen.

Just another question of spend money now or time later.
 
Agreed. The price differential between good non-ECC and ECC ram is not *that* much. And running an error-correcting FS like ZFS is kind of dubious if you can have in-memory corruption (which won't be detected...)
 
Back
Top