Would a Faster ZFS Slog drive for ZIL make sense?

Zarathustra[H]

Extremely [H]
Joined
Oct 29, 2000
Messages
38,864
Hey all,

I have a ZFS setup as follows:

Code:
     raidz2-0       
       WD RED 4TB
       WD RED 4TB
       WD RED 4TB
       WD RED 4TB
       WD RED 4TB
       WD RED 4TB
     raidz2-1       
       WD RED 4TB
       WD RED 4TB
       WD RED 4TB
       WD RED 4TB
       WD RED 4TB
       WD RED 4TB
    logs
     mirror-2                  
       Intel S3700 100GB
       Intel S3700 100GB
    cache
     Samsung 850 Pro 512GB
     Samsung 850 Pro 512GB

So, I have two 100GB Intel S3700 mirrored for my SLOG device.

This allows me to do sync writes at about 119MB/s as tested locally with random data.

If I do async writes I am able to write at about 400MB/s

(these are imperfect benchmarks, as I did not shut down other actively running things on the server for this test)

One of my biggest pet peeves with this system is that I have always been disappointed in my relatively slow sync writes over 10Gig ethernet.

I don't run any VM's off of this pool, it is strictly for storage of large files that occasional get accessed. When I do write large files to the pool - however - I get impatient.

The S3700's were essentially the best thing on the market for a SLOG when I got them in 2014, but I have not kept up as much lately regarding what may have changed.

I have seen lately that 8GB ZeusRAM devices have gotten a lot cheaper (can be had for $400 a piece on eBay right now) and there are also a number of PCIe and M.2 solutions.

Would a ZeusRAM 8GB unit be a significant upgrade? One would think they would be since they are RAM based, but on the other hand I have heard that 100-150MB/s sync writes is the most you can expect from a pool when using them as SLOG devices.

Is there any drive out there for under $500 (so I can get two for a mirror under $1000) that performs notably better than the S3700 does for this purpose these days?

Key requirements are:

- Low latency writes
- Sustained high speed writes
- relatively high write endurance
- Battery or capacitor backed cache, so all data is committed if power goes out

The size can be tiny, I don't care. Does not need to be larger than 10GB.

Appreciate any input!
 
As you use the pool for filer use with larger files, i doubt that an Slog makes any sense.
ZFS itself is a crash resistent CopyOnWrite filesystem what means that a crash during a write does not lead to a corrupt raid or corrupt filesystem. A large file that is written during a crash is damaged in any case, does not matter of sync settings. The writing programm ex Microsft Word must care about this with a temp file.

This is different to a conventional raid with its write hole problem. As a raid stripe is there written sequentially to disks, a crash lead to corrupt raid stripes and as you need two transactions for a write (modify data then modify metadata) a crash can corrupt not only the raid but your filesystem as well. These two problems do not exist on ZFS.

Sync write with a Zil/Slog is only needed for transaction critical use cases ex with VMs (and older non CoW filesystems) or databases where you need a transaction save behaviour ie a committed single write must be on disk. Otherwise simply disable sync and let ZFS do whats needed.

If you really need sync write behaviour and the S3700 is too slow, you need something like a P3700 (NVMe with high iops and powerloss protection)
 


I disagree with this. Mirrors are certainly great in some circumstances, but they are not the end all for everyone.

For an active pool or a small pool they make sense. For a larger pool, primarily used for long term data storage, they don't IMHO.

Everything is a tradeoff between performance, performance while degraded, fault resilience, and storage efficiency.

In a high availability storage situation, where a few hours of performance impact during a resilver would be a huge deal, I'd probably opt for a mirror configuration, at the same time knowing full well that for the same amount of fault tolerance, I'm sacrificing a significant amount of storage efficiency, meaning I'll have to spend more on more or bigger servers, with more controllers (or bigger SAS expanders) and bigger backplanes with more drives.

In the real world cost matters, so you have to apply the solution best suited to the needs at hand. If the pool is used for an office file server - for instance - or maybe a home NAS like mine, where you can do an up to 16 hour (24 hour day - 8 hour workday, but you probably won't need that much) resilver during non-business hours without hurting anyone, a RAIDz2 setup will be cheaper from a server and drive perspective and be able to fill the same purpose, especially since a degraded state and the need to resilver is relatively rare. I've only had to do it 3 times since I first started using ZFS in ~2010 or so. It seems silly to go to that amount of extra expense for a resilver problem that will only happen once in 3 years and I can do overnight while sleeping.

It's all about the needs of the application, and the tradeoff in terms of performance, performance while degraded, fault resilience and cost.


The layout above is not the entirety of the drives in my server. Just the portion I was asking questions about. Here is the full "zpool status" edited to show the drives in the pools:

Code:
  pool: VM-datastore
 state: ONLINE
  scan: scrub repaired 0 in 0h4m with 0 errors on Sun Jun 11 00:28:51 2017
config:

    NAME                            STATE     READ WRITE CKSUM
    VM-datastore                    ONLINE       0     0     0
     mirror-0                       ONLINE       0     0     0
       Samsung 850 EVO 500GB        ONLINE       0     0     0
       Samsung 850 EVO 500GB        ONLINE       0     0     0

errors: No known data errors

  pool: file_storage
 state: ONLINE
  scan: scrub repaired 0 in 16h5m with 0 errors on Sun Jun 11 16:29:47 2017
config:

    NAME                            STATE     READ WRITE CKSUM
    file_storage                    ONLINE       0     0     0
     raidz2-0                       ONLINE       0     0     0
       WD Red 4TB                   ONLINE       0     0     0
       WD Red 4TB                   ONLINE       0     0     0
       WD Red 4TB                   ONLINE       0     0     0
       WD Red 4TB                   ONLINE       0     0     0
       WD Red 4TB                   ONLINE       0     0     0
       WD Red 4TB                   ONLINE       0     0     0
     raidz2-1                       ONLINE       0     0     0
       WD Red 4TB                   ONLINE       0     0     0
       WD Red 4TB                   ONLINE       0     0     0
       WD Red 4TB                   ONLINE       0     0     0
       WD Red 4TB                   ONLINE       0     0     0
       WD Red 4TB                   ONLINE       0     0     0
       WD Red 4TB                   ONLINE       0     0     0
    logs
     mirror-2                       ONLINE       0     0     0
       Intel S3700 100GB            ONLINE       0     0     0
       Intel S3700 100GB            ONLINE       0     0     0
    cache
       Samsung 850 Pro 512GG        ONLINE       0     0     0
       Samsung 850 Pro 512GG        ONLINE       0     0     0

errors: No known data errors

As you can see above I have pool configurations for different applications.

I have a small pool I use to boot the server and have my active VM images on. I require relatively high performance out of it, which is why it is made up of two SSD's. Since my servers don't do heavy writes here, I've opted to go with EVO drives to save money. This has proven to be the correct choice. A year and two months in, I still have 95% remaining life according to smart, which means I'll have at least 23 more years out of them as far as write cycles go (in other words, they'll last until way after they are functionally obsolete)

This proved to be a good application for a single two drive mirror for me, as the data is easily replaceable (the images are backed up locally) and I didn't need much space. it also really doesn't need a separate SLOG, as it is a mirror of two SSD's

The big pool with dual RAIDz2 vdev's of 6 drives each is used for file storage. It doesn't need high IOPS in either read or write, as it doesn't handle that type of workload. I do appreciate fast sequential reads and writes to it though, and I want a little bit more fault resilience, as while it is backed up to a remote cloud backup, restoring it would be a major long drawn out pain, due to its size. In this application I value the higher storage efficiency and thus lower cost per TB of RAIDz2, because resilver times are not an issue. I can start a resilver before I go to bed, and it will be done in a couple of hours. if its a particularly slow one, it will always be done by the time I get home from work the next day, so it won't impact my actual use of the pool.

For this file server application, RAIDz2 is a great choice.

The l2arc is there to speed up work on large files (like when editing large video files from my desktop) and help reduce the risk of stutter during playback from my media library, if multiple things happen to kick off all at once.

The SLOG drives are there to speed up sequential writes when I am copying or saving files to the portion of the pool that has sync writes enabled. I only enable sync writes in certain areas that have files that would be a huge pain if I lost due to write errors.

Anyway, long story short, ZFS is great for many reasons, and one of those is how flexible and configurable it is, so you can make the most out of it to fit your budget, needs and application. There is no such thing as a good "one size fits all" approach like using mirrors for everything.
 
Last edited:
As you use the pool for filer use with larger files, i doubt that an Slog makes any sense.
ZFS itself is a crash resistent CopyOnWrite filesystem what means that a crash during a write does not lead to a corrupt raid or corrupt filesystem. A large file that is written during a crash is damaged in any case, does not matter of sync settings. The writing programm ex Microsft Word must care about this with a temp file.

This is different to a conventional raid with its write hole problem. As a raid stripe is there written sequentially to disks, a crash lead to corrupt raid stripes and as you need two transactions for a write (modify data then modify metadata) a crash can corrupt not only the raid but your filesystem as well. These two problems do not exist on ZFS.

Sync write with a Zil/Slog is only needed for transaction critical use cases ex with VMs (and older non CoW filesystems) or databases where you need a transaction save behaviour ie a committed single write must be on disk. Otherwise simply disable sync and let ZFS do whats needed.

If you really need sync write behaviour and the S3700 is too slow, you need something like a P3700 (NVMe with high iops and powerloss protection)


I see your point. I should have phrased my question differently.

I have already decided that the areas I run with sync=always on my pool, I want to keep that way. There are less critical areas I run with sync=disabled. I have also decided that I am willing to spend as much as ~$1,000 for a mirrored pair of SLOG devices if they can help significantly improve my write speeds above my current measured 119MB/s.

At $400 a piece, will a pair of 8GB ZeusRAM's do it, or are they old enough now that they won't make a huge impact?

I hear good things about Intel's server PCIe SSD's in SLOG applications, but they only come in enormous (and expensive) sizes which are both out of my budget and seem wasteful for a pair of little SLOG's

I read a post on Servethehome about using a RAID card's battery backed cache as a poor mans RAM device, but I keep reading warnings about using ZFS on top of RAID cards and I'm not convinced this would be a good idea. It would be very cost effective though, to use such a RAID card to mirror once SLOG devices and set the SLOG size to the same size (or slightly smaller for a margin) as the RAID controllers battery backed cache...

Maybe there are new M.2 devices that I am unaware of that would work well in this application?

Any input on what I can get that first the budget ($500 or less per drive) and might be an improvement in SLOG performance over my aging S3700's would be appreciated.
 
The l2arc is there to speed up work on large files (like when editing large video files from my desktop) and help reduce the risk of stutter during playback from my media library, if multiple things happen to kick off all at once.

The SLOG drives are there to speed up sequential writes when I am copying or saving files to the portion of the pool that has sync writes enabled. I only enable sync writes in certain areas that have files that would be a huge pain if I lost due to write errors.

The l2arc for videofiles is perfectly ok as only on the l2arc you can enable prefetching to cache a sequential workload. You must enable as its disabled by default

The second is a common misunderstanding. The slog is not a write cache. The ZFS writecache is made up from 4GB RAM. The slog is a device to log all writes that are in cache but not on disk do redo them on next powerup. The slpg is only used in this special case. Disable sync for your editing device.
 
Also, sync writes always seem to be much slower than async, even with a ZEUSRAM device. I once even created an 8GB ramdisk and used that as SLOG (for testing), and saw the same behavior. There seems to be some kind of ZIL throttling or other suboptimal behavior.
 
The l2arc for videofiles is perfectly ok as only on the l2arc you can enable prefetching to cache a sequential workload. You must enable as its disabled by default

The second is a common misunderstanding. The slog is not a write cache. The ZFS writecache is made up from 4GB RAM. The slog is a device to log all writes that are in cache but not on disk do redo them on next powerup. The slpg is only used in this special case. Disable sync for your editing device.


I know exactly how the SLOG works. I ever said it was a write cache, what I said was it could speed up sync writes, and this it does.

The SLOG is simply a separate log device, where the ZIL which otherwise resides in the main pool is placed.

During async writes, writes are accepted into the RAM cache and immediately reported back to the source as having been committed to non-volatile disk. This is essentially a lie, in order to keep writes moving faster. If power is lost or the system crashes before this data is ACTUALLY committed to non-volatile storage, you could wind up with data loss.

During sync writes, a log is written to the ZIL, and once that log is written the system reports the write as committed to non-volatile storage. The data is then written to the main pool from RAM, and once this is done, the log of that data is purged from the ZIL. If anything happens (crash, power loss, etc) before the data is written from the RAM to the main pool, on the next boot or mount, the log in the ZIL is interpreted and the data is recovered and then written to the pool, and no data is lost. This is thus more secure than async writes. The ZIL is never read from unless something goes wrong and this boot/mount time log interpretation is needed.

When we get a SLOG, all we do is have a dedicated (faster) drive for the ZIL, which speeds up async writes, as the data can be quickly committed to the ZIL, and reported back as written.

So, as I stated, yes, a good slog speeds up writes.

That doesn't mean I think it is a cache. It isn't.
 
Also, sync writes always seem to be much slower than async, even with a ZEUSRAM device. I once even created an 8GB ramdisk and used that as SLOG (for testing), and saw the same behavior. There seems to be some kind of ZIL throttling or other suboptimal behavior.

Interesting.

I thought it had more to do with inefficienciues in the SAS -> ZeusRAM configuration.

The reason I say this is because I have seen benchmarks (which I now can't seem to find) with Intel's PCIe P3700 outperforming STEC's ZeusRam 8GB by a wide margin when used as a SLOG. Presumably this is due to lower latency of being connected directly to the PCIe bus rather than having to do through a SAS controller.

I've also heard reports (but without data) suggesting that the new Optane P4800x is an absolute SLOG monster, putting everything else to shame.

In either case, these are moot for my purposes, as these devices are both huge (and thus wasteful for this purpose. I don't want a several hundred GB to over a TB SSD when I only need 12GB) and their price puts them right out of my budget.

It would be really cool if those little Intel M.2 Optane drives came in a capacitor backed variety, as they could make for a good SLOG I'd imagine, at a more reasonable price, for the small amount of storage you really need for this purpose. That being said, Intel's unethical locking them to only certain platforms becomes a problem as well.
 
As you use the pool for filer use with larger files, i doubt that an Slog makes any sense.
ZFS itself is a crash resistent CopyOnWrite filesystem what means that a crash during a write does not lead to a corrupt raid or corrupt filesystem. A large file that is written during a crash is damaged in any case, does not matter of sync settings. The writing programm ex Microsft Word must care about this with a temp file.

This is different to a conventional raid with its write hole problem. As a raid stripe is there written sequentially to disks, a crash lead to corrupt raid stripes and as you need two transactions for a write (modify data then modify metadata) a crash can corrupt not only the raid but your filesystem as well. These two problems do not exist on ZFS.

Sync write with a Zil/Slog is only needed for transaction critical use cases ex with VMs (and older non CoW filesystems) or databases where you need a transaction save behaviour ie a committed single write must be on disk. Otherwise simply disable sync and let ZFS do whats needed.

If you really need sync write behaviour and the S3700 is too slow, you need something like a P3700 (NVMe with high iops and powerloss protection)

This is a really good answer! People should bookmark this.
 
Zarathustra[H]

Everything correct what you say but for me the essential point with large files is that sync=always is much slower than sync=disabled even with the fastest Slog device as every write must be done twice, once to the logdevice for every committed datablock and once to the pool as a flush of the writecache content.

You must count this against the danger of sync=disabled when writing large files. If there is a crash it mostly does not matter if you can recover last committed datablocks or not. The whole file is mostly damaged in both cases while the ZFS filesystem is always intact. So for a filer with large files you get a lower performance with no or minimal gain regarding security. This is why a regular filer ex SMB is per default sync=default=off

Of course you can think of situations where this is different and you can use sync to improve security in these rare situations where it helps but sync=always for a ZFS filer is not a common setting,
 
Zarathustra[H]

Everything correct what you say but for me the essential point with large files is that sync=always is much slower than sync=disabled even with the fastest Slog device as every write must be done twice, once to the logdevice for every committed datablock and once to the pool as a flush of the writecache content.

You must count this against the danger of sync=disabled when writing large files. If there is a crash it mostly does not matter if you can recover last committed datablocks or not. The whole file is mostly damaged in both cases while the ZFS filesystem is always intact. So for a filer with large files you get a lower performance with no or minimal gain regarding security. This is why a regular filer ex SMB is per default sync=default=off

Of course you can think of situations where this is different and you can use sync to improve security in these rare situations where it helps but sync=always for a ZFS filer is not a common setting,


I agree with you.

The importance of sync writes goes down in a large file static storage environment. It's not like in VM images or databases, where if you miss those last crucial writes, you might corrupt the whole damned thing.

On the other hand, I've seen my pool write at between 400 and 800MB/s, and I have a direct wired 10Gbit line to my workstation, which can theoretically handle up to 1200MB/s in writes In ZFS the typical setting is 5 seconds between cache flushes. With my pool write speeds that could amount to between 2 and 4GB of data between flushes, that would be lost with async writes.

The way I set things up is as follows. I have one main dataset with sync=always just to err on the side of safety, as my workstation is of the single small dedicated SSD variety, so all working files are opened from the NAS. This dataset contains a huige mix of files, from my many spreadsheets up to images and large large videos I am editing, as well as partition dumps and my fiance's mac's time machine backups via Netatalk. Everything but the kitchen sink goes in here.

When I identify areas where sync writes don't make sense, I move them to their own dedicated datasets with sync=disabled set. Examples here are my media folder, which contains mostly purchased downloaded videos and rips of my blu-rays, because for these files I am usually copying them over from another location, so if something goes wrong I can usually just copy them again, without anything being lost. My MythTV DVR recording files are also in a sync=disabled folder.

The rest, that I have not yet specifically identified as not needing sync writes stay in the main dataset, where sync is set to always, just to err on the side of caution. I would want to be in the middle of saving one of my large spreadsheets and have it be lost or corrupted, and I have so many files that I don't want to be in the business of sorting sync and async type files on a file by file basis. I don't trust sync=standard to figure it out correctly on its own.

So, I have my reasons for what I do, and I'm not really looking to change it, so can we please skip the whole philosophical discussion about when sync writes do and do not make sense, and focus on the question I am asking?

Namely, are there any pair of SLOG devices that will outperform my current mirror of 100GB Intel S3700's for $1000 or less?
 
Faster than the S3700-100 as an Slog:

- Overpovision the S3700-100 ex with a HPA of 80GB (=20GB usable) on a new or securely erased SSD
- the same with a faster S3700 (200-400GB)
- ZeusRam, best with mpio SAS
- Intel P3700 NVMe
- newer generations of NVMe with high iops and powerloss protection

A Slog mirror helps only
- when the slog dies with a crash (dataloss)
- to prevent a performace degration on a failure as you revert to Zil then
 
Lots of words in all these other replies.

For performance: Max system RAM first. I'd say at least 32 GB of RAM
 
Back
Top