ZFS Users: Post Your Scrub / Resilver Speeds

Zarathustra[H] · Aug 10, 2021

I was doing a scrub today, after (hopefully) chasing down a very annoying SAS cable issue that was causing errors and was reminded how cool it is to see these arrays tear through massive amounts of data during scrubs and resilvers, so I figured I'd post mine.

What have you got?

Shadowarez · Aug 11, 2021

soon as my 8x 18tb exos arrive ill post as i have to build the pools. itll have 2 1tb nvme ssds to help it along once they arrive. is sata ok for log files or would nvme be better?

Zarathustra[H] · Aug 11, 2021

Shadowarez said:
soon as my 8x 18tb exos arrive ill post as i have to build the pools. itll have 2 1tb nvme ssds to help it along once they arrive. is sata ok for log files or would nvme be better?

Log drives have very specific and unusual requirements.

I don't know how familiar you are with sync writes and async writes and what the log drives do, but I'll write a quick summary below just to clear things up, and please don't be offended if I have taken time to explain things you already know.

1.) Log is short of ZIL, or ZFS Intent Log. It's a place where sync writes are temporarily stored quickly before being integrated into the pool properly. It is often oversimplified as a kind of write cache, but that is actually incorrect.

The terminology is as follows:
ZIL = ZFS Intent Log, a place where sync writes are temporarily stored
SLOG Device (sometimes just LOG device) = a dedicated fast storage device specifically for the ZIL.

2.) You always have a ZIL, even if you don't have dedicated drives for it. If you don't use dedicated drives, the ZIL is just integrated into the main pool.

3.) There are async writes and sync writes. Normally the system tries to tell on it's own and use whatever write method is appropriate, but it isn't always good at doing so. You can also tell your system if you want sync writes or async writes on a given storage pool or subvol.

4.) This is done by issuing the zfs set command:

set <property=value> ... <filesystem|volume|snapshot>

options for "sync" are:

sync standard | always | disabled

so, as an example

zfs set sync=off <poolname>

This will turn off sync writes for a given pool, and everything that is written will be written as async.

These settings are inherited by subvols unless specified specifically for them. It is more typical to have subvols with settings tailored to the types of files stored in them, than setting these things for the entire pool. For instance, my pool/media subvol has sync off. It doesn't add value there. The main pool has it set to "standard". The subvol where I store live VM images has it set to "always".

5.) Async writes are for writes writes which are of lower criticality. The file can be downloaded again, or copied again or something else. Having that very last second of data before a system goes down is less critical. If this is the case ZFS communicates back to whatever is sending it data that the data is committed to drive and it is OK to proceed with more data, even if it is still only in memory and hasn't been stored to the drive yet. This speeds up writes significantly, as there is no waiting for the drive to commit things.

6.) With sync writes, ZFS waits until the writes have been fully committed to disk before sending the confirmation that more data can be sent. This makes things slower.

7.) The Log device (or SLOG, separate log device) exists to speed up this sync write process. It moves the ZIL from the pool to a faster storage media. During normal operation this media is never read from, only written to. Intended writes are quickly written here, allowing ZFS to communicate that data has been written to disk, without having to wait for the slower drives. The data is then written from the ARC in RAM to disk, and once that is completed the data in the ZIL is discarded.

8.) The only time the ZIL is read from is if something goes wrong. If the system goes down (power loss, crash, etc) before the write operation from RAM to disk is complete. If this happens, the next time the pool is imported, it reads out the ZIL, reconstructs the data and integrates it into the pool.

9.) Only about 1 second worth of writes are ever stored on the ZIL at any given time, so the drive does not need to be large (but often larger drives are faster, so usually they wind up being much larger than they need to be anyway)

10.) You don't NEED redundancy on the ZIL, but it is a good idea of you are cautious. Otherwise if you ever have a log device fail AND a system go down at the same time, you will lose the last second of writes. This is less unlikely as it may seem. You could easily not notice that the SLOG is starting to throw errors followed by a power outage and unclean shutdown, or a hard crash.

11.) Most people probably don't need a SLOG. Async writes are probably just fine for them. Think of it, if you are copying a large file, like to a media library, and it interrupts midway through, that file is useless either way. Having the last second of data rescued doesn't really help you much. Where it really becomes useful is if you are storing databases on there or running VM's with images on the storage pool. If you do this, losing that last second can be critical, causing valuable information to be omitted from the database and potentially corrupting a database or a VM drive image. These are just a couple of examples, but you get the point. You can save money and complexity if async writes are fine for your application by setting the whole pool to sync=off (or if you don't care about sync write speeds, and just write sync writes to the ZIL in the pool, but this will be very slow) . You just have to consider your use scenario.

12.) Finally, considerations for selecting a good SLOG drive are not really based on normal SSD performance characteristics. The most important performance statistic for a SLOG drive is write latency. This will make or break SLOG performance. It is also a good idea to have a battery or capacitor backed SLOG just in case power is lost, so anything still in the drives DRAM cache can be committed to flash in the fraction of a second at the end.

13.) So, not all SSD's are created equal for this task. back in the SATA days I first tried using Samsung drives for my SLOG. They performed almost as badly as hard drives. I then moved to 100GB Intel S3700 SATA drives. They performed MUCH better, but still slow by NVMe standards.

14.) Just like with SATA, not all NVME drives are created equal for this task. BY FAR the best drives for this purpose are Intel's enterprise Optane drives. I just picked up a pair of new old stock Optane 900P's to replace my aging Intel S3700's (they are technically marketed as consumer drives, but are Enterprise models under the cover) With these I can easily saturate my 10gig Ethernet with sync writes. Before I was getting about 100MB/s writes (which wasn't bad back in the gigabit Ethernet days)

This writeup over on Servethehome is a little old, but it compares some drives and tells the story pretty well.

Zarathustra[H] · Aug 11, 2021

Shadowarez said:
soon as my 8x 18tb exos arrive ill post as i have to build the pools. itll have 2 1tb nvme ssds to help it along once they arrive. is sata ok for log files or would nvme be better?

Also, if your use case doesn't look like it will benefit from a SLOG, there are other things you can do with SSD's to speed up the pool.

1.) L2ARC or cache devices. (same thing, two different names)

The L2ARC is a drive based read cache. Back in the day fast hard drives were put in this role, and slow hard drives were used as the main pool. In the SSD era SSD's have been used for this purpose. ZFS being a copy on write file system needs a good amount of RAM to work correctly. The main cache (the ARC) is located in RAM. The best thing you can do for performance is to expand RAM until you have about a GB of RAM per TB of storage in the pool. Above that it doesn't help much. After that the L2ARC can help, but it really depends on workload.

First off, unlike with the SLOG, the L2ARC does not benefit from redundancy. Everything contained in it is also in the fully redundant main pool. Losing an L2ARC drive will not hose the pool. In my case I striped two SSD's to maximize speed and capacity in this role. It also does not benefit from super low latency like the SLOG does, so you can use more normal drives in this role.

The downside is that outside some pretty specialized workloads, where the same files that are larger than will fit in the ARC, yet small enough that they fit in the L2ARC are repeatedly read, the L2ARC doesn't really help much. I've had an L2ARC for years, and the cache hit rate is embarrassingly low (under 0.1%) I'm currently going through a stepwise pool upgrade, and I am considering just getting rid of the L2ARC all together in the process and instead using new nvme drives for the new "special allocation class"

2.) Special Allocation class or "Special device". These are relatively new in the last couple of years. Realizing how the L2ARC has always been disappointing, this new class is intended to speed up the pool in other ways. It does so in two ways:

a.) Storing of file system metadata on these faster disks; and
b.) (optional) Storing of small files under a certain preset size on these faster disks.

Unlike L2ARC these special devices have the potential of greatly speeding up the pool. Also unlike the L2ARC, if you lose your special devices, you lose your entire pool, so it is recommended you keep the same level of redundancy on these as you do on the main pool.

In my case, I plan on adding a few NVMe drives as special devices as part of my pool modernization. Since I have two VDEV's of RAIDz2 this means that I can lose at least two drives before losing data, so I plan on mimicking this in the special devices. The best way to accomplish this for me is to go with a three way mirror, so this is what I plan to do. I am still researching the necessary disk size and best drives for this purpose, but I am leaning towards three 2TB Inland Premium drives. They are cheap enough, perform well, and have a DRAM cache. I am not 100% settled on this yet. Still reading up on it.

Shadowarez · Aug 11, 2021

no offence taken this is all very new to me and i love to learn this as i start out in my unraid journey i have questions over on the unraid forum as well but itll be for when i get this setup running awaiting drives to come in. im using the 8 bay nas cas from silverstone for this and already made the modifications to case for better airflow and chose some really nice fans to keep air moving through it.

if all im looking to do is host my plex media library, my data files to access from the network, and say back up say 2-3 computers,and afew cell phones. do i need to run a vm or can i do this all through docker? if like i can pm ya bought my use case since you seem to be very well versed in this.

Zarathustra[H] · Aug 11, 2021

Shadowarez said:
no offence taken this is all very new to me and i love to learn this as i start out in my unraid journey i have questions over on the unraid forum as well but itll be for when i get this setup running awaiting drives to come in. im using the 8 bay nas cas from silverstone for this and already made the modifications to case for better airflow and chose some really nice fans to keep air moving through it.

if all im looking to do is host my plex media library, my data files to access from the network, and say back up say 2-3 computers,and afew cell phones. do i need to run a vm or can i do this all through docker? if like i can pm ya bought my use case since you seem to be very well versed in this.

Sure, feel free to PM if you'd like, but it would be even better to start a thread with your questions, that way the answers are publicly available for other people to benefit from as well, and you can get opinions from others on here, not just myself.

If you want to make sure I see it, just tag me or PM me a link to the thread!

ZFS Users: Post Your Scrub / Resilver Speeds

Zarathustra[H]

Extremely [H]

Shadowarez

Gawd

Zarathustra[H]

Extremely [H]

Zarathustra[H]

Extremely [H]

Shadowarez

Gawd

Zarathustra[H]

Extremely [H]