ZFS ZIL Question

msitpro · Apr 26, 2012

Skud said:
1) The write cache is enabled on the disks.
2) There is something going on with transfers with over a 32kb block size.
3) There is something else that goes on when you add a ZIL besides what is normally known. When I added the separate ZIL device by over 32kb transfer size performance tanked.

I suspect that the slog might be being ignored for writes over 16Kb in size......writing straight to the pool drives with their caches turned on. Can you monitor iostat while running attoto take a look?

What is the log bias attribute set to for this pool / filesystem / zvol ? Latency or Throughput?

Also - might have missed it - what is the block size of the zvol?

danswartz · Apr 26, 2012

I seem to recall nex7 discovering that multiple zil devices don't help - the system will not write to a device until the pending write to the previous device is done.

patrickdk · Apr 26, 2012

danswartz, what nex7 found, was that a single write stream will only cause a queue depth of 1 to the slog devices, so one transaction at a time.

If you had multible different sources writing, then it could make use of it (multible slog vdevs), but no single write would.

I wonder if the same is true about a single slog device, if it would increase the queue depth or not. Hard reading that post.

danswartz · Apr 26, 2012

yeah, me too.

MasterCATZ · Apr 28, 2012

stevebaynet said:
Zil and L2arc are per pool only, not global

Thanks for that

I ended up buying an OCZ RevoDrive for my Boot OS

and I will use the i-RAM drive for my HDD ZFS system

knocking down the amount of ZFS cache helped free up memory for tmpfs

kvolaa · Dec 8, 2013

I have test ZFS server connected to Infiniband DDR network, we have about ten hypervisors KVM based and ESXi. We using NFSv4 for KVM and iSCSI for ESXi, over RDMA protocols SRP and iSER.

And problem is, as all of you can consider, synchronous writes. Speed of SLOG device.

For SLOG dedicated device, I have an idea of using integrated LSI 2208 RAID controller, battery backed, in my test ZFS Monster unused until now.
With 1GB very fast read/write cache RAM, RAID controller will transform ZIL mostly sequentional 4KB writes with queue depth = 1 into RAID0 tasks.

I want to use common HDD drives, for instance WD RE4 250GB, because their write speed is the same in long run - around 125MB/s. And durability.
With eight discs I can get SLOG device with durability and speed over 1GB/s sequentionaly.

With such SLOG device I can serve NFS or iSCSI synchronous writes from hypervisors, for instance ESXi or KVM based, over Infiniband DDR or 10GbE.

Of course, pool must be capable of serving such speeds too.

It's my idea. What do you think about it ? Did any of you tested such setup ?

patrickdk · Dec 8, 2013

Not sure, that 1gb ram buffer won't last long, but raid cards are generally limited in what they will supply.

I think you will want it to flush the ram as fast as possible, likely 4 60gig ssd's or so? really have to play around with the raid card, and drives, and see what gives the optimal performance, and able to keep up wit hthe speed you want.

I also wonder, at what point is all this cost is enough, to just worrent getting a ddrdrive, just wish they would make it in like an x4 pcie model.

Skud · Dec 9, 2013

I investigated that path a while back, but it never gained much traction as I ended up using a 100GB Intel S3700 SSD under-provisioned to 20GB for my ZIL.

I posted a thread about it over on the STH forums.

http://forums.servethehome.com/raid-controllers-host-bus-adapters/1517-hba-cache-bbu.html

I think it would be a good option, but someone needs to test it out....

Riley

madrebel · Dec 9, 2013

Skud said:
I investigated that path a while back, but it never gained much traction as I ended up using a 100GB Intel S3700 SSD under-provisioned to 20GB for my ZIL.

I posted a thread about it over on the STH forums.

http://forums.servethehome.com/raid-controllers-host-bus-adapters/1517-hba-cache-bbu.html

I think it would be a good option, but someone needs to test it out....

Riley

why 'short stroke' the SSD? SLOG functions are only going to ever be a few GB per 10 seconds as a worst case anyway. May as well give the whole drive to zfs and let the firmware do its thing.

danswartz · Dec 9, 2013

I think the concern some drives had was if the OS didn't support TRIM, write latency could suffer. No idea how likely that was...

madrebel · Dec 9, 2013

true but background garbage collection combined with only using 10% of the SSD under absolutely extreme loads should still be fast.

however, i still see lots of folks like the OP being IMO far too concerned with SLOGs for home installations. just disable sync and if you're still concerned run the server attached to a UPS. really no point budgeting a SLOG for those environments.

danswartz · Dec 9, 2013

Yeah, agreed. I run sync disabled on my esxi datastore. I do hourly replications and daily (nightly) backups, so worst case on a power failure, roll back any failed VMs and lose an hour tops...

obrith · Dec 9, 2013

madrebel said:
why 'short stroke' the SSD? SLOG functions are only going to ever be a few GB per 10 seconds as a worst case anyway. May as well give the whole drive to zfs and let the firmware do its thing.

I haven't tested the S3700 but (massively) "short stroking" the Intel 320's provided nearly double the I/Os when I set a few systems up a few years ago.

madrebel · Dec 9, 2013

it shouldn't though technically the firmware is writing to fresh cells even when short stroked.

hmm wonder why you saw those results. note, not suggesting you didnt see what you saw just curious why that produces those results. never tested myself.

omniscence · Dec 10, 2013

You overprovision the SSDs because that way the garbage collection has more space to work with. when you use the full 100 GB of an SSD that has maybe 120 GB internally, the GC can only use 20 GB. If you 'short stroke' it to 20 GB, the GC always has 100 GB available, so it will find an erased block faster (it also has always less internal fragmentation), resulting in less latency.

While ZFS may have commited the full SLOG to the pool already, the SSD does not know about that, so at some point the full partition will be allocated with already commited (old) data. This would be different if ZFS had TRIM support for the SLOG, but even that would possibly be worse than overprovisioning because the TRIM implementation on SATA before 3.1 can induce extra latencies.

ZFS ZIL Question

msitpro

Weaksauce

danswartz

2[H]4U

patrickdk

Gawd

danswartz

2[H]4U

MasterCATZ

n00b

kvolaa

n00b

patrickdk

Gawd

Skud

Gawd

madrebel

Gawd

danswartz

2[H]4U

madrebel

Gawd

danswartz

2[H]4U

obrith

Limp Gawd

madrebel

Gawd

omniscence

[H]ard|Gawd