big zfs ssd problem - zil/l2arc max 2500 iops

csmcsm

n00b
Joined
Sep 9, 2012
Messages
13
hello,

I played around with zfs for several weeks (I used openindiana and onmios to test) and I have a very strange issue.

if i add a zil device (a hitachi 400GB SLC SSD 4gb fc) to my pool and set sync=always, the zil device get (depending on the block size) 30-90mbits throughput BUT iops are only between 2200 and 2500. the harddisks can handle at least 8k write iops - so i decided to add a second disk to the zil - now both hdd just do 1100-1250 iops (so total still not over 2500 iops). because i thought this is strange, i added 2 more disks - and all 4 disks together are still at 2500.

i removed the zil disks and created a pool with 1 disk (not mirrored, just for testing standalone disk) - again 2500 iops only, i added 5 more disks to that pool (again standalone only) and im back to the same as above, i cant push more than 2500 iops if i use sync=always. the throughput is from 30mbits up to 800mbits (depending on blocksize) so im sure its not related to the disks itself.

i also tried the same with sata based ssd disks - and i have pretty much the same issue in regards to iops.

if i swich to sync=standard, i get up to 40k iops, but im sure this is just because ram is used to aggregate the writes to larger blocks.

i tried several kernel paramenters, but nothing works..

any idea? for me it looks like a iops limit on vdev - but i cant figure out where.. the server hardware runs with 96gb ram and dual hexa core intel xeon (so this is for sure not the issue too).
 
Get a lower latency zil/slog drive.

Or change the parameters of the test.

The results are expected.
 
there is pretty much nothing faster to get than this ssd disks, also as a side note, i also tested performance for a volume and as zil device a Violin Memory VMA storage (lowest latency, 350k iops is what i get out on a windows storage server with that device) so its for sure not latency.

what do you mean by change parameters of the test and why that poor / slow iops performance is expected? this whole thing is absolute not expected - at least if i add a disk to a zvol, the iops should increase - but they don't.
 
How can you just randomly compare how zfs works to something else?
If something gets what you want, use it.

ZFS has very specific way it works, and latency is the issue when using sync. If you want faster sync, you must lower latency. There is no other option (other than to disable sync).

The 400gb ssd has a latency, plus the overhead of 4gbit fc. Dunno, but this doesn't sound very speedy to me, and you want to compare that to a violin that is designed to have no latency?

Why should adding more disks, increase the speed of sync writes? more disks have nothing to do with sync write speeds, the speed of a SINGLE disk, controls sync write speed. So your sync write speed will be the SLOWEST disk you have, not a combined speed of all your disks.

If you want faster sync write, you need faster disks (lower latency disks).
If you want to handle multible people writing at once, THEN you can worry about the quanity of disks, but as everything you did so far, has been limited to a single writer, that won't matter at all.
 
You cannot just randomly compare IOPS values, they have to be compared in relation to the queue depth. For ZFS sync writes (by a single writer) QD1 is the important metric, something that is usually ignored by SSD vendors because they cannot differentiate their products based on that value. Basically every ZIL attached to FC, SATA, SAS will be almost equally bad for this.

The best is solid state storage directly attached to the PCIe bus. If money were no issue I would probably try a ZeusRAM on a write-caching RAID controller with BBU. Maybe a fast SSD on a RAID controller with BBU is also enough if you are not after a very high sustained bandwidth.
 
Last edited:
well thanks for the answer, but i think you missunderstood me.

1) i tested it with 400gb slc ssd disks - 2500 iops for the whole pool (if i add one 400gb disk or 6 - all 6 without zraid, just to see single disk performance)
2) if i create seperate pools with one 400gb disk each - each pool still does just 2500 iops - but EACH pool does (so i get all 6 pools together 15.000 iops)
3) i have a violin memory san - and connected this to solaris to test - evne on that $200k device - i dont get more than 2500 iops out of it (i put it in a single pool, i also tested it to use as zil device).
4) no, i did test it with multiple writes at the same time, it not speedup.
5) i know if i use two devices in a mirrired zil - obviously it still shows speed of just one zil - BUT i added 4 seperate zil vdevs - so in this case zil can write on 4 disks and spread out the zil log to each of them seperate - so this should increase speed.
6) as a side note, i also tested it with 200gb emlc disks on a sun (!) SAS6g controller - same iops speed limit on that disks too

so its for sure not related to the devices - i simple have a 2500 iops limit each disk somewhere.


also, use pcie devices have no sense for me because we want to run two storage nodes in a failover cluster, so to have a pcie device for writelog in one server means dataloss if we switch to the second node (becuase second node not see the pcie device so would loose the zil content)
 
Last edited:
btw, a funny thing i found out while testing:

if i change some paramenters:

echo zfs:zfs_no_write_throttle/W0t0 | mdb -kw
echo zfs_write_limit_override/W0t1 | mdb -kw


i saw all of the sudden 7k write iops for one device, if i added 2 more i came to 20k write iops (so that would be more correct numbers), BUT it seemed there was almost no bytes written (i saw just a few kbs at dd and iozone).
so it showed me high iops, and throughput with "zpool iostat pool 1" but the data itself was just a few bytes written (i saw that on iozone and dd) AND i even saw 100k iops on violin memory iops monitor.
this is really wired, so i think this looks more like a bug in that case..
 
I would try Solaris 11.1

It's the most stable, best implementation of zfs out there.
 
just checked with solaris 11.1 - same problem - I cant get out more than 2500 iops from a pool, if I ad one (basic)disk or if I add 6 (basic) disk - total is 2500 iops...
 
Keep in mind that early SSDs sucked as far as write performance, IIRC. The ZFS defaults for filling L2ARC are very small too. Is there a reason you can't just bump the write throttle&limit and go with it?
 
Im not talking about L2ARC - I already increased writeboost to 400mb/s and average to 200mb/s and it works perfect. (please see at bottom of that post my /etc/system)

to clarify, I don't have any throughput problem (I get about 350mb/s each of the ssd disks - so for 4 disks in one pool I get 1.4gb/s throughput if I use the ssd disks as a general pool (one zpool, 4 disks, no raid, just 4 plain disks) set sync=standard - so its not a throughput problem, I only have the problem in regards to small iops - and because zil use small packets in sync=always (I get about 30mb/s and 2500 iops - if one disk or 6, iops are the limiting factor).

with l2arc its not so a big problem because the warm-up and filling of l2arc Is done by aggregated iops (so I get 200mb/s with about 1k iops at write speed to the l2arc disks) but on zil, the packets are much smaller and its a total pain and problem to use (but if I remove the zil cache disk, and write zil direct to the pool, its faster but it harms the disks because reads from that disk increases because the disks are busy with more sync writes..

im really pretty much out of any ideas, for me it appears zfs have somehow a iops limit for a zfs pool.. problem is.

if I set sync=standard, and disable writecach on the FC volumes I export to the servers, it slowers down the whole system dramatically, but if I enable writecach on the FC volumes, its pretty dangerous to use in regards to data loss.


its really strange, its on openindiana, omnios, on even solaris 11.1 - and I have the problem with 3 different ssd types (eMLC sas 6gb disks, SLC FC disks and even with a Violin memory 3200 series), so for me this looks like a iops limit or bug somewhere in zfs itself.
and the reason why I think its somewhere a iops limit - if I use the ssd disks as pool disks, or as zil disks, I don't get more than 2500 iops out - if I use them as zil disks, or as regular disks in a pool (but test them with small writes),.. - no disk push over 2500 iops in synchronous writes. if I set to sync=standard - throughtput increases a lot AND I see frequently also much larger iops - I assume this is because with sync=standard the writes are aggregated and flushed together to the disk (so all of the sudden the writes become a streaming workload - and this workload is fast and good working)


my /etc/system (I tried it also with a empty /etc/system and with 10 other combinations - nothing fixed my problem), this is my current one:
set zfs:zfs_no_write_throttle=1
set zfs:l2arc_write_max=167772160
set zfs:l2arc_write_boost=335544320
set zfs:l2arc_headroom=4
set zfs:l2arc_feed_secs=1
set zfs:l2arc_feed_min_ms=50
set zfs:l2arc_noprefetch=0
set zfs:zfs_vdev_max_pending=100
set zfs:l2arc_feed_again=1
set sd:sd_io_time=0x14
set sd:sd_retry_count=3
set ssd:ssd_io_time=0x14
set ssd:ssd_retry_count=3
 
Last edited:
" Im not talking about L2ARC - I already increased writeboost to 400mb/s and average to 200mb/s and it works perfect. (please see at bottom of that post my /etc/system)"

Please read more carefully. I said "The ZFS defaults for filling L2ARC are very small too". I was pointing out that ZIL isn't the only area that is impacted by (IMO) out of date tuning parameter defaults. I thought you said earlier that increasing those two settings DID help things?
 
increase the settings only helped in regards to faster l2arc filling (so no 8mb limit is there anymore) - but not in regards to iops, only in regards to throughput, and because zil not need throughput but needs iops, this had no impact on iops at all)
 
I guess I am totally confused then. I thought you said you changed those two params and IOPS got better, but questions as to whether it was doing actual I/O.
 
ah, yes I know now what you refer to, well the iops was better, but there was actually no data written anymore :) it looked on zpool iostat like 20k iops and 100mb/s - but on the testing server (where I run dd and iozone) it showed only like if it writes a few kb/s - and on the storage itself the speedtest file not really increased in size (so I assume its a bug). if you reproduce that - you see what I mean, its really wired.
 
You should search the forum for posts by Nex7 - who is/was a Nexenta employee. He goes over why SSDs get less IOPS than you expect and why multiple SSDs do not help (much). He has thorough answers but the jist is;

The ZIL is essentially a QD1 write pattern, and worse than an average benchmark at QD1 at that because it waits for every IO to be ackowledged before it moves on to the next (or sends a flush). Most SSD's don't shine without high QD.

The reason multiple SSD's dont help is again because it's essentially a single QD and only sends one IO at a time. You're waiting on the first SSD to finish before the second one gets any data, back and forth.

This is why DDR based drives are/were popular, their latency is massively lower than even an SSD per IO.
 
Last edited:
Interesting thread, all around. This reminds me why I was so disappointed when I got a good (samsung 840 pro) SSD and pool sync writes still sucked (ESXi writing to NFS share). Since I do hourly pool backups, I said 'screw it!' and put that dataset in sync=disabled mode. Absolute worst case, I lose an hour's work...
 
thanks for the good answer!!

it also explains why the VMA3205/vma3210 san (which is not a ssd disk, it is a san with extreme low latency and extreme fast speed) still not get more than the 2500 iops out..

and it (maybe) explains also why the same ssd disks used as regular pool disks are also not faster (because there I had sync=always on that pool - so basically the ssd disks would be faster but because they do zillog on the data disks (this pool had no separate zil devices) we are basically to the same issue back what you mentioned above

any idea how to set the QD for the ssd device to something much higher?

to do sync=disabled is really not a option for me even if I would setup hourly backup - if the san crashes (and I had this in past - not often, but 1-2 times with kernel panic) I would have to repair and restore about 150 VMs,..


.. but it still not explains the very wired behavior:
if I set:
echo zfs:zfs_no_write_throttle/W0t0 | mdb -kw
echo zfs_write_limit_override/W0t1 | mdb -kw

I get a lot of iops - and throughput with zpool iostat - BUT basically no data written to the disks..
/ this looks for me like a wired code bug..
 
Here is a thought for you. Since you are using zfs, instead of doing backups, just take a zfs snapshot of the share every hour. If you have a crash, your 'repair and restore' consists of 'zfs rollback ....' Could you live with that?
 
You should search the forum for posts by Nex7 - who is/was a Nexenta employee. He goes over why SSDs get less IOPS than you expect and why multiple SSDs do not help (much). He has thorough answers but the jist is;

The ZIL is essentially a QD1 write pattern, and worse than an average benchmark at QD1 at that because it waits for every IO to be ackowledged before it moves on to the next (or sends a flush). Most SSD's don't shine without high QD.

The reason multiple SSD's dont help is again because it's essentially a single QD and only sends one IO at a time. You're waiting on the first SSD to finish before the second one gets any data, back and forth.

This is why DDR based drives are/were popular, their latency is massively lower than even an SSD per IO.

THIS.


Unfortunately not many people know this.

This is why I really wish review sites would test QD1 IOPS on SSDs more frequently. :(
 
As others have already stated - your SSD latency at a single or very low queue depth is what is ultimately the only real value of importance in most ZIL use-cases, but it sounds like that eventually got communicated in an understandable way for you, so I'll leave it at that.

-snip-
my /etc/system (I tried it also with a empty /etc/system and with 10 other combinations - nothing fixed my problem), this is my current one:
set zfs:zfs_no_write_throttle=1
set zfs:l2arc_write_max=167772160
set zfs:l2arc_write_boost=335544320
set zfs:l2arc_headroom=4
set zfs:l2arc_feed_secs=1
set zfs:l2arc_feed_min_ms=50
set zfs:l2arc_noprefetch=0
set zfs:zfs_vdev_max_pending=100
set zfs:l2arc_feed_again=1
set sd:sd_io_time=0x14
set sd:sd_retry_count=3
set ssd:ssd_io_time=0x14
set ssd:ssd_retry_count=3
-snip-

l2arc_write_max @ 160 MB is going to cause you problems. Way too high. Also not likely doing what you want it to do. There's a lot of confusion here around what these values do and how they're actually used. I may need to write a new blog entry explaining this. If you're curious why I say that, I'll give you a hint - start here: http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/fs/zfs/arc.c#4079

similarly so will l2arc_write_boost - also, you don't really want to 'double' your boost setting, it gets added onto the existing write_max, so the same value as l2arc_write_max would = double; you're actually triling l2arc_write_max with this setting, if it invokes boost. See: http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/fs/zfs/arc.c#4072

set zfs:l2arc_feed_secs=1 - this is already the value

set zfs:l2arc_feed_min_ms=50 - that's scary to me; again, prior comment, start looking at the code there

set zfs:l2arc_noprefetch=0 - this is rarely a good idea; it is unlikely you want this set to 0

set zfs:zfs_vdev_max_pending=100 - you're nuts - this is a per-leaf-vdev setting, eg: per disk (vdevs are the aggregates of leaf vdevs, which are usually disks, or a partition on a disk, or a LUN from somewhere) -- you're telling ZFS it has a 100 queue depth on every leaf vdev. Not even SSD's generally want this this high. SSD's I'd recommend somewhere between 16 & 32 probably; SAS disks 8-10, SATA I tend to say 1.

set zfs:l2arc_feed_again=1 - I can find no evidence this is going to do anything. I don't believe this is tuneable.

set sd:sd_io_time=0x14 - Why would you want it to wait a full 20 seconds on sd timeouts? Especially on an SSD pool?

set ssd:ssd_io_time=0x14 - not that you've changed this from sd_io_time anyway, but this is probably not what you think it is - this is FC disks, not SSD's. :)

echo zfs:zfs_no_write_throttle/W0t0 | mdb -kw
echo zfs_write_limit_override/W0t1 | mdb -kw

I get a lot of iops - and throughput with zpool iostat - BUT basically no data written to the disks..
/ this looks for me like a wired code bug..

As memory serves, write_limit_override is not a boolean, it is a size value of, well, let's call it the txg size (in RAM) before invoking a sync or delay mechanic, as I recall.. so setting this to 1, that's likely to not be at all what you want. I'm not even sure if it would obey such a tiny value, or what it's going to do to write mechanics, I'd have to go dig through code. That you set zfs_no_write_throttle, though, may override this entirely.

zfs_no_write_throttle is a horrible setting - we used to use this as a quickfix for write throttle issues, but over time we found it inevitably caused more problems with the systems than it solved. The setting itself, in code, seems almost innocuous - it only gets checked in like 1 or 2 places and modifies a bit of logic if true. However, the results of those changes have a snowball effect on lots of other pieces of ZFS, culminating in very variable and weird performance.

I'd have to go dig through code more to tell you why your values for these 2 would lead to the symptom you describe, but it isn't impossible. You're going down a lot of weird paths with no_write_throttle set to 1 and throwing lots of data at the pool. We almost never set this to 1 anymore, so I don't have fresh memory here.

The good news re: write throttle is that all this crap has been fundamentally altered in latest Open-ZFS code by the guys at DelphiX, and you can look forward to a more easily metric'ed and stable method of handling incoming write I/O once whatever distro you're on is using their new code push. Settings like zfs_no_write_throttle and write_limit_override and txg_synctime_ms and txg_timeout are gone, and good riddance.

Ultimately, if you understand that the latency of ZIL (log) writes is the only metric of note, you can understand why the slog devices are only doing a total of 2500 or so IOPS, even if you add more - ZIL is round-robin in terms of how it hits log vdevs, and it isn't parallel, it's serial, so it doesn't write to one and then write to the next until it gets back an affirmative from the first -- thus, the only way additional slog devices can help you is if their latency is very, very low. Indeed, in general, the only slog devices I'm aware of that 'scale' past 2 or so with any continuing improvement in slog IOPS potential is STEC ZeusRAM's, and even they ultimately do not continue to scale - we've never actually tried to find that limit, but believe it to be around 8-10 of them.
 
btw, a side note - reading speed is perfect (I tested it, the ssd disks showed me 10-12k iops each at read if I use them as read-cache) - just write speed is still the issue.
 
Hi,

first of all, thanks a lot for your detailed answer. I really appreciate it, and I don't want to look like a smartass, but I have some notes/questions (see below ///):

As others have already stated - your SSD latency at a single or very low queue depth is what is ultimately the only real value of importance in most ZIL use-cases, but it sounds like that eventually got communicated in an understandable way for you, so I'll leave it at that.

////generally yes, but as far as I understood, if I set it to single or low, it negative impact the rest of the pool disks?



l2arc_write_max @ 160 MB is going to cause you problems. Way too high. Also not likely doing what you want it to do. There's a lot of confusion here around what these values do and how they're actually used. I may need to write a new blog entry explaining this. If you're curious why I say that, I'll give you a hint - start here: http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/fs/zfs/arc.c#4079
/// why it cause me problems? the disks im using are HDS400gb slc dualport disks and the jbod I have them in is a 4x4gb FC jbod, connected direct to quad qlc card - they are separate from the regular FC disk and jbods. I tested them, they can easy push ways higher numbers


similarly so will l2arc_write_boost - also, you don't really want to 'double' your boost setting, it gets added onto the existing write_max, so the same value as l2arc_write_max would = double; you're actually triling l2arc_write_max with this setting, if it invokes boost. See: http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/fs/zfs/arc.c#4072
/// see my answer above, im well aware it boosts that - and I want to have fastest possible cache warmup (and the disks easy can do it)

set zfs:l2arc_feed_secs=1 - this is already the value
/// yes, I tested/used so much different solaris based systems, I just want to be sure :)


set zfs:l2arc_feed_min_ms=50 - that's scary to me; again, prior comment, start looking at the code there
/// why scarry?


set zfs:l2arc_noprefetch=0 - this is rarely a good idea; it is unlikely you want this set to 0
/// yes I want this by intention based on my workload and based on what the disks can easy do


set zfs:zfs_vdev_max_pending=100 - you're nuts - this is a per-leaf-vdev setting, eg: per disk (vdevs are the aggregates of leaf vdevs, which are usually disks, or a partition on a disk, or a LUN from somewhere) -- you're telling ZFS it has a 100 queue depth on every leaf vdev. Not even SSD's generally want this this high. SSD's I'd recommend somewhere between 16 & 32 probably; SAS disks 8-10, SATA I tend to say 1.
/// I agree with you, this was a testing parameter (I tested it with all kind of different values, but yes I will switch it back to 32. that should work for my regular 10k-FC disks and for my SLC-SSD FC disks

set zfs:l2arc_feed_again=1 - I can find no evidence this is going to do anything. I don't believe this is tuneable.
/// it was a tunable, but I need to check, maybe really not used anymore and to remove

set sd:sd_io_time=0x14 - Why would you want it to wait a full 20 seconds on sd timeouts? Especially on an SSD pool?
/// well it was previously 60s x5 retry - which created 5min stuck on slow-breaking disks. we had that issue in past 2-3 times where a disk was not set to failed and tried 60x5 - which created some VMs to crash due the long stuck time. what else number you would recommend? again here the problem is its not possible to set that separate for the regular disks in the pool and the ssd disks..


set ssd:ssd_io_time=0x14 - not that you've changed this from sd_io_time anyway, but this is probably not what you think it is - this is FC disks, not SSD's. :)
/// well that is for Sata disks as far as I know, I don't use them on this pool, but I have a separate SAS/SATA pool, so I thought to set this here too?



As memory serves, write_limit_override is not a boolean, it is a size value of, well, let's call it the txg size (in RAM) before invoking a sync or delay mechanic, as I recall.. so setting this to 1, that's likely to not be at all what you want. I'm not even sure if it would obey such a tiny value, or what it's going to do to write mechanics, I'd have to go dig through code. That you set zfs_no_write_throttle, though, may override this entirely.
/// agreed, its not set anymore - that was just a test and as I mentioned, very strange, with set to 1 - I got 20k and more iops - but basically no data written. so yes you are right, its the wrong setting but that test showed me the system CAN actually write 20k and more iops to a log disk (at least to the VMA/Violin device I used here for testing) and that's why I think there is somewhere a bug in the code - with write_imit_override set to 1 and no_write_throttle set to 0 - it created that strange behavior (high iops, but basically no data written) if you set one of them to a different value, iops are slow as usual (but actually more data is written If no_write_throttle set to 1). to test this, its really really strange behavior.

zfs_no_write_throttle is a horrible setting - we used to use this as a quickfix for write throttle issues, but over time we found it inevitably caused more problems with the systems than it solved. The setting itself, in code, seems almost innocuous - it only gets checked in like 1 or 2 places and modifies a bit of logic if true. However, the results of those changes have a snowball effect on lots of other pieces of ZFS, culminating in very variable and weird performance.
/// agreed on most systems, but on this system it have basically no bottleneck to the disks what so ever, and the servers connected would never really run into any performance limitations or max it out. to explain, we use 6x jbods, each with 2x4gb fc and 10k fc disks (each connected direct, so no chain with slowness) and the ssd jbod is a 4x4gb with just 6 disks in it and also separate/direct connected (we bought a ml370 with 9 pcie slots and quad port qlc 4gb cards)

I'd have to go dig through code more to tell you why your values for these 2 would lead to the symptom you describe, but it isn't impossible. You're going down a lot of weird paths with no_write_throttle set to 1 and throwing lots of data at the pool. We almost never set this to 1 anymore, so I don't have fresh memory here.
/// please see my notes above (and yes I agree)

The good news re: write throttle is that all this crap has been fundamentally altered in latest Open-ZFS code by the guys at DelphiX, and you can look forward to a more easily metric'ed and stable method of handling incoming write I/O once whatever distro you're on is using their new code push. Settings like zfs_no_write_throttle and write_limit_override and txg_synctime_ms and txg_timeout are gone, and good riddance.
// well txg still have impact (at least how I tested it to set the commit times) but this not had any results on my iops write limit issue

Ultimately, if you understand that the latency of ZIL (log) writes is the only metric of note, you can understand why the slog devices are only doing a total of 2500 or so IOPS, even if you add more - ZIL is round-robin in terms of how it hits log vdevs, and it isn't parallel, it's serial, so it doesn't write to one and then write to the next until it gets back an affirmative from the first -- thus, the only way additional slog devices can help you is if their latency is very, very low. Indeed, in general, the only slog devices I'm aware of that 'scale' past 2 or so with any continuing improvement in slog IOPS potential is STEC ZeusRAM's, and even they ultimately do not continue to scale - we've never actually tried to find that limit, but believe it to be around 8-10 of them.
//// well, we cant use zeusram because we don't have a shared sas-jbod, and to connect that disk internal to just one node would prevent our failover to a second node (with zil). also, we tested it with cheap MLC SSD disks, with Violin Memory/VMA array, with HDS SLC SSD, even with a RAMSAN - all had the exact same 2500 iops limit - and not just for zil, as I mentioned I have the exact same problem if I use the disk as single zvol disk.. so there have to be somewhere a issue in the code or in the settings in general.. otherwise the disks would not perform so exactly the same slow.. specially the VMA device is a low latency, SLC 5-10TB array..
 
Here is a thought for you. Since you are using zfs, instead of doing backups, just take a zfs snapshot of the share every hour. If you have a crash, your 'repair and restore' consists of 'zfs rollback ....' Could you live with that?

no, sorry this is 100% not possible for us. as I mentioned, we have 100s of VMs on our san - we cant have any inconsistency what so ever (to repair them would be a nightmare for us, to some of the VMs we don't even have the root pwd which is need to fix/fsck them).
 
Hi,

first of all, thanks a lot for your detailed answer. I really appreciate it, and I don't want to look like a smartass, but I have some notes/questions (see below ///):

////generally yes, but as far as I understood, if I set it to single or low, it negative impact the rest of the pool disks?

This perhaps means you don't understand yet. The ZIL is low queue depth. You can't affect that. That is how it works. So if your SSD doesn't perform well at low queue depth, a ZIL with a log vdev made up of those SSD will not perform well, either.

/// why it cause me problems? the disks im using are HDS400gb slc dualport disks and the jbod I have them in is a 4x4gb FC jbod, connected direct to quad qlc card - they are separate from the regular FC disk and jbods. I tested them, they can easy push ways higher numbers

The problem is this number isn't meant to represent how much data the drive can push per second (not the least of which being that in default mode it actually wants to write every 200 milliseconds or so).

/// see my answer above, im well aware it boosts that - and I want to have fastest possible cache warmup (and the disks easy can do it)

Yes, but again, this number is having an effect on the logic that is not simply a 'try to write this much per second'. In fact setting it too high can lead to that, when in reality what you'd rather have is 'try to write this lesser amount, but every 200 ms'.

/// why scarry?
Because your misunderstanding of the other number means you seem to believe your L2ARC devices are capable of 160 MB per 50 ms. I assure you, they are not. :)

/// yes I want this by intention based on my workload and based on what the disks can easy do
This is only going to be useful if the prefetched data actually has a decent hit rate.

/// I agree with you, this was a testing parameter (I tested it with all kind of different values, but yes I will switch it back to 32. that should work for my regular 10k-FC disks and for my SLC-SSD FC disks
32 is a potentially decent number that still maintains low latency for FC_connected SLC SSD. It is unlikely to be for 10K disks, FC connected or not. For those I'd recommend more like 8-16, depending on your client's sensitivity to I/O latency.

/// well it was previously 60s x5 retry - which created 5min stuck on slow-breaking disks. we had that issue in past 2-3 times where a disk was not set to failed and tried 60x5 - which created some VMs to crash due the long stuck time. what else number you would recommend? again here the problem is its not possible to set that separate for the regular disks in the pool and the ssd disks..
I had not understood that you had 2 types of disk in there. But yes, 20 is better than 60, and in fact what you've got it set to is much closer to what we use at work, but if you KNOW all your disks are TLER, you should probably set this down to just over the highest TLER value your drives give up at, usually 8-10 (I recommend 10-12).

//// well, we cant use zeusram because we don't have a shared sas-jbod, and to connect that disk internal to just one node would prevent our failover to a second node (with zil). also, we tested it with cheap MLC SSD disks, with Violin Memory/VMA array, with HDS SLC SSD, even with a RAMSAN - all had the exact same 2500 iops limit - and not just for zil, as I mentioned I have the exact same problem if I use the disk as single zvol disk.. so there have to be somewhere a issue in the code or in the settings in general.. otherwise the disks would not perform so exactly the same slow.. specially the VMA device is a low latency, SLC 5-10TB array..
I would not have expected these to all have the 'exact same IOPS limit'. I would expect them to have not great IOPS limits - I've also seen Violin and RAMSAN's tested, and they're just not that great at single queue depth, plus you're adding in the FC or ethernet interconnect latency, so it's just no good. The Violin is a great device, but we've tested it before as a ZIL at a client site and it did not work out.

Though I do agree with you, actually, as those should have had fairly different maximum values. They shouldn't all have maxed out at something within 10% of each other. That would seem to imply some other bottleneck is involved, likely something that exists in all the test scenarios (some component that is always there, be it hardware or code), though at midnight I'm not thinking of anything. If an idea floats through my brain I'll update the thread with it.

The only thought bubbling up right now is how much RAM do you have? The 'write cache' of ZFS always exists in RAM, and if it fills up its limit I seem to recall it forces a txg. How are you testing this? Some workload that just throws writes at the thing as fast as possible? You might be causing a problem you'd never actually see in production, btw.
 
THIS.


Unfortunately not many people know this.

This is why I really wish review sites would test QD1 IOPS on SSDs more frequently. :(

How can this not be know to so many people. This is mentioned in basically any thread about slow sync writes I have come across.
And it is even the logical conclusion if you think about what the ZIL does.
 
Back
Top