All SSD ZFS Pools

Suprnaut · Jan 31, 2013

My client has purchased over 100 OCZ Talos 2 drives. Dell MD1220 JBODs. Dell R710 Servers with 144GB of ram. STEC Zeusram drives. LSI 9207e HBA cards.

I have setup Nexenta for them and I have been very underwhelmed by the benchmarks thus far. Specifically the writes. No matter what configuration, even RAID0 with 48 drives. I have trouble getting more that 200MB writes. Reads are phenomenal topping out at 25GB.

Troubleshooting so far.
-We have had 24 STEC ZeusIOP drives brought in since OCZ was initially blamed for the poor performance. The STECs suffered from the same poor write performance.

-We have added 8 Zeusram drives as a log and that has not improved the performance.

-We have tried the latest Open Indiana.

My question is are there any tunable zfs settings that may be geared towards spinning disks and not ssds that would improve performance? Does anyone else have experience with large SSD pools? Any other suggestions?

_Gea · Jan 31, 2013

You need to find the part that gives problems:

First you should start with a local dd or bonnie bench and sync disabled and a few disks and opt 32 GB RAM only
what result?

If possible, compare values without expander (directly connected)
what result, optionally compare an LSI 2008 based LSI controller like an 9211/IBM 1015 IT mode

Did you have another system to cross check the same SSD, maybee
a common used SuperMicro system?

You may also check against OmniOS bloody which has the newest free ZFS bits and drivers.

obrith · Jan 31, 2013

It sounds to me like you're doing O_sync writes and bottlenecking at your ZIL. You probably can't do better than that if you need O_sync with a ZIL. Remove your ZIL or disable it and see how your tests turn out.

You actually probably will see far superior performance (in O_sync writes) removing the ZIL(s) and letting ZFS use the pool for ZIL.

Suprnaut · Jan 31, 2013

Gea,

dd benchmarks max out at 2.2GB writes and 3gb reads.

We don't have any other hardware on hand such as jbods or other hbas.

We could try OmniOS, I didn't realize it had the newest drivers.

We have tried direct connect, daisy chained, LSI SAS switches, multi-path, single-path, etc... all results were pretty much the same.

Obirth,

I thought the same thing, but the ZeusRam Zil actually slightly improved performance. (Of course there are 8 of them striped though.)

spazoid · Jan 31, 2013

Striping the SLOG will not help you, it'll just mean that the ZeusRAM's will be idle. The ZIL write performance is bottlenecked by the latency of the SLOG, and striping only increases bandwidth - it does not lower latency.

This has been mentioned on these forums a few times within the last few weeks, so I'm sure you can find the threads, and I'm fairly certain they had links to other sites with further documentation,

_Gea · Jan 31, 2013

Suprnaut said:
Gea,

dd benchmarks max out at 2.2GB writes and 3gb reads.

We don't have any other hardware on hand such as jbods or other hbas.

We could try OmniOS, I didn't realize it had the newest drivers.

We have tried direct connect, daisy chained, LSI SAS switches, multi-path, single-path, etc... all results were pretty much the same.

Obirth,

I thought the same thing, but the ZeusRam Zil actually slightly improved performance. (Of course there are 8 of them striped though.)

It seems your problem is sync-write:
- Do not use or stripe more than one ZeusRAM as ZIL -only mirror is an option
- do you need sync write at all
- compare values without sync, with sync and with sync + one zeusram zil

read http://www.nex7.com/node/12

Suprnaut · Feb 1, 2013

Gea,

I have been working with the Nexenta Engineer who writes that blog and he is the one who said I was saturating the Zeusram and needed about 8 to keep up.

Yes I am doing sync writes in the benchmark because they system is for a massive Oracle db. The numbers do look a lot better with sync disabled, but I was told to never run that way especially with a database.

Even with 8 Zeusrams I can look at iostat during the benchmark and see that the %b (busy percentage) is over 65% on each zil.

ddrdrive · Feb 1, 2013

Supnaut,

Are the ZeusRAMs attached to the same LSI 9207 HBA(s) used by the pool SSDs? Have you tried a dedicated HBA/chassis for the log devices (ZIL)?

Log device/pool IO contention can be a serious performance bottleneck.

Christopher George
www.ddrdrive.com

Suprnaut · Feb 1, 2013

ddrdrive,

I have tried all different configurations such as having the ZeusRam on its own dedicated 9207. Currently everything is going through 2 LSI SAS Switches so every drive has 4 paths to the 2 HBAs.

_Gea · Feb 1, 2013

Suprnaut said:
Gea,

I have been working with the Nexenta Engineer who writes that blog and he is the one who said I was saturating the Zeusram and needed about 8 to keep up.

Yes I am doing sync writes in the benchmark because they system is for a massive Oracle db. The numbers do look a lot better with sync disabled, but I was told to never run that way especially with a database.

Even with 8 Zeusrams I can look at iostat during the benchmark and see that the %b (busy percentage) is over 65% on each zil.

One Zeusram may be enough if you use ZFS over Ethernet up to 10GE
8GB Size is big enough to hold the 5s last writes and commit them to disk syncroniosly and keep the next data in the meantime. Sync performance is mainly a problem of IO and latency values then. A second ZeusRAM would not help.

It seems that you run your db locally so you need better values locally. In such a case a 8GB ZeusRAM may not enough in size and sequential write values may also not be high enough while I/O is enough for SSD. You may need enough ZEUS-RAM to hold 10s writes with a similar syncronous write performance than your pool.

ex
If your async write performance is 3GB/s, you have about 60GB in 10s, means at least 8 Zeus-RAM. But mostly, db-Performance should not be transfer-limited but I/O limited. Did you really have transactions that produce 3 GB/s?

msitpro · Feb 1, 2013

What benchmark(s) are you running? 128K+ ? 4K ? Random/Sequential?

I suspect it's a Sync Writes problem as well, which limits you to the speed of your ZIL with a Queue Depth of one from looking at others' performance figures.....

Have you tested without the log device(s) ?

As others have mentioned striping the log doesn't seem to do anything....according to Nex7's blog....

Suprnaut · Feb 1, 2013

I have used all block sizes from 4K to 128K. Oracle uses 8K blocks. Also as I've stated I've tried with and without a Zil.

I read Nex7's blog post (as well as spoken to him on the phone and email) and his conclusion was that you won't see a performance increase for 1 thread. My tests are with 1 thread-128 threads. Also the conclusion of the blog post was that the SSD being used in that scenario had poor latancy and suggested using a low latency ram drive which is what I'm using. The tool I'm using is vdbench.

Gea, no I will be using 8GB FC connected to an Oracle cluster. Each cluster node will have redundant links so 16GB per node. I'm sure the performace should be adequate for the application, but at the cost spent on SSDs I would like to deliver performance higher than 2 spinning drives.

paret0 · Feb 1, 2013

Suprnaut - Does the Solaris 11 zfs_unmap_ignore_size bug affect Nexenta?

"Bug 15826358 - SUNBT7185015 Massive write slowdown on random write workloads due to SCSI unmap"

https://bug.oraclecorp.com/pls/bug/webbug_edit.edit_info_top?rptno=15826358 (note: link won't work without OSN creds)

Very simple workaround from /etc/system:

set zfs:zfs_unmap_ignore_size=0

Don't know if this has anything to do with your issues (random writes instead of sync), but it is a "Solaris" bug that causes some SSD vdevs to fall on their faces.

Good luck. I imagine there's some significant consternation over this problem...

Suprnaut · Feb 1, 2013

paret0,

I tried to tune that sertting but it doesn't seem to be in Nexenta. Must be Solaris specific.

paret0 · Feb 1, 2013

Suprnaut said:
paret0,

I tried to tune that sertting but it doesn't seem to be in Nexenta. Must be Solaris specific.

It doesn't show in /etc/system in default Oracle Solaris. You have to append that line to it.

And it may well be a Solaris 11+ specific bug, and workaround...

Are you working with DTrace on this?

Suprnaut · Feb 1, 2013

paret0,

I tried using this command which is what I use to tune other settings to the live system that get reset after a reboot:

"echo zfs_unmap_ignore_size /W0t0 |mdb -kw"
normally it would confirm that it has changed the setting instead for this setting it says
"mdb: failed to dereference symbol: unknown symbol name"

msitpro · Feb 1, 2013

Space between | and mdb?? I think I had that problem with some commands.

Suprnaut · Feb 2, 2013

msitpro,

Not sure, I just hit the up arrow to previous commands I had run and modified it with the other setting so Im 99% sure the command was correct.

madrebel · Feb 2, 2013

BE VERY CAREFUL WITH THESE!!!!!!!!!!!!!!!!

in /etc/system add these and test

zfs_no_write_throttle=1
zfs_write_limit_override=1

you can also play around with zfs_vdev_max_pending = i would start at 12 and ramp from there.

you can change this on the fly if you know hex codes, read about it here http://www.c0t0d0s0.org/archives/7370-A-little-change-of-queues.html

if you have hybrid pools (or any spin drives for that matter) in this system the above settings WILL cause problems because spin disks can't keep up. this will significantly increase memory usage of the ZIL and cause NMS to bog way the fuck down at times. there are also possible memory fragmentation issues. if you run into these you'll likely core dump. there is a fix for this, if you run into it ... if you have paid support they will provide you with the fix but if you don't have support yet and you run into the kernel panic send me a pm. also make sure compression is disabled. while it isn't taxing in a general sense it does add latency which does decrease performance so disable it until you hit the numbers you think you should get and then re-enable it and see what happens.

also the larger your block size the better your throughput is going to be. default is 128K i would run some tests at 1024K or do a milder step up the block size ladder maybe like 256K or something.

btw, don't let your client go cheap with the OCZ talos drives. they're garbage. if your client insists on being a cheap ass look at the kingston ssdnow 100E drives. i have personally tested these drives and they perform well. i had 11 of them in a raid0 stripe and wrote to them for 48 hours straight 100% random 4K blocks and saturated a gig link no zil. unfortunately never got around to testing them behind 10gig. no noticeable drop in write performance over that 48 hour period either.

you can get HA out of these sata disks one of two ways, interposers or iscsi. nexenta won't support HA without interposers for SATA but they will support iscsi based HA.

so fill two boxes and run OI or work out licensing with nexenta so you aren't paying twice. use raid10 on each box and then export a zvol that is 95% of all usable space. use the 'head' nexenta system to import the iscsi from the other two systems. i personally haven't done this yet, just haven't had time to lab it out however i know for a fact nexenta has systems in the field where they import FC/AoE/iscsi block from other systems like this in HA configs.

with a config like that you could also do a raid10 on the 'head' nexenta system giving you fairly good availability but you're losing 75% of your available space. half from the down stream raid10 and then half of that if you did raid10 on the head. typically though folks just import the iscsi block and then run one block on one head node and one on the other head node. if one head crashes the rsf plugin mounts the iscsi block on the other head node without the single sata buss getting in the way. you're still vulnerable to a single down stream box dying though so just be aware of your risk profile and if the client is ok with that, get it in writing and roll with it.

Suprnaut · Feb 2, 2013

Madrebel,

I know OCZ has a crap reputation. I didn't have input into their procurement. I'm 99% sure we are stuck with them. But to be fair the OCZs thus far have benchmarked slightly better than the high-end STEC ZeusIOp demo drives we got in.

As for the block size we are pretty much stuck at 8K because that is Oracle's native size. It is strange I'm getting the same throughput no matter what block size with SSD. With spinning disks I get a huge bump by using the larger block sizes.

I have played with those mak pending, write throttle, and write limit settings. Maybe I just need to find the sweet spot with them.

madrebel · Feb 2, 2013

change the block size anyway and benchmark just to see what happens.

_d3_ · Feb 25, 2013

try seting logbias=throughput

https://blogs.oracle.com/roch/entry/synchronous_write_bias_property

ewwhite · Sep 10, 2013

Just curious. Was there every a resolution to this issue?

brutalizer · Sep 10, 2013

Here is a similar thread, they are getting decent performance with all SSDs. Maybe this thread can help?
http://forums.servethehome.com/sola...ssds-supermicro-x9-96gb-10gbe-am-i-crazy.html

MAuVE · Nov 25, 2013

Has anyone got any hands-on experience with the IBM 6 Gb Performance Optimized HBA in an all SSD ZFS pool.
Is there some tangible proof of the optimization fruits, in comparison to other LSI SAS2008 implementations like the ServeRAID M1015 and M1115?

All SSD ZFS Pools

Weaksauce

Supreme [H]ardness

Limp Gawd

Weaksauce

Limp Gawd

Supreme [H]ardness

Weaksauce

n00b

Weaksauce

Supreme [H]ardness

Weaksauce

Weaksauce

Limp Gawd

Weaksauce

Limp Gawd

Weaksauce

Weaksauce

Weaksauce

Gawd

Weaksauce

Gawd

n00b

n00b

[H]ard|Gawd

n00b