OpenSolaris derived ZFS NAS/ SAN (OmniOS, OpenIndiana, Solaris and napp-it)

Anyone using ZeusRAM for ZIL? What are your write/read performance for 4k random write?

I have a RaidZ2 pool with 10x4TB SAS drives and ZeusRAM for ZIL and when testing with fio, I only get around 2500iops, which is way too little I think.

I also tried creating a mirror pool with 2xZeusRAM, but system somehow gets limited, again, at 2,5kIOPS as well.

My drives are in a Supermicro JBOD with SAS expander and connected to LSI 9207 HBA. I will try with linux tomorrow, maybe it's an OS problem?
lp, MAtej
 
Anyone using ZeusRAM for ZIL? What are your write/read performance for 4k random write?

I have a RaidZ2 pool with 10x4TB SAS drives and ZeusRAM for ZIL and when testing with fio, I only get around 2500iops, which is way too little I think.

I also tried creating a mirror pool with 2xZeusRAM, but system somehow gets limited, again, at 2,5kIOPS as well.

My drives are in a Supermicro JBOD with SAS expander and connected to LSI 9207 HBA. I will try with linux tomorrow, maybe it's an OS problem?
lp, MAtej

I use 2x ZeusRAM's myself in my Pool, how do you have them configured, are they flashed with C023 firmware?

Can you post your pool config?
 
vektor777:
I have 2 pools. My firmware revision is C025.

Pool1:
Code:
        NAME                        STATE     READ WRITE CKSUM
        data                        ONLINE       0     0     0
          raidz2-0                  ONLINE       0     0     0
            c6t5000C500836B2889d0   ONLINE       0     0     0
            c6t5000C500836B3259d0   ONLINE       0     0     0
            c6t5000C500836B5255d0   ONLINE       0     0     0
            c8t5000C50083756635d0   ONLINE       0     0     0
            c8t5000C5008375B075d0   ONLINE       0     0     0
            c8t5000C50083756A51d0   ONLINE       0     0     0
            c10t5000C50083756535d0  ONLINE       0     0     0
            c10t5000C50083756C0Dd0  ONLINE       0     0     0
            c9t5000C50083759D85d0   ONLINE       0     0     0
            c9t5000C50083756751d0   ONLINE       0     0     0
        logs
          mirror-1                  ONLINE       0     0     0
            c6t5000A72A300B3D5Fd0   ONLINE       0     0     0
            c8t5000A72A300B3D80d0   ONLINE       0     0     0

Pool2 (2x ZeusRAM as disks):
Code:
        NAME                        STATE     READ WRITE CKSUM
        zeusram                     ONLINE       0     0     0
          mirror-0                  ONLINE       0     0     0
            c10t5000A72A300B3D7Ed0  ONLINE       0     0     0
            c9t5000A72A300B3D9Dd0   ONLINE       0     0     0

At least the "zeusram" pool should run at 45kIOPS without problem. But it can't go over 2,5kIOPS, neither with sync=always or sync=disabled.

gigatexal: no, this is a dual CPU, 12 core with HT machine.

Matej
 
It could have many threads but I think ZFS benefits most from IPC and clockspeed, no?
 
Could be, but I still think Intel Xeon E5-2640 should handle that amount of traffic without a problem.

Matej
 
what is your fio run's settings? what is the output of top when the run is happening?
 
My fio settings are the following:
fio --filename=/zeusram/fiotest02 --size=3g --rw=randrw --refill_buffers --norandommap --randrepeat=0 --ioengine=solarisaio --bs=4k --rwmixread=0 --iodepth=16 --numjobs=16 --runtime=60 --group_reporting --name=4ktest

Output of top:
Code:
last pid:  2807;  load avg:  4.41,  1.02,  0.38;  up 5+19:12:57                                                   08:55:47
65 processes: 48 sleeping, 17 on cpu
CPU states: 31.7% idle, 63.6% user,  4.7% kernel,  0.0% iowait,  0.0% swap
Kernel: 33736 ctxsw, 1943 trap, 12347 intr, 27134 syscall
Memory: 256G phys mem, 123G free mem, 4096M total swap, 4096M free swap
Under that, there are a few fio processes. Probably 16 of them:)

I just got to work and will boot the server to linux and try there. Just to see, if its a driver/OS problem.

Matej
 
Nope, CPU is not bottleneck.

I switched to linux today.

I tested 4k random sync write directly to a Zeusram: 48kIOPS -- OK
I tested 4k random sync write to a mdadm raid1 build with 2 ZeusRAMs: 48kIOPS -- OK

Then I created a ZFS raidZ2 pool with 10 drives and 2 ZeusRAM in mirror as ZIL. I set sync=always and started fio test:
Code:
fio --filename=/pool0/test01 --size=5g --rw=randwrite --refill_buffers --norandommap --randrepeat=0 --ioengine=linuxaio --bs=4k --iodepth=16 --numjobs=16 --runtime=60 --group_reporting --name=4ktest
The result is 15kIOPS. Is this the best I can get out of the system or should it be able to push 48kIOPS for 4k, just like when used as a single drive?

If I understand ZFS write cache right, it is storing small sync writes in ram cache and flushes it every 5s to hard drive as a sequential large block write, right? Since I'm using ZeusRAM as ZIL, system also writes transaction log to ZeusRAM and when used with sync=always, write cache is only as fast as ZIL device.
If the upper is right, how come I'm only seeing 14kIOPS when using ZIL? Is it a latency problem?
All my drives(including ZeusRAM) are in an external JBOD, so that probably introduces a little latency. But could it be that much?

Matej
 
If you write small random data to a pool without sync, they are collected in RAM and written as a single large and fast sequential write. This gives you a certain level of write performance.

If you use sync, the same happens but additionally you have a data logging on every write request to your ZIL device. This means that you have two write actions and the effective performance is fast sequential write performance to the pool + sync random write on every commit to a ZIL.
 
So I should see 48kIOPS on random small writes, since this is what ZeusRAM can handle?

MAtej
 
I seem to recall creating a couple of ram disks and using one as an SLOG and other as the pool disk. Created a sync=always filesystem and then did a bulk write of several GB from /dev/zero to it and seeing the speed be much less than I thought. I will try to repeat that tonight...
 
So I should see 48kIOPS on random small writes, since this is what ZeusRAM can handle?

MAtej


No, you are always slower than your pool as you must write all data to a slow pool sequentially AND you must add the delay for writing to the fast ZIL with small random writes..

Sync write is always slower than unsync write without a ZIL as this is not a performance but a security option.
 
Gea:
For test, I added another 10 drives raidz2 vdev, so now I have the following config:
pool0 pool:
* 10 drives raidz2
* 10 drives raidz2
* log mirror ZeusRAM

If I'm writing at 48k IOPS@4k, that is around 180MB/s.

Both vdevs combined should be able to write sequentially with at least 180MB/s.

If I'm looking at this the right way:
- I can write data to storage at 48k IOPS@4k (this is the max. that ZIL can handle). This transform to 180MB/s of bandwidth
- TGX is flushed every 5s
- For slow pool to keep up with the traffic, it should be able to sequentially write AT LEAST 180MB/s, but the upper configuration probably can do more.

If I understand this correctly, the "slow pool" is not the limit in my case...

I got some ideas on the OmniOS mailing list. I tried creating more folders and run 7 fio processes, each in a different folder. I got up to 35k IOPS when using iodepth=4 and threads=16 for each fio command.

There was an interesting reply from Chip that got me thinking about the upper test:
The ZIL on log devices suffer a bit from not filling queues well. In order to get the queues to fill more, try running your test to several zfs folders on the pool simultaneously and measure your total I/O.
 
danswartz: I will try this ramdrive solution today. I tried using ram as a zil device yesterday in linux, but after creating a ram block device, I did not see any traffic over it via iostat. I will try today in OmniOS and report back.

HammerSandwich: I could do that and I will, when I get back to work in a few hours.

Matej
 
Nope, CPU is not bottleneck.

I switched to linux today.

I tested 4k random sync write directly to a Zeusram: 48kIOPS -- OK
I tested 4k random sync write to a mdadm raid1 build with 2 ZeusRAMs: 48kIOPS -- OK

Then I created a ZFS raidZ2 pool with 10 drives and 2 ZeusRAM in mirror as ZIL. I set sync=always and started fio test:
Code:
fio --filename=/pool0/test01 --size=5g --rw=randwrite --refill_buffers --norandommap --randrepeat=0 --ioengine=linuxaio --bs=4k --iodepth=16 --numjobs=16 --runtime=60 --group_reporting --name=4ktest
The result is 15kIOPS. Is this the best I can get out of the system or should it be able to push 48kIOPS for 4k, just like when used as a single drive?

If I understand ZFS write cache right, it is storing small sync writes in ram cache and flushes it every 5s to hard drive as a sequential large block write, right? Since I'm using ZeusRAM as ZIL, system also writes transaction log to ZeusRAM and when used with sync=always, write cache is only as fast as ZIL device.
If the upper is right, how come I'm only seeing 14kIOPS when using ZIL? Is it a latency problem?
All my drives(including ZeusRAM) are in an external JBOD, so that probably introduces a little latency. But could it be that much?

Matej

Your single and dual pool RaidZ2 can't handle the transaction off-load performance of 500MB/s+ the ZuesRAM can put-out, and you're seemingly capped at the lower throughput and IOPs.

Have you tried running hte ZuesRAM in Dual port configuration?
What about testing with a pool of mirrors instead of RaidZ2? You could even do 5 mirrors in 2 vdevs it sounds like by the drive # you have if you want to see if this is it.
 
I will try different configurations:

* 10x mirrors with ZIL
* ZeusRAM pool with ZIL
* Ramdisk pool with ZIL

Matej
 
Here are my stumbling results:

3x 10 drives RaidZ2 pool with ZIL:
7x dd if=/dev/zero of=/pool0/folderX/test bs=4k count=2000000

Iostat output:
Code:
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00    0.00 44986.00     0.00 179928.00     8.00     3.54    0.08    0.00    0.08   0.02  98.90
sdac              0.00     0.00    0.00 44994.00     0.00 179952.00     8.00     3.51    0.08    0.00    0.08   0.02  99.40
sdad              0.00     0.00    0.00  115.00     0.00  3496.00    60.80     0.16    1.42    0.00    1.42   1.35  15.50
sdai              0.00     0.00    0.00  111.00     0.00  2916.00    52.54     0.24    2.18    0.00    2.18   1.89  21.00
sdam              0.00     0.00    0.00  117.00     0.00  2936.00    50.19     0.20    1.67    0.00    1.67   1.60  18.70

With FIO, I'm getting weird results. I'm running 4x fio with:
fio --filename=/pool0/testX/fiotest02 --size=5g --rw=randwrite --refill_buffers --norandommap --randrepeat=0 --ioengine=linuxaio --bs=4k --iodepth=16 --numjobs=16 --runtime=600 --group_reporting --name=4kwrite

iostat output:
Code:
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00    0.00 1523.00     0.00  6304.00     8.28     0.09    0.06    0.00    0.06   0.05   8.10
sdac              0.00     0.00    0.00 1522.00     0.00  6304.00     8.28     0.09    0.06    0.00    0.06   0.05   8.00
sdad              0.00     0.00  426.00  154.00  6816.00  9324.00    55.66     5.56    9.62   12.08    2.81   1.58  91.90
sdai              0.00     0.00  444.00  150.00  7104.00  7764.00    50.06     4.90    8.51   10.61    2.31   1.54  91.60
sdam              0.00     0.00  427.00  135.00  6832.00  7808.00    52.10     5.04    9.05   10.96    2.98   1.60  90.00

For some reason, fio is doing a lot of reading?!

10x mirror pool with ZIL:
Code:
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00    0.00 44751.00     0.00 178992.00     8.00     3.56    0.08    0.00    0.08   0.02  99.70
sdac              0.00     0.00    0.00 44750.00     0.00 178992.00     8.00     3.57    0.08    0.00    0.08   0.02 100.00
sdad              0.00     0.00    0.00   83.00     0.00  7348.00   177.06     0.08    0.94    0.00    0.94   0.94   7.80
sdai              0.00     0.00    0.00   70.00     0.00  6292.00   179.77     0.10    1.40    0.00    1.40   1.31   9.20
sdam              0.00     0.00    0.00   74.00     0.00  6540.00   176.76     0.06    0.77    0.00    0.77   0.77   5.70

Fio again, giving me weird results with high read rate
Code:
evice:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00    0.00 10666.00     0.00 42872.00     8.04     0.61    0.06    0.00    0.06   0.04  44.20
sdac              0.00     0.00    0.00 10668.00     0.00 42880.00     8.04     0.62    0.06    0.00    0.06   0.04  45.70
sdad              0.00     0.00  415.00   46.00 53120.00  3100.00   243.90     5.36   10.45   11.27    3.09   2.16  99.50
sdai              0.00     0.00  402.00   84.00 51456.00  6696.00   239.31     4.23    8.76   10.32    1.32   2.05  99.60
sdam              0.00     0.00  410.00  115.00 52480.00 10180.00   238.70     5.91   15.31   13.20   22.82   1.90  99.90

1x ZeusRAM pool with 1x ZeusRAM for ZIL:
7x dd if=/dev/zero of=/pool0/folderX/test bs=4k count=2000000

Iostat output:
Code:
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00    0.00 50252.00     0.00 201000.00     8.00     3.92    0.08    0.00    0.08   0.02 100.00
sdac              0.00     0.00    0.00  635.00     0.00 65511.50   206.34     0.17    0.26    0.00    0.26   0.26  16.60
sdad              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdai              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdam              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00

Fio again, giving me weird results with high read rate
Code:
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00    0.00 1280.00     0.00  5044.00     7.88     0.06    0.05    0.00    0.05   0.05   6.00
sdac              0.00     0.00 1263.00 5931.00 161664.00 124435.00    79.54     2.49    0.35    1.33    0.14   0.11  81.70
sdad              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdai              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdam              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00

I'm stumbled:)
 
Last edited:
The thing with fio is, that I'm also writing 4k blocks. But for an unknown reason, it is still doing RWM (I guess).

I will turn on ARC cache, so reads will be eliminated and check again...

Matej
 
I wasn't clear? My point was: if you write 4K and the block is not in ARC (or you have ARC disabled), it will need to read 128K (I think that is the default?) and then write back out. So it may still read, even if ARC is enabled (if data is not in ARC already...)
 
I did enable ARC on the tests and it might be that data was not in the ARC. I should run the same test twice or more time, to eliminate reading (I have enough memory to cache everything).

On the other side, I has recordsize set to 4k, so there shouldn't be any RWM. It could be that blocks weren't align properly and so, RWM could still happen. Will do some more testing.

Matej
 
I'm not sure we're on the same page yet. My understanding was that datasets use 128KB records by default, so if you do 4KB writes, it will have to do RMW unless it hits in ARC?
 
Would any of that change your recent recommendations that we use OmniOS?

If you compare OmniOS vs Solaris, the difference is

ZFS encryption:
Solaris only

Fast sequential resilvering
Solaris only

SMB version.
Solaris is SMB 2.1
OmniOS is currently SMB 1 but SMB2.1 is nearly ready
https://www.illumos.org/issues/6399 (by Gordon Ross from Nexenta)

Price
Solaris is free only for demo and development.
For commercial use count 1000$ per server/year
OmniOS is free with a commercial support option

Other aspects are similar
 
Ubuntu 14.04 + ZoL is smb 3.1 ... with 10bE / 40 GbE you feel it !

This is not ZoL related.
It is SAMBA that offers SMB 3. You can use SAMBA on Solaris as well but without the
easyness regarding AD, ACL, Windows SIDs and Previous versions.
 
Yes & No

Enabling ACL :
Code:
# zfs set acltype=posixacl storage/tank

If I wanted to set the acltype back to stock configuration (default), I would do the following (Thanks to DeHackEd from #zfsonlinux freenode channel for letting me know about this):

Code:
# zfs inherit acltype storage/tank
On one side, I would use Solaris 11.3 + Nappit in a production environment with enterprise license if it supports fastest 10GbE / 40GbE transfer than Ubuntu 14.04 + ZoL
On second side, some clients doesn't want to hear anymore about Oracle / Sun, and so in their minds Solaris
On third side, Indiestor only works on Debian or Ubuntu.
 
Last edited:
Something strange on my newly rolled out Napp-IT appliance:
https://forums.servethehome.com/ind...le-consistent-apd-on-vm-power-off.6099/page-7



Checked the Smart values and one of the disks has failed:
pvp2yKX.png


6bJT9at.png



I created some jobs to see if they report the same but when I run the alert job nothing happens! (Status logs job works fine and end up in my mailbox. But they don't mention the failed disk)

DwYqV9O.png


Any ideas..?
 
Last edited:
Alert mails are generated on degraded pools (ZFS disk errors), nearly full pools and job errors,
not on smart or other warnings.

An extendable alert mechanism to monitor more system states is on the todo list.
 
Alert mails are generated on degraded pools (ZFS disk errors), nearly full pools and job errors,
not on smart or other warnings.

An extendable alert mechanism to monitor more system states is on the todo list.

Nice to hear..thanks..!

How do I get the monitor extensions to do their thing, do i have to enable a certain service..?

4vauAPP.png
 
Back
Top