OpenSolaris derived ZFS NAS/ SAN (OmniOS, OpenIndiana, Solaris and napp-it)

Discussion in 'SSDs & Data Storage' started by _Gea, Dec 30, 2010.

  1. levak

    levak Limp Gawd

    Messages:
    386
    Joined:
    Mar 27, 2011
    Anyone using ZeusRAM for ZIL? What are your write/read performance for 4k random write?

    I have a RaidZ2 pool with 10x4TB SAS drives and ZeusRAM for ZIL and when testing with fio, I only get around 2500iops, which is way too little I think.

    I also tried creating a mirror pool with 2xZeusRAM, but system somehow gets limited, again, at 2,5kIOPS as well.

    My drives are in a Supermicro JBOD with SAS expander and connected to LSI 9207 HBA. I will try with linux tomorrow, maybe it's an OS problem?
    lp, MAtej
     
  2. vektor777

    vektor777 n00b

    Messages:
    15
    Joined:
    Jan 28, 2009
    I use 2x ZeusRAM's myself in my Pool, how do you have them configured, are they flashed with C023 firmware?

    Can you post your pool config?
     
  3. gigatexal

    gigatexal [H]ardness Supreme

    Messages:
    7,210
    Joined:
    Jun 22, 2004
    Are you cpu limited?
     
  4. levak

    levak Limp Gawd

    Messages:
    386
    Joined:
    Mar 27, 2011
    vektor777:
    I have 2 pools. My firmware revision is C025.

    Pool1:
    Code:
            NAME                        STATE     READ WRITE CKSUM
            data                        ONLINE       0     0     0
              raidz2-0                  ONLINE       0     0     0
                c6t5000C500836B2889d0   ONLINE       0     0     0
                c6t5000C500836B3259d0   ONLINE       0     0     0
                c6t5000C500836B5255d0   ONLINE       0     0     0
                c8t5000C50083756635d0   ONLINE       0     0     0
                c8t5000C5008375B075d0   ONLINE       0     0     0
                c8t5000C50083756A51d0   ONLINE       0     0     0
                c10t5000C50083756535d0  ONLINE       0     0     0
                c10t5000C50083756C0Dd0  ONLINE       0     0     0
                c9t5000C50083759D85d0   ONLINE       0     0     0
                c9t5000C50083756751d0   ONLINE       0     0     0
            logs
              mirror-1                  ONLINE       0     0     0
                c6t5000A72A300B3D5Fd0   ONLINE       0     0     0
                c8t5000A72A300B3D80d0   ONLINE       0     0     0
    
    Pool2 (2x ZeusRAM as disks):
    Code:
            NAME                        STATE     READ WRITE CKSUM
            zeusram                     ONLINE       0     0     0
              mirror-0                  ONLINE       0     0     0
                c10t5000A72A300B3D7Ed0  ONLINE       0     0     0
                c9t5000A72A300B3D9Dd0   ONLINE       0     0     0
    
    At least the "zeusram" pool should run at 45kIOPS without problem. But it can't go over 2,5kIOPS, neither with sync=always or sync=disabled.

    gigatexal: no, this is a dual CPU, 12 core with HT machine.

    Matej
     
  5. gigatexal

    gigatexal [H]ardness Supreme

    Messages:
    7,210
    Joined:
    Jun 22, 2004
    It could have many threads but I think ZFS benefits most from IPC and clockspeed, no?
     
  6. levak

    levak Limp Gawd

    Messages:
    386
    Joined:
    Mar 27, 2011
    Could be, but I still think Intel Xeon E5-2640 should handle that amount of traffic without a problem.

    Matej
     
  7. gigatexal

    gigatexal [H]ardness Supreme

    Messages:
    7,210
    Joined:
    Jun 22, 2004
    what is your fio run's settings? what is the output of top when the run is happening?
     
  8. levak

    levak Limp Gawd

    Messages:
    386
    Joined:
    Mar 27, 2011
    My fio settings are the following:
    fio --filename=/zeusram/fiotest02 --size=3g --rw=randrw --refill_buffers --norandommap --randrepeat=0 --ioengine=solarisaio --bs=4k --rwmixread=0 --iodepth=16 --numjobs=16 --runtime=60 --group_reporting --name=4ktest

    Output of top:
    Code:
    last pid:  2807;  load avg:  4.41,  1.02,  0.38;  up 5+19:12:57                                                   08:55:47
    65 processes: 48 sleeping, 17 on cpu
    CPU states: 31.7% idle, 63.6% user,  4.7% kernel,  0.0% iowait,  0.0% swap
    Kernel: 33736 ctxsw, 1943 trap, 12347 intr, 27134 syscall
    Memory: 256G phys mem, 123G free mem, 4096M total swap, 4096M free swap
    
    Under that, there are a few fio processes. Probably 16 of them:)

    I just got to work and will boot the server to linux and try there. Just to see, if its a driver/OS problem.

    Matej
     
  9. gigatexal

    gigatexal [H]ardness Supreme

    Messages:
    7,210
    Joined:
    Jun 22, 2004
    Doesn't look like a cpu constrained issue. Hmm
     
  10. levak

    levak Limp Gawd

    Messages:
    386
    Joined:
    Mar 27, 2011
    Nope, CPU is not bottleneck.

    I switched to linux today.

    I tested 4k random sync write directly to a Zeusram: 48kIOPS -- OK
    I tested 4k random sync write to a mdadm raid1 build with 2 ZeusRAMs: 48kIOPS -- OK

    Then I created a ZFS raidZ2 pool with 10 drives and 2 ZeusRAM in mirror as ZIL. I set sync=always and started fio test:
    Code:
    fio --filename=/pool0/test01 --size=5g --rw=randwrite --refill_buffers --norandommap --randrepeat=0 --ioengine=linuxaio --bs=4k --iodepth=16 --numjobs=16 --runtime=60 --group_reporting --name=4ktest
    
    The result is 15kIOPS. Is this the best I can get out of the system or should it be able to push 48kIOPS for 4k, just like when used as a single drive?

    If I understand ZFS write cache right, it is storing small sync writes in ram cache and flushes it every 5s to hard drive as a sequential large block write, right? Since I'm using ZeusRAM as ZIL, system also writes transaction log to ZeusRAM and when used with sync=always, write cache is only as fast as ZIL device.
    If the upper is right, how come I'm only seeing 14kIOPS when using ZIL? Is it a latency problem?
    All my drives(including ZeusRAM) are in an external JBOD, so that probably introduces a little latency. But could it be that much?

    Matej
     
  11. danswartz

    danswartz 2[H]4U

    Messages:
    3,614
    Joined:
    Feb 25, 2011
    I think this may be some kind of zil throttle?
     
  12. levak

    levak Limp Gawd

    Messages:
    386
    Joined:
    Mar 27, 2011
    ZIL throttle? That exists?

    I will google a bit around that...

    MAtej
     
  13. _Gea

    _Gea 2[H]4U

    Messages:
    3,835
    Joined:
    Dec 5, 2010
    If you write small random data to a pool without sync, they are collected in RAM and written as a single large and fast sequential write. This gives you a certain level of write performance.

    If you use sync, the same happens but additionally you have a data logging on every write request to your ZIL device. This means that you have two write actions and the effective performance is fast sequential write performance to the pool + sync random write on every commit to a ZIL.
     
  14. levak

    levak Limp Gawd

    Messages:
    386
    Joined:
    Mar 27, 2011
    So I should see 48kIOPS on random small writes, since this is what ZeusRAM can handle?

    MAtej
     
  15. danswartz

    danswartz 2[H]4U

    Messages:
    3,614
    Joined:
    Feb 25, 2011
    I seem to recall creating a couple of ram disks and using one as an SLOG and other as the pool disk. Created a sync=always filesystem and then did a bulk write of several GB from /dev/zero to it and seeing the speed be much less than I thought. I will try to repeat that tonight...
     
  16. _Gea

    _Gea 2[H]4U

    Messages:
    3,835
    Joined:
    Dec 5, 2010

    No, you are always slower than your pool as you must write all data to a slow pool sequentially AND you must add the delay for writing to the fast ZIL with small random writes..

    Sync write is always slower than unsync write without a ZIL as this is not a performance but a security option.
     
  17. levak

    levak Limp Gawd

    Messages:
    386
    Joined:
    Mar 27, 2011
    Gea:
    For test, I added another 10 drives raidz2 vdev, so now I have the following config:
    pool0 pool:
    * 10 drives raidz2
    * 10 drives raidz2
    * log mirror ZeusRAM

    If I'm writing at 48k IOPS@4k, that is around 180MB/s.

    Both vdevs combined should be able to write sequentially with at least 180MB/s.

    If I'm looking at this the right way:
    - I can write data to storage at 48k IOPS@4k (this is the max. that ZIL can handle). This transform to 180MB/s of bandwidth
    - TGX is flushed every 5s
    - For slow pool to keep up with the traffic, it should be able to sequentially write AT LEAST 180MB/s, but the upper configuration probably can do more.

    If I understand this correctly, the "slow pool" is not the limit in my case...

    I got some ideas on the OmniOS mailing list. I tried creating more folders and run 7 fio processes, each in a different folder. I got up to 35k IOPS when using iodepth=4 and threads=16 for each fio command.

    There was an interesting reply from Chip that got me thinking about the upper test:
     
  18. HammerSandwich

    HammerSandwich [H]ard|Gawd

    Messages:
    1,112
    Joined:
    Nov 18, 2004
    You'd have to run a small test, but what about a pool with Zeus #1 as data & #2 as SLOG?
     
  19. levak

    levak Limp Gawd

    Messages:
    386
    Joined:
    Mar 27, 2011
    danswartz: I will try this ramdrive solution today. I tried using ram as a zil device yesterday in linux, but after creating a ram block device, I did not see any traffic over it via iostat. I will try today in OmniOS and report back.

    HammerSandwich: I could do that and I will, when I get back to work in a few hours.

    Matej
     
  20. ToddW2

    ToddW2 2[H]4U

    Messages:
    4,019
    Joined:
    Nov 8, 2004
    Your single and dual pool RaidZ2 can't handle the transaction off-load performance of 500MB/s+ the ZuesRAM can put-out, and you're seemingly capped at the lower throughput and IOPs.

    Have you tried running hte ZuesRAM in Dual port configuration?
    What about testing with a pool of mirrors instead of RaidZ2? You could even do 5 mirrors in 2 vdevs it sounds like by the drive # you have if you want to see if this is it.
     
  21. levak

    levak Limp Gawd

    Messages:
    386
    Joined:
    Mar 27, 2011
    I will try different configurations:

    * 10x mirrors with ZIL
    * ZeusRAM pool with ZIL
    * Ramdisk pool with ZIL

    Matej
     
  22. levak

    levak Limp Gawd

    Messages:
    386
    Joined:
    Mar 27, 2011
    Here are my stumbling results:

    3x 10 drives RaidZ2 pool with ZIL:
    7x dd if=/dev/zero of=/pool0/folderX/test bs=4k count=2000000

    Iostat output:
    Code:
    Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
    sdb               0.00     0.00    0.00 44986.00     0.00 179928.00     8.00     3.54    0.08    0.00    0.08   0.02  98.90
    sdac              0.00     0.00    0.00 44994.00     0.00 179952.00     8.00     3.51    0.08    0.00    0.08   0.02  99.40
    sdad              0.00     0.00    0.00  115.00     0.00  3496.00    60.80     0.16    1.42    0.00    1.42   1.35  15.50
    sdai              0.00     0.00    0.00  111.00     0.00  2916.00    52.54     0.24    2.18    0.00    2.18   1.89  21.00
    sdam              0.00     0.00    0.00  117.00     0.00  2936.00    50.19     0.20    1.67    0.00    1.67   1.60  18.70
    
    With FIO, I'm getting weird results. I'm running 4x fio with:
    fio --filename=/pool0/testX/fiotest02 --size=5g --rw=randwrite --refill_buffers --norandommap --randrepeat=0 --ioengine=linuxaio --bs=4k --iodepth=16 --numjobs=16 --runtime=600 --group_reporting --name=4kwrite

    iostat output:
    Code:
    Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
    sdb               0.00     0.00    0.00 1523.00     0.00  6304.00     8.28     0.09    0.06    0.00    0.06   0.05   8.10
    sdac              0.00     0.00    0.00 1522.00     0.00  6304.00     8.28     0.09    0.06    0.00    0.06   0.05   8.00
    sdad              0.00     0.00  426.00  154.00  6816.00  9324.00    55.66     5.56    9.62   12.08    2.81   1.58  91.90
    sdai              0.00     0.00  444.00  150.00  7104.00  7764.00    50.06     4.90    8.51   10.61    2.31   1.54  91.60
    sdam              0.00     0.00  427.00  135.00  6832.00  7808.00    52.10     5.04    9.05   10.96    2.98   1.60  90.00
    
    For some reason, fio is doing a lot of reading?!

    10x mirror pool with ZIL:
    Code:
    Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
    sdb               0.00     0.00    0.00 44751.00     0.00 178992.00     8.00     3.56    0.08    0.00    0.08   0.02  99.70
    sdac              0.00     0.00    0.00 44750.00     0.00 178992.00     8.00     3.57    0.08    0.00    0.08   0.02 100.00
    sdad              0.00     0.00    0.00   83.00     0.00  7348.00   177.06     0.08    0.94    0.00    0.94   0.94   7.80
    sdai              0.00     0.00    0.00   70.00     0.00  6292.00   179.77     0.10    1.40    0.00    1.40   1.31   9.20
    sdam              0.00     0.00    0.00   74.00     0.00  6540.00   176.76     0.06    0.77    0.00    0.77   0.77   5.70
    
    Fio again, giving me weird results with high read rate
    Code:
    evice:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
    sdb               0.00     0.00    0.00 10666.00     0.00 42872.00     8.04     0.61    0.06    0.00    0.06   0.04  44.20
    sdac              0.00     0.00    0.00 10668.00     0.00 42880.00     8.04     0.62    0.06    0.00    0.06   0.04  45.70
    sdad              0.00     0.00  415.00   46.00 53120.00  3100.00   243.90     5.36   10.45   11.27    3.09   2.16  99.50
    sdai              0.00     0.00  402.00   84.00 51456.00  6696.00   239.31     4.23    8.76   10.32    1.32   2.05  99.60
    sdam              0.00     0.00  410.00  115.00 52480.00 10180.00   238.70     5.91   15.31   13.20   22.82   1.90  99.90
    
    1x ZeusRAM pool with 1x ZeusRAM for ZIL:
    7x dd if=/dev/zero of=/pool0/folderX/test bs=4k count=2000000

    Iostat output:
    Code:
    Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
    sdb               0.00     0.00    0.00 50252.00     0.00 201000.00     8.00     3.92    0.08    0.00    0.08   0.02 100.00
    sdac              0.00     0.00    0.00  635.00     0.00 65511.50   206.34     0.17    0.26    0.00    0.26   0.26  16.60
    sdad              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
    sdai              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
    sdam              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
    
    Fio again, giving me weird results with high read rate
    Code:
    Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
    sdb               0.00     0.00    0.00 1280.00     0.00  5044.00     7.88     0.06    0.05    0.00    0.05   0.05   6.00
    sdac              0.00     0.00 1263.00 5931.00 161664.00 124435.00    79.54     2.49    0.35    1.33    0.14   0.11  81.70
    sdad              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
    sdai              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
    sdam              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
    
    I'm stumbled:)
     
    Last edited: Oct 23, 2015
  23. danswartz

    danswartz 2[H]4U

    Messages:
    3,614
    Joined:
    Feb 25, 2011
    If the blocks you are writing are less than the dataset recordsize, you will get RMW behavior?
     
  24. levak

    levak Limp Gawd

    Messages:
    386
    Joined:
    Mar 27, 2011
    The thing with fio is, that I'm also writing 4k blocks. But for an unknown reason, it is still doing RWM (I guess).

    I will turn on ARC cache, so reads will be eliminated and check again...

    Matej
     
  25. danswartz

    danswartz 2[H]4U

    Messages:
    3,614
    Joined:
    Feb 25, 2011
    I wasn't clear? My point was: if you write 4K and the block is not in ARC (or you have ARC disabled), it will need to read 128K (I think that is the default?) and then write back out. So it may still read, even if ARC is enabled (if data is not in ARC already...)
     
  26. levak

    levak Limp Gawd

    Messages:
    386
    Joined:
    Mar 27, 2011
    I did enable ARC on the tests and it might be that data was not in the ARC. I should run the same test twice or more time, to eliminate reading (I have enough memory to cache everything).

    On the other side, I has recordsize set to 4k, so there shouldn't be any RWM. It could be that blocks weren't align properly and so, RWM could still happen. Will do some more testing.

    Matej
     
  27. danswartz

    danswartz 2[H]4U

    Messages:
    3,614
    Joined:
    Feb 25, 2011
    I'm not sure we're on the same page yet. My understanding was that datasets use 128KB records by default, so if you do 4KB writes, it will have to do RMW unless it hits in ARC?
     
  28. _Gea

    _Gea 2[H]4U

    Messages:
    3,835
    Joined:
    Dec 5, 2010
  29. lordsegan

    lordsegan Gawd

    Messages:
    624
    Joined:
    Jun 16, 2004
  30. lordsegan

    lordsegan Gawd

    Messages:
    624
    Joined:
    Jun 16, 2004
    n/t
     
    Last edited: Oct 28, 2015
  31. _Gea

    _Gea 2[H]4U

    Messages:
    3,835
    Joined:
    Dec 5, 2010
    If you compare OmniOS vs Solaris, the difference is

    ZFS encryption:
    Solaris only

    Fast sequential resilvering
    Solaris only

    SMB version.
    Solaris is SMB 2.1
    OmniOS is currently SMB 1 but SMB2.1 is nearly ready
    https://www.illumos.org/issues/6399 (by Gordon Ross from Nexenta)

    Price
    Solaris is free only for demo and development.
    For commercial use count 1000$ per server/year
    OmniOS is free with a commercial support option

    Other aspects are similar
     
  32. ST3F

    ST3F Limp Gawd

    Messages:
    181
    Joined:
    Oct 19, 2011
    Ubuntu 14.04 + ZoL is smb 3.1 ... with 10bE / 40 GbE you feel it !
     
  33. CopyRunStart

    CopyRunStart Limp Gawd

    Messages:
    153
    Joined:
    Apr 3, 2014
    This would be so tempting if Ubuntu + ZoL could integrate as well as Solaris does with Active Directory and ACL's.
     
    Last edited: Oct 29, 2015
  34. _Gea

    _Gea 2[H]4U

    Messages:
    3,835
    Joined:
    Dec 5, 2010
    This is not ZoL related.
    It is SAMBA that offers SMB 3. You can use SAMBA on Solaris as well but without the
    easyness regarding AD, ACL, Windows SIDs and Previous versions.
     
  35. ST3F

    ST3F Limp Gawd

    Messages:
    181
    Joined:
    Oct 19, 2011
    Yes & No

    Enabling ACL :
    Code:
    # zfs set acltype=posixacl storage/tank
    If I wanted to set the acltype back to stock configuration (default), I would do the following (Thanks to DeHackEd from #zfsonlinux freenode channel for letting me know about this):

    Code:
    # zfs inherit acltype storage/tank
    On one side, I would use Solaris 11.3 + Nappit in a production environment with enterprise license if it supports fastest 10GbE / 40GbE transfer than Ubuntu 14.04 + ZoL
    On second side, some clients doesn't want to hear anymore about Oracle / Sun, and so in their minds Solaris
    On third side, Indiestor only works on Debian or Ubuntu.
     
    Last edited: Oct 30, 2015
  36. _Gea

    _Gea 2[H]4U

    Messages:
    3,835
    Joined:
    Dec 5, 2010
  37. nostradamus99

    nostradamus99 [H]Lite

    Messages:
    88
    Joined:
    Jun 28, 2012
    Last edited: Oct 30, 2015
  38. _Gea

    _Gea 2[H]4U

    Messages:
    3,835
    Joined:
    Dec 5, 2010
    Alert mails are generated on degraded pools (ZFS disk errors), nearly full pools and job errors,
    not on smart or other warnings.

    An extendable alert mechanism to monitor more system states is on the todo list.
     
  39. davewolfs

    davewolfs Limp Gawd

    Messages:
    331
    Joined:
    Nov 7, 2006
    Does Solaris support VAAI?
     
  40. nostradamus99

    nostradamus99 [H]Lite

    Messages:
    88
    Joined:
    Jun 28, 2012
    Nice to hear..thanks..!

    How do I get the monitor extensions to do their thing, do i have to enable a certain service..?

    [​IMG]