napp-it vs. freenas? vSAN within an Lab ESXi Whitebox

loadit · Jan 3, 2015

This is my first post in this forum. But I read tons and tons of threads about ZFS fileservers (napp-it / OpenIndiana / Nexenta / etc).

First my hardware which I used for this ESXi whitebox built:

1x Cooler Master HAF XB Evo Case
2x ICY Dock MB994SP-4SB-1 HDD Cages
1x Cooler Master B500W PSU
1x Supermicro X9SRH-7F Board
1x Intel Xeon E5-2620 CPU
1x Noctua NH-D15 CPU FAN
3x Enermax T.B. Silence 120mm CASE FAN
4x Kingston DDR3 16GB 1333MHz ECC-Reg RAM
8x Crucial MX100 512GB SSD
1x Kingston SSDNow 120GB SSD

After sucessfully modding the NH-D15 to fit the narrow ILM design of the Supermicro X9 board. I used my Kingston SSDNow harddisk as the LocalDatastore for the ESXi 5.5 U2 Hypervisor installation. Created two vSwitches (one connected to the physical GBit Intel Nic one as loopback interace for the vSAN later). I changed the MTU on the vSwitch1 (vSwitch0 is for the Managment network and the physical interface) to 9000 for vmk1 and the Virtual Machine Port Group. I created a little linux test client (Ubuntu 14.10 Server + iperf + nfs-common) also on LocalDatastore with VMXNET3 interface attached to vSwitch1.

napp-it:
I downloaded the newest napp-it 0.9f3 build from the napp-it.org page. Followed the documentation to install the appliance (changed it to 2vCore, 8GB) on my ESXi hypervisor to the LocalDatastore. E1000 connect to my vSwitch0, VMXNET3 connect to my vSwitch1. I patched the VMXNET3 driver, the TCP Stack and modified the MTU and LSO according to Cyberexplorers Blog. The eight Crucial MX100 are directly passed thru to the VM with DirectPath I/O. On the appliance I configurated them as RAID10 and disabled sync on the array.

DD gave me a result of around 1.2GByte/sec write and read I did not really test so far.

So I ran some tests between the napp-it appliance and the linux test client and was able to achive around 20Gbit/sec thru one iperf stream.

Code:

root@napp-it-15a:/root# /opt/csw/bin/iperf -s
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size: 1.00 MByte (default)
------------------------------------------------------------
[  4] local 10.254.1.2 port 5001 connected with 10.254.1.3 port 47192
[ ID] Interval       Transfer     Bandwidth
[  4]  0.0-10.0 sec  22.2 GBytes  19.1 Gbits/sec

So I created a NFS share and mounted this share with my linux test client. And I got quite a good 1.0 - 1.2 GByte/Sec write speed. Read again I did not test so far.

FreeNAS:
FreeNAS I downloaded the latest iSO 9.3 from the FreeNAS page. Uploaded the iSO to the hypervisor and created a new VM (2vCore, 8GB, 10GB HDD, LSI2308IT DirectPath I/O) with again two interfaces. One E1000 the the management network (vSwitch0) one VMXNET3 connected to the storage network (vSwitch1). Also the SSDArray configured as a RAID10 array.

NOW here comes the
I ran DD and it gives me following result:

Code:

[root@freenas] /mnt/Storage# dd if=/dev/zero of=tempfile bs=1M count=4096
4096+0 records in
4096+0 records out
4294967296 bytes transferred in 1.248864 secs (3439099468 bytes/sec)
[root@freenas] /mnt/Storage# dd if=/dev/zero of=tempfile bs=1M count=4096
4096+0 records in
4096+0 records out
4294967296 bytes transferred in 1.244599 secs (3450884130 bytes/sec)
[root@freenas] /mnt/Storage# dd if=/dev/zero of=tempfile bs=1M count=4096
4096+0 records in
4096+0 records out
4294967296 bytes transferred in 1.266117 secs (3392235449 bytes/sec)

I somehow get almost threetimes the DD write speed as I get from napp-it???

NFS I also tried with my linux client gives me only 550 - 650 MByte/sec... maybe some optimization requiered...

So now my questions:

Are these FreeNAS DD write speeds for real?
Is it possible to achive them also over NFS?
Can I also improve napp-it to get those speeds?

Thanks a lot guys for your help. (especially _Gea I read tons of posts from you!!!)

Best
Yves

_Gea · Jan 4, 2015

From tests that I have seen, the sequential write performance of a new 512 GB MX100 is a little below 500 MB/s.
The write performance with small random writes is between 100 and 350 MB/s.

A raid-10 from 8 disks can have a numeric maximum of 4 x these values so possible values are between 400 MB/s and a max of 2 GB/s. Higher values are mainly caused by a RAM caching effect. What I would first try is to enhance the test file size to at least 2 x RAM.

I you want to limit ESXi effects, cou can pass-through a whole SAS controller and use OS drivers instead of
RDM passing single disks.

loadit · Jan 4, 2015

Now its getting even more wierd...

Code:

[root@freenas] /mnt/Storage# dd if=/dev/zero of=tempfile bs=1M count=16384
16384+0 records in
16384+0 records out
17179869184 bytes transferred in 5.316661 secs (3231326729 bytes/sec)
[root@freenas] /mnt/Storage# dd if=/dev/zero of=tempfile bs=1M count=32768
32768+0 records in
32768+0 records out
34359738368 bytes transferred in 10.015405 secs (3430688881 bytes/sec)

This VM has only 8GB RAM so the first command should have been enough to not use RAM as cache. So I first thought about the ESXi giving additional RAM to the VM since it has 64GB RAM. So I ran the DD again with 128GB see the crazy result:

Code:

[root@freenas] /mnt/Storage# dd if=/dev/zero of=tempfile bs=1M count=131072
131072+0 records in
131072+0 records out
137438953472 bytes transferred in 40.288406 secs (3411377283 bytes/sec)

The LSI SAS controller is already completly passed thru with DirectPath I/O as a PCI device not as single RDMs.

loadit · Jan 4, 2015

this is RAID10, is it?

Code:

[root@freenas] /mnt/Storage# zpool status
  pool: Storage
 state: ONLINE
  scan: none requested
config:

        NAME                                            STATE     READ WRITE CKSUM
        Storage                                         ONLINE       0     0     0
          mirror-0                                      ONLINE       0     0     0
            gptid/efe99ac4-939e-11e4-b4ab-000c2962ef44  ONLINE       0     0     0
            gptid/f0162e97-939e-11e4-b4ab-000c2962ef44  ONLINE       0     0     0
          mirror-1                                      ONLINE       0     0     0
            gptid/65f301f2-939f-11e4-b4ab-000c2962ef44  ONLINE       0     0     0
            gptid/662007f9-939f-11e4-b4ab-000c2962ef44  ONLINE       0     0     0
          mirror-2                                      ONLINE       0     0     0
            gptid/7ec96409-939f-11e4-b4ab-000c2962ef44  ONLINE       0     0     0
            gptid/7ef53d09-939f-11e4-b4ab-000c2962ef44  ONLINE       0     0     0
          mirror-3                                      ONLINE       0     0     0
            gptid/96a92632-939f-11e4-b4ab-000c2962ef44  ONLINE       0     0     0
            gptid/96d54789-939f-11e4-b4ab-000c2962ef44  ONLINE       0     0     0

errors: No known data errors

  pool: freenas-boot
 state: ONLINE
  scan: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        freenas-boot  ONLINE       0     0     0
          da0p2     ONLINE       0     0     0

errors: No known data errors

omniscence · Jan 4, 2015

dd is not a good benchmark by itself. You have not complete control over the caching and syncs. I would use

Code:

zpool iostat -v Storage 1

on the storage server VM and watch what transfer rates you actually get to the disks while the dd runs on another console.

zambanini · Jan 4, 2015

freenas default zfs setting is lz4 compression , )

anyway, zfs and benchmark is tricky. you may take a look at the vsphere community, there is a benchmark suite.

btw..disabling the zfs sync....wtf, you kill your data also without the box, if that is your primary goal

loadit · Jan 4, 2015

I disabled the LZ4 compression of the FreeNAS and now I get this results:

Code:

[root@freenas] /mnt/Storage# dd if=/dev/zero of=tempfile bs=1M count=16384
16384+0 records in
16384+0 records out
17179869184 bytes transferred in 13.935840 secs (1232783192 bytes/sec)

So it seams that what I got from napp-it I also get with FreeNAS if the LZ4 compression is turned off.

Should I turn that on or off on the napp-it appliance?
If I have an USV on the ESXi Whitebox I dont need to worrie about disabling the zfs sync?

omniscence · Jan 4, 2015

LZ4 compression causes a very low CPU overhead. The general consensus is that you can keep it turned on in all cases. The only exceptions would be very slow processors or exclusively already compressed data. But even from Bluray ISOs often compress by a few percent.

The sync property only influences the integrity of the data inside the files or zvols. The integrity of the filesystem itself is not dependent on this setting, only on the layers below ZFS and the hardware.

The system can still crash even with a PSU. At that point the filesystems inside the zvols can be corrupted. The journaling filesystems of your VMs assume the their sync writes are on stable storage before updating their logs. If ZFS does not guarantee this, you may end up with a corrupt filesystem. You can emulate that by formatting a flash drive with NTFS, enabling the write buffer on that device, switching the write cache commits off and then pulling that drive during a copy operation.

loadit · Jan 4, 2015

Okay, so I guess I leave it turned on.

Can I optimize the napp-it to get more than the 1.2 - 1.4gbyte/sec I get?

_Gea · Jan 4, 2015

This is more a general ZFS than a napp-it question.
As a general rule, sequential write performance of ZFS scale with number of data-disks without redundancy
while iops scale with number of vdevs.

ex with 10 disks:

a multiple raid-10 (5 mirrors = 5 datadisks) has
- 5x the sequential write performance of a single disk (10x the read-performance as they read in parallel)
- 5 x the iops of a single disk

if you compare a raid-z2 with 8 datadisks
- 8x the sequential write performance of a single disk (8x the read-performance)
- 1 x the iops of a single disk

So raid-Z2 is much faster regarding sequential write performance but much slower regarding iops.
As iops is not a problem with SSDs, try a Raid-Z

loadit · Jan 4, 2015

okay, I just did some test. I killed the RAID10 array, created a new RAIDZ2 array with my 8 Crucial MX100 512GB ssds

Code:

  pool: Datastore
 state: ONLINE
  scan: none requested
config:

	NAME                       STATE     READ WRITE CKSUM      CAP            Product /napp-it   IOstat mess
	Datastore                  ONLINE       0     0     0
	  raidz2-0                 ONLINE       0     0     0
	    c4t500A07510D6B76B6d0  ONLINE       0     0     0      512.1 GB       Crucial_CT512MX1   S:0 H:0 T:0
	    c4t500A07510D6B7DB8d0  ONLINE       0     0     0      512.1 GB       Crucial_CT512MX1   S:0 H:0 T:0
	    c4t500A07510D6BB8E2d0  ONLINE       0     0     0      512.1 GB       Crucial_CT512MX1   S:0 H:0 T:0
	    c4t500A07510DA28026d0  ONLINE       0     0     0      512.1 GB       Crucial_CT512MX1   S:0 H:0 T:0
	    c4t500A07510DAA8CB6d0  ONLINE       0     0     0      512.1 GB       Crucial_CT512MX1   S:0 H:0 T:0
	    c4t500A07510DAAA069d0  ONLINE       0     0     0      512.1 GB       Crucial_CT512MX1   S:0 H:0 T:0
	    c4t500A07510DAAA95Cd0  ONLINE       0     0     0      512.1 GB       Crucial_CT512MX1   S:0 H:0 T:0
	    c4t500A07510DAAB69Dd0  ONLINE       0     0     0      512.1 GB       Crucial_CT512MX1   S:0 H:0 T:0

errors: No known data errors

and ran DD benchmark:

Code:

root@loadit-nappit1:/Datastore# dd if=/dev/zero of=tempfile bs=1M count=8192
8192+0 records in
8192+0 records out
8589934592 bytes (8.6 GB) copied, 12.8714 s, 667 MB/s
root@loadit-nappit1:/Datastore# dd if=/dev/zero of=tempfile bs=1M count=8192
8192+0 records in
8192+0 records out
8589934592 bytes (8.6 GB) copied, 12.8857 s, 667 MB/s
root@loadit-nappit1:/Datastore# dd if=/dev/zero of=tempfile bs=1M count=4096
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 5.99056 s, 717 MB/s
root@loadit-nappit1:/Datastore# dd if=/dev/zero of=tempfile bs=1M count=4096
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 6.244 s, 688 MB/s
root@loadit-nappit1:/Datastore# dd if=/dev/zero of=tempfile bs=1M count=4096
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 6.08249 s, 706 MB/s

not really what I hoped to see...

_Gea · Jan 4, 2015

ok, so far the theory about sequential and IO performance.
What I would do are benchmarks that are related for a special workload or
done from a client with the protocol that is used then.

I does not help to optimize for pure sequential performance when IO is
more based on small random reads/writes. You must find an acceptable
relation between performance, price and capacity. If you optimize for one of them,
the two others become worse.

A massive Raid-10 layout will always be the best regarding iops and read performance
but price/ capacity ratio is bad, too bad for most use cases with SSDs

loadit · Jan 4, 2015

so I guess practice and theory does not really fitt...

Since in theory I should get double the write compared to what I get with RAID10... and I got half

does not really make sense to me. Since 8 drives should be faster than 4 drives... but again theory... I do not really care about losing so much space. As long as I get around > 1.2GByte / Sec and high IOPS I am happy. But I was hoping to get bit more out of these SSDs then 300MByte/Sec per Drive....

_Gea · Jan 4, 2015

Depending of use case, performance of a single MX100 range from 100 MB/s to 500 MB/s.
This is what you see with different benchmarks.

Enterprise SSDs may offer a smaller difference between the two extremes as differences between small
concurrent random access and large single user sequential access is not as huge..

loadit · Jan 4, 2015

Well I gues I get what I paid for. In switzerland the MX100's where reallly really cheap. So I got them.

Do I have some playroom? Optimize it around 1.5 GByte Write?

There actually is no special use case for this Lab Box I will test diffrent server OS / Linux OS / etc. and create LAB Envoirments.

_Gea · Jan 4, 2015

Beside a better hardware, enterprise SSDs use overprovisioning.
For example a Intel enterprise SSD with 512 GB flash is sold as 400 GB.
The difference is overprovisioning.

If you create a host protected area on a new desktop disk or after a secure erase (10-20% of space),
you can add some percent of performance under load as the overprovisioning
helps the firmware to reorganize the SSD in the background.

You can also use several smaller raid-z vdevs to increase iops instead of one large vdev.
If you really need the performance, stay at multiple raid-10 that scales best with number of vdevs.

Mostly you combine two pools, one high performance pool from SSDs for VMs and one high-capacity pool
from disks use as filer or for backups.

loadit · Jan 5, 2015

Is this "overflow protection (reservation, use max 90%)" the option for this? Or do I have to do that with one of these tools (hdat2, hdparm -N)?

I ran again some tests. On the napp-it appliance itself I get with compression around 2.0 - 2.2 GByte/Sec in DD Benchmarks. Around 1.0 - 1.2 GByte/Sec with compression turned off

On the Linux client same non physical vSwitch1 I get 1.0 - 1.2GByte/Sec over the NFS mount.

But on the ESXi Hypervisor I also mounted the Storage via NFS as a Datastore. SSHd to the Hypervisor and did a DD:

Code:

/vmfs/volumes/13df8453-412e2b6c # time sh -c "dd if=/dev/zero of=esxifile bs=1M
count=4096"
4096+0 records in
4096+0 records out
real    0m 37.52s
user    0m 0.00s
sys     0m 0.00s

If I correctly calculate 4096 / 37.52 I get 109.16MByte/Sec which is very very slow compared to the 1.0 - 1.2 GByte/Sec I get from the Linux Client... Is this normal or am I missing something. The VM I setup and added one 8GB test drive to the NFS datastore on the Hypervisor is also impacable SLOW 10 - 15 MByte/Sec write 100 - 120 MByte/Sec read...

omniscence · Jan 5, 2015

I suspect that is because ESXi does sync writes while Linux does not by default.
More than 1 GB/s sounds impossible with sync writes, especially on fully virtualized systems.

One method to speed that up is to use a write optimized SSD like the Intel S3700 as SLOG, but doubt that you will gain much with that method.
Most of the latency probably comes from context switches.

On the other hand VMs usually do not need such a high sequential throughput.
With a SSD pool as backstorage you will probably have a very good random throughput.
I would use a more meaningful benchmark to qualify the system as a virtualization server (like fio).

loadit · Jan 5, 2015

I just tried to turn off the sync completly with sync=disabled on the test pool.

Result is 2x the speed... but far away from what I get from my VM...

Code:

/vmfs/volumes/32b6373b-96c1d255 # time sh -c "dd if=/dev/zero of=tempfile bs=1M
count=4096"
4096+0 records in
4096+0 records out
real    0m 20.23s
user    0m 0.00s
sys     0m 0.00s
/vmfs/volumes/32b6373b-96c1d255 # time sh -c "dd if=/dev/zero of=tempfile bs=1M
count=4096"
4096+0 records in
4096+0 records out
real    0m 20.34s
user    0m 0.00s
sys     0m 0.00s

If I calculate correctly thats around 204.8MByte/Sec...

omniscence · Jan 5, 2015

With that you only turn off the syncs from ZFS to the disks (for data in the files/zvols, not for ZFS metadata). It has no influence on the syncs from the NFS client to the server.
I am not totally sure about that, but as I wrote, I assume the context switch latency to be an important, if not the largest factor here.
I do not know how to switch off the NFS syncs with ESXi, I only use Linux.

Another parameter that can influence the throughput is atime. Try to switch it off.

_Gea · Jan 5, 2015

loadit said:
Is this "overflow protection (reservation, use max 90%)" the option for this? Or do I have to do that with one of these tools (hdat2, hdparm -N)?
.

The overflow protection sets a ZFS reservation of 10 % and hinders the ZFS
from going to 100% usage.

A host protected area is a reservation on the SSD with the effect that the SSD reports a smaller size to the OS
(allowing the SSD firmware to use the rest for background tasks)

I use the following DOS tool to create host protected areas (you can use hdparm on Linux es well)
http://www.hdat2.com/files/cookbook_v11.pdf

more
http://napp-it.org/ssd_en.html

loadit · Jan 5, 2015

Okay, I will do that. But is what I am trying even possibel? To use the NFS 1GByte/Sec I get back on my ESXi Host? Or should I just switch back the RAID controller to RAID Mode and use the Array directly in the Hypervisor?

loadit · Jan 5, 2015

Should I switch to iSCSI?

_Gea · Jan 5, 2015

Its quite obvious that a shared ZFS-NFS storage has more, more, more, more,.. options
and performance than a local ESXi datstaore so a local raid is rarely an option.

If you look at performance, NFS and a iSCSI devices on the same SAN will mostly offer
a similar performance beside some multipath considerations. As NFS is foolprove and
offers much more options, I would prefer NFS:

loadit · Jan 5, 2015

But how do I get my ESXi to achive the same speed as my VM? I can't belive that can be so hard to do...

omniscence · Jan 5, 2015

Your VM does not do sync writes, but sync writes are something you ultimately may want to have for a VM backstorage.
For what do you need the speed from the hypervisor? For high-throughput file transfers you can access the NAS VM directly from your other VMs.

loadit · Jan 5, 2015

@omniscence so what you are telling me. Since I need the sync writes... it is not possible? At least not with this hardware configuration?

omniscence · Jan 5, 2015

I think so. With sync NFS, every write ends up waiting for a confirmation that the block is actually written to disk.
This is not a problem of the hardware, more like a limitation of how NFS works.

I would try iSCSI, but I know nothing about ESXi. iSCSI if correctly configured should honour syncs if the initiator requests one.
The question would be whether ESXi properly forwards syncs from the VM to its iSCSI target.

loadit · Jan 5, 2015

But how do other people solve this Issue. I can not belive I am the only person who thinks 100 MByte/Sec is to slow for even a ESXi Lab Whitebox... Especially if you have 8 SSDs inside...

_Gea · Jan 5, 2015

loadit said:
But how do other people solve this Issue. I can not belive I am the only person who thinks 100 MByte/Sec is to slow for even a ESXi Lab Whitebox... Especially if you have 8 SSDs inside...

100 MB/s or 1 Gb/s is a physical limit of an 1 Gb/s ethernet network.
If you use 10G Ethernet, ESXi internal networking, port trunking, FC or IB the limits are higher.

omniscence · Jan 5, 2015

I would like to repeat that sequential 100 MB/s is an almost meaningless metric for a virtualization server. If you measured random access performance with a sufficient number of parallel transactions (which is more like the workload multiple VMs generate) of the pool you will probably find out that it is comparable to a directly attached SSD. It is most likely still slower than a RAID array on a controller directly available to the hypervisor, but the entire principle of ZFS is to sacrifice performance for data integrity.

loadit · Jan 6, 2015

@_Gea: I don't use physical LAN for this vSAN. Its all based on virtual Switches with no Physical Adapter. I can iperf over 20GBits... In a single thread.

@omniscence: Maybe u are right. But still the VMs on the mounted NFS store feel horrible slow... Can u tell me how I can messure the random access performance?

zambanini · Jan 6, 2015

with that crappy hw, you will not get faster.
if you want speed, buy a nvme ssd and run your vsphere datastore from this lokal disk/ssd.

loadit · Jan 6, 2015

@zambanini: thanks for the advice. I have been looking for the Intel DC P3700 (800GB) but I really like to first try out some stuff with the ESXi server. Can you also explain to my. Why I get 1.1 GByte/Sec on my NFS Store thru my Ubuntu Client. But on the ESXi Hypervisor I only get bloody 100 - 180 Mbyte/Sec?

zambanini · Jan 6, 2015

because you test the wrong way. do your homework, learn about benchmarking. understand nfs take a look at the mount options, learn about sync... und sei nicht so ein fauler sack der statt zu recherchieren ziellos herumprobiert.

loadit · Jan 6, 2015

Since you do know that I am benchmarking wrong. And have no Idea about NFS and I am here to learn and you seam to know the right way. Why don't you enlighten my so call "lazy ass" (fauler sack)

zambanini · Jan 7, 2015

because everything is in the man pages, documentation and so on. so instead of whining like a baby... RTFM.

spazoid · Jan 7, 2015

Wow, what's with the attitude, zambanini? We don't need or want that around here.

zrav · Jan 8, 2015

loadit said:
But how do other people solve this Issue. I can not belive I am the only person who thinks 100 MByte/Sec is to slow for even a ESXi Lab Whitebox... Especially if you have 8 SSDs inside...

In my experience iSCSI does perform noticeably better than NFS for VM storage.

However if you'd be willing to try KVM based virtualization instead of ESXi you could skip NFS/iSCSI completely by using OmniOS or Linux as host with native ZFS + KVM.

zambanini · Jan 8, 2015

iscsi is not synch forced on vsphere, thats why speed differs. no new information.

napp-it vs. freenas? vSAN within an Lab ESXi Whitebox

n00b

Supreme [H]ardness

n00b

n00b

[H]ard|Gawd

n00b

n00b

[H]ard|Gawd

n00b

Supreme [H]ardness

n00b

Supreme [H]ardness

n00b

Supreme [H]ardness

n00b

Supreme [H]ardness

n00b

[H]ard|Gawd

n00b

[H]ard|Gawd

Supreme [H]ardness

n00b

n00b

Supreme [H]ardness

n00b

[H]ard|Gawd

n00b

[H]ard|Gawd

n00b

Supreme [H]ardness

[H]ard|Gawd

n00b

n00b

n00b

n00b

n00b

n00b

Limp Gawd

Limp Gawd

n00b