ZFS Nas for VMWare

FreakinAye · Mar 23, 2012

I'm looking at upgrading an old VMWare environment (a bunch of Win2k3x64 w/ VMWare Server 1.x and local storage) with something more... modern.

I'm going to start with a pair of ESXi5 boxes with shared storage, and have been going back and forth on what to do for the storage.

The new storage needs to be able to run 25-50 WinXP desktops (1gb ram) with moderate use. CPU and RAM should be fine with the ESX boxes I have planned.

Initially I was looking at QNAP or Synology 8-port NAS for about $1000 diskless. I planned on using NFS or iSCSI with 8x 1Tb WD Black drives. The more I read about these the more I think they aren't up to the task... I'm concerned about both performance and reliability and so I'm back to looking at homebuilt NAS.

Here is my current plan for OI/Nappit
MB: X9SCM-F-O - $200
CPU: Xeon E3-1230 - $240
RAM: 4x4GB of Kingston unbuffered ECC - $140
HBA: ServeRaid M1015 reflashed - $100
HDD: WD Black * ?????
Case: Probably tower case with 2 icydock (or similar) bays, but may just get a norco instead if I'm going over 8-10 disks

That's about as far as I've gotten with hardware. I'm really not sure how many drives I'm going to need to meet the performance requirements of the VMs.

*EDIT* Planning on using RaidZ2

Questions

1) Will 8x1TB WD Black drives handle this, or do I need more/faster drives?

2) Will I be able to benefit from deduplication in this scenario and is 16GB enough ram for it (2-3GB per TB was recommendation I saw)?

3) Is iSCSI or NFS generally preferred with ZFS on ESXi5?

4) Do I dedicate a NIC port to storage traffic or can 2x1GB ports be teamed for throughput & reliability?

Any other advice is greatly appreciated! I've been putting this off for too long so I wouldn't have to make a decision, but Gea's work has put a ZFS build over the top for me and makes it a clear winner.

danswartz · Mar 23, 2012

I would go with raid10, not raidz2, if you are serving multiple guests for esxi, as the load will be read-heavy, and the round-robin read will come in very handy.

FreakinAye · Mar 23, 2012

Yeah I was doing some more reading and was about to come back and edit my post. 4TB should be adequate and it looks like read *really* benefits from it.

I'll plan on Raid10 over RaidZ/2.

obrith · Mar 23, 2012

A decent ZIL will make a major difference. VERY big improvement on storage write latency. I'm pretty happy with my under provisioned Intel 320's for this task.

Same goes for L2ARC if you have a decent sized working set.

I use NFS most of the time for a variety of reasons and I've noticed Nex7 (a nexenta guy you can find around here sometimes) is very partial to NFS as well.

I've got a lot of my non-critical VM's hosted on a OI+Napp-it box and it's vastly superior to our SAN in every way (including stability and speed) at a small fraction of the cost. I typically have 0ms read/write latency unless there are major IOs happening and with ZIL I can get 6-7000 write IOPS before the system flinches.

_Gea · Mar 23, 2012

some thoughts

i would
-use the http://www.supermicro.com/products/motherboard/Xeon/C202_C204/X9SCL_-F.cfm
due to its two ESXi supported Nics

- also use 2x or 3x mirrored vdevs for ESXi
If you expect performance, it scales with number of vdevs
where read performance scales with disks in case of mirrors.

- not use dedup unless dedup rate is > 100
you must add the dedup needed RAM to your ARC (RAM read cache) and OS RAM needs
so do not think about if you do not have REALLY a lot of RAM and need this feature absolutely

- prefer using a single NFS datastore over a iSCSI datastore per VM
(to have easy and fast SMB access to snaps and files for backup and clones)

-Use hybrid storage

- If 1 Gb/s is not enough, you can try Nic teaming (i do not), use All-In-One for fast interconnects
in software or go the 10 GBe line

stevebaynet · Mar 23, 2012

I think you will definitely have to team the NIC's if your going to be serving up 25 - 50 XP VM's with any reasonable amount of disk IO. (i know intel has a handy 4 gigE port card as well) unless you can fork out the $$ for 10gigE

FreakinAye · Mar 23, 2012

obrith said:
A decent ZIL will make a major difference. VERY big improvement on storage write latency. I'm pretty happy with my under provisioned Intel 320's for this task.

Same goes for L2ARC if you have a decent sized working set.

I use NFS most of the time for a variety of reasons and I've noticed Nex7 (a nexenta guy you can find around here sometimes) is very partial to NFS as well.

I've got a lot of my non-critical VM's hosted on a OI+Napp-it box and it's vastly superior to our SAN in every way (including stability and speed) at a small fraction of the cost. I typically have 0ms read/write latency unless there are major IOs happening and with ZIL I can get 6-7000 write IOPS before the system flinches.

I'm still new to ZFS, but essentially ZIL = write cache & L2ARC = Read cache?
Is available RAM used as L1 ARC?

Can OS and ZIL share the same mirror or do you have a mirrored pair for OS, another mirrored pair of 320's for ZIL and a third SSD for L2ARC?

What kind of space should I plan on for both ZIL and L2ARC if I have 4TB of usable storage?

If there is power loss, ZIL knows where the data should go and writes it normally when the pool is back online?

_Gea said:
some thoughts

i would
-use the http://www.supermicro.com/products/motherboard/Xeon/C202_C204/X9SCL_-F.cfm
due to its two ESXi supported Nics

Thanks for the mobo recommendation. I had read that even last week there was still not native support for the 2nd NIC in the other mb.

- also use 2x or 3x mirrored vdevs for ESXi
If you expect performance, it scales with number of vdevs
where read performance scales with disks in case of mirrors.

So pair up whole disks in mirrored vdevs and stripe all of them into a single zpool, correct? In this case I'd be doing 4 mirrored vdevs in the pool. If I need to expand the zpool later, can I add individual mirrored vdevs or do I have to add 4 at a time?

- not use dedup unless dedup rate is > 100
(you must add the dedup needed RAM to you ARC and OS RAM)

I wasn't too concerned about duplication so I'll skip dedupe for this project

- prefer using a single NFS datastore over a iSCSI datastore per VM
(to have easy and fast SMB access to snaps and files for snaps, backup and clones)

I was planning on using multiple 3-500gb iSCSI or NFS datastores. I generally aim for <15-20 VMs per DS.

-Use hybrid storage

Hybrid storage means including an SSD based ZIL and L2ARC in the pool? Is there any problems with garbage collection or TRIM?

Does ZFS work well with entirely SSD-based pools for higher tiered storage? I've considered a home NAS for myself with WD Green drives for media and SSD for photo/video editing.

- If 1 Gb/s is not enough, you can try Nic teaming (i do not), use All-In-One for fast interconnects
in software or go the 10 GBe line

Unfortunately All-In-One won't be an option here as I would really like VMWare HA/clustering and would prefer to use multiple nodes with cheaper hardware (single proc, smaller dimms) instead of top of the line procs & 16gb dimms.
I may also look at converting some of the existing boxes to ESXi.

I see some people installing their NAS as a VM on ESXi with VT-d and passing through their HBA instead of installing their NAS on bare metal.
Benefits of this?
Do you need to pass through the SSDs directly to VM for hybrid storage (trim/gc not supported by ESX)?

Thanks a lot for all the help so far. I'm really excited to get this going now!

FreakinAye · Mar 23, 2012

stevebaynet said:
I think you will definitely have to team the NIC's if your going to be serving up 25 - 50 XP VM's with any reasonable amount of disk IO. (i know intel has a handy 4 gigE port card as well) unless you can fork out the $$ for 10gigE

Definitely can't afford the 10gE for this. I may add an additional 2 port gigE card and team 3 of them together for this.

Jumbo frames will obviously be used, but should I also isolate storage traffic on a separate switch/vlan?

danswartz · Mar 23, 2012

ZIL is not a write cache, a better analogy in the linux world would be an ext3 journal on an external device.

FreakinAye · Mar 23, 2012

danswartz said:
ZIL is not a write cache, a better analogy in the linux world would be an ext3 journal on an external device.

Isn't it still technically caching writes before they go to the other disks though? Journaling is probably a better way of looking at it.

My understanding is that on write the zpool will have to wait for the writes to complete to the applicable disks, so the slowest disk is going to hinder write speed. ZIL instead stores the data and it's intent and will write to the zpool disks when they are 'free'.

is this correct?

Does every write go through the ZIL?
If it's full, will a write have to wait until ZIL offloads some of its data?
If one of the SSD in a mirrored ZIL fails will it still utilize the ZIL or will it write direct to disks (after writing my next question, I suspect the answer is 'no')?
If both drives in the ZIL fail will writes continue? Will corruption happen in this case?
Is the ZIL FIFO (assuming a single destination)?

_Gea · Mar 23, 2012

Q
I'm still new to ZFS, but essentially ZIL = write cache & L2ARC = Read cache?
Is available RAM used as L1 ARC?

A
yes

Q
Can OS and ZIL share the same mirror or do you have a mirrored pair for OS, another mirrored pair of 320's for ZIL and a third SSD for L2ARC?

A
You can slice disks, but that is really a "you should never"
Use separate disks

Q
What kind of space should I plan on for both ZIL and L2ARC if I have 4TB of usable storage?

A
L2ARC read cachre: depends on workload, as much as possible
Log/ Write cache (Data goes to RAM and the to disk, but write cache is more understandable than log)
min 4 GB, max used is about half the RAM, mirror when possible

Q
If there is power loss, ZIL knows where the data should go and writes it normally when the pool is back online?

A
Yes

Q. So pair up whole disks in mirrored vdevs and stripe all of them into a single zpool, correct? In this case I'd be doing 4 mirrored vdevs in the pool. If I need to expand the zpool later, can I add individual mirrored vdevs or do I have to add 4 at a time?

A
Yes but you can add any vdevs (mirrored, single, Raid-Z) to a pool
best is using the same vdevs

Q
I was planning on using multiple 3-500gb iSCSI or NFS datastores. I generally aim for <15-20 VMs per DS.

A
I would use NFS to have file/snap access to a single VM

Q
Is there any problems with garbage collection or TRIM?

A
Trim is not yet supported
Use a SSD that can handle best

Q
Does ZFS work well with entirely SSD-based pools for higher tiered storage? I

A
I use SSD only pools (4 x 3 mirrored vdevs) for performance. I use "cheap" MLC SSD, therefor 3 x mirror
(failure rate in the first year was up to 10% per year), with current used Intel 320 this was much better

Q
All-In-One

A
needs always pass-through

Benefit (beside power and only one server instead of two):
Up to 10 Gbe in Software between ESXi and SAN without any costly and failure sensitive SAN network hardware

_Gea · Mar 23, 2012

FreakinAye said:
Does every write go through the ZIL? ?

Writes are going always to RAM and then to disk and are done optimized every few seconds (async)
Without Zil, every sync write request needs a disk commit (slow) prior the next request.
A zil caches/logs the writes for faster commits
Its size must be large enough to handle/ buffer this sync-write to async write conversion
( must buffer up to a few seconds of data)

FreakinAye · Mar 23, 2012

A
You can slice disks, but that is really a "you should never"
Use separate disks

So 5 disks not including storage disks?

Maybe I will go for that norco...

L2ARC read cachre: depends on workload, as much as possible
Log/ Write cache (Data goes to RAM and the to disk, but write cache is more understandable than log)
min 4 GB, max used is about half the RAM, mirror when possible

Would something like a 120gb intel 320 or 510 be appropriate or too small for L2ARC on a 4TB pool (only 3%)? Is L2ARC a true cache in the sense that it is storing frequent data,etc? I assume mirroring here is irrelevant?

Wow that is much smaller than I expected for ZIL. Since I can only use 8GB of ZIL in this case is SLC NAND recommended? Is this why some people drop like $2k on those tiny SSDs for these builds?

A
I use SSD only pools (4 x 3 mirrored vdevs) for performance. I use "cheap" MLC SSD, therefor 3 x mirror
(failure rate in the first year was up to 10% per year), with current used Intel 320 this was much better

What brand did you use the first year? OCZ?

Are you relying only on idle time garbage collection on the Intel? Have you seen any degradation in performance on the SSD pool from when you first installed?

Benefit (beside power and only one server instead of two):
Up to 10 Gbe in Software between ESXi and SAN without any costly and failure sensitive SAN network hardware

I was referring to the practice of running ESXi on bare metal and NAS as a VM even if it is not an all-in-one (used only as NAS with no other VMs running).

For a single-server setup there is no question that all-in-one is awesome, and probably what I'll be doing when I rebuild my home lab.

thanks again

_Gea · Mar 23, 2012

I'm not a ZFS developer to refer to exact values
but L2ARC is a real second level cache to RAM but only for small files

The best size depends on workload but in every case use as much Arc (RAM) as possible - much faster
If your reads are cached from RAM/ARC or files are too big, size does not matter - not used-

My first line of SSD were the first available serie of Sandforce based ones (Solidata)
The replaced ones after RMA were much better

About speed degration:
I do not have done tests. The feeling of a good enough and much faster than disks is enough

About All in One
I do not use that "must never fail big SAN". Storage is part of my ESXis
Every of my ESXi boxes has its own dedicated ZFS-SAN, accessable for all others in case of problems
So I use All-In-One's forever instead of dumb local ESXi datastore

danswartz · Mar 23, 2012

The zil is never read unless replaying the log after a crash.

stevebaynet · Mar 23, 2012

FreakinAye said:
Definitely can't afford the 10gE for this. I may add an additional 2 port gigE card and team 3 of them together for this.

Jumbo frames will obviously be used, but should I also isolate storage traffic on a separate switch/vlan?

I would definitely isolate to a storage vlan, i dont think you would need a separate switch as long as its decent.

As for L2arc, what i have been told in the past is, max out as much ram as you can afford for arc, then add SSD for L2arc one at a time. There are scripts you can DL that will tell you your L2arc usage, once its full, add another SSD.

coolrunnings · Mar 25, 2012

stevebaynet said:
I would definitely isolate to a storage vlan, i dont think you would need a separate switch as long as its decent.

One of the advantages of using NFS with an all-in-one is the ability to access files on the machine via SMB. Pardon my ignorance but wouldn't putting it on a storage vlan negate this? I'd really like to understand the nuts-and-bolts of this as this is what I am working on setting up here in my home lab as well...

axan · Mar 25, 2012

just because it's on separate vlan doesn't mean u can't access it through smb from your desktop etc. As long as you have something that can route traffic between vlans (router, l3 switch etc) you can access the storage. You could also fire up a virtual machine and put it on same vlan as storage and then be able to access nas shares etc.

stevebaynet · Mar 25, 2012

coolrunnings said:
One of the advantages of using NFS with an all-in-one is the ability to access files on the machine via SMB. Pardon my ignorance but wouldn't putting it on a storage vlan negate this? I'd really like to understand the nuts-and-bolts of this as this is what I am working on setting up here in my home lab as well...

Depends on your setup, since OP is using vmware ESX, he will probably be using multiple NICs and multiple vSwitches.

In my case, i have 10gbe NICs and 10gbe switch for storage traffic, and 1gbe NICs and switch for management and net traffic.

On the ZFS box i have IP's for both the 10gbe storage traffic and the 1gbe Mgmnt/Net traffic. ESX data stores all use the 10gbe, but i also have VM's that connect to CIFS and even NFS shares via the 1gbe NICs just when i need to pull ISO's n stuff.

Really depends on your use case.

_Gea · Mar 25, 2012

I usually connect my ESXi machines (All-In-One or ESXi only) via one vlan tagged physical network adapter (mostly 10 Gbe) to my vlan capable HP-switch (opt. a second Nic to a second switch for failover).

Physical access to my vlans is done on HP-switch level for external access.
From ESXi view, i connect my VM's and their virtual nics to a vlan on the (one) ESXi virtial switch.

x-cimo · Mar 26, 2012

I really like the idea of combining a ZFS NAS on the same machine as ESXi, to reduce power usage, and simplify UPS to only keep one machine up in poweroutage/emergancy.

I was panning on Getting moving my current desktop (i7 860 with pci passthrough) to a ESXi5 and using a Q6600 with 8GB of ram, and sharing with iSCSI.

The issue with my i7 860 is it only support up to 16GB of ram which, would not be enough to host both my VMs and a 10TB ZFS NAS. I might look at getting a Xeon system, instead of upgrading my desktop, maybe the new E5-1600. Or a older one off Ebay.

How is the NFS CPU overhead / speed on a allinonemachine, compared to iSCSI?

How is the ZFS recovery in the event of a Bare metal or NAS VM failure?

Also, I was planning to go with freeNAS, but their ZFS implementation is lagging behind, and it seem that freeBSD ZFS is quite a bit slower than the others.

What I need from the ZFS server is: iSCSI, NFS, SMB, Active Directory Domain integration (for permission), and some sort of notification (email?) to tell me about problems?

Thanks!

dsumike · Mar 26, 2012

x-cimo said:
I was panning on moving my current desktop (i7 860 with pci passthrough) to a ESXi5 and using a Q6600 with 8GB of ram, and sharing with iSCSI.

I am in a similar position with my home storage build. I have an older WHS with about 10TB of space running virtualized on a Q9550 w/ 8GB (maxed) and also have an i7-860 w/ 8GB (16GB max) as a desktop.

I am trying to find a way to make the leap to using ZFS at home for my test lab, but I think with both systems having 8GB and 16GB limitations, the only real option is to start from scratch and build a system that can support 24gb+. RAM is so cheap it's hard to want to go

cantalup · Mar 27, 2012

^^^
if you prefer that solution "all in one"
my suggestion is:
1) pick your best MB and hardware that based on your budget.
2) do not be picky, ECC RAM, where would help you in the long run especially running "all in one", there were some threads that discussed ECC RAM is useful for zfs memory hunger

.

FreakinAye · Mar 28, 2012

I've decided to go with the Norco 4224 for this. I'm getting 2x M1015, so I have 16 bays to work with.

Would I be better off getting 16 500GB drives instead of 10-12 1TB drives?

I was looking at the WD Black 500GB (WD5002AALX) for ~$100-110, compared to the 1TB Black that is more like 140-150.

Thoughts?

Also is a 650W PSU enough for this? I was planning on getting the Seasonic 650W Gold w/ a single 12v rail.

FreakinAye · Mar 29, 2012

Should I put the ZIL and L2ARC drives on the motherboard SATA controller or on the M1015?

I was planning on putting all the data drives on the M1015 and the other 3-5 drives on the onboard controller.

Also, I saw the 500gb RE4 drives at amazon for like $110. Is the enterprise-class reliability the only thing I'll gain from those in a ZFS based system? TLER isn't important unless you're using traditional RAID, correct? *EDIT* Partially answered my own question. It looks like you should *disable* TLER for ZFS.

patrickdk · Mar 29, 2012

depends on what you want for tler.

If you don't care how long it takes to recover corruption, you don't need tler.
If you do care, stuff that you don't want delayed and waiting forever, then you still want tler, so it marks that disk bad quicker.

Putting the zil/l2arc on the motherboard vs m1015 depends mainly on the type of motherboard, and how it's all wired up, and the throughput speeds of your zil/l2arc drives.

FreakinAye · May 29, 2012

I finally got this put together with specs pretty much as described in this thread
10x WD RE4 1TB drives
2x M1015 in IT mode
2x Intel 120GB for L2ARC
1x 20GB Intel SLC for ZIL
Dedup is set to 1.0 which I assume means it's off.

Code:

pool: datastore01
 state: ONLINE
  scan: none requested
config:

	NAME                       STATE     READ WRITE CKSUM
	datastore01                ONLINE       0     0     0
	  mirror-0                 ONLINE       0     0     0
	    c4t50014EE2057489ADd0  ONLINE       0     0     0
	    c4t50014EE205899D7Cd0  ONLINE       0     0     0
	  mirror-1                 ONLINE       0     0     0
	    c4t50014EE2058C4169d0  ONLINE       0     0     0
	    c4t50014EE2058FE0F0d0  ONLINE       0     0     0
	  mirror-2                 ONLINE       0     0     0
	    c4t50014EE25ACA679Ed0  ONLINE       0     0     0
	    c4t50014EE25AE2719Bd0  ONLINE       0     0     0
	  mirror-3                 ONLINE       0     0     0
	    c4t50014EE2AFE8C5E4d0  ONLINE       0     0     0
	    c4t50014EE2B01B7B1Dd0  ONLINE       0     0     0
	  mirror-4                 ONLINE       0     0     0
	    c4t50014EE2B0369252d0  ONLINE       0     0     0
	    c4t50014EE2B03AFE59d0  ONLINE       0     0     0
	logs
	  c3t0d0                   ONLINE       0     0     0
	cache
	  c3t3d0                   ONLINE       0     0     0
	  c3t4d0                   ONLINE       0     0     0

I basically haven't tweaked any settings for the pool or OS. Here are the Bonnie perf test results
Size: 4.53T
File: 32G

Seq-Wr-Chr: 143MB/s
99% CPU

Seq-Write: 570MB/s
41% CPU

Seq-Rewrite: 319MB/s
33% CPU

Seq-Rd-Chr: 148MB/s
98%CPU

Seq-Read: 812MB/s
28%CPU

Rns Seeks: 4618.6/s
7% CPU

Does that look appropriate for performance? It is better than I expected, so I certainly have nothing to complain about.

danswartz · May 29, 2012

Looks decent I think.

madrebel · May 29, 2012

sequential doesn't mean much really. fire up a VM and create a zfs backed disk. run iometer using random 60% write and 40% read with 4K blocks. then do another test at 80% random read and 20% random write.

those numbers are interesting. sequential, not interesting at all.

dilidolo · May 29, 2012

Personally I would use SAS disks, even those 7200RPM SAS. When you have lots of read and write at the same time, you will see SATA suffering.

madrebel · May 29, 2012

depends, where you see sata really struggle is lots of (tunneled) SAS IDs on a single controller across multiple expanders/switches.

zfs does a really good job with ARC/l2arc so good that many times reads and writes barely touch disk. i wouldn't try to push 100k IOPs worth of 50/50 4k random data over sata but small workloads without multiple expanders per HBA work fine.

danswartz · May 29, 2012

ARC and L2ARC have no effect on writes.

dilidolo · May 29, 2012

madrebel said:
depends, where you see sata really struggle is lots of (tunneled) SAS IDs on a single controller across multiple expanders/switches.

zfs does a really good job with ARC/l2arc so good that many times reads and writes barely touch disk. i wouldn't try to push 100k IOPs worth of 50/50 4k random data over sata but small workloads without multiple expanders per HBA work fine.

I don't have multiple expanders at home, just one shelf with 16 SATA disks, 2*120G SSD cache, 30G zil, 24G RAM. When backup kicks in, my VMs are noticeably slower though backup is only at 100MB/s rate while the storage has no problem to push over 800MB/s. I have another zpool with 8*SAS, no problem.

SAS is full dupex and SATA is half. For real work, SAS works a lot better.

madrebel · May 29, 2012

danswartz said:
ARC and L2ARC have no effect on writes.

zil, which is first always in ram, does affect writes a great deal and since zil and arc live in ram it is safe to assume that with much ram (sync issues aside) your ability to write is tied directly to your amount of ram. presuming your ram is greater than whatever your network can suck in over any given window of time then you can cache large amounts writes. pretty sure in a contentious situation zil blocks will force arc blocks out of ram too but i'm not positive on that.

if you play around with compression and watch your LED activity lights you'll also notice that zil does the compression before it hits disk. highly compressible sequential writes that can fit entirely in ram are extremely fast as an example.

ex:

dd if=/dev/zero of=/tank/t.txt bs=1M count=10000

if you have > 10GB ram and compression on you'll create that file in a little over 2s. try that dd test with compression enabled and disabled and do an fsstat 'tank' 1 while you're at it. the difference is impressive.

danswartz · May 29, 2012

Not to nitpick, but the post I was replying to never mentioned zil, just arc/l2arc. I didn't want people who don't know better to walk away with wrong info.

ZFS Nas for VMWare

n00b

2[H]4U

n00b

Limp Gawd

Supreme [H]ardness

Limp Gawd

n00b

n00b

2[H]4U

n00b

Supreme [H]ardness

Supreme [H]ardness

n00b

Supreme [H]ardness

2[H]4U

Limp Gawd

Limp Gawd

[H]ard|Gawd

Limp Gawd

Supreme [H]ardness

Limp Gawd

Weaksauce

Gawd

n00b

n00b

Gawd

n00b

2[H]4U

Gawd

Limp Gawd

Gawd

2[H]4U

Limp Gawd

Gawd

2[H]4U