Virtualization and ZFS

luckylinux · Dec 15, 2019

I would be interested in your opinion because I am migrating all my machines to Virtualized solutions (mostly Proxmox VE, still have one ESXi Box).

For the Host (or AIO using OmniOS + Napp-IT) filesystem I'm without doubt using ZFS.

Does it matter the FS that the Guest will be using? Of it should be ZFS as well?
My question is basically: is all data corruption prevented at host level (i.e. the Guest can use any FS he wants) or it's the weakest link of the two that determines the reliability of the data saved (i.e. both must be ZFS, otherwise it's almost "pointless")?

Thank you for your insights

_Gea · Dec 15, 2019

In a napp-it AiO setup you use ZFS as filesystem. Due Copy on Write, ZFS is alway consistent as an atomic write (write data and update metadata or write a whole write stripe) is always done completely or discarded. With checksum, ZFS can always guarantee a filesystem and file consistency.

From ZFS view, a VM filesystem is a file (or zvol). ZFS cannot guarantee consistency or atomic writes for VMs per se. The most important problem is the ZFS write cache (can be several GB) that commits writes immediately to a VM but put small random data on disk with a delay of a few seconds (to increase performance as writes are not small/slow random but fast/sequential). A crash in the meantime can mean an inconsistent file state from the view of a VM. This is why you need ZFS sync write to guarantee any commited VM write to be on stable disk. The performance degration of sync can be limited with an Slog.

If the VM does its own write caching, you must care for this cache as well. If the VM use ZFS you want sync there as well or use a filesystem without write ramcache (slow).

danswartz · Dec 22, 2019

Not sure if this is the right thread, but here goes. I have a 2 host ESXi 6.7 cluster. I'd like to do some flavor of your 'cluster in a box', but I don't care about SAS/SATA shared storage. Each host has 2 1TB nvme cards - I'd like to do something using that (shared nothing). Does this work with your cluster in a box?

luckylinux · Dec 23, 2019

_Gea said:
From ZFS view, a VM filesystem is a file (or zvol). ZFS cannot guarantee consistency or atomic writes for VMs per se. The most important problem is the ZFS write cache (can be several GB) that commits writes immediately to a VM but put small random data on disk with a delay of a few seconds (to increase performance as writes are not small/slow random but fast/sequential). A crash in the meantime can mean an inconsistent file state from the view of a VM. This is why you need ZFS sync write to guarantee any commited VM write to be on stable disk. The performance degration of sync can be limited with an Slog.

Thank you for your replies as usual, _gea.

Is there a "zfs set" command to force sync at all times? I'm not so much concerned about performance as I am about consistency. I doubt a SLOG is required if the primary storage is NVME anyway.

_Gea said:
If the VM does its own write caching, you must care for this cache as well. If the VM use ZFS you want sync there as well or use a filesystem without write ramcache (slow).

In Proxmox I disabled Caching options for each virtual HDD. VMs (guests) are typically ext4 (ntfs for a few Windows VMs) though, hence my question.

_Gea · Dec 23, 2019

force sync:
zfs set sync=always filesystem

(sync=default means that the writing application decides)

Slog:
An Slog is helpful if it is much faster than the pool regarding latency and write iops (qd1).
With NVMe this is rarely the case but you should prefer NVMe with powerloss protection.

_Gea · Dec 23, 2019

danswartz said:
Not sure if this is the right thread, but here goes. I have a 2 host ESXi 6.7 cluster. I'd like to do some flavor of your 'cluster in a box', but I don't care about SAS/SATA shared storage. Each host has 2 1TB nvme cards - I'd like to do something using that (shared nothing). Does this work with your cluster in a box?

A ZFS Cluster consists of two heads where both have access to the same disks. The Cluster management guarantees that only one head can access a disk at a time. This usually means dual port disks, either dualport SAS, dualport NVMe or multipath iSCSI. The intension is availability with a failover (crash, tests or OS maintenance like adding bugfixes or updates to the inactive head with a failover if everything is ok) of a storagepool and services like iSCSI, NFS and SMB between the two heads.

In an ESXi environment, you can use singleport disks (Sata or NVMe) and use the shared disk options in ESXi that allows to assign disks simultaniously to the two storage VMs of the Cluster.

see chapter 3.1
http://www.napp-it.org/doc/downloads/z-raid.pdf

danswartz · Dec 23, 2019

_Gea said:
A ZFS Cluster consists of two heads where both have access to the same disks. The Cluster management guarantees that only one head can access a disk at a time. This usually means dual port disks, either dualport SAS, dualport NVMe or multipath iSCSI. The intension is availability with a failover (crash, tests or OS maintenance like adding bugfixes or updates to the inactive head with a failover if everything is ok) of a storagepool and services like iSCSI, NFS and SMB between the two heads.

In an ESXi environment, you can use singleport disks (Sata or NVMe) and use the shared disk options in ESXi that allows to assign disks simultaniously to the two storage VMs of the Cluster.

see chapter 3.1
http://www.napp-it.org/doc/downloads/z-raid.pdf

Thanks, I downloaded this and am reading it. I'm a little confused. You can have two storage VM running on the same host, which seems the only usable way to share SATA disks? My current setup is 2 ESXi hosts using hyperconverged Starwind (windows 2016 appliances). They have recently come out with a linux appliance, but not with ZFS yet. In both cases, I am burdened by 1-yr NFR license, and am leery of having to renew that or have my storage disappear

Anyway, from your previous reply, it seems as if the 2 NVME drives in each host cannot be shared, unless I have a 'head unit' on each of the 2 ESXi hosts? I'm having trouble understanding your single vcluster vs twin vcluster. I don't really want or need dual jbod for storage redundancy. Could use put a 1TB ZFS mirror on the 2 NVME drives on each host, and do some variant of twin cluster? Or what? Thanks for your thoughts!

_Gea · Dec 23, 2019

I have not used Starwind but as far as I know, they use two vsan appliances each with its own storage and a sync/mirror mechanism between.The advantage is that you do not need dualpath storage and that they work independently. The disadvantage is the sync delay and a reduces performance with near realtime/realtime sync or that the nodes may not be 100% identical on a crash.

A ZFS cluster works different.
As a start you have a full featured storage appliance with a local ZFS datapool that offers storage via iSCSI, NFS or SMB. If the storage appliance fail, you can use a second VM where you switch the disks manually, import the same pool and activate storage services.

A ZFS (v)cluster does this automatically with a failovertime (head down to service re-enabled on the other head) in around 20s. This requires that both storage VMs/heads have simultanious access to the same disks to allow the fast pool failover. Advantage is that you always use the same most currect data even on high load (no sync/mirror layer). As you have only one storage pool, this is a single point of failure but as failure rate of a SAS Jbod is very low, this is usually not a problem.

Basically it would be also possible to use the VMs to create an iSCSI Lun from each NVM and build a mirror on either VM over the iSCSI targets. Such a "LAN based realtime ZFS mirror" would also make the VMs independent and allow a WAN mirror. Performance may be way below the single pool multiport approach.

This is why you need a mechanism to allow both heads to access the same disks in a ZFS Cluster.

A dual (v)Cluster extends this by two independent Jbod cases either two AiO setups or two barebone server usually with two Jbod SAS cases each with a cross connected dual expander setup. Even in this case you use a single ZFS datapool for both heads with a pool layout that allows a full Jbod failure (as well as a full failure of any AiO).

To make this happen (two VMs, both with access to the same NVMe) you need either dualport NVMe like Optane DC-D4800 or singleport NVMe with ESXi shared access to the same NVMe (add same raw disk to both VMs).

danswartz · Dec 23, 2019

Okay, this makes sense, thanks. So is it reasonable to have 2 ESXi hosts (and head VMs) but only one dual expander JBOD?

_Gea · Dec 23, 2019

Yes, this is typical.

Two storage server (barebone or VMs), each with an SAS HBA, each HBA connected to one of the expanders in the dual expander SAS Jbod (and therefor both storage servers can access any SAS dualport disk)

kdh · Dec 23, 2019

your virtualization engine shouldn't give a crap about the underlying storage, and it shouldn't give a crap about the FS your guest OS is running. If it does.. youre running the wrong hypervisor. The only thing it should care about is if it can create datastores out of it, and it meets your acceptable level or response time. Running ZFS in your guests because your hypervisor is running it is silly. The whole point of virtualization is to abstract all of that from the guest os.

danswartz · Dec 23, 2019

Looking at the Supermicro 12-drive JBOD you listed. Looks very interesting. I'm puzzled as to why their docs show connecting *two* 8643 or 8644 cables to the same HBA. Is this to get 8 lanes instead of 4? It can't be the number of devices supported, since the JBOD has an expander that the HBA connects to. Or am I misunderstanding something?

danswartz · Dec 23, 2019

Not thinking I need SAS3 either at this point...

_Gea · Dec 23, 2019

In a dual expander Jbod, the first SAS port of all disks is connected to the first expander and the second SAS disk port to the second expander. If you connect both expander to the same HBA you double access performance from 1x12G to 2x12G per disk (if number of disks is not a limit as a single SAS expander cable is 4x12G). In a dual expander solution you can use up to 2 SAS cables per expander to connect HBAs =16x12G overall between the expanders and the HBAs) If you use two HBA in one storageserver you add redundancy over cabling and HBA (Dualpath SAS connectivity need a mpio capable storageserver setup as any disk is known then under two device names where one must be suppressed and managed only in the background)

If you use two storageserver each with an HBA. each server has 12G access to any disk in a HA/Failover setup.

If you buy new, you want SAS3 (12G). The new generation of 12G SAS SSDs like the WD SS530 are near to NVMe with troublefree hotplug capability. With faster SSDs the cumulated performance of some SSDs are > SAS2 /6G over an expander.

danswartz · Dec 23, 2019

I think I was not clear, sorry. That JBOD has two versions - single expander and dual expander. What is confusing me is that each expander has *two* inputs, not just one. So the SM manual shows (for one particular configuration) two SAS cables going from a single HBA to the two inputs on the same expander. I'm trying to understand if that is to increase BW to 2x12gb or whether a single cable can't drive all 12 drives. The latter makes no sense to me, since there is an expander, so the HBA is not wired directly to the drives. Does this make more sense?

_Gea · Dec 23, 2019

Each disk is connected to the SAS3 expander via a 12G SAS link.
In a 12 bay Jbod when each disks deliver data at its maximum speed you can get up to 12 x 12G to the expander.

Between expander and HBA (with one SAS connector cable=4x12G SAS) this is reduced to 4x12G maximum performance. As your HBA has a second external port as well, you can connect this to the second expander port what means 8x12G between HBA and expander. With a dual expander and a 16 port or 2x 8port HBA you would be able to connect HBA and Jbod with 4 SAS cables=16 x 12G.

Usually 4x12G = around 4GB/s is more than most servers can process.

danswartz · Dec 23, 2019

That makes sense, thanks. My JBOD at the moment is a 4x2 ZFS RAID10 of NL-SAS spinners, so performance isn't critical...

_Gea · Dec 24, 2019

This is uncritical. Even a 6G expander has more than enough performance to the HBA with a single SAS cable (4x6G). This is around 2,4 GB/s, enough to max out what 12 SAS mechanical disks can deliver (count 200MB/s each).

Using 6G SSDs (500 MB/s) or 12G SAS SSDs (1GB/s) per disks is different as then the single SAS connector between HBA and expander can be a bottleneck with many SSDs.

danswartz · Dec 24, 2019

thanks!

danswartz · Dec 25, 2019

I downloaded the zip file from your site with the 2 OVA files. I am running ESXi 6.7u3. Neither one deploys successfully (both complain about checksums not matching the manifest?)

danswartz · Dec 25, 2019

Trying to test your cluster in a box setup, but apparently the PRO license I already have isn't adequate? Need a PRO complete or something? Is there a way to evaluate this?

_Gea · Dec 25, 2019

I have tested the current ova zfs_vsan_omni032_esxi6v1.ova.
I was able to import directly in ESXi but saw this message as well when I tried to import via VMware workstation into ESXi. Seems a problem with ovftools that I have used to generate the single file ova. The problem is the .mf file in the ova. Remove it (ex via 7zip) or re-download the ova or import in ESXi.

About your key.
You need a new key for this machine id (menu extensions > get machine id). Swap your key at https://www.napp-it.org/extensions/swap_en.html. You need the Pro key and 19.dev on the cluster control server. For the heads use at least 19.10 home.

danswartz · Dec 25, 2019

Always an adventure

I did the key swap, but your ISP's SMTP server rejected the self-signed certificate on my SMTP proxy. I've added that host to the exempt list. Hoping it retries soon...

danswartz · Dec 25, 2019

Got it, thanks! Looking now...

danswartz · Jan 7, 2020

Waiting for my SC826 JBOD to arrive. I'm looking at your docs, and am not sure where the cluster controller VM is supposed to go? Storage wise, I mean. Is it required for the cluster to function under normal circumstances?

_Gea · Jan 8, 2020

The Cluster consist of two OmniOS servers or VMs (heads) in failover mode (About > Settings). One of them (master) has the failover pool imported and offers services over a failover HA ip.

The second head (slave) is in standby mode. You can use it for tests or updates.

The controlserver is a third OmniOS server or VM in control-mode (About settings). It is there to manage the manual or auto failover and display the state of the Cluster. On a failover it kills the former head, imports the failover pool on the second head and enables services over the HA ip (head becomes new master). The former master becomes slave and you can do tests or updates there. This control function can be enabled on an OmniOS system that you use as backup system. A typical Cluster is a three server setup with 2x heads and a control/backupserver.

The control/backupserver and the second head are not needed for your services. You only need both for a failover.

danswartz · Jan 8, 2020

Ah, okay, that makes sense. Thanks! I was wondering how the control VM would avoid crashing/hanging if the host it was running on got borked along with the storage appliance on that same host. Fortunately, I have a low-power sandy bridge host in a micro-atx case I can deploy as the control server.

danswartz · Jan 8, 2020

I think I'm stuck. Got the physical CC node setup, as well as the two virtual storage appliances. It looks as if the CC unit needs to access the storage subnet? This is not possible for me, as it is a 50gb mellanox link directly between the two vsphere hosts. I had thought the CC would send commands to the master and slave to manipulate the HA IP. When I look at System => Appliance Cluster, it shows both storage nodes as up and in standby. The CC node shows all green except for the HA IP, which shows red. Am I misunderstanding how this works?

_Gea · Jan 9, 2020

The cluster nodes must see each other over the management interfaces. This seems the case as cluster control shows the state of the heads.

If the ha pool is not yet created, you must set one head to standalone mode to configure the zraid pool, then switch back to failover mode.
If both heads are up and in standby/slave mode with the zraid pool shown as importable, you must manually up one to the master role (it imports the zraid pool then and enables the HA ip).

The cluster control server controls the heads remotely over the management net (all nodes must be in an appliance group, on CC, menu Extensions > Appliance group > add). It must not have access to the disks. It should have access to the HA ip (if in a different net ex LAN) to detect its state.

ps
via pre/post failover command/scripts you can enable/disable a second ha ip ex for the regular lan that is accessable by CC and one for a dedicated direct link..

Virtualization and ZFS

Limp Gawd

Supreme [H]ardness

2[H]4U

Limp Gawd

Supreme [H]ardness

Supreme [H]ardness

2[H]4U

Supreme [H]ardness

2[H]4U

Supreme [H]ardness

Gawd

2[H]4U

2[H]4U

Supreme [H]ardness

2[H]4U

Supreme [H]ardness

2[H]4U

Supreme [H]ardness

2[H]4U

2[H]4U

2[H]4U

Supreme [H]ardness

2[H]4U

2[H]4U

2[H]4U

Supreme [H]ardness

2[H]4U

2[H]4U

Supreme [H]ardness