Building a (low cost) Linux-based ZFS High-Availability Clustered NAS

ewwhite

n00b
Joined
Jul 9, 2012
Messages
19
I thought I'd share this here since I've used bits and pieces of advice from the forum in the past.

I was frustrated with the limitations and cost implications of using commercial ZFS solutions in a high-availability setup. NexentaStor, QuantaStor and Zetavault either had cost or support issues that prevented me from using them in this way.

My primary use case is for VMware clusters that require NFS-backed storage. I could make this work under iSCSI, but I tend to prefer NFS+10GbE for VMware.

I spent a few months testing a build of CentOS with Red Hat's cluster add-on, but also getting the failover portion to be reliable under the most common failure scenarios.

Original Reddit posts: Is anyone interested in my ZFS on Linux HA recipe? • /r/zfs and https://redd.it/4wxs4z

Documentation on Github: Home · ewwhite/zfs-ha Wiki · GitHub

___

Low-cost build manifest for a simple 12TB usable storage array (under $3.5k):

HP StorageWorks D2600 fully disassembled.
bTOqqxP.jpg


Front view of HP ProLiant DL360 G7 head nodes and D2600 JBOD.
bo790aX.jpg


Rear view of servers and JBOD with 6G SAS cabling.
goIxUMr.jpg
 
As an eBay Associate, HardForum may earn from qualifying purchases.
As an Amazon Associate, HardForum may earn from qualifying purchases.
Any way you could add a 2nd SAS enclosure for further redundancy?

What kind of performance are you seeing?
 
Any way you could add a 2nd SAS enclosure for further redundancy?

What kind of performance are you seeing?

Sure. I have setups like this with multiple enclosures as well. Cabling setup is a factor, but performance is really dependent upon pool design and your disk/controller situation. E.g. you can fill a JBOD with SAS SSDs or 10k disks or 7.2 SAS drives...

Performance is fine, though. Think of it as direct-attached storage where only one server gets to see the disks at a time.
 
What exactly makes this a NAS and not a SAN? You're hosting VM's on the disk, yes? This isn't just file storage...
 
Very interesting ewwhite. Thank you for sharing this. I have been running into the same issues with pricing and value for my customers.
 
ZFS HA Solutions like RSF-1 (ex NexentaStor) are based on two nodes with a common Dual Expander SAS Multipath Jbod. The Cluster Software allows a node failover with a Pool import with a service failover on the second node if the first fails.

I am currently working on ZFS Appliance Netraid based on ISCSI as a simpler and cheaper alternative that allows a storagehead failure AND/OR a storagenode failure. This means that you use two storage appliances each with an NVMe, Sata or SAS datapool where you create an iSCSI LUN from each Pool. A Storagehead with an initiator (dedicated or on one of the server) creates a Raid-1 Netraid pool over both LUNs and share it over a failover IP via NFS or SMB. This means a realtime network pool-mirror.

Performance wise, performance with 10G is very good and you have two completely independent nodes. If the Master fails you can import the pool on the second node, takover the HA-ip and you are ready. For a manual failover you only need an iSCSI target and Initiator like Solarish Comstar on Solaris or the free OmniOS.

I currently add the a management software to make setup and management easier and to allow an active/active serup with auto failover. Current state is that I have finished the howto and the basic setup management. Missing are manual and auto failover. I hope for a first working setup until the end of the month.

Concept, see http://napp-it.de/doc/downloads/napp-it.pdf
chapter 25
 
Last edited:
You probably want Solaris and/or FreeBSD for this but oh well ;-)

No, I explicitly do NOT want Solaris-variants or FreeBSD for this application. This is mainly for hardware support and monitoring reasons.
I base most systems on HP ProLiant hardware, and make use of the monitoring tools and add-ons available for supported operating systems. These are definitely not available for FreeBSD, so its a non-starter.

Also, mindshare matters. I'm deconstructing a 250TB botched FreeNAS installation at a new client that the customer was afraid to touch because of the "foreignness" of FreeBSD. I think people can relate to Linux quite a bit more (e.g. it's very easy to sell CentOS/RHEL)
 
Many hardware vendors supports FreeBSD, I don't think anyone running ZFS would advice you to use Linux but oh well... There's a recent presentation about how gandi.net uses FreeBSD and their infrastructure and why they choose that OS. BSDCan2016: FreeBSD based high density filers
Napp-it is also a candidate.
 
Last edited:
Many hardware vendors supports FreeBSD, I don't think anyone running ZFS would advice you to use Linux but oh well... There's a recent presentation about how gandi.net uses FreeBSD and their infrastructure and why they choose that OS. BSDCan2016: FreeBSD based high density filers
Napp-it is also a candidate.

I have quite a bit of experience with this; hence the post sharing my approach.
Linux and ZFS isn't a problem, so I'm not sure what you're asserting.

Napp-it is a fine product, and I'm certain many are happy with their FreeBSD installations. I'm focused on Linux, though.
 
I recently benchmarked my 2TB PostgreSQL database running ZFS on Linux and FreeBSD. I got consistently 50% better performance on FreeBSD.
I have also tested SQL Server on iSCSI from a FreeBSD server and 4x Intel S3500 in Raid0. I got consistently 20% better performance on read and write workloads compared to native SATA setup with Raid0 in Windows.

So the FreeBSD ZFS is extremely well optimized and shines for high iops workloads.
 
Thanks for posting. The OCF script you link in the article appears to be written for Illumos, did you have to make any modifications to it in order to get it work work with ZOL. I gave it a quick look and it doesn't look like it's really doing anything too crazy.
 
Thanks for posting. The OCF script you link in the article appears to be written for Illumos, did you have to make any modifications to it in order to get it work work with ZOL. I gave it a quick look and it doesn't look like it's really doing anything too crazy.

No modifications were necessary. It works out of the box.
 
Last edited:
OP, Did you consider using GlusterFS, backed by ZFS on the 2 nodes?

If yes, was there a reason why you went this way? Because it seems to me that the active-active nature of GlusterFS would allow for higher throughputs then the active-passive setup you are running now.
 
Why would you think that? If you are talking about a ZFS pool on each node, with a zvol from each pool providing storage to glusterfs, writes have to be done to nodes, or you don't have redundancy. And since you presumably want ZFS redundancy on each node, if you go with gluster redundancy, you are losing storage on each ZFS pool due to redundancy, *and* losing storage at the glusterfs level too (since it would be doing its own redundancy). I suppose you could go with simple raid0 on each ZFS pool, but then adios self-healing. You could go with gluster striping instead of redundancy, but then having a node go down screws you. Easiest is to get a jbod enclosure with SAS disks (can be relatively cheap.) Install CentOS (or whatever) on each node. Each node has an HBA to talk to the jbod enclosure (make sure it has two ports for HBAs). Then install pacemaker cluster software with a ZFS resource agent to export/import the pool on failover. This is basically what ewwhite is doing, but you can implement a cheaper and simpler (but not as robust) version of his solution...
 
  • Like
Reactions: rsq
like this
I'm not worried about throughput. The performance profile of a clustered node in my setup is the same as a single node.

What I DO have in some setups is a dual active/passive where I pin a zpool to each node. Say I have two zpools housed in different JBOD enclosures. Node1 serves pool1. Node2 serves pool2. Each pool has a VIP associated with it. If node1 goes down, node2 imports the pool and VIP, and vice-versa.
 
I'm not worried about throughput. The performance profile of a clustered node in my setup is the same as a single node.

What I DO have in some setups is a dual active/passive where I pin a zpool to each node. Say I have two zpools housed in different JBOD enclosures. Node1 serves pool1. Node2 serves pool2. Each pool has a VIP associated with it. If node1 goes down, node2 imports the pool and VIP, and vice-versa.

From what I understood from Nexenta when we asked about active/active during a SAN build for our office, this is basically what they told us we would have to do.
 
Last edited:
Why would you think that? If you are talking about a ZFS pool on each node, with a zvol from each pool providing storage to glusterfs, writes have to be done to nodes, or you don't have redundancy. And since you presumably want ZFS redundancy on each node, if you go with gluster redundancy, you are losing storage on each ZFS pool due to redundancy, *and* losing storage at the glusterfs level too (since it would be doing its own redundancy). I suppose you could go with simple raid0 on each ZFS pool, but then adios self-healing. You could go with gluster striping instead of redundancy, but then having a node go down screws you. Easiest is to get a jbod enclosure with SAS disks (can be relatively cheap.) Install CentOS (or whatever) on each node. Each node has an HBA to talk to the jbod enclosure (make sure it has two ports for HBAs). Then install pacemaker cluster software with a ZFS resource agent to export/import the pool on failover. This is basically what ewwhite is doing, but you can implement a cheaper and simpler (but not as robust) version of his solution...

Valid point on the storage space lost. I have in fact mirrors in my gluster nodes, and the whole node mirrored. Basically 25% efficiency.

I was just thinking of the availability of the VM disks from the point of view of the compute nodes. I know with gluster I can pull a power cord out of a gluster node when the VM is pushing data, and not lose anything. There is a 50msec or so stutter, and then the VM continues.

How long does it take to switch over the pool in the case you propose? Do the VM's need to be rebooted or is it transparent?
 
Valid point on the storage space lost. I have in fact mirrors in my gluster nodes, and the whole node mirrored. Basically 25% efficiency.

I was just thinking of the availability of the VM disks from the point of view of the compute nodes. I know with gluster I can pull a power cord out of a gluster node when the VM is pushing data, and not lose anything. There is a 50msec or so stutter, and then the VM continues.

How long does it take to switch over the pool in the case you propose? Do the VM's need to be rebooted or is it transparent?

I'm not sure what danswartz is proposing versus the solution I outlined. But if we're talking about the solution described at my Github, there is no impact to a running VM during controller node failover.

See:
 
I'd have to go back and re-read your solution, but mine is much lower-end. e.g. a simple JBOD enclosure with two ports. Both ZFS storage nodes connect to one port each. The active node serves up VM storage via a virtual NFS IP address. If the active node dies or hangs, the backup node imports the pool and activates the virtual IP. In that scenario, under relatively light load, I typically seem to see about 10 seconds for the backup server to be in business.
 
Back
Top