OpenIndiana/ napp-it + OpenSource Clustering/ High Availabilty

_Gea · Apr 10, 2012

OpenIndiana + OpenSource Clustering and HA/ High Availablity

added:
----------------------------------

There is now an OpenSource storage cluster available for OpenIndiana. Because this is a really killer feature,
i intent to add it into napp-it - although my cluster knowledge is limited but I can learn.

My first intension is to find interested users to share experiences on a common base config to push the HA idea
for free OpenIndiana ZFS storage clusters (to have one working and tested minimal config) and include it as a free option into
napp-it (as an open alternative to the comercially supported RSF-1 clusters for OpenIndiana from high-availability).

I will start with my current state at http://booki.cc/howto-setup-openindiana-ha-clusters-in-esxi/
This is an open book, editable for everyone and I hope for a working config if some more users help to write such a first howto
If you are interested in clustering (on OpenIndiana), please share ideas and help with setup suggestions.
All what you need is a VMware server to create two VM's to have a common config to work at.
----------------------------------

Clustering is now available not only as a commercial product like the RSF-1 solution from www.high-availability.com
but also as a OpenSource solution based on PaceMaker. http://www.clusterlabs.org/wiki/Main_Page

I was now asked by a user to support this solution within napp-it. He also send me patched files according to
http://www.scuba.org.uk/toptech/openindiana-ha-cluster-howto/ together with a suggestion for an entry config:

"Supermicro 6036ST-6LR. It is the base of Tegile's cluster config as well as iX's TrueNAS pro.
That system is active active hot pluggable, only $3K barebone, two motherboards with 16 dual path bays."

I also contacted Mike from scuba.org.uk who posted the HowTo.
He added .. The solution itself doesn't have built in support for Storage clustering, however there is no reason
why someone couldn't write an OCF (Open Cluster Framework) plugin for Pacemaker which support ZPool failover.
..
However the way I would do it is to go a bit old school with the plugin and go back to the way the original Sun Cluster
etc did storage failover by checking for a file on the pool and then failing over if that file isn't where it should be. Of course to do
this anyway you need to have the storage on some sort of backend SAN, it's not like the Nexenta plugin where they do metro
replication to get the failover sorted...

but mentioned also not having the time to offer more help about the solution.

From my side, i do not have enough experience with HA and not the time to support such a solution.
But maybee there are some users that have already tried and are able to comment or answer questions about this solution.
An active user group about this feature is needed if it should become a ready to use technology for all. I would be glad to integrate it
as a free add-on within napp-it. If someone add it as an affordable extension, i will support that too.

... added from user
Since ZFS isn't a cluster file system, even Oracle, NexentaStor's RSF-1 treat ZFS as a single mount file system, meaning that it is indeed
a script that cleanly exports and then imports the file system to another node.

(Just don't break 30seconds? or VM's file systems would error out between node switch)...

more including the installer files
http://wiki.openindiana.org/pages/viewpage.action?pageId=23855106

stevebaynet · Apr 10, 2012

I know Nexenta has lots of options for the health check for the RSF HA, IP, serial, but the one that intrigued me was the disk based one, not sure if it was quite like a quorum drive or not. But you could kinda borrow from that page:

Both heads are active/active. Pool that you make HA will live on node A, it writes or "touches" a file on some sort of admin folder mounted in that pool. Node B constantly monitors that file via that admin share (could check via SSH, CIFS, etc etc). If node B sees the file go stale, it assumes something bad has happened to node A. You could add some logic here, if node A is still responsive, it could first try to export the pool to node B, otherwise it would force the import on node B, which would make that the new master for that given pool. contents of the shared file could also tell you which node the pool is current mounted on to avoid "split brain syndrome" yadda yadda.

I wish i was more solaris knowledgeable, ive setup heartbeatd on centos numerous times but i have a feeling i'd be a bit lost on the solaris side (and i too lack the time)

An excellent idea tho!

helldesk · Apr 21, 2012

Hi All,

Mike here from scuba.org.uk, I originally implemented the solution and put my findings up on my blog, it was hard work getting it all to the state it is currently in where it can be implemented. The docs and patches on my blog are all workable, I'm currently (where time allows and it's not that often that it does right now), working on a revised builder script so you just install the script on an OI installation (151a) and run it, giving the prefix you want installed and it will do the rest for you, all you have to do at the end is configure the heartbeat (ha.cf) and the cluster setups (crm).

Many thanks to the people who helped moved the project alone with the scripts and _Gea who added the information to the OI wiki page and also this forum. Where time allows I'll update the components as I go but for the moment this is all I have to time send out.

Regards,

HellDesk

_Gea · Apr 27, 2012

Update: OpenIndiana + an easy to setup and free HA solution

Some work was done by TheMoron to setup a free HA cluster on OpenIndiana 151a3
with just running a script. This could be a huge step forward for a quick and easy to setup HA ZFS storage cluster.

He send me the script. You can download from http://www.napp-it.org/doc/clustersetup.zip

A note from Mike about a Python Path Problem with the installer script:
First set the following environment variable
export PYTHONPATH=/opt/lib/python2.6/site-packages

Optionally some symbolic links from /usr/bin -> /opt/ha/bin and /usr/sbin ->/opt/ha/sbin are needed
An updated installer will follow the next days when fully tested

Example and goal (redundant storage cluster of two independent nodes):
Two independant OI server nodes, each with its own storage, published as iSCSI target
These two targets are used as mirrorred Pool on the active node and shared via CIFS and NFS
If this node fails, the second one should take over and continue serving Cifs and NFS with its part of the mirror.
If the failed server is back, the mirrors are resilvered.

If you suggest another default config, please discuss (I'am not a clustering expert)

My intention is to collect infos about such a basic entry level config build from two OI nodes
with the mandatory option to evaluate a virtualized cluster and build a hardware cluster with the same config.
I will publish this config and integrate/manage it as a free add-on into napp-it.

I would like to ask all who are interested in a free HA ZFS storage cluster to try this solution and discuss experiences
here and/or offer an easy to follow cookbook for such a basic setup to help others on first steps..

basic infos:
http://wiki.openindiana.org/pages/viewpage.action?pageId=23855106

msitpro · Apr 27, 2012

Is it not possible to use this software with SAS JBOD storage and the cluster software just imports/exports the pool into the correct node?

_Gea · Apr 27, 2012

msitpro said:
Is it not possible to use this software with SAS JBOD storage and the cluster software just imports/exports the pool into the correct node?

A dual path SAS drives solution is possible as well as the hard drives in a mirrored solution.
I would prefer the mirrored solution for first steps/evaluation and a easy to follow cookbook based on it
because you do not need dedicated hardware and all who are interested can try this virtualized.

My first intension is to find interested users to share experiences on a common base config to push the HA idea
for OpenIndiana ZFS storage clusters (to have one working and tested minimal config) and include it as a free option into
napp-it (as an open alternative to the comercially supported RSF-1 clusters for OpenIndiana from high-availability).

This is my first cluster with this solution, so I need help to build a working default config.
I will start with my current state at http://booki.cc/howto-setup-openindiana-ha-clusters-in-esxi/
This is a open book, editable for all and I hope for a working config if some more users help to write such a first howto

Question:
Is this Comstar failover idea the best (easiest) config see https://blogs.oracle.com/jayd/entry/iscsi_failover_with_comstar
or should I prefer a solution based on two nodes with shared dual path SAS storage ?

Main disadvantage: you need hardware. You cannot evaluate with just installing two VM's

_Gea · Apr 30, 2012

Update:

Installer script for PaceMaker/ Heartbeat finished
Mike from http://www.scuba.org.uk/category/toptech/ has build a Package to install the Clustersoftware via a simple pkgadd on OI

Documentation (build a cluster on ESXi 5 with two nodes and shared storage - completely virtualized, no costs to evaluate)
http://booki.cc/howto-setup-openindiana-ha-clusters-in-esxi/ (shared document, help needed)

patrickdk · May 2, 2012

We should be able to virtual test the shared sas jbod config also, using shared vm disks on esxi.

trza2k · Jul 18, 2012

Any progress on this? I wouldnt mind trying to contribute to Pacemaker+ZFS but thought i would see where things are at.

_Gea · Jul 19, 2012

current state:
- PaceMaker was ported to Solaris
- Documentation and a ready to use evaluation installation howto is missing

problem:
- persons with PaceMake experience are needed
- those who started the project lacks time

but everyone who is interested is invited to add insights to
http://booki.cc/howto-setup-openindiana-ha-clusters-in-esxi

(shared document, open to edit for all)

trza2k · Jul 25, 2012

Thanks for the update. I had a play around and built a Pacemaker/Heartbeat cluster in OI 151a5 which is working fine.

The problem i have is trying to find a decent fencing solution, i found SBD http://www.linux-ha.org/wiki/SBD_Fencing but the SBD daemon does not appear to build in OI, the code for the daemon seems to use linux specific includes.

Then there is using scsi3 persistent reservation, however i cannot find any tools to implement this on OI, linux provides sg_persist........... and i just googled sg_persist for opensolaris and found SUNWsg3utils package which seems to work on OI, will test this now

EDIT: turns out you can build sg3_utils for solaris from source http://sg.danny.cz/sg/ it includes a good example of how it works in the examples dir sg_persist_tst.sh

japiejo · Aug 1, 2012

Has anyone yet succesfully configured ocf resources in their cluster? I now have a cluster running (Thanks to Mike for the packages !), but no resources configured yet, I suppose that's where we're all stuck now right ?

If so, I'll see if I can get something started, but it would be nice if someone already has something working to get going.

BTW, I did see some messages about starting heartbeat as a svc service, but that does not seem te be included in the current package or the scripts, does anyone know of instructions for this, or have an XML-file for this ?

trza2k · Aug 1, 2012

japiejo said:
BTW, I did see some messages about starting heartbeat as a svc service, but that does not seem te be included in the current package or the scripts, does anyone know of instructions for this, or have an XML-file for this ?

Clustersetup.zip/clusterbuilder.sh (at the bottom of the script) does the svc service setup for you (i had to change the group in the xml to a group that exists).

I have only got the "IPaddr" ocf resource going so far, required some small changes to the script which were change shell to bash and change the root env var. This provided a IP address that could fail over between nodes, seemed to work well.

I'm currently in the research phase of a ZFS/NFS HA OCF but i want to do this with SCSI3 Persistent reservations as to remove the chance of a split brain and both nodes trying a forced zpool import. Once that is sorted i can work on ISCSI/FC resources, if i get that far

japiejo · Aug 1, 2012

@GEA:

On your booki page you wrote:
What is the suggested way to setup and share storage?
- share each RAW disk as target and build the pool on the node
to have similar settings like with shared SAS storage ?

I think that is the right way to approach this. It would basically mean that the only difference from a dual-path SAS setup is the fact that you would need to attach the iScsi disks first.
From there, the pool import, and sharing NFS/CIFS/iSCSI would be the same configuration.

Typically, you would create a resource group that would consist of : a shared ip resource, a pool resource, and a NFS/CIFS/iSCSI-target resource (for sharing).
With the disks on iSCSI you would need to add a iscsi-initiator resource, and the pool resource would have a dependency on that.

japiejo · Aug 1, 2012

trza2k said:
Clustersetup.zip/clusterbuilder.sh (at the bottom of the script) does the svc service setup for you (i had to change the group in the xml to a group that exists).

Ah ! I used Mike's packages, so have not used the clusterbuilder script to build from source. I'll have a look at the script for the svc service bits.

I have only got the "IPaddr" ocf resource going so far, required some small changes to the script which were change shell to bash and change the root env var. This provided a IP address that could fail over between nodes, seemed to work well.

I would have thought it to be harder. Did you also need to change the scripts to use ipadm ? I'm still trying to figure out what script actually does what

I'm currently in the research phase of a ZFS/NFS HA OCF but i want to do this with SCSI3 Persistent reservations as to remove the chance of a split brain and both nodes trying a forced zpool import. Once that is sorted i can work on ISCSI/FC resources, if i get that far

That's good to hear. I think proper fencing is indeed good to have before doing a pool import.

Thanks so far !

Edit: Got the heartbeat svc set up, using the xml file from GEA's booki site (as it does in the script). Just a minor thingy: It assumes that a group "wheel" exists, this was not the case on my box (oi-151a-3 with latest updates). Maybe that group is created when you install napp-it, which I haven't done yet.
I changed the group in the xml file to adm, and it runs fine (I think)

Edit2: nevermind that, completely missed your hint about that

trza2k · Aug 1, 2012

japiejo said:
I would have thought it to be harder. Did you also need to change the scripts to use ipadm ? I'm still trying to figure out what script actually does what

The script uses ifconfig to add a ip alias to an existing interface. I did have static IP's setup for both nodes though. The script has quite a few SUNOS cases you can take a look at.

japiejo · Aug 2, 2012

I still have a problem where crm does not see the ocf resources. I suspect it has to do with setting up the environment wrt to ocf. I use root to run crm, I have $OCF_ROOT=/opt/ha/lib/ocf , $OCF_AGENTS=/opt/ha/lib/ocf/resource.d/heartbeat , PATH has /opt/ha/sbin added, the $PHYTONPATH is set .....

What am I missing ?

trza2k · Aug 2, 2012

I used the clusterbuilder.sh to build my version, seems to work fine. How are you checking for resources? I use

"crm ra list ocf heartbeat"

japiejo · Aug 2, 2012

I try to do the same thing:

crm(live)ra# list ocf heartbeat
Traceback (most recent call last):
File "/opt/ha/sbin/crm", line 42, in <module>
crm.main.run()
File "/opt/ha/lib/python2.6/site-packages/crm/main.py", line 283, in run
if not parse_line(levels,shlex.split(inp)):
File "/opt/ha/lib/python2.6/site-packages/crm/main.py", line 144, in parse_line
rv = d() # execute the command
File "/opt/ha/lib/python2.6/site-packages/crm/main.py", line 143, in <lambda>
d = lambda: cmd[0](*args)
File "/opt/ha/lib/python2.6/site-packages/crm/ui.py", line 1162, in list
if p and not p in ra_providers_all(c):
File "/opt/ha/lib/python2.6/site-packages/crm/ra.py", line 147, in ra_providers_all
for s in os.listdir(dir):
OSError: [Errno 2] No such file or directory: '/usr/lib/ocf/resource.d'

I can't figure out what files get sourced, and where that "/usr/lib/ocf/resource.d" comes from. I could try creating symlinks from /usr/lib, but that isn't a pretty solution IMHO.

Hmm, just tried with a symlink and it works, sort of, I get "not installed" when trying to start a resource:

Failed actions:
fo0_start_0 (node=oi-sn01, call=3, rc=5, status=complete): not installed
fo0_monitor_0 (node=oi-sn02, call=2, rc=5, status=complete): not installed

But at least it now knows where to find the ocf ra's

trza2k · Aug 2, 2012

japiejo said:
Hmm, just tried with a symlink and it works, sort of, I get "not installed" when trying to start a resource:

You get that when the script doesn't run correctly, that's where the fun of making them work begins

techster · Aug 5, 2012

I just found a VM KB article about how to setup shared storage on ESXi 5 and was able to get two openindiana nodes to see the disk. One I had ZFS mounted readonly and was able to see any files that had been saved at the time of the import. I'm going to try more tests with pacemaker next weekend.

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1034165

zpool import -o readonly=on -f sharedpool

Hope that helps someone more knowledgeable in shared storage clusters progress down this rabbit hole.

smokeman · Oct 12, 2012

on that bookit link, one of the phrases is multipath...
I wanted to ask if anyone knows how I could replicate the functionality of stormagic, with opensolaris based distros...stormagic uses a linux based virtual appliance, you can tell by watching it boot.

stormagic integrates with vmware, and shares out an iscsi target, which vmware mounts, and uses. What is unique is that the iscsi target is the same accross 2 nodes.

I'll describe the setup, and see if anyone has any ideas how they do it.
vhost1 has stormagic san1. san1 has a pool, with an iscsi target created, and shared.
vhost2 has stormagic san2. san2 has a pool setup the same way as san1, and this pools is set as a replication partner. san2 then has an iscsi target with the same target name, yet is shared on it's ip address, so vmware sees 2 paths to the same iscsi target.

this is an active-active setup, with a third node as a neutral storage host. This is a service installed on the vcenter server. If split brain is detected, the neutral storage host decides the winner, which is then kept online, while the other node repairs.

how does this compate with what you are doing?
would it be easier to just use zfs mirroring, and have each vsan share out an iscsi target, and have a third virtual server mount the 2 iscsi targets as devices, in a raidz mirror, then share that back out to vmware as an iscsi target?

greppy · Oct 16, 2012

Hey guys, first post here...Awesome thread. I have been watching this forum for a while, but just got to the point that I could attempt this cluster configuration on real hardware. The last time I used solaris it was right as Sun was migrating from SunOS so it's been a while...

I'm running oi_151a5 with napp-it and the packages configured with the clusterbuilder script in this thread (2 nodes). Communication is good and crm_mon -1 confirms heartbeat.

Now I am ready to configure resource agents and I'm perplexed as to how to do this w/ ZFS. What I would really like to do is setup shared storage pool in active/active failover. As an alternative I could resort to separate mirrored pools, but for this configuration I am not as worried about performance as I am total storage space. I'll only do mirrors if shared storage is not "doable".

So has anyone been able to get resource agents setup for storage? I'm trying to spin up, but I'm somewhat of novice with this config.

NOTE: As I took the services up and down with svcadm, at one point the heartbeat service went into maintenance mode and I couldn't get it recovered. After digging through logs I found that it was looking for the wheel group (as noted by other users). Rather than going and recompiling after updating the XML, I temporarily created the wheel group on this machine... Will fix that later.

Thanks to everyone who has contributed to this thread. It really helped me get to where I am. I'm in the red zone. Need to punch it in for the score.

Thanks,
Dan

lordsegan · Oct 16, 2012

This is not something that I need, but thank you for your work!

_Gea · Oct 16, 2012

HA is pure enterprise and high-end use on ZFS that usually needs a lot of money that you pay to High-Availability (directly on OI/Solaris or via Nexenta).

This thread was created in the hope to bundle community efforts.
The original contributors are currently not active (if you have the know-how, you have a time problem quite often) so any contribution to this thread or the for everyone open booki file is welcome. I am - like many others, very interested in HA but lacks the time to do my own investigations.

thanks

trza2k · Nov 26, 2012

Just a FYI that i am still working on this, just finding time is hard(as Gea mentioned). Though i have abandoned the use of OI for the project, OI doesn't seem to be going anywhere. So i'm currently testing with freebsd using pacemaker+corosync, i've managed to get them built on freebsd, now its just a matter of trying to get it working correctly, then build cluster scripts etc.

fmosti · Mar 18, 2013

Hello,
I successfuly deployed an openindiana storage cluster, starting from the work already made here.
The cluster is based on supermicro hardware and I manage 2 array with one ZFS volume each one
(it is easy to change for different needs.) shared with iscsi on mutipathing configuration and it works well.
Someone here is still interested ?

I can write an how-to and publish the scripts.

_Gea · Mar 19, 2013

hello fmosti

There is currently no other progress with a free HA solution on OI/OmniOS.
It would be nice if you can do one or more of the following

- add a howto and scripts ex at the OI wiki page http://wiki.openindiana.org/pages/viewpage.action?pageId=23855106
- contact Mike about the progress at http://www.scuba.org.uk/toptech/openindiana-ha-cluster-howto/
- add some notes to the booki book (open book for all) http://booki.cc/howto-setup-openindiana-ha-clusters-in-esxi/

Audio-Catalyst · Mar 20, 2013

fmosti Would be very interesting, and educating...

been Fiddling around with Gea's and Mike's solution, however just doesn't seem to get it work.
No sense in updating anything if i can't get to work

ShavlikLeague · Mar 21, 2013

fmosti said:
Hello,
I successfuly deployed an openindiana storage cluster, starting from the work already made here.
The cluster is based on supermicro hardware and I manage 2 array with one ZFS volume each one
(it is easy to change for different needs.) shared with iscsi on mutipathing configuration and it works well.
Someone here is still interested ?

I can write an how-to and publish the scripts.

fmosti, I too would be most interested in a how-to, plus scripts. I just installed OI this past week and started investigations into configuring two HA head nodes sharing a common SAS storage shelf. Your insight would be most appreciated and I would be more than happy to report back my findings to help validate the repeatability of your config. Thanks in advance for your efforts!

fmosti · Mar 24, 2013

ok, I need few days to write the documentation.
Regards,
Fabio

Audio-Catalyst · Mar 25, 2013

fmosti said:
ok, I need few days to write the documentation.
Regards,
Fabio

very Cool Thanks!

Nabisco_DING · Apr 19, 2013

I am also very interested in your writeup!
eagerly waiting for your response!

DigitalDaz · Apr 19, 2013

Same here, and its way more than a few days. If you don't have time how about just posting the scripts? Anything will help.

Audio-Catalyst · Apr 20, 2013

yep.. also still eagerly waiting

DigitalDaz · Apr 30, 2013

Fabio, your solution not worked?

nitsujr · May 7, 2013

@_Gea

I would like to contribute to this effort. I am not terribly well-versed in clustering software, but do have a functional cluster set up with CMAN and Corosync.

I actually have a SuperSBB SYS-6036ST-6LR and need to get it up and running very soon. I am not interested in Nexenta's offerings. I gave FreeBSD (NAS4Free) a spin but at the end of the day, I think a Solaris derivative + napp-it is the best option.

Preferably I would like to get this going on OmniOS, as it seems to be the most appropriate and active distribution.

There are a lot of issues with the current clustersetup.zip at http://www.napp-it.org/doc/clustersetup.zip. There is very little error checking and there are broken links (to Pacemaker for example, which has moved to Github, and to the patches on scuba.org.uk).

Is there a more recent version available? If not, I will begin cleaning up the script and porting it to OmniOS.

Thanks for napp-it, by the way. Incredible piece of software. I am planning on having my company subscribe to the Pro version.

_Gea · May 8, 2013

hello nitsujr

Currently there is only the commercial offer from highavailability.com (Nexenta use this as well).
The initial idea of this thread was to bundle efforts in a free solution where my part was to collect infos
and add some base support into napp-it.

Currently there is no activity from former contributors. Any insights and improvements are welcome -
Either in this thread or at the OI wiki (There is no comparable Wiki at OmniOs but I think they would like contributions too)

Audio-Catalyst · May 13, 2013

Fully Agreed Gea,

been twidling and fiddling around with the Scuba and wiki, but so far have not been able to get it work properly..
Was really happy when i saw fmosi's post.. but then it just became silent

nitsujr · May 14, 2013

FYI, RSF-1 is $4,900 for a perpetual license for a two-node cluster. No raw storage limits. Definitely better than Nexenta's pricing model.

I'm busy this week but I will see what I can come up with next week. Maybe we should throw something up on Github so that we can easily collaborate?

My biggest concern is the handling of pools which haven't been exported in a failover situation. Need to understand how much risk is involved by forcing an import on the slave node.

OpenIndiana/ napp-it + OpenSource Clustering/ High Availabilty

Supreme [H]ardness

Limp Gawd

n00b

Supreme [H]ardness

Weaksauce

Supreme [H]ardness

Supreme [H]ardness

Gawd

n00b

Supreme [H]ardness

n00b

n00b

n00b

n00b

n00b

n00b

n00b

n00b

n00b

n00b

n00b

n00b

n00b

Gawd

Supreme [H]ardness

n00b

n00b

Supreme [H]ardness

n00b

n00b

n00b

n00b

n00b

Weaksauce

n00b

Weaksauce

n00b

Supreme [H]ardness

n00b

n00b