OpenSolaris derived ZFS NAS/ SAN (OmniOS, OpenIndiana, Solaris and napp-it)

2 Days ago my pool got flagged degraded, one disk was faulted.
Today another is faulted so my RAIDZ2 pool is really vulnerable now and I'm scaredd of data loss..
The disks are Hitachi 5k3000's and were bought in 5/2011, thanks to napp-it for warning me about the faulted disks.
What should I do until the new drives arrive? Shutdown the NAS?

Code:
pool overview:

      pool: tank
 state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
    Sufficient replicas exist for the pool to continue functioning in a
    degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
    repaired.
  scan: scrub repaired 0 in 4h19m with 0 errors on Sun Feb 26 07:20:20 2012
config:

    NAME        STATE     READ WRITE CKSUM
    tank        DEGRADED     0     0     0
      raidz2-0  DEGRADED     0     0     0
        c1t0d0  FAULTED      1   597     0  too many errors
        c1t1d0  ONLINE       0     0     0
        c1t2d0  FAULTED      0 1.65K     0  too many errors
        c1t3d0  ONLINE       0     0     0
        c1t4d0  ONLINE       0     0     0
        c1t5d0  ONLINE       0     0     0

I would never power off. Most failures are on power-on.

ZFS is the best you can do for availabilty of data. But if more disks fails than allowed with your selected redundancy level,
you need a disaster backup on a second machine.

I learned that again today with my vm-webserver. It was a 3 x 2 way mirrored pool + hotfix. One disk failed.
During resilvering the next disk got checksum errors - from the same vdev. No chance to recover data and
the first time since 4 years and ZFS that I needed a backup.

I cloned the replicated dataset from yesterday on my backup machine and started the vm remotely via a NFS share.
I will rebuild this VM dataset to a 3 way mirror now for better availability (and speed). If data is important, choose highest
availability/ redundancy levels and do backups despite. Add a hotfix or buy a cold spare. The days to RMA or receive a new
disk can be too long to sleep well.
 
http://www.stec-inc.com/product/mach16.php


BAsed on purely the part numbers - somebody is not telling the truth........

But it does say Industrial temp on the ebay auction.


Still way better than a few spinning rust drives. Until you get to 10's of drives.

Something interesting

File I downloaded earlier, has now been updated / changed

One I downloaded the other day dated 7/11
New one is dated 2/12

Wonder with the amount of people looking up info they quickly updated the pdf file

Looks good.... iops are now stated as
40,000 read
35,000 write << not 1500
big jump

4k reads are very impressive
120gb Vertex EX3 MI benches @ 3992 iops read
These come in @ 24,000 iops according to the new PDF file:eek:

http://www.stec-inc.com/downloads/ssd_brochures/MACH16_brochure.pdf

.
 
Yet to take delivery of the drives but i did a little more research and according.to the part number on ebay and the picture used it is NOT an industrial temp drive.

According to STEC's part code de-coder it is an IOPS drive (and it says it on the photo) which means it can't be industrial temp as they are branded just mach16.

Also the first letter in the last section 'UCU' indicates operating temp - U means enterprise and not industrial....

edit: according to their site it looks like there may be a new firmware for the industrial drives which brings performance in line with the others.
 
I'm running napp-it on solaris 11 and loving it.

I have a question about bonnie++.

When I go to pools -> Benchmarks and click "start" next to a pool under the Bonnie column. Nothing appears to happen.

I did a "find / -iname *bonnie* -print". But nothing returned.

Then I did a "pkg install bonnieplus" and that returned "No updates necessary for this image."

I came back the next day after clicking "start" for the bonnie benchmark but there is still no data populated for test results.

So I then opened an ssh session and ran "iostat -xcnCXTdz 5"

Then I clicked on "start" again for the bonnie benchmark on a single-disk pool, however after 15 minutes of monitoring the output of iostat i have no activity on that disk.

I would run bonnie manually from ssh I but also can't figure out how to do that.

Any help would be appreciated!

bonnie++ is currently not running under Solaris 11.
Use dd bench instead (quite similar results)
 
With the OpenIndianna + Nappit combo is there anything I need to be aware of when migrating from FreeNAS?

I'll use the command "zpool export arrayname" to drop the array before migrating to a new OS. Can I simply import the array through the Nappit web interface without deleting data off the array?

Beside upgrading the version of ZFS anything else I should do once the array has been migrated?

thanks
 
With the OpenIndianna + Nappit combo is there anything I need to be aware of when migrating from FreeNAS?

I'll use the command "zpool export arrayname" to drop the array before migrating to a new OS. Can I simply import the array through the Nappit web interface without deleting data off the array?

Beside upgrading the version of ZFS anything else I should do once the array has been migrated?

thanks

Problem:
If the disks were formatted in FreeBSD with GPT partitions (which FreeBSD recognizes but Solaris doesn't),
you cannot import to Solaris.

Only pools with disks formatted with GEOM can be exported from FreeBSD/FreeNAS/ZFSGuru and reimported into Solaris followed by a zpool upgrade.
 
Yet to take delivery of the drives but i did a little more research and according.to the part number on ebay and the picture used it is NOT an industrial temp drive.

According to STEC's part code de-coder it is an IOPS drive (and it says it on the photo) which means it can't be industrial temp as they are branded just mach16.

Also the first letter in the last section 'UCU' indicates operating temp - U means enterprise and not industrial....

edit: according to their site it looks like there may be a new firmware for the industrial drives which brings performance in line with the others.

New firmware?
Where

Results of a test here
http://www.overclockers.com.au/pix/index.php?page=image&id=n3s5q
 
What are the options with Solaris/NFS/Napp-It as far as NIC failover or connecting on multiple network addresses?

I'm concerned about saturation on one port and I don't want to go the route of link aggregation. The one thing I do like about our current SAN (MD3000i) is the failover we get and the round robin connectivity in VSphere.

Before I can really sell this approach internally I need to sort out some hardware failover and load balancing similar to but not as complex as our MD3000i.

I'm just not getting a lot of good search hits on this so that leads me to believe I am missing some keywords with this configuration.

Suggestions?
 
brian, the failover just isn't there. There was sun cluster suite thing but it only works with older versions of solaris. There was OHAC which is now IHAC but apparently it hasn't had any momentum.

FreeBSD has HAST which works pretty well but freeBSD has an awful iscsi target (for my needs, yours it may work ok). HAST is not as mature as DRBD though and i really wish zfsonlinux was much further along than it is currently.

Nexenta has some sort of HA clustering but I have no experience with it. I presume it works, it better for how much they're charging for it.

you can also use scheduled snapshot jobs. the first of these will take awhile but presuming you have a fast enough interconnect or your % change is relatively low then you can be in sync within a few minutes. Gea/nappit offers (for a very reasonable price, something ill be buying soon) an extension that does this for you. however this is not high availability this is replication.

You could go up the stack a bit and make use of something like gluster to handle availability, i haven't researched this route much yet but apparently solaris does have a gluster client.

If someone has another option i would love to hear about it. i really really wish openIA had a DRBD alternative to make this simple.

As for your other points, commstar is a really solid iscsi target service. anything you can do in a commercial iscsi target you can do with commstar. portal groups, security etc. as for esxi setting up round robin MPIO works the same with a commstar target as it does any other target.

*edit* specifically for NFS you just export each share over a different nic.
 
What are the options with Solaris/NFS/Napp-It as far as NIC failover or connecting on multiple network addresses?

Suggestions?

One Idea is to try installing VMWare ESXi first and then run your NAS on this. This is the same as doing an All In One which this thread has lots of info on. But you Don't have to Do all in one with this method you can just have the One NAS VM that uses vt-d direct IO thing and no other local VM's running. You can later add more RAM to this machine and run VM's off it as well. The extra cost of doing this is in the set up time, 1-2 extra ESXi boot disks, an extra disk controller (as esxi and NAS vm can't share the same raid/HBA card) and a modern vt-d capable Motherboard/CPU.

And for this extra effort you get ESXi virtual networking. You can have 2 or more Physical network cards on the machine assigned to the NAS VM's virtual network and they become fault tolerant. Note that there are a couple of options in ESXi for load balancing and failover method. They are not perfect load balancing solutions but the right config and setup should work fine. I would Google to find the VMWare help guides for this. Also note It works best with 2 separate switch networks to make it more redundant. If your network is only 1G Networking this is quite cheap but for 10G cards cables and switches the cost will be higher.

see www.vmware.com/files/pdf/virtual_networking_concepts.pdf page 8-9

Note that with this method the client machines connecting to the NFS share will see just one NAS IP address with no apparent redundancy/multipathing but the physical/virtual network fabric is redundant. Also make sure your client machines have two network cards with some form of redundancy as well to make all this effort worth it.
 
brian, the failover just isn't there. There was sun cluster suite thing but it only works with older versions of solaris. There was OHAC which is now IHAC but apparently it hasn't had any momentum.

FreeBSD has HAST which works pretty well but freeBSD has an awful iscsi target (for my needs, yours it may work ok). HAST is not as mature as DRBD though and i really wish zfsonlinux was much further along than it is currently.

Snip.

Not really Fail Over....

But I was thinking of trying the following

Box 1 ZFS Sharing out it's pool via a FC LU
Box 2 ZFS Sharing out it's pool via a FC LU

Box 3 Which connect to box 1 & 2's FC LU's and mirrors them
Share Box 3 out via another FC port or ethernet iSCSI / NFS as you see fit.

Still the single point of failure tho with Box 3, but maybe the DATA stored is more High Availablity than a single ZFS box due to the mirror of whole servers not just mirroring of drives / pools?

:confused:.
 
right but that is where gluster is imo a better solution. with gluster you can mirror around data very flexibly and use the name spaces as your shared out data.

still reading up on gluster so i may be completely off.
 
I'm having trouble getting a Crucial M4 to show up in OI.

All other disks show up fine, the disk it needed temporarily to host a VM.

Ive tested the bay with another disk and it works.

I can use the drive on a windows box fine.

Any ideas?

It shows up in the controllers bios.
 
Not really Fail Over....

But I was thinking of trying the following

Box 1 ZFS Sharing out it's pool via a FC LU
Box 2 ZFS Sharing out it's pool via a FC LU

Box 3 Which connect to box 1 & 2's FC LU's and mirrors them
Share Box 3 out via another FC port or ethernet iSCSI / NFS as you see fit.

Still the single point of failure tho with Box 3, but maybe the DATA stored is more High Availablity than a single ZFS box due to the mirror of whole servers not just mirroring of drives / pools?

:confused:.
This may work but it seems like quite an expensive solutions. If short term downtimes are ok another way to do it simpler but not get the high availability is to have two boxes with say 6TB of usable storage each. Each node shares 2TB of usable NFS/iSCSI storage and you store half your VM's on one and half on the other which will double your total performance (though a single VM can only use the performance of one of the nodes). Then you use replication every 15min or something to sync the 2TB of data from node A->B and node B->A.

If a Node fails then all the vm's running off it crash as there is no automatic HA failover with this solution. You then need to try to get the failed Node to boot again and restart your VM's (Note if your server comes back quick enough your VM's may not need to be crashed and restarted as they will just hang on IO for a while). If the node is stuffed and you need your VM's up right now then you go to the other node and change the backup pool on it from read only to read/write mode and share it again. Link your VM servers to this new share manually and you are away with up to 15min of lost data (less if your replication happens more often)

As you can see this solutions is not ideal for downtime but if your replication is kept up to date then at least you keep your data mostly intact.

I don't think you can easily get more redundant than this without going to a much more complex system or a commercial solution like Nexenta. But one thing you can do to make it better is to convert the above solution to an All-In-One and have the VM's stored on the local node executed from that same node. This reduces the number of Devices that can fail but means that when a node goes down those VM's crash. But you have to remember that even if you had a proper shared redundant storage system the VM Host machines have the same chance of failing and crashing all running VM's. Though It is easier to set up simple HA systems to reboot the VM on a second VM Host machine from the same shared storage in this case. So the only thing that would be a lot nicer is making it easier to do this recovery with a napp-it based solution.

There are higher end HA options that cover VM Host machines crashing as well but these I think are slower and more expensive as they have to keep the VM's Memory State replicated between hosts. Other than that you can look at getting HA at the application level by running two separate VM's that serve the same application/service and then client machines will not be affected by a Node going down.
 
Hmm, I didn't think of using VSphere itself as the main way to handle networking fault tolerance. That makes perfect sense though now that I think about it a little more.

The only downside to this approach that I can see is having to spec out the VT-x enabled server which would kill my thoughts of getting a cheap Dell 2960 server as the base OI/Napp-It box. I'll have to look at other options from that aspect, but it does seem to solve my problems around the networking being a point of failure.

I could export the NFS share on a nic, but if the NIC dies then I would loose that connection.
 
I've always heard that you can mount a hard drive in any orientation, just once you choose an orientation, it should stay that way for the rest of its life. This is because the motor bearings may wear slightly over time, settle in and stay functioning, but if you change the orientation, they bearings will shift and you could end up with crashed heads/etc.

That being said, I have 2x36GB original WD Raptors that have been mounted in every possible orientation over the course of their 8 years life, and they're still kicking. Probably not for much longer though... :)

Horizontal/vertical is fine, just NOT upside down.

Edit: And if you chose one, it should stay that way, yes.
 
II learned that again today with my vm-webserver. It was a 3 x 2 way mirrored pool + hotfix.
[...]
Add a hotfix or buy a cold spare.
Since you wrote cold spare, is hotfix some deliberate inside joke?

The term is hot spare.
 
I did a "find / -iname *bonnie* -print". But nothing returned.
If you don't quote your pattern, your shell expands it for you. If there's nothing in the current directory matching *bonnie*, you essentially typed "find / -iname -print" which I think tries to find an empty file name? I'm not even sure.

Try "find / -iname '*bonnie*'". -print is redundant.
 
Problem:
If the disks were formatted in FreeBSD with GPT partitions (which FreeBSD recognizes but Solaris doesn't),
you cannot import to Solaris.

Only pools with disks formatted with GEOM can be exported from FreeBSD/FreeNAS/ZFSGuru and reimported into Solaris followed by a zpool upgrade.

Thanks for the info, whats the easist way to figure out if they are GPT partitions?
 
Gea, is LSI2008 still not recommended for pool data disks due to the problem with locating the disks in the JBOD?

head unit uses the on board LSI2008 for syspool, but my vendor sent me the wrong card that is also LSI2008 based for the external JBOD.



I'm debating sending it back in favor of a LSI-9205-8e

Thoughts?
 
Gea, is LSI2008 still not recommended for pool data disks due to the problem with locating the disks in the JBOD?

head unit uses the on board LSI2008 for syspool, but my vendor sent me the wrong card that is also LSI2008 based for the external JBOD.



I'm debating sending it back in favor of a LSI-9205-8e

Thoughts?

I would say the LSI-9205-8e would have similar disk tracking problems as LSI2008 cards but I have no experiance myself. The advantages of the LSI2008 cards I think outway their disadvantages. They are some of the fastest cards for ZFS use and support 3TB+ drives and can work well with external JBOD and expanders etc. All you have to do is get a label machine and label each hot swap caddy with the serial number of the drive. You can then match this with the drives in napp-it GUI to know which drive to remove when needed. Gea is also working on a extra extention that helps make it easier to identify drives but I don't know how this works as I haven't used this myself.
 
Re NexentaStor, as they're working on a new release based on Illumos, does anyone know if the community edition of this release will exist, and if so increase the total allowable capacity from 18t?
 
Re NexentaStor, as they're working on a new release based on Illumos, does anyone know if the community edition of this release will exist, and if so increase the total allowable capacity from 18t?

I'm sure they will continue a community edition one but they may not increase the limit. But you are on a forum about napp-it here and you will be able to use napp-it with the new illumian open source distrubtion that the new nexenta is going to be based off. With this version there are no size limitations. Illumian has no web interface to manage it so this is where napp-it comes in. If you want nexenta's web GUI then you will need the limited community eddition or nexenta's comercial product.
 
Gea, is LSI2008 still not recommended for pool data disks due to the problem with locating the disks in the JBOD?

head unit uses the on board LSI2008 for syspool, but my vendor sent me the wrong card that is also LSI2008 based for the external JBOD.



I'm debating sending it back in favor of a LSI-9205-8e

Thoughts?

In the meantime i would always prefer the LSI SAS2 cards using IT firmware with disk WWN numbers
instead of controller port based id's.

Pros of controller/port id's:
It is easy to find a disk in a working pool when the id is controller 0 disk 2

Cons of controller/port id's:
ZFS recognizes disks based on this id. If you hot-replace disks it could be a problem
recognizing disks when names become id_old or are listed after removing. Sometimes
you need a reboot or pool export/import for proper listings

Real disk id's are technically the better way. ZFS recognizes disks by this number. No problem if
you need to (hot) move or replace disks. I prefer these numbers for this reason.

Beside this, you have no real choice. If you like the features of newer LSI cards like performance or
3 TB+ disks you need to go that way.

You must:
Plugin a new disk. Write down the WWN and the serial and mark the disktray to find it based on these numbers.

or
try to identify a disk with dd (napp-it menu disk-details). This may work well with some configs

or
if you own a supported SES backplane, you can try the napp-it monitor extension.
This extension can detect the controller/enclosure/slot/ of a disk and switch on a alert led.
 
As far as identifying disks, if you don't have a case/enclosure that supports the activity lights (to use the DD method) - the latest versions of napp-it do a good job of identifying the serial number of the disk (you have to run the smart monitoring part at least once - after that it seems to cache it). This serial number is printed on the front side (and top) of all my disks, so it was pretty easy to actually match them up.
 
In the meantime i would always prefer the LSI SAS2 cards using IT firmware with disk WWN numbers
instead of controller port based id's.

Pros of controller/port id's:
It is easy to find a disk in a working pool when the id is controller 0 disk 2

Cons of controller/port id's:
ZFS recognizes disks based on this id. If you hot-replace disks it could be a problem
recognizing disks when names become id_old or are listed after removing. Sometimes
you need a reboot or pool export/import for proper listings

Real disk id's are technically the better way. ZFS recognizes disks by this number. No problem if
you need to (hot) move or replace disks. I prefer these numbers for this reason.

Beside this, you have no real choice. If you like the features of newer LSI cards like performance or
3 TB+ disks you need to go that way.

You must:
Plugin a new disk. Write down the WWN and the serial and mark the disktray to find it based on these numbers.

or
try to identify a disk with dd (napp-it menu disk-details). This may work well with some configs

or
if you own a supported SES backplane, you can try the napp-it monitor extension.
This extension can detect the controller/enclosure/slot/ of a disk and switch on a alert led.

Thanks Gea, argh, not sure what to do. I understand your pro's and con's. I can either keep the supermicro LSI2008 based card, or swap it with the LSI-9205-8e card. Anyone else want to throw in their two cents on which way would be the best way to go?
 
Hi Gea,

Doe the current version of Napp-it works with EON Storage 1.0beta which is base on oi 151a.

Thanks.
 
When you use dd to identify disks, are you identifying the dd'd disk by sound?

Also, how do I know if a disk goes bad? Is there a way to email error reports?
 
The DD identification method works if you have individual disk activity lights (norco cases, etc.) -- if you don't just reference via the serial number on the disk (at least that was the easiest way for me).

Yep, napp-it supports sending email alerts on errors - look under the jobs tab.
 
Ahh! missed the part about where to install SSLeay. Once that was installed things worked perfectly!
 
Last edited:
I did the same mistake when trying to make TLS work... Wasn't aware SSLeay needs to be installed using the package manager. Went super-smoothly afterwards :)
 
Back
Top