Nappit/Omnios help recovering zraid2 single disk fault

kumasan

n00b
Joined
Apr 21, 2015
Messages
9
I have an all in one setup using nappit/omnios (configuration below). I just had a single disk crash in an 8 disk zraid2 configuration but the file system is down/not mounted despite only a single disk failure. Im behind on backups and hoping to recover the system if at all possible.

If I run the zraid status it shows the pool as degraded but not mounted and a single disk failure. I keep getting messages about the failed disk not responding and think the consumer grade sata disk is to blame. I know I should not be using these disks (have some 4tb nearline SAS I need to upgrade to) but I would be ever greatful if somebody would be able to help me get the data off the degraded raid.

I have to admit Im not that knowlegable on how to manage the zraid but from searching it seems I might need to somehow detach/remove the bad disk and get the degraded system mounted somehow.

I had thought about shutdown and detaching the faulty disk but I dont know how to map the drive letters in the solaris to the serial number of the disks and afraid that if I power up the system with another disk disconnected it might make the raid go from degraded to failed. Thinking I might be able to somehow put the raid offline (though if it is not responding I might not even be able to do this).

Would greatly appreciate any tips on how to get started.

Kuma



Configuration:

Supermicro X10SL7-F with onboard LSI-SAS flashed to non raid
Xeon E3-1232v3
32GB ECC ram
ESXi 6.0
2x Intel GBE
8xSeagate 2TB drives (on LSI SAS passed through to OMNIOS)
250GB Samsung 850 SSD (esxi local datastore)

OMNIOS/ZFS VM
I have a primary VM for omnios 6GB memory, 4cores onboard LSI-SAS passed through.
VMware tools installed
LSI SAS, E1000, 30GB vdisk on local SSD store
I created a ZFS pool with RAIDZ2 and have it shared via smb and nfs

Windows10 VM
2 cores 6GM memory, 120GB disk on OMNIOS/NFS share with thin provisioning
E1000, LSI-SAS, Paravirtual controller
 
Wanted to followup/clarify on the questions.

In hopes of getting the degraded system mounted Im thinking I might try to locate the problem drive, shut down the system, disconnect the drive and reboot the system.
Hopes would be that the lsi sas controller will respond better with no drive rather than the crippled sata drive.

Another possibility would be to detach and replace the drive with a new one. Im afraid that since the controller is still unable to mount the degraded pool because of the timeouts it may not be able to detach the problem drive. Dont know if there is a way to force the drive offline to happen next reboot?

It looks like my motherboard does not have activity lights so I cant easily find out which disk failed via LED.
Im wondering if there is any way to query drive serial via the solaris?

What Im really nervous about is rebooting my system if i guess the wrong cable to pull and the system goes to fully degraded. Not sure if I shutdown and fixed the connections I would be able to get it back from failed to degraded?

Was hoping there might be a safe way to set the raid in omnios to not try to detect/resilver so I dont corrupt things worse.

Would greatly appreciate any tips/pointers or suggestions.
 
I suppose too a bad sata disk blocking the HBA so you first look at the logs (Menu System > logs) for infos about the bad disk. If you remove this disk, your pool should be available in a degraded state.

Napp-it shows WWN and serial of a disk. Even if you do not have wrote them down shutdown the system. They are printed on the disk.

About your concerns.
If you pull the wrong disk, a Z2 remains available but degraded as you have 2 disks for redundancy.
If you pull more than two disks the pool switch to unavailable - at least after a clear error (Menu Pools). If you bring back enough disks for a Z2 the pool automatically goes to degraded or online. This is by far less critical than hardware raid.

If you cannot identify the bad disk, pull all disks, bootup and insert disk by disk and check if it is detected or if it blocks (then you have found the bad one)
 
Back
Top