Help with a hanging OI/Solaris ZFS System

Skud · Feb 12, 2012

Hi All,

I have an OI 151a system that has been flawless for almost a year now. Recently it's been hard locking without leaving anything in the logs (so far as I can tell). There have been a few changes made to the system since the issue has started happening:

- Some ZFS tuning (zfs_vdev_max_pending = 4 and zfs_write_limit_override = 805306368).

- An install of a Brocade NIC software package which has since been removed (both software and the NIC itself).

- I tried to install the LSI MSM for more control over the LSI card, but it needed the Solaris SMA and a bunch of SNMP packages which OI doesn't appear to have. So, that was also uninstalled. However, two of the packages are still there (sassnmp and sasirsnmp) as they won't uninstall because the script looks for the SMA service (which doesn't exist).

- Added an Acard 9010 as a log device.

- Added an an SSD identical to the existing boot drive. Mirrored, copied the bootloader, etc.

I'm not seeing any drive errors or anything in the logs for that matter. The logs look normal right up to the point it hangs. When it does hang the system becomes completely unresponsive. The console is also unresponsive and just shows logging from the regularly scheduled napp-it commands. It stops responding to pings too. The only way to get it back is a hard reset.

Thoughts? Is there any more logging I can enable?

Thanks!!
Riley

Nex7 · Feb 12, 2012

So I want to ask about your zfs_vdev_max_pending and zfs_write_limit_override choices, but before getting to that, I'd rule out hardware issue. Have you looked at 'fmdump'? The two most useful ways to view its data are 'fmdump -e' and 'fmdump -eV'. If you'd like to specify a time range, for problems that just occurred I tend to do something like "fmdump -eVt 7day" to snag the last 7 days, or "fmdump -eVt 1hour" if there's tons of entries and I know it just occurred, etc.

Pipe that to a file and start running through it. Please note that this is a very low-level 'log' if you will, and actually some amount of entries within are no cause for concern. It's large numbers of entries, patterns, or entries directly proceeding issues that may be of use to you.

Also, when the system hangs up completely, you mention that it actually is still sending log entries to the console? I assume however that all I/O has stopped? Can't SSH in? How long have you waited when this occurs?

I'm a big fan of the 'what changed' and 'eliminate variables' methods of trouble-shooting -- if it was working, and then things were added, start removing those things until the problem disappears. Whatever the last thing you remove before the problem clears up is likely to be your culprit.

staticlag · Feb 12, 2012

Its probably a hardware glitch.

First thing I would do is remove the ram and all the cards and re-install them.

Skud · Feb 12, 2012

Agreed. The problem didn't start happening until after some changes have been made, so hopefully that's where the issues lies.

Using fmdump -eVt 1day or 2day gives me output, but nothing past earlier this morning. The hang occurred around midnight last night, so I would assume that output should be included.

When the hang occurs the entire system locks up - console and all. I admit that I was confusing when I talked about the console output. The console is frozen as well, but the last console output that is there is just regular normal system activity - nothing to indicate any issues.

Using those variables for max_pending and write_limit_override are the values which have given me the best performance. Without them I hit a write limit of 50MB/s at 32-64KB block size. Here is my original thread on that issue: http://hardforum.com/showthread.php?t=1577141

Yesterday I noticed that napp-it stopped running all my jobs back in December, including my weekly scrub. So, I had initiated a scrub last night before bed and was around 25% done. This morning after discovering the frozen system and the subsequent reboot the scrub continued and then came across (and repaired) 320KB of data on one drive. In addition, the same 2TB Hitachi drive now shows a pending reallocated sector. So, it does appear I have a drive with some issues. Though - I don't think it should hang the whole system.

The only drive showing errors was that single drive. Using the SMART tools built into Napp-it I checked the status of all the other drives. After checking a couple drives I noticed that ALL my drives on EVERY controller (onboard SATA and the LSI SAS) are now showing soft and hard errors. Figuring this was quite strange I decided to use fmdump. Lo and behold I found a long string of timeouts and errors relating to the drives. Then, I issued a reboot and everything is normal - no errors shown on the drives.

Again, I decided to use napp-it's smartctl to look into the drive's SMART status and again all drives start showing hard and soft errors. Growing suspicious, I Googled and found that others are also having issues with smartctl causing drives to momentarily drop off. So, that gives me relief, but I do seem to still have one drive with a few bad blocks.

Riley

ChrisBenn · Feb 12, 2012

Skud said:
Again, I decided to use napp-it's smartctl to look into the drive's SMART status and again all drives start showing hard and soft errors. Growing suspicious, I Googled and found that others are also having issues with smartctl causing drives to momentarily drop off. So, that gives me relief, but I do seem to still have one drive with a few bad blocks.

Yep, I can confirm using smartctl will present as S/W errors under iostat (typically 1 per invocation) - the current versions of napp-it I think warn about this in the web dialog.

Skud · Feb 13, 2012

Well, I take it back - there is something in the logs. Specifically:

Code:

Feb 13 00:16:09 SAN01 savecore: [ID 570001 auth.error] reboot after panic: BAD TRAP: type=e (#pf Page fault) rp=ffffff000fcc87f0 addr=8 occurred in module "fp" due to a NULL pointer dereference
Feb 13 00:16:09 SAN01 savecore: [ID 232024 auth.error] Panic crashdump pending on dump device but dumpadm -n in effect; run savecore(1M) manually to extract. Image UUID 21e62b72-9e23-6d13-e3ee-8ac014475bd6.
F

It crashed again today at what I'm assuming is midnight. So, something is happening at midnight that is causing this issue. The only job I have in Napp-It running at that time is a snapshot, but that runs fine when I do it manually. I don't have any at jobs and from what I can see no cron jobs are running at midnight either.

I'm wondering if this has something to do with the failed NIC install. The NIC was a Brocade 1020 CNA FCoE/iSCSI adapter. The software and card has been removed, but is it possible that it did something with the OS fibre channel config that is now causing issues?

Riley

jwcalla · Feb 13, 2012

Check the man pages for fp. Near the end it lists a bunch of files involved with the program / module. Check the files and see if any have been modified recently. Or, use diff to compare them with an older version in your snapshots from a time you know everything was working.

Second, check to make sure the driver / module for the Brocade NIC isn't still loaded. Even if it is, it shouldn't be used so I doubt this is giving you any grief. I only bring it up because I used to get hard locking in Linux with the default Realtek driver when combined with NFS.

Skud · Feb 13, 2012

I checked the man pages and all of the files referenced in there haven't been modified since early-mid 2011. Also, running modinfo doesn't even show the fp module as being loaded, so I'm not sure what's going on there and why it would be causing crashes.

I'm thinking about just blowing away my OI system and starting from scratch since I need this to work reliably. If I export my pool and then reimport it do all settings come back? What will need to be reconfigured in regards to CIFS and COMSTAR?

Riley

ChrisBenn · Feb 13, 2012

Before you completely blow away the system you could try booting off of a live CD (OI or Solaris) and seeing if the issue repeats itself.

Also if you have any older boot environment snapshots (beadmin) you could try rolling back to one of those.

Skud · Feb 14, 2012

The only BE snapshots I have are pre-OI_151a, so I might as well reinstall.

It hard locked again last night around 9pm, so I decided to start over and take a different approach. I wanted to do a semi-AIO implementation where OI/napp-it was the only guest. This would let me have a more reliable system in that if something does happen I can just reboot the VM. It also would let me use my Brocade 10GB NIC since it has ESX support. I had assumed the board I had in there had IOMMU support for device passthrough, however upon doing more digging and attempting to actually do it I found out that wasn't the case.

Regardless, I do think I have a hardware issue. Before I started I exported the data pool and broke the system pool mirror so that I could go back if I needed to (which turned out to be a good idea). I then did the following:

- Tried to install ESXi 5.0 to a USB drive. The first attempt locked up at 16%. Second install went fine, but I found out I didn't have proper IOMMU support.
- Tried a fresh install of OI_151a. Install hung during the file copy.
- Suspected memory issues, so I ran memtest overnight. No problems.

After doing more research I found out it may be the motherboard. A brief perusal on the Asus support forums shows scads of people having "freezes" where the system just hard locks or blue screens with no warning.

So, I ordered an Asus 990FX Sabertooth board which does seem to have proper IOMMU support. We'll see how that goes in a few days.

In the meantime, I've booted off the mirrored drive I took out and re-imported the pool. Everything is back up and running. The only thing I had to really do was restart COMSTAR after importing the pool.

Riley

Help with a hanging OI/Solaris ZFS System

Skud

Gawd

Nex7

Weaksauce

staticlag

[H]ard|Gawd

Skud

Gawd

ChrisBenn

Limp Gawd

Skud

Gawd

jwcalla

2[H]4U

Skud

Gawd

ChrisBenn

Limp Gawd

Skud

Gawd