Xiotech Emprise 5000 - Can't mount volumes over a certain size

Adam12176 · Sep 28, 2009

I've recently encountered a pretty debilitating problem with our SAN/servers.

The hardware:
(2) Dell PowerEdge R805's - 32gb per server
(4) Qlogic FC HBA adapters
(1) Xiotech Emprise 5000
(2) Brocade FC Switches

A little background:
The solution tested fine in our lab before being transferred to the data center. Our setup is two Dell rack mount servers with 2 FC HBA's per server. The two boxes are cabled into two independent fiber switches, and zoned appropriately (according to Brocade). The Xiotech apparently uses an active/active configuration for MPIO as well.

The problem:
After transport to the data center, it appears that the SAN will allow zoning over 350GB with no problems; but when you try to bring the volume online on 'Server1', it fails, claiming "The requested resource is in use". Volumes under 350gb can be advertised and mounted fine, without a single hitch. Server2 can mount and use volumes of any size without a problem.

The troubleshooting:
We have spent a fair amount of time with all our vendors concerning this problem, which results in a massive finger pointing party.

To cover the easy bases, all vendors have verified the configuration and firmware versions of all this hardware. All the fiber channel hardware was provided as a package from the same reseller. Also, to eliminate MPIO problems, we have advertised a LUN to only one FC HBA, to eliminate the possibility of MPIO or HBA contention. Also, the HBA's were switched, with no real success. The swap according to my colleague, seemed to point to a bad card for a day, and then reverted back to the original (Server1) problem machine.

Our reseller has tried to blame Microsoft and Hyper-V as the root of all our problems. For reference, I worked personally with LUNs over 1 terabyte before this solution was transferred to the data center. All hardware and fiber was transported in original packaging, and we have tried swapping fiber to isolate a broken strand.

After contacting the reseller, we have installed ESX 4 and have encountered a similar (albeit more cryptic) error message. ESX 'sees' the LUN as available, and even determines that it is blank. After choosing a black size that more than services the LUN size (I've tried 1024gb/4mb and 2048gb/8mb) but after completing the add storage wizard, I get an error message "Error during the configuration of the host: Failed to update disk partition information". Obviously this is less helpful than Hyper-V's error message, but just as puzzling.

Obviously Hyper-V is not the problem, ESX is displaying the exact same errata. We have been in contact with Microsoft as well, and we have tried Server03, 08, and 08R2 (our goal) all without success.

So, if all the hardware vendors say we're configured correctly, and we're getting errors across multiple operating systems and virtualization platforms, where do we go? Any ideas?

TL;DR : 2 Servers, 1 can use a LUN of any size, other can't use above 350gb. Swapped cards, no change on problem server. Setup worked fine for ~3 months of testing, now it doesn't. Help?

lopoetve · Sep 28, 2009

sec, I remember this.

lopoetve · Sep 28, 2009

esx classic or i?

Adam12176 · Sep 28, 2009

ESXi right now.

08R2 with Hyper-V is the goal.

lopoetve · Sep 28, 2009

log into the esxi console - tell me if the esxcfg-module -l command works.

Adam12176 · Sep 28, 2009

Works.

Edit: Headed home, I'll plan on checking back here in the next hour or so. I'll have access to the data center machine(s) if there's anything you want to pull.

lopoetve · Sep 28, 2009

esxcfg-module –q

Let me know what qlogic driver it's using (should be qlaXXXX_XXX something or another)

Adam12176 · Sep 28, 2009

I'm using the RAC to get into that server, so output is a little skewed - it looks like the only one I can see is qla2xxx.o - that's the verbatim file listed.

lopoetve · Sep 28, 2009

esxcfg-module –s “ql2xmaxsgs=32 ql2xmaxqdepth=255” qla2xxx.o
reboot.

give that a try.

Adam12176 · Sep 28, 2009

So the update - tried a 700gb partition with a 4mb and 8mb block size - no dice.

Error message is now: "Error during the configuration of the host: Failed to update disk partition information."

lopoetve · Sep 28, 2009

after the reboot?

ok...
got me then.

Adam12176 · Sep 29, 2009

Any other theories?

The reseller wants 1,200 bucks a day to troubleshoot, citing a 'Hyper-V related problem'. Any ideas are welcome.

Adam12176 · Sep 29, 2009

So another update - I can mount a 50gb volume, but a 300/700/1TB fails.

This is getting wacky.

RiDDLeRThC · Sep 29, 2009

Adam12176 said:
Any other theories?

The reseller wants 1,200 bucks a day to troubleshoot, citing a 'Hyper-V related problem'. Any ideas are welcome.

See if the resellers tone changes when you threaten to just return the equipment and go with a more supportive vendor.

Adam12176 · Sep 29, 2009

RiDDLeRThC said:
See if the resellers tone changes when you threaten to just return the equipment and go with a more supportive vendor.

Psh, we paid them, they don't care. We can't 'unpay' them.

gimp0 · Sep 29, 2009

I know the equipment tested fine in the lab, but have you considered that the HBAs in server1 are not functioning correctly. It is a "cheap" test to replace those HBAs and maybe even the cables for good measure.

Also, are the LUNs you are trying to mount owned by a certain controller on the SAN or have you tried mounting LUNs owned by both controllers on server1 and server2 to make sure they are still good.

Sorry if all of this sounds kind of obvious, but often times when people troubleshoot issues they forget to go back to the basics.

Adam12176 · Sep 29, 2009

We did swap HBA's, and the problem did move with the HBA. After reconfiguring the switches per Brocade's best practices, the problem returned on the original problematic server.

So your problem goes Server1 -> Server2 (after HBA swap) -> Back to Server1 with no transfer of hardware after re-zoning.

We did swap pieces of fiber to try and eliminate that problem. The hardware tested good according to Qlogic and Brocade. Tested good as in no errors in transmission, witnessed ourselves.

We have tried different approaches to advertising LUNs, making them exclusive and offered to any and all HBA's with no luck either way.

lopoetve · Sep 29, 2009

how is the zoning set up? Single initiator?

Adam12176 · Sep 29, 2009

Yes, we do have a single initiator.

lopoetve · Sep 29, 2009

so each zone is an HBA port and then the san ports, and nothing else?

Grab the vmkernel log from the problem host, post it somewhere. I'll tkae a look.

edit: ESXi - messages file instead.

Adam12176 · Sep 29, 2009

Ok - so every time I try to post a part of the log, the board is telling me my message is less than 3 characters. Trying again.

Adam12176 · Sep 29, 2009

Ok, this is getting frustrating.

Sep 29 09:16:09 Hostd: (vim.fault.PlatformConfigFault) { dynamicType = <unset>, faultCause = (vmodl.MethodFault) null, text = "Unable to create Filesystem, please see VMkernel log for more details.", msg = "", }
Sep 29 09:16:55 vmkernel: 0:15:35:02.732 cpu4:165867)ScsiScan: 839: Path 'vmhba3:C0:T0:L0': Vendor: 'XIOTECH ' Model: 'ISE1400 ' Rev: 'A '
Sep 29 09:16:55 vmkernel: 0:15:35:02.732 cpu4:165867)ScsiScan: 842: Path 'vmhba3:C0:T0:L0': Type: 0x1f, ANSI rev: 4, TPGS: 0 (none)
Sep 29 09:16:55 vmkernel: 0:15:35:02.732 cpu4:165867)ScsiScan: 105: Path 'vmhba3:C0:T0:L0': Peripheral qualifier 0x3 not supported

Now that it would actually let me post that, I see a lot of that when trying to work with a LUN. Let me know if there's an easy way to post the full log somewhere, or PM it, or whatever.

Adam12176 · Sep 30, 2009

As an update to this thread:

It appears our problem lies in the switch zoning. We eliminated all zoning on the switches, and suddenly I could mount any LUN, any size.

lopoetve · Sep 30, 2009

You definitely don't want to do that though - a LIP storm will knock everything offline. :-/

Adam12176 · Sep 30, 2009

Agreed. It's not a permanent fix, more 'proof of error' than anything else.

I think I'll take a crack at zoning the switches, the other guy did it last time. More incentive to snag a SNIA cert down the road. I'll point the finger at Brocade a bit too, considering they hopped on and said everything was perfect in our config. Definitely not.

lopoetve · Sep 30, 2009

lol.

Each zone - One port on the HBA, Both ports that it should see on the san. Rinse, repeat.

Adam12176 · Sep 30, 2009

Well that's the problem I guess - that's what we had.

lopoetve · Sep 30, 2009

call brocade and tell them your switches be fuxored.

Adam12176 · Oct 9, 2009

I wanted to end this thread on a happy note - as I didn't do the switch config, I don't feel bad about posting the blunder, heh.

Brocade called back and verified the switch config, and seconds before issuing an RMA, noticed that the zoning wasn't done by WWN, it was done by some sort of Brocade alias to the WWN. Needless to say, this doesn't work. Why you can add the alias to the zoning in the config, I do not know, but...well, you can.

This was fixed and the entire setup is functioning correctly again. Special thanks to lopoetve for the extra mile in helping to troubleshoot.

lopoetve · Oct 9, 2009

ebbeh? You should be able to, it'll just perform like crap and take loads of CPU overhead. Interesting. Cool it works!

Xiotech Emprise 5000 - Can't mount volumes over a certain size

Limp Gawd

Extremely [H]

Extremely [H]

Limp Gawd

Extremely [H]

Limp Gawd

Extremely [H]

Limp Gawd

Extremely [H]

Limp Gawd

Extremely [H]

Limp Gawd

Limp Gawd

2[H]4U

Limp Gawd

Limp Gawd

Limp Gawd

Extremely [H]

Limp Gawd

Extremely [H]

Limp Gawd

Limp Gawd

Limp Gawd

Extremely [H]

Limp Gawd

Extremely [H]

Limp Gawd

Extremely [H]

Limp Gawd

Extremely [H]