I've recently encountered a pretty debilitating problem with our SAN/servers.
The hardware:
(2) Dell PowerEdge R805's - 32gb per server
(4) Qlogic FC HBA adapters
(1) Xiotech Emprise 5000
(2) Brocade FC Switches
A little background:
The solution tested fine in our lab before being transferred to the data center. Our setup is two Dell rack mount servers with 2 FC HBA's per server. The two boxes are cabled into two independent fiber switches, and zoned appropriately (according to Brocade). The Xiotech apparently uses an active/active configuration for MPIO as well.
The problem:
After transport to the data center, it appears that the SAN will allow zoning over 350GB with no problems; but when you try to bring the volume online on 'Server1', it fails, claiming "The requested resource is in use". Volumes under 350gb can be advertised and mounted fine, without a single hitch. Server2 can mount and use volumes of any size without a problem.
The troubleshooting:
We have spent a fair amount of time with all our vendors concerning this problem, which results in a massive finger pointing party.
To cover the easy bases, all vendors have verified the configuration and firmware versions of all this hardware. All the fiber channel hardware was provided as a package from the same reseller. Also, to eliminate MPIO problems, we have advertised a LUN to only one FC HBA, to eliminate the possibility of MPIO or HBA contention. Also, the HBA's were switched, with no real success. The swap according to my colleague, seemed to point to a bad card for a day, and then reverted back to the original (Server1) problem machine.
Our reseller has tried to blame Microsoft and Hyper-V as the root of all our problems. For reference, I worked personally with LUNs over 1 terabyte before this solution was transferred to the data center. All hardware and fiber was transported in original packaging, and we have tried swapping fiber to isolate a broken strand.
After contacting the reseller, we have installed ESX 4 and have encountered a similar (albeit more cryptic) error message. ESX 'sees' the LUN as available, and even determines that it is blank. After choosing a black size that more than services the LUN size (I've tried 1024gb/4mb and 2048gb/8mb) but after completing the add storage wizard, I get an error message "Error during the configuration of the host: Failed to update disk partition information". Obviously this is less helpful than Hyper-V's error message, but just as puzzling.
Obviously Hyper-V is not the problem, ESX is displaying the exact same errata. We have been in contact with Microsoft as well, and we have tried Server03, 08, and 08R2 (our goal) all without success.
So, if all the hardware vendors say we're configured correctly, and we're getting errors across multiple operating systems and virtualization platforms, where do we go? Any ideas?
TL;DR : 2 Servers, 1 can use a LUN of any size, other can't use above 350gb. Swapped cards, no change on problem server. Setup worked fine for ~3 months of testing, now it doesn't. Help?
The hardware:
(2) Dell PowerEdge R805's - 32gb per server
(4) Qlogic FC HBA adapters
(1) Xiotech Emprise 5000
(2) Brocade FC Switches
A little background:
The solution tested fine in our lab before being transferred to the data center. Our setup is two Dell rack mount servers with 2 FC HBA's per server. The two boxes are cabled into two independent fiber switches, and zoned appropriately (according to Brocade). The Xiotech apparently uses an active/active configuration for MPIO as well.
The problem:
After transport to the data center, it appears that the SAN will allow zoning over 350GB with no problems; but when you try to bring the volume online on 'Server1', it fails, claiming "The requested resource is in use". Volumes under 350gb can be advertised and mounted fine, without a single hitch. Server2 can mount and use volumes of any size without a problem.
The troubleshooting:
We have spent a fair amount of time with all our vendors concerning this problem, which results in a massive finger pointing party.
To cover the easy bases, all vendors have verified the configuration and firmware versions of all this hardware. All the fiber channel hardware was provided as a package from the same reseller. Also, to eliminate MPIO problems, we have advertised a LUN to only one FC HBA, to eliminate the possibility of MPIO or HBA contention. Also, the HBA's were switched, with no real success. The swap according to my colleague, seemed to point to a bad card for a day, and then reverted back to the original (Server1) problem machine.
Our reseller has tried to blame Microsoft and Hyper-V as the root of all our problems. For reference, I worked personally with LUNs over 1 terabyte before this solution was transferred to the data center. All hardware and fiber was transported in original packaging, and we have tried swapping fiber to isolate a broken strand.
After contacting the reseller, we have installed ESX 4 and have encountered a similar (albeit more cryptic) error message. ESX 'sees' the LUN as available, and even determines that it is blank. After choosing a black size that more than services the LUN size (I've tried 1024gb/4mb and 2048gb/8mb) but after completing the add storage wizard, I get an error message "Error during the configuration of the host: Failed to update disk partition information". Obviously this is less helpful than Hyper-V's error message, but just as puzzling.
Obviously Hyper-V is not the problem, ESX is displaying the exact same errata. We have been in contact with Microsoft as well, and we have tried Server03, 08, and 08R2 (our goal) all without success.
So, if all the hardware vendors say we're configured correctly, and we're getting errors across multiple operating systems and virtualization platforms, where do we go? Any ideas?
TL;DR : 2 Servers, 1 can use a LUN of any size, other can't use above 350gb. Swapped cards, no change on problem server. Setup worked fine for ~3 months of testing, now it doesn't. Help?