Windows Server with iSCSI Mounts and 100% Uptime - NetApp

KapsZ28

2[H]4U
Joined
May 29, 2009
Messages
2,114
With some clients comes unrealistic expectations. In this specific case a customer has three Windows servers. Two are running Windows Server 2003 R2 Enterprise x64. The other is Windows Server 2008 Enterprise x64. These are all physical servers.

For storage they were using Openfiler because, let's just say it, they were being cheap. In the past couple of months the Openfiler has become unstable and we offered to let them use our NetApp for their iSCSI drives. By the way, they are using the Microsoft iSCSI Initiator for all the iSCSI targets.

After migrating one of them they were happy and happen to say "we need to make sure these drives stay connected 100% of the time". I almost laughed because nothing about this setup is Highly Available, but I want to focus on just the iSCSI storage at the moment since that is what we are responsible for.

These iSCSI drives contain large Oracle databases used for legal eDiscovery software. In the past on the Openfiler SAN these has been some data corruption due to SAN reboot or possible iSCSI disconnect while a batch job was in the middle of being processed. So they are very adamant about 100% up time.

I know that Microsoft's iSCSI does support multi-pathing, although I don't know if it was supported with Server 2003 or how good it is. Plus being software based running on Windows, I would think it is more susceptible to crashing then say adding two physical iSCSI cards into each server. But I am just guessing here.

Obviously we would need redundant switches and paths to each Windows server with multi-pathing enabled, but more importantly a mirrored NetApp disk shelf in case one fails. I am also not 100% sure on how that works. If you have two mirrored NetApp shelves, and one fails, does the other immediately take over the workload or is there still a certain amount of downtime?

So what kind of recommendations do you have?
 
Eh, I guess everyone is enjoying their Labor Day weekend. Hopefully someone will have an answer tomorrow. :)
 
had a similar client do a setup around starwind san software, seems to be working ok...

NetApp is expensive... and that's about all it has going for it... we have another client that purchased a NetApp with installation, and NetApp was out here for DAYS trying to get their crap to work right...was pretty funny actually
 
I would be asking the client if they drive their cars 24/7 and not allow downtime to service them and to put fuel in?

.
 
Not familiar with Netapp, but some googling shows it does have a synchronous replication mode. Assuming Netapp has any failover capabilities , and it's fast enough to bring the LUNs up on the secondary node before the iSCSI timeouts kick in, this should only stall pending writes during the failover.

I've cooked up exactly this kind of solution with DRBD for an off the shelf hardware solution.
 
If you have a NetApp system with the SyncMirror (note, that is not the same thing as SnapMirror) license you can synchronously mirror disk pools (aggregates in NetApp terminology). By default it will mirror the disk pool between two "plexes" consisting of disks that are in different disk shelves for obvious reasons, although this can obviously be overridden. You should obviously also have a HA system so that the controllers wouldn't be providing a source of downtime.

With iSCSI, if a controller should fail over there will be a couple seconds before the IPs are failed over to the partner controller. You should install the NetApp host utilities to configure the timeouts so that no writes are lost. For multipathing if you're doing it on Windows Server 2003 you would need the NetApp DSM to support multipathing; this may or may not require a license. Windows Server 2008+ have built-in MPIO support. Obviously both paths should be on different physical networks to ensure the highest level of uptime.

No solution will provide 100% uptime, but with a properly configured NetApp system you can get pretty close. They're expensive but they're tanks. We have disk shelves that have been online for 1000+ days and controllers online 750+, usually they're only taken down for updates (with HA updates are done non-disruptively). The weak link here would be Windows.

Source: I've been running NetApp systems for the last 7 years. Our organization has FAS3020, FAS2040, FAS2240, FAS3170 and a couple brand new FAS8060 arrays in production.
 
Back
Top