ESX disk write caching

oakfan52 · Jul 17, 2009

Not to thread jack but we use Hitachi currently migrating from a USP1100 to a USP-V........ and so say we have had issues is a giant understatement. It just seems like the HDS arrays are just aren't built to handle VMware loads very well. Anyone else have any HDS experience ?

AreEss · Jul 17, 2009

defuseme2k said:
Forgive my ignorance, but you mean the ntfs cluster size should be 64k, correct? I don't want to make any assumptions.

Correct. You should match your file system cluster sizes to your array segment sizes.

Also, it doesn't appear I can set the segment size while creating an array from the GUI. You were specific that it was an array setting, not a LUN setting. Will I need to do this from the command line? Also, my LUN segment size choices are 128 or 256 from the GUI, are either of these better than the other?

It depends on the application, and what exactly you're doing. If you're doing VCB to disk, and then disk to tape, I would not go over 128k. You may need to go to the CLI. This is an array level setting, and should be set at the start. You can only migrate an array segment size one step up or down, and it can take days to complete on large arrays.

oakfan52 said:
Not to thread jack but we use Hitachi currently migrating from a USP1100 to a USP-V........ and so say we have had issues is a giant understatement. It just seems like the HDS arrays are just aren't built to handle VMware loads very well. Anyone else have any HDS experience ?

The USP1100 is one of those "too smart" arrays in some respect. The HDS "RAID-V" or "VRAID" (whichever you wanna call it, depending on who's rebranding it,) is actually very well suited to VMWare in most situations. The chances are much more likely that there were fundamental configuration issues either on the part of VMWare or the HDS. I'm glad you didn't say AMS2000, because VMWare and the AMS2000 is a very bad idea - Active-Active arrays should NEVER be used with VMWare ESX or ESXi under any circumstances, or should be forced to operate in Active-Passive mode. (VMWare doesn't multipath correctly, and A/A requires your host do MPIO to balance controller load. In other words, ESX/ESXi will just trash one controller without doing a ton of additional work. Not worth it.)

To tell you more on why the HDS didn't work though, I honestly would have to come on site and look at everything in person. There's a whole mess of things with the USP's that I'm just not comfortable doing remotely, because it's very easy to make things far worse than they were.

lopoetve · Jul 18, 2009

oakfan52 said:
Not to thread jack but we use Hitachi currently migrating from a USP1100 to a USP-V........ and so say we have had issues is a giant understatement. It just seems like the HDS arrays are just aren't built to handle VMware loads very well. Anyone else have any HDS experience ?

1. Cut the SCSI queue to 2 on the ESX side. Yes, you heard me - TWO.
2. MAX hosts per lun = 4.
3. MAX Vms per lun = 4.
4. FC / SAS 15k disks ~only~
5. Avoid LUSE luns if possible
6. Host storage domains have a queue of exactly 2048 and process 32 iops per cycle, no matter how many hosts are in them. This means you need to have as little per HSD as you can. Cut them down. Small clusters, small numbers of vms per lun, fast disks. (The non -V gets 1024 queue and 16 iops).

They're fast, but VERY, VERY picky.

lopoetve · Jul 18, 2009

AreEss said:
The USP1100 is one of those "too smart" arrays in some respect. The HDS "RAID-V" or "VRAID" (whichever you wanna call it, depending on who's rebranding it,) is actually very well suited to VMWare in most situations. The chances are much more likely that there were fundamental configuration issues either on the part of VMWare or the HDS. I'm glad you didn't say AMS2000, because VMWare and the AMS2000 is a very bad idea - Active-Active arrays should NEVER be used with VMWare ESX or ESXi under any circumstances, or should be forced to operate in Active-Passive mode. (VMWare doesn't multipath correctly, and A/A requires your host do MPIO to balance controller load. In other words, ESX/ESXi will just trash one controller without doing a ton of additional work. Not worth it.)

To tell you more on why the HDS didn't work though, I honestly would have to come on site and look at everything in person. There's a whole mess of things with the USP's that I'm just not comfortable doing remotely, because it's very easy to make things far worse than they were.

Under no circumstances should you EVER set an AA array to MRU, unless you want to miss a failover and have your environment go down, or lose your manually configured path balancing that you should have done. Stupid? Probably. True? Definitely :-/

There are plenty of AA arrays that work just fine with ESX - you configure pathing from the host then, instead of the array - that's all, and define different paths for each host per lun. You don't get MPIO unless you're on 4, but in 3.5 you can certainly balance the load, and it works just spectacular. ESX will only trash a single controller if you haven't done any work configuring things.

IE: For an DMX, have host one hit FA1 for luns 1/3/5, and FA2 for 2/4/6. Reverse for host 2. Mix ports too - we restore pathing information on reboot, after all.

jessejames · Jul 18, 2009

DS3400 with 2x 512M controllers
cacheBlockSize=16 and segment size of 128k
Raid6 11 300GB 15k SAS disks + 1 spare
ESX-3.5U4 and a windows 2003 SP2 VM
iometer 32k 100% writes and 100% random
With mirrorEnabled=true we have 2900 I/O per second
With mirrorEnabled=false we have 9300 I/O per second !!!!
How do you explain that ?
We also try with Raid5 5+1 and 11+1 - same phenomenon (by the way, raid6 performance is almost the same as raid5)
we created a 32k align partition - same phenomenon

9300 vs 2900 I/O is something !
Will it be possible that when mirrorEnabled=true, both controller lost half their cache and explain the huge drop in performance ?
In fact, each controller have 310MB of cache (202 reserved for the processor)

lopoetve · Jul 18, 2009

That one is AreEss's specialty

I'll watch the master at work for it.

AreEss · Jul 18, 2009

lopoetve said:
That one is AreEss's specialty I'll watch the master at work for it.

Heh. Actually, it's an easy one.

Cache Mirroring on the DS3400 incurs an extreme penalty to performance for three reasons. One, it halves available cache from 1GB (512MB per controller) to 512MB. Two, the connection between the controllers for the cache mirror is slow. Three, the memory and processors are slow, which makes it even slower. (There's no hardware ASIC for cache mirror function like in the DS4800/DS5100/DS5300.)

So yes, those are the expected numbers with cache mirroring enabled at 2x512MB. Workarounds would be to upgrade to 1GB, which will net you ~20% increase in performance best case. The obvious answer is to turn off cache mirroring. Here's the catch - if you have a controller fault, data in cache on the failed controller is lost. So systems with 1GB cache should not run with cache mirroring disabled. 512MB you can do so, provided you adjust cache flush to be more aggressive.

Also, DO NOT RUN THE ARRAY SIZES YOU MENTIONED, EVER. The DS3400 is very array sensitive, even moreso than the DS4300. NEVER run RAID5 with an even number of disks, NEVER run RAID10 past 10 disks, NEVER run RAID6 with an odd number of disks. While you may see reasonable IOPS, throughput will go to absolute hell.

AreEss · Jul 18, 2009

lopoetve said:
1. Cut the SCSI queue to 2 on the ESX side. Yes, you heard me - TWO.
2. MAX hosts per lun = 4.
3. MAX Vms per lun = 4.
4. FC / SAS 15k disks ~only~
5. Avoid LUSE luns if possible
6. Host storage domains have a queue of exactly 2048 and process 32 iops per cycle, no matter how many hosts are in them. This means you need to have as little per HSD as you can. Cut them down. Small clusters, small numbers of vms per lun, fast disks. (The non -V gets 1024 queue and 16 iops).

They're fast, but VERY, VERY picky.

That's way too tweaky and specific. Far more than it needs to be. You don't need to restrict disks, or max hosts or VMs like that at all. Also it's extremely arbitrary for a USP 1100. 1100's I insist on on-site because they can vary so widely. *bonk*
Firstly, SCSI queue on the ESX side should actually be 2 or 4, depending. 2 for a FT environment, 4 for VMotion only. FC will behave badly with a queue of 1, and you'll lose significant IOPS below 4. The problem is that 4 can cause issues with failure detection on the USP's.
Breaking down LUNs on a USP is an interesting little thing, because the USP operates based on packs of 4 disks each. e.g. Upgrading disk is not "add 1x146GB" it is "add 1 pack 4x146GB." A LUN can span 4 to 24 spindles, IIRC. (Depends on model, configuration, etcetera.) You should evaluate your VM capacity on a per LUN basis in conjunction with HDS service. Do not size by spindle count; only by IOPS on the USP 1100. (You'll have to use Windows tools to determine what a guest wants in terms of IOPS.)
Hosts should be done per cache board board per control board per director board per director port versus back-end boards. This is why I hate the USP 1100 the most; it's up to 32 cache boards with up to 8 control boards with up to 6 director boards with up to 192 ports handled by up to 4 back-end boards. It is absolutely impossible to reasonably give a general recommendation without evaluating the total processing capacity of the USP1100, which requires knowing the board and port configurations. Depending on the ESX host, putting 48 guests on it could be completely reasonable with a USP1100 - we'll say 6 FC 4Gbit ports, 8 hosts per port. Or it could be complete insanity. The USP-V is even WORSE in this regard, too.
LUSE is another issue which comes down to USP configuration, but expands into LUSE count, attached hosts, average IOPS on the USP, etcetera. LUSE is acceptable only depending on the specific situation and configuration, and can be used to decrease seek and/or increase IOPS capability. But LUSE can also cripple an LU's performance. So it's something that should only be used with great care and extreme attention to detail. It is very easy for a well performing LUSE to turn into a catastrophic bottleneck. If you aren't comfortable spending a great deal of time monitoring performance in detail, I don't recommend using LUSE for anything.
Storage domain queueing, Iopoetve and I are going to be completely at odds on, I suspect. Queue depth should be adjusted depending on the actual array configuration, based against disks used. IOPS per cycle should be adjusted purely based on application demand, and balanced against other applications on the USP. Yes, this means you're in for a great deal of performance analysis.

Like I said; I'm not comfortable making recommendations on these remotely for a reason. And above is why. You cannot make sound recommendations on USP performance tuning without knowing a great deal of information about the configuration and other hosts attached to it. You also need to do a lot of performance analysis and trending, or you end up in fire-fighting mode on a very regular basis. I don't like them at all, because they all but require you have a dedicated SAN administrator who does nothing but babysit it.

Have you perhaps considered an IBM DS8100 or DS8300 Turbo instead of the USP-V? Or instead of direct connection, using multiple systems behind an IBM SVC? The administration overhead on either solution is much lower. The other advantage is that the SVC is storage neutral and very good at improving performance for things like ESX. So not only will you see a net performance gain regardless, you have the option to use a greater number of lower cost arrays behind the SVC to reduce total costs and increase performance. Even just an SVC in front of the USP-V will make your life a lot easier, simply because you no longer are bound by USP-V restrictions. You tune the USP-V for the SVC (which is very easy,) and tune ESX for the SVC - which requires much less monitoring and active administration.
EDIT: As an example, you could use multiple well tuned IBM DS5100 or HDS AMS2100/AMS2300 systems behind an SVC cluster. If we presume each controller behind the SVC can do 2GB/sec sequential throughput and 100,000 IOPS, this means that 4 controllers would give you peak of 8GB/sec and 400,000 IOPS before SVC caching. (The SVC can increase IOPS significantly in most situations, as well.) This also eliminates the Active-Active restriction on ESX, because you attach to the SVC instead of the controllers. There are SVC restrictions in ESX of course, but those are another topic.

lopoetve · Jul 18, 2009

AreEss said:
That's way too tweaky and specific. Far more than it needs to be. You don't need to restrict disks, or max hosts or VMs like that at all. Also it's extremely arbitrary for a USP 1100. 1100's I insist on on-site because they can vary so widely. *bonk*

Its the standard starting place that we've found. Especially if you have more than 4-5 hosts per lun. You tweak from there, obviously, but that'll get you running and you can go from there. VMware KB

Firstly, SCSI queue on the ESX side should actually be 2 or 4, depending. 2 for a FT environment, 4 for VMotion only. FC will behave badly with a queue of 1, and you'll lose significant IOPS below 4. The problem is that 4 can cause issues with failure detection on the USP's.

I'm assuming you're only talking about hte USP, right?

Hosts should be done per cache board board per control board per director board per director port versus back-end boards. This is why I hate the USP 1100 the most; it's up to 32 cache boards with up to 8 control boards with up to 6 director boards with up to 192 ports handled by up to 4 back-end boards. It is absolutely impossible to reasonably give a general recommendation without evaluating the total processing capacity of the USP1100, which requires knowing the board and port configurations. Depending on the ESX host, putting 48 guests on it could be completely reasonable with a USP1100 - we'll say 6 FC 4Gbit ports, 8 hosts per port. Or it could be complete insanity. The USP-V is even WORSE in this regard, too.

True, but again, this was just a starting point.

LUSE is another issue which comes down to USP configuration, but expands into LUSE count, attached hosts, average IOPS on the USP, etcetera. LUSE is acceptable only depending on the specific situation and configuration, and can be used to decrease seek and/or increase IOPS capability. But LUSE can also cripple an LU's performance. So it's something that should only be used with great care and extreme attention to detail. It is very easy for a well performing LUSE to turn into a catastrophic bottleneck. If you aren't comfortable spending a great deal of time monitoring performance in detail, I don't recommend using LUSE for anything.

Amen

Storage domain queueing, Iopoetve and I are going to be completely at odds on, I suspect. Queue depth should be adjusted depending on the actual array configuration, based against disks used. IOPS per cycle should be adjusted purely based on application demand, and balanced against other applications on the USP. Yes, this means you're in for a great deal of performance analysis.

according to my contacts, iops per HDS are set and cannot be adjusted, hence my recommendations. We've found that limiting the number of hosts per domain greatly helps with these issues, since they're limited per domain (and so many places simply put everything into a single HSD

Like I said; I'm not comfortable making recommendations on these remotely for a reason. And above is why. You cannot make sound recommendations on USP performance tuning without knowing a great deal of information about the configuration and other hosts attached to it. You also need to do a lot of performance analysis and trending, or you end up in fire-fighting mode on a very regular basis. I don't like them at all, because they all but require you have a dedicated SAN administrator who does nothing but babysit it.

When ARENT we fire fighting?

jessejames · Jul 18, 2009

Thanks

in fact, communication between the controller seems to be the culprit.
with mirrorEnabled=false and writeCacheEnabled=true : 11000 I/O
with mirrorEnabled=false and writeCacheEnabled=false : 2400 I/O

Would it be that the system doesn't even try to mirror the data ?
Or maybe it is waiting to validate the writes to the disk before flushing both cache (in fact same perf as writeCacheEnabled=false because iometer saturates the disk I/O)
In fact, I/O per disk is probably aroung 300 so 11 raid6 (9+2) disks would give around 9x300 = 2700 I/O ?
So, the 11000 (or 9000) I/O are the one which are "wrong" ?

For the Raid6 setup, is it correct to use 10 disks and 2 spare or 12 disks and 0 spare? I will create 2 equal vmfs lun

AreEss · Jul 19, 2009

lopoetve said:
Its the standard starting place that we've found. Especially if you have more than 4-5 hosts per lun. You tweak from there, obviously, but that'll get you running and you can go from there. VMware KB

Okay, at least we agree on that part. I'd have to look, but I haven't had enough test and tune time with USPs to really adjust those numbers.

I'm assuming you're only talking about hte USP, right?

USP, but that recommendation covers several other arrays as well. I'm not naming names, because somebody's gonna change things wrong and blow things up. So for those of you reading, don't even think about it. Ask for proper queue depth and ye shall receive.

Amen according to my contacts, iops per HDS are set and cannot be adjusted, hence my recommendations. We've found that limiting the number of hosts per domain greatly helps with these issues, since they're limited per domain (and so many places simply put everything into a single HSD

Sorry, might have been unclear on that. The procedure is complicated as all hell. Basically you have to establish your baseline IOPS against your HDS advised limits. From THERE you slice it up based on what your hosts want in order to balance it. Never load your IOPS past 70% of tested maximum; you may need that overhead. You will need that overhead at some points. If you need more IOPS, you need another domain.

When ARENT we fire fighting?

Hey, I was almost never fire fighting on the DS4k+SVC setup. Granted, that's partly because I completely gave up because reconfiguring it correctly was absolutely impossible, but it was stable at least!

AreEss · Jul 19, 2009

jessejames said:
Thanks

in fact, communication between the controller seems to be the culprit.
with mirrorEnabled=false and writeCacheEnabled=true : 11000 I/O
with mirrorEnabled=false and writeCacheEnabled=false : 2400 I/O

Would it be that the system doesn't even try to mirror the data ? Or maybe it is waiting to validate the writes to the disk before flushing both cache (in fact same perf as writeCacheEnabled=false because iometer saturates the disk I/O)

No, the problem is that the controller mirrors the data over the only path between the two controllers. So you're contending not only with cache mirroring tasks (4k blocks going between the two controllers at a high rate) but disk load as well, including arrays. If you're using the proper configuration to utilize both ESMs (arrays across even -AND- odd slots,) then yeah. You are pretty much boned.
It doesn't validate cache flush though, no. What it has to do is validate cache copy before it returns. The cache flushing increase is actually a result of less cache being available, forcing it to flush blocks as quickly as possible, after mirror copy validation.

In fact, I/O per disk is probably aroung 300 so 11 raid6 (9+2) disks would give around 9x300 = 2700 I/O ?
So, the 11000 (or 9000) I/O are the one which are "wrong" ?

The DS3400 has a hard cap of around 10,000 IOPS roughly in single controller configuration. That's basically limit. I could probably do slightly better with one and a few weeks to play with it, but not a great deal better. The 11,000's a result of caching at both host and controller - a net gain of ~1,000 from caching is about normal.

For the Raid6 setup, is it correct to use 10 disks and 2 spare or 12 disks and 0 spare? I will create 2 equal vmfs lun

Let's say you have 12 disks, which sounds to be the case. This makes things a little tricky. You do NOT run without hotspares, but you also don't want to buy an EXP3000 if you can avoid it.
So, let's toss RAID6 out the window for now. We don't have the disks. Let's presume we're going for maximum performance in both IOPS and throughput. Our optimal configuration would be something along the lines of:
RAID5 (4+1), ESX_A, Disks 1, 2, 3, 4, 5 - Segment size 256k
RAID5 (4+1), ESX_B, Disks 6, 7, 8, 9, 10 - Segment size 256k
Hotspares: Slots 11, 12
This gives us two advantages; one, we're increasing total IOPS by utilizing both controllers. Per controller IOPS is lower, but aggregate is higher. Two, we're increasing utilization of the controllers and ESMs. Three, expansion becomes as simple as attaching an EXP3000 with 6 or 12 disks, and copying the existing arrays.

defuseme2k · Jul 20, 2009

Thanks so far for your help, apprently the settings we changed didn't make any difference on our throughput. We had the array configured with 64k segments and then an ntfs format with 64k clusters. Didn't seem to make any difference. Maybe these 8 sata disks just can't go fast enough.

I have another question though. If you look in disk management on the windows box and right click on a LUNs properties, on the policies tab you can normally enable write behind caching. I haven't checked the IBM yet, but I cannot seem to turn this on for LUNs I have presented from the HP EVA we have. I can check the box but it doesn't stick after I hit OK and the message changes to: This device does not allow its write cache setting to be modified.

Maybe its not supposed to work and I don't know it.

jessejames · Jul 20, 2009

Thanks AreEss

So here what I did:
set logicaldrive["DS3400L01"] cacheFlushModifier=immediate;
set logicaldrive["DS3400L01"] mirrorEnabled=false;
set storagesubsystem cacheFlushStart=0;
set storagesubsystem cacheFlushStop=0;
Can it be more agressive than that apart completly disabling cache ??

With those settings, I have my 11000 I/O but can it be really used ESX3.5 in production without risking corruption ?
I'm still questionning about those number though. 1500 I/O with writecache disabled and 11000 I/O with what appears theoretically no cache..

AreEss · Jul 21, 2009

jessejames said:
Thanks AreEss

So here what I did:
set logicaldrive["DS3400L01"] cacheFlushModifier=immediate;
set logicaldrive["DS3400L01"] mirrorEnabled=false;
set storagesubsystem cacheFlushStart=0;
set storagesubsystem cacheFlushStop=0;
Can it be more agressive than that apart completly disabling cache ??

With those settings, I have my 11000 I/O but can it be really used ESX3.5 in production without risking corruption ?
I'm still questionning about those number though. 1500 I/O with writecache disabled and 11000 I/O with what appears theoretically no cache..

Ouch. No. That's definitely wrong, and your benchmark's giving false numbers. Watermark of 0 is effective disable. Never set it to 0, that will cause corruption.
Set cache flush to .. 2s I think it is, watermarks at 60 low 80 high. If 2s won't go, 5s is fine. No matter what you're going to have to accept some risk of data corruption, but if we presume an effective cache of 310MB * 80% = ~250MB cache. ESX should have ~250MB of I/O buffer/cache internally, minimizing your risk. (Plus guest cache, etcetera, so you have a fairly wide safety net.) That's unfortunately the best case you're going to get and maintain solid performance.
The risk of actual data loss due to controller failure, it should be noted, is still really low. The DS3400 is a remarkably reliable product, and I've never heard of any controller failures in the wild. That doesn't mean it won't happen, and you shouldn't consider the possibility that it will happen. Just that I'd consider it very unlikely.

AreEss · Jul 21, 2009

defuseme2k said:
Thanks so far for your help, apprently the settings we changed didn't make any difference on our throughput. We had the array configured with 64k segments and then an ntfs format with 64k clusters. Didn't seem to make any difference. Maybe these 8 sata disks just can't go fast enough.

What's your number on sequential? The key here is that on random throughput, you will see maybe 100MB/sec if you're lucky. On sequential, you should scream along at 300MB/s+. But again, only on sequential. SATA is not acceptable for high random disk loads.

I have another question though. If you look in disk management on the windows box and right click on a LUNs properties, on the policies tab you can normally enable write behind caching. I haven't checked the IBM yet, but I cannot seem to turn this on for LUNs I have presented from the HP EVA we have. I can check the box but it doesn't stick after I hit OK and the message changes to: This device does not allow its write cache setting to be modified.

Maybe its not supposed to work and I don't know it.

This is a restriction of the EVA arrays. Write cache is exclusively controlled by the EVA heads. Don't expect performance out of them, either. I evaluated the EVAs, and found that short of the 8800, I'd get better performance out of an LSI 8480 or a ServeRAID. They really and truly suck hard. I don't know if it's the processing hardware in the heads or what, but they are just AWFUL.

lopoetve · Jul 21, 2009

AreEss said:
What's your number on sequential? The key here is that on random throughput, you will see maybe 100MB/sec if you're lucky. On sequential, you should scream along at 300MB/s+. But again, only on sequential. SATA is not acceptable for high random disk loads.

This is a restriction of the EVA arrays. Write cache is exclusively controlled by the EVA heads. Don't expect performance out of them, either. I evaluated the EVAs, and found that short of the 8800, I'd get better performance out of an LSI 8480 or a ServeRAID. They really and truly suck hard. I don't know if it's the processing hardware in the heads or what, but they are just AWFUL.

you have to path balance to the owning head no an EVA to get any kind of oomph out of them - they look like an A/A array because of the mirrorview port, but it's only 1gb, so if it has to use it, it's slow as shit.

AreEss · Jul 21, 2009

lopoetve said:
you have to path balance to the owning head no an EVA to get any kind of oomph out of them - they look like an A/A array because of the mirrorview port, but it's only 1gb, so if it has to use it, it's slow as shit.

Oh no, even with the heads balanced, the thing is a steaming pile of crap until you hit 8800 or whatever it is. (8400 maybe? I forget now, it's late anyways.) I had some loser HP reps trying to sell me a combination of XP20K and EVA saying the EVA4400 was plenty fast for Tier 3 as compared to 300GB 10K FC.
Yeah, no. It wasn't even close.

defuseme2k · Jul 21, 2009

The EVA performs OK for what it is doing. It is an EVA4400, but I wouldn't call it amazing or anything by any means. It is populated with 300GB 15K FC disks (about half way) and then we have a single 8 disk RSS with 450GB 15K FC disks. The interlink performance IS crap and its "active/active" but in a worthless way due to the interlink. That is where LUN ownership and ALB for MPIO come in to make sure you're not reading from the wrong controller. I assume writes go to both controllers because of the cache mirroring. We have it load balanced manually from ESX where each host uses the SAME path for a particular LUN. We tried to evenly balance across both controllers and all 4 ports basically 'rotating' every new presentation. We also have the owning controller specified in cmdview. On the VCB proxy we're using MPIO -- SQST with ALB for reading the data. We would never expect amazing performance (or even the same as the below test) moving data from the EVA to the IBM.

It is VCB and it is doing VCB to disk then we're replicating some where else. We tested doing large file copies from the windows perspective from C: -> the LUN to test write speeds using a stop watch. I'd prefer real data like that compared to doing it with a hdd app that shows best case.

Part of the problem is too that the backup software we're using does a VCB copy to a 'temporary' holding container then it has to copy it to the final destination doing inline deduplication. That just adds to the time frame, not to mention the temporary holding container is another LUN on the IBM SAN.

lopoetve · Jul 21, 2009

correct, writes don't matter as that's cached either way - only the reads. ALUA deals with it, if your os is ALUA aware (ESX4).

Sounds like you did it right.

oakfan52 · Jul 23, 2009

Unfortunately we have a dedicated team to handle or storage. They have little to no interest in performance tuning and don't understand anything about vmware.

Believe me the headaches I have had with my physical file servers and the vmware hosts have given me a crash course in SAN technology.

We have the full range of hosts types from HPUX, Solaris, Windows, and VMWare connecting and sharing storage ports on the USP1100.

Our USP1100 is pretty much 100% LUCE volumes. For pre-production we have two AMS 1000 (A/P arrays) externalized via the USP1100.

The new setup is the USP-V with 64 disk storage pools for teir1 storage. We have a new AMS 2500 visualized behind the USP for teir2. vmware has been the first to move and so far performance is much better. But we really have to wait until everyone is moved to see what its like. We did manage to get dedicated storage ports for vmware finally.

One of the big issues we seem to have with the USP1100 is over subscription on the storage ports. The storage ports would start suppressing I/o causing the servers (including vmware hosts) to get latency spikes over 3000ms.

I don't feel like our vmware deployment is that large. we have 6 host cluster in prod with @140 vm's. Preprod we have 3 clusters with a total of @550 vm's on 15 hosts.

I'm think it time to start considering a dedicated SAN as we have over 40TB for VMware currently. Gowing everyday.

We have already moved Exchange off to an EMC Clariion with a dedicated FC network because tehy could no keep the avg response time below 50ms on the USP1100.

AreEss · Jul 23, 2009

That's not even remotely a large setup, and you should absolutely NEVER be at 3000ms EVER UNDER ANY CIRCUMSTANCES. You need to bring up to management IMMEDIATELY that 3000ms disk waits WILL result in MULTIPLE production outages AND data corruption.

You need to get your storage away from your grossly incompetent SAN team. They have absolutely no business claiming to be professionals. None whatsoever. Storage is the absolute most critical component of any business operations, and they clearly are not qualified to administer so much as a ServeRAID. And I hope these idiots read this. The complete and utter disregard they are showing for the most mission critical component in your environment is disgusting beyond belief. They should be terminated immediately for gross incompetence and negligence which puts the business itself at risk.

All that said, letting these complete MORONS front-end an AMS with a USP-V will just make things even WORSE for you. You need to talk to management now. Here's what you need to tell them:

Our storage group has made it quite clear that they have no interest in performing the necessary work to ensure that our production VMWare environments are stable and reliable. There is absolutely no question based on their track record and current production issues that a USP-V is the incorrect solution to this problem. It is also without doubt that the migration is likely to cause us a great deal of trouble and offer absolutely no relief from any of the problems we currently suffer from. Indeed, these problems quite clearly stem from a lack of responsiveness and support from the SAN team, who has been unwilling to provide adequate support to ensure reliability of the environments we are responsible for.
To this end, it is necessary that we have private storage for our VMWare environment which is owned in entirety by the VMWare Administrative team. We are willing to take the time to train ourselves to properly maintain and administer our storage hardware, as it is apparent that this is the only way in which the business will be able to achieve a reliable, stable, and supported VMWare environment.

Feel free to adjust to suit, but you get the gist of it.

This is way into management territory, and I have absolutely no doubt at all that a USP-V is going to buy you more problems than solutions. It's quite clear the SAN team is too incompetent, negligent, or disinterested in doing their jobs to do them right. The USP-V is even more sensitive to tuning than the USP1100 - it's obvious their answer to getting chewed out is to blame the hardware and demand bigger hardware. Meaning you've got zero chance of anything actually getting fixed. Once they start loading it up, it's just going to end up like the USP1100. It never should have been allowed to ever get that bad.

EDIT: Yes. I get very pissed off about shit like this. Justifiably so. If you can't be assed to do your job, or do it right, you should be terminated and blacklisted. There are enough incompetents who think "reboot randomly in the middle of the day" is acceptable for mission critical systems, or any production system or system actively used for development. They need to be thrown out of IT permanently, and blacklisted to prevent them from ever holding an IT position above junior helpdesk again.

lopoetve · Jul 23, 2009

For once, I have absolutely NO disagreement with AreEss

It's about medium sized for a VMWare environment

oakfan52 · Jul 23, 2009

I completely agree. Unfortunately most of our issues are political not technical.

When that team is under a different Director and VP. It's a political nightmare where the engineers are just stuck in the middle trying to make it work as best as we can.

I think at best the USP-V will buy us some time, but ultimately we will be in the same situation we have been for the past 2 years because there has been no fundamental change in how the storage is managed. And this is what I have stated over and over to my management.

The interesting part to note(while trying not to bash the vendor). HDS has been the one to do the initial configuration of all of our subsystems. They have been on site during a lot of the "issues" and it doesn't seem to get fixed. So the solution is just to bring in a new SAN from HDS ? It boggles my mind at what we pay for these systems and the level of support we get. Now whether thats poor on HDS's part or is it that we are not engaging them.... that I don't know.

We engaged vmware support, as we have the enterprise level, multiple times. It seemed as though they did not want to point fingers. I think maybe the support engineers felt the political aspect. End the end the issues we keep having are politically seen and propagated as "vmware issues". The last time we opened a case in March all of our vmware environments were down as well as all server sharing storage ports with them. While I don't support the vmware servers as a primary job I wa sin the meetings as my production servers were impacted. I am like AreEss I don't play the politics. I don't sugar coat it and I'll tell you straight up how I feel. Lets just say that a director left the meeting and I was asked to leave, but in the end it got HDS on site while vmware support was on the phone. The issue was mitigated, but like I said above its a ticking time bomb waiting to go off again because there has been no fundamental change in how we manage storage.

AreEss · Jul 23, 2009

oakfan52 said:
I completely agree. Unfortunately most of our issues are political not technical.

When that team is under a different Director and VP. It's a political nightmare where the engineers are just stuck in the middle trying to make it work as best as we can.

Then let me change my advice.
Get your resume out there and start looking for work. Somebody's going to get hung out to dry when everything comes crashing down.

I think at best the USP-V will buy us some time, but ultimately we will be in the same situation we have been for the past 2 years because there has been no fundamental change in how the storage is managed. And this is what I have stated over and over to my management.

Once they start bringing over more stuff? It'll crash harder. Everything as LUCE on a USP-V is worse than on a USP-1100. That's what I was told by an HDS guy, and I don't see any reason to disbelieve. Especially when you're mixing ports by front-ending AMS2000's; the AMS2000's would be better off standing alone.

The interesting part to note(while trying not to bash the vendor). HDS has been the one to do the initial configuration of all of our subsystems. They have been on site during a lot of the "issues" and it doesn't seem to get fixed. So the solution is just to bring in a new SAN from HDS ? It boggles my mind at what we pay for these systems and the level of support we get. Now whether thats poor on HDS's part or is it that we are not engaging them.... that I don't know.

I can't comment on your specific experiences other than to say, that's DEFINITELY not my experiences with Hitachi, ever. I've always found HDS support to at least be correct and effective, if not always the fastest turnaround on issues. If the SAN twits opened it as a Prod Down issue and HDS came on site, I think you can bet money that the SAN twits were stonewalling HDS and ignoring their recommendations. (HDS is much more deferential to customer demands than others.) Maybe you should demand they replace it with a Sun 9900V? Then the SAN twits aren't allowed to touch it - ever for anything.

We engaged vmware support, as we have the enterprise level, multiple times. It seemed as though they did not want to point fingers. I think maybe the support engineers felt the political aspect. End the end the issues we keep having are politically seen and propagated as "vmware issues". The last time we opened a case in March all of our vmware environments were down as well as all server sharing storage ports with them. While I don't support the vmware servers as a primary job I wa sin the meetings as my production servers were impacted. I am like AreEss I don't play the politics. I don't sugar coat it and I'll tell you straight up how I feel. Lets just say that a director left the meeting and I was asked to leave, but in the end it got HDS on site while vmware support was on the phone. The issue was mitigated, but like I said above its a ticking time bomb waiting to go off again because there has been no fundamental change in how we manage storage.

Heh. Folks like us are few and far between. I guess management still doesn't like being told the truth. Like, "I can't support this environment because the vendor's tech support is now five guys in India who have never even touched the product, and can barely read from a script."
So, yeah. It's time to look for new work. The USP-V right now is buying you time, but that's going to blow up even harder than the USP-1100. The sad thing is that your environment? (Iopoetve and I will disagree here, but I like this game.) We'll presume that you can take some performance penalty in dev/test, but keep the penalty reasonable. So, I can do that in... two AMS2500's or three IBM DS5300's at.. we'll say average wait of <10ms. (Just a guesstimate, I'd have to look at applications on the guests and run some capacity excersizes too.) Or one USP-VM with lots of room to spare. Oh, and that's with rolling Exchange back in there too.

lopoetve · Jul 23, 2009

Hey, I have ~zero~ preference for particulars for devices - I simply know how to set most of them up and make them work as well as they can. I can't have a preference from where I am and where I work!

I just make what you have work, but I'm on the virtual hardware side atm, not the actual iron side.

oakfan52 · Jul 23, 2009

The USP-V won't have any LUCE LUN's. From what I remember they stated you can't create them on the USP-V. I believe the USP-V and USP-VM are diffident subsystems.

Psyco · Jul 26, 2009

I have similiar issues with our storage team at my worksapce as well. We have 9 hosts in two clusters. We bought two cx480's, which have 150 400GB 10K FC drives on each one in each datacenter. We had EMC do the layout telling them what we needed. I argued for quite some time that having multiple 8 disk raid5 groups, then meta lunning across all of them is making one big disk that is partitioned more or less. They would not seperate the disks. I got told that is the old way of thinking. Makes no sense to me. What they are doing is gambling on our behalf. When one of those raid groups drops a drive, it will affect BOTH clusters and EVERY vm in our environment. This seems like the wrong thing to do just form an HA/DR perspective. From a perfomance perspective it seems totally wrong as well. The writes I guess can be masked with the 16GB of write cache, but when the two SQL clusters get their 500 databases rocking through the same clariions, albeit on seperate disks, that write cache is going to be at a premium. I really wanted to make smaller raid groups, carve them up for specific purposes, OS loads, Web, File, APP, SQL. All we have now are a shit ton of 400GB chunks. Seems ok as long as we do not expand rapidly on the number of vm's and if they are high perfromance or not. What can I do to provide documentation to prove my theory's is in fact valid; looking for an outside opinion.

lopoetve · Jul 26, 2009

That's REALLY not a good idea. On ANY array.

Rebuild, scrub, etc - all have the potential to take the ~entire~ environment offline. And by offline, I mean offline forcibly. As in vms crashing, hosts disconnecting, everything dropping, etc.

Hell, the array will report if a rebuild is in progress - ESX won't mount the drive while it is, if there are any issues - you'll get a transient storage error. Good luck with that. Especially since you now can't balance load per actual lun / spindle for serious apps.

It's just not a good idea to tie everything together into one thing. This should be blatantly obvious.

Zbo · Oct 20, 2010

Hi guys,

I have read this thread twice, learned plenty, and improved the performance of our storage a great deal as a consequence.

I have a couple of questions that would be great to clarify if you could.

1. When you refer to a 128KB segment size configured on the LUN, does this mean that each segment on each disk is 128KB, or that one complete stripe is 128KB?
2. Why should you only have uneven disk numbers for RAID 5? Is this related to the question above, in that you need an even number of spindles to split the segemnt over equally?
3. We use almost exclusively 8MB VMFS block size, as most of our disk is SATA 1TB drives, and we are storing large disk based backup files inside Windows VM's on 2TB volumes (we have about 80TB so far, 4 x Ds3400's and 8 x EXP3000's). What would be the best segment size? We use RAID 5 predominantly.

Thanks

Zbo

ESX disk write caching

[H]ard|Gawd

2[H]4U

Extremely [H]

Extremely [H]

n00b

Extremely [H]

2[H]4U

2[H]4U

Extremely [H]

n00b

2[H]4U

2[H]4U

[H]ard|Gawd

n00b

2[H]4U

2[H]4U

Extremely [H]

2[H]4U

[H]ard|Gawd

Extremely [H]

[H]ard|Gawd

2[H]4U

Extremely [H]

[H]ard|Gawd

2[H]4U

Extremely [H]

[H]ard|Gawd

Limp Gawd

Extremely [H]

n00b