ESX disk write caching

Oi.
edit: Removed for correction below.

http://www.vmware.com/pdf/esx3_partition_align.pdf - follow this on a test vm. align the partition with the VMFS volume, and lets see what performance is then, as we'll minimize split writes. this normally gains about 3-5%, but if it's a major issue on this san, we might get a lot more out of it.
 
Last edited:
ok. I correct my prior.

Block size is an alloc/dealloc block, not a read size. Reads will not result in a read of the entire block, but of only what was needed bo be read. If the read straddles a file block, VMFS will split the I/O due to the possibility that the file block may be discontiguous.
 
I did this on a test Win2K3 VM - and writes went from 160MB/s to 260MB/s (with 16K cache block size on the san). Using 4K cache block size it went from ~80MB/s to 160MB/s.
 
Holy Jeebus.

BINGBINGBING!

I KNEW THERE WAS A REASON! Hot diggity DAMN. Excepting split reads, there's some things they're leaving out for you there, Iopoetve. It's hard for me to explain (because I'm just not good at it) but there's a multi-level disconnect here that nailed us HARD.

There are six layers we have to look at, going from the VM down:
VM Guest
VMDK File
VMFS Allocation
VMFS Segment Allocation
HBA vs. VMFS Segment Allocation
Array Block/Segment Size

The VM Guest resides in the VMDK File, which is ruled by VMFS Allocation rules. However, VMFS has an underlying segment (which is DEFINITELY over 8k but LESS than 512k) which is it's minimum read/write operation size. Meaning a 4K read in a VM Guest is actually a 32k physical read, which aligns with any FC HBA. But obviously there's a huge VMWare side performance penalty at 4K. (Which is why alignment is done at 32K.)
What this means is that a VMDK read encompasses a number of VMFS segments, which going by the document Iopoetve provided, are divisible by 32k. This means 64k, 128k, 256k, 512k. Here's where we get into the fun. The splits mean you're going to multiple differing blocks, which probably result in ESX internal cache misses, and a lot of random seeking.

Let's take our 260MB/s beast here. I think we can get you ~300MB/s by getting our RAID segment right. What's the segment size on the array that did 260MB/s?
 
Oi.
edit: Removed for correction below.

http://www.vmware.com/pdf/esx3_partition_align.pdf - follow this on a test vm. align the partition with the VMFS volume, and lets see what performance is then, as we'll minimize split writes. this normally gains about 3-5%, but if it's a major issue on this san, we might get a lot more out of it.

Can you explain why the boot volume does not need to be aligned. Also per microsoft Server 2008 automatically aligns its partitions to 1MB. I assume tahts ok because it can be devided by 32 ?
 
Let's take our 260MB/s beast here. I think we can get you ~300MB/s by getting our RAID segment right. What's the segment size on the array that did 260MB/s?

That's the funny part - 64K,128K,256K, and 512K all yielded 260MB/s!!
 
Oi.
edit: Removed for correction below.

http://www.vmware.com/pdf/esx3_partition_align.pdf - follow this on a test vm. align the partition with the VMFS volume, and lets see what performance is then, as we'll minimize split writes. this normally gains about 3-5%, but if it's a major issue on this san, we might get a lot more out of it.

saved for later review this week or next. we're deploying new hosting servers with latency sensitive services in mind. i'll be looking forward to testing this
 
Can you explain why the boot volume does not need to be aligned. Also per microsoft Server 2008 automatically aligns its partitions to 1MB. I assume tahts ok because it can be devided by 32 ?

It's destructive.

Technically, you should see, at most, about 5% increase in performance from aligning volumes, and it doesn't get you that benefit on a boot volume. I'll look into why.

correct @ server 08.
 
any change with different raid levels?

Nope. There will be no change except in RAID0. We're gonna need to retest the direct attach Windows box with the same alignment. I wanna see if that does what I think it will - being, the exact same. I think we're hitting cache limit, but I don't see cache mirror turned on.
 
Nope. There will be no change except in RAID0. We're gonna need to retest the direct attach Windows box with the same alignment. I wanna see if that does what I think it will - being, the exact same. I think we're hitting cache limit, but I don't see cache mirror turned on.

Given the overhead for differing RAID calcs, there should be some change, from what all I'm reading :) Not major, but some.
 
Given the overhead for differing RAID calcs, there should be some change, from what all I'm reading :) Not major, but some.

On other arrays, sure. Part of it is understanding that FC is NOT 4Gbit! It is 2x2Gbit. Presuming utilization of both paths, with the DS3400 having two internal loops, that gives us ~500MB/s with cache enabled. The DS3400 is a different beast, much like the DS4200. Except it contains less failure than the DS4200, typically. The CPU is basically packing enough overhead in the design that there's far less variance between RAID types. We definitely should not be hitting CPU limits on the DS3400 at 260MB/sec in RAID5.

The other reason I know it's not the CPU is because it's 260MB/s at multiple segment sizes. Segment size has a significant effect on CPU when you're doing less than full stripe writes. The read-calc-write penalty would be hammering the numbers down very hard. The same goes for smaller segment sizes - there should be at least 20MB/s of variance from 64K to 256K. I may be misremembering the DS3400's controllers, but with dual, it should have 4 host ports total (2 per controller,) so we shouldn't be hitting limit at one path+cache. Just looking at it with ignoring all our other information, 260MB/s says "single path limit" to me. I think it may be that ESX is only using one path, so I want to see the numbers out of Windows direct attach, where I know it'll use both paths.
 
On other arrays, sure. Part of it is understanding that FC is NOT 4Gbit! It is 2x2Gbit. Presuming utilization of both paths, with the DS3400 having two internal loops, that gives us ~500MB/s with cache enabled. The DS3400 is a different beast, much like the DS4200. Except it contains less failure than the DS4200, typically. The CPU is basically packing enough overhead in the design that there's far less variance between RAID types. We definitely should not be hitting CPU limits on the DS3400 at 260MB/sec in RAID5.

The other reason I know it's not the CPU is because it's 260MB/s at multiple segment sizes. Segment size has a significant effect on CPU when you're doing less than full stripe writes. The read-calc-write penalty would be hammering the numbers down very hard. The same goes for smaller segment sizes - there should be at least 20MB/s of variance from 64K to 256K. I may be misremembering the DS3400's controllers, but with dual, it should have 4 host ports total (2 per controller,) so we shouldn't be hitting limit at one path+cache. Just looking at it with ignoring all our other information, 260MB/s says "single path limit" to me. I think it may be that ESX is only using one path, so I want to see the numbers out of Windows direct attach, where I know it'll use both paths.

Oh, we're definitely using one path. We ~only~ use one path. ESX multipathing in 3.5 is failover only, with load balancing per lun on the SP (path balancing really, vs. load) - but only pathbalancing. We do not offer MPIO or any kind of bandwidth improving multipathing :)

yet. It's coming in 4.
 
Oh, we're definitely using one path. We ~only~ use one path. ESX multipathing in 3.5 is failover only, with load balancing per lun on the SP (path balancing really, vs. load) - but only pathbalancing. We do not offer MPIO or any kind of bandwidth improving multipathing :)

yet. It's coming in 4.

Okay, this makes no sense to me. On the one hand, you're saying you do no multipathing, but you're also saying that you do per-LUN load balancing (round robin, I'm guessing.) The DS3400 should be showing 4 paths per LUN, 2 Active, 2 Failover, 1 of each per HBA. That means that if you're doing per-LUN load balancing, it should be using two paths - one per HBA. If that's the case, we should see 400-450MB/s on ESX.

Also, we may want to kick Engineering with a steel toe on this one. Or mark the Hitachi AMS 2000-series and SVC as unsupported. Single port single-path on either of those will cripple anything else that tries to use that port, under ESX load. They're also both true Active/Active systems, meaning you need to Multipath to actually run correctly. (Above and beyond preferred path on the SVC. The AMS 2000 does not have preferred paths, according to what I was told by HDS.)
 
No, we do not do load balancing.
We do path balancing. We allow you to not hammer all requests to the same SP. It cannot issue requests to the same lun down multiple paths - only a single path.

EG: A/A array, Host 1 hits lun 1 on SP1, host2 hits lun 1 on SP2, etc.

We do that. Host 1 won't rotate between SP1 and SP2 for sending commands - ESX 3.5 doesn't have that capability stable yet.

Round Robin will be in the next version of ESX.

We use a single, preferred path for active/active, and a single, most recently used path for MRU - A/P arrays. You can set up preferred paths so that all commands are not being issued to the same SP, or the same HBA, but that's the current extent of it. There's no round robin - once set, it will use the same hba/sp till told otherwise.
 
Okay. Then what we are seeing here, is absolute path limitation.

260MB/s is the absolute maximum single direction throughput possible for a single ESX host in all environments. To get optimal performance, alternate port selection between hosts (because you have one HBA.)
Why is 260MB/s the absolute maximum throughput possible? That is equivalent to >2Gbit/s, which is the maximum you can put down a single fiber at 4Gbit. A 4Gbit link is a 2Gbit inbound and a 2Gbit outbound link. This means that yes, ~130MB/s is the maximum for a 2Gbit FC path, and ~420MB/s for an 8Gbit FC path.

If we can get two VMs on different ESX servers using different datastores on different controllers doing a combined 400-440MB/sec, then I'll be totally satisfied with our work here. :)
 
WOOT! Excuse me while I do my happy dance!
330MB/s out of a single array is quite faboo, considering we're talking ~500-550MB/s out of the entire DS3400. Having them both on the same LUN will obviously, cripple performance, when compared to having them on different LUNs on different controllers.
SO!

Let's talk implementation strategy! Okay, so I just like to abuse management speak. All you need to do is match up all of your ESX arrays to the fast one's configuration, then alternate VMs between the two arrays. (But not VCB, that's a different beast.) That should give you a total combined throughput peak of ~550MB/sec with four VMs. Even 165MB/s is way more than enough for anything you're going to be doing. (If you need >165MB/sec sustained outside of backups, you need IBM POWER.)

For those fine folks following along at home, can you post the exact configuration of the array that we're rocking out with? Disk count, type, segment size are the big ones.

Woo! Thanks for being patient with us, SpaceHonkey. Knew we'd get this running right. :)
 
Hardware - IBM HS21 blades (2 x Xeon E5440) with with 4Gb Emulex dual port HBAs, IBM DS3400 with dual storage processors (512MB cache) and 12 300GB 15K SAS drives, 2 Brocade 10/20 port fiber switches, H series BladeCenter chassis.

DS3400 config - 2 raid 5 arrays containing 5 drives each, with 2 drives for hotswap. Luns configured with 256KB segment size (for what it's worth - testing with segment sizes 64 - 512 yielded similar results in my config). Max throughput with 16KB cacheBlockSize setting (default is 4KB, yields 160MB/s versus 260MB/s).

VMs - Win2K3, ALIGNED partitions (align=64), formatted with 32KB clusters. 2MB VMFS block size.

Results using IOMeter with 32KB sequential writes = 260MB/s (with 16KB cacheBlockSize)

Note - for those not intimately familiar with the IBM DS series - SMcli is your friend!
 
Last edited:
Now - another question?! This one directly to lopoetve (or anyone else!) - how can you get Converter to align partitions during P2V? Or will we have to wait for ESX4 for this?
 
you can't currently. Cant' say anything about the future either way.

512mb cache? We're screaming for that :p Not too shabby!

imho - converter is spectacular for minor apps, or for getting a machine quickly into the environment. If it's a big app - then make a vm better customized to it and migrate to a built-for-vmware-from-the-start VM. :)
 
AreEss - another question. I've removed the luns that I created and it has left me with discontiguous free space that I need to be one big glob. My once 500GB free space has become 150 and 350. Now I need to recreate the original 500GB lun that it once held - any ideas how to do this without having to redo all my luns and shuffling data?

I suspect changing segment sizes are what caused it all to shift around.
 
AreEss - another question. I've removed the luns that I created and it has left me with discontiguous free space that I need to be one big glob. My once 500GB free space has become 150 and 350. Now I need to recreate the original 500GB lun that it once held - any ideas how to do this without having to redo all my luns and shuffling data?

I suspect changing segment sizes are what caused it all to shift around.

Woo! I'm educational, if you noticed that. ;)
Okay, here's the fun thing about the DS3400. You can actually "defragment" LUNs. That's what we want to do here, because honestly? Anything else is going to require completely clearing out the array and recreating from scratch. Not fun, that.

Here's the problem; you can't just do that from a menu in the DS3400. They didn't bother to provide it. The DS4k's and DS5k's have it. So, here's what we need to do.

Open Storage Manager.
Select the DS3400 on the left window.
Tools->Execute Script
Put this in the script window, where N is the number of the array.
Code:
Start Array[N] Defragment ;
Verify and Execute.

You can put multiple Defragment statements in the window as well, I believe. If not, Verify will abort it.
 
Ok, will try.

Luckily only the RemoteCLI VM was on this array, so I svmotion'd it off.

I'll let you guys know what happens...

BTW - finally goes into production this weedend!

edit:
Ok - kicked it off. Why is defragmenting used instead of something simple sounding like "move" ;) ?

Open Storage Manager.
Select the DS3400 on the left window.
Tools->Execute Script

Yeah, I've been doing that for a while now as using SMcli from the command prompt blows - interactive mode isn't exactly user friendly!
 
Note- a defrag will generally knock any ESX hosts offline.

Aaaaaaaactually, this is why I love the DS3k/DS4k/DS5k defrag. It's a VERY bad thing to have, don't get me wrong. It's stupid. HOWEVER, it's nonintrusive. Meaning, yes, you can do it with things running on the array. You'll have reduced performance, but you won't have any inaccessible blocks. You should never, ever need to defragment an array in reality. However, cheap design intrudes rather rudely.

Ok - kicked it off. Why is defragmenting used instead of something simple sounding like "move" ?

Because a Move is different from a Defragment. Move is relocating; Defragment is consolidating by reallocating blocks. It may seem like it's just semantics, but it's a really important difference - you're not moving anything on the controller, in actuality. You're just rearranging blocks in place. You won't see any progress or change until the Defragment is complete, either, so don't create any new LUNs on the array till it's done.

Don't get me started on SMcli. I'm convinced that's only there for RSM on the DS4k/DS5k and support scripts, like the DS4k/DS5k firmware check/fix garbage. You could sell me lots of DS3400's, but you'll never get me to pay money for another DS4k or DS5k again.
 
it's not the blocks being gone that matter - it's the performance :) We tend to timeout a lot and drop connections as a result of the load, especially at higher-priority rebuilds. ~shrug~ :)
 
Beautiful - worked like a charm!

FYI - apparently AreEss is the ONLY freaking source of good info related to these DS series enclosures on the internets. I've searched high and low and it just doesn't exist out there. Thanks again!

And obviously a HUGE thanks to lopoetve for all the support provided not just to this thread but the entire VM forum. We VMware n00bs would be lost here with out you and several other excellent guys here. It's really nice to know that you can come here and get excellent info/support that is often better than what we pay for! VMware should be giving you overtime for what you do here.

Thanks again guys!
 
it's not the blocks being gone that matter - it's the performance :) We tend to timeout a lot and drop connections as a result of the load, especially at higher-priority rebuilds. ~shrug~ :)

Yeah, sorry, forgot to mention that it does prioritize active I/O over the defrag - at least on the DS4k's.

SpaceHonkey said:
Beautiful - worked like a charm!

Done already? Man. I am glad to hear that. That's damn fast. Then again, last time I did it was on 10k FC2 disks in large 8+1's. ;)

FYI - apparently AreEss is the ONLY freaking source of good info related to these DS series enclosures on the internets. I've searched high and low and it just doesn't exist out there. Thanks again!

Wow, thanks! But I can't claim all the credit. I have a good bit of experience, but really wouldn't know as much as I do without the help of quite a few IBMers and some staff at LSI. There are a few (okay, one) Redbooks, but to tell the truth, they're high level rather than detailed analysis. Unless you upgrade to a DS8000. ;)
http://www.redbooks.ibm.com/portals/storage
 
Wow this thread is really interesting, and we have one of these same IBM SANs. We're getting the same 160~ speed to it with a direct to windows presented LUN. After reading this and looking at the 4k cluster size -- what if we created a NTFS partition with a 32k cluster size on the LUN directly. Would that then potentially create the same situation speed wise? It sure seems to at the VM level, but from that perspective many things come into play, but I believe I'm reading it correctly. Just changing the cacheBlockSize to 16k didn't help us right away, but I'm curious if I need to format the LUN at 32k clusters as well.

The windows server I'm talking about happens to be our VCB proxy, and the write performance to the IBM blows.

I guess I should qualify this by saying we have 8 SATA disks. I'm not sure how that will or will not affect speeds. I'm sure its slower but 1/2?
 
Last edited:
No, you would create an NTFS partition at a 64k segment size, and put it on an array with a 64k segment size. This would give you the best performance. Segment size is an array-level, not LUN level setting. Presuming RAID6 6+2, you should see ~500MB/s theoretical peak throughput. But because it's a single LUN you won't get much benefit from multipathing, so it will be lower.
 
Forgive my ignorance, but you mean the ntfs cluster size should be 64k, correct? I don't want to make any assumptions.

Also, it doesn't appear I can set the segment size while creating an array from the GUI. You were specific that it was an array setting, not a LUN setting. Will I need to do this from the command line? Also, my LUN segment size choices are 128 or 256 from the GUI, are either of these better than the other?
 
Last edited:
May also be a RAID group setting (on equallogic / clariion it is at least). See if the RAID group has that setting.
 
Back
Top