VMFS5: SAN block size vs IO


Supreme [H]ardness
Apr 13, 2004
So VMFS5 uses a 1MB block size, and an 8k sub-block. On most SANs the default stripe height is 64k. One might be inclined to set the SAN stripe size to 8k as well to match the VMFS sub-block, and have 1 IOPS per sub-block read, but that would generate quite a bit more IOPS for reads/writes which are larger than 8k. Likewise one may be inclined to set the stripe size to 4k, for yet more IOPS generation, but to distribute R/W among more spindles.

Is there a way to measure IO sizes on a per VM basis over time? Is it worth monkeying with this at all or just leave the whole thing at 64k and call it good?

Also where does the SAN block size fit in? I just did a test run with setting the SAN block size (not stripe size) to 1024 and it wouldn't format in VMFS at all throwing errors left and right. 512 seems to be the default for my array but I can crank it up all the way to 65536 bytes.
Last edited:
VMFS block size has no relation to guest IO size or what size is issued to the array for guests, only for VMFS operations, just to start with. You don't really need to improve those - they go quick enough as they are.

In general, there are some performance gains to be had by optimizing block sizes to a ~point~ for guest ops, but only to a point (also depending on the guest OS being used, and the interconnect being used) before you start causing more edge problems than you solve. Analyzing that takes looking at what I've come to call "IO Efficiency" and is a very involved process (I've done it for nimble, dot hill, X-io, and a couple of others so far). You can easily figure out with vscsiStats what size IOs your guests issue, and then you need to measure the IO performance and efficiency of your platform across that spectrum to see if there is something it does ~so~ well that increasing iops intentionally would improve performance (really only happens for large block IO).

Hope that helps - at the moment, I'm a bit discombobulated by a networking stupidity issue.
Ok, so from the VMware side though, the smallest piece of VMFS data that can be read is the 8k sub-block?

If I have a file that's 1k and it's on a Windows VM, where it will be read from a 4k NTFS cluster which will then result in an 8K read from VMFS, which in turn will require 64k to be read from the physical storage (if formatted with 64k). Is this all correct thus far?

If the above is correct, then would it not make sense, as a general practice, to create a VD/LUN on the physical storage that is formatted to 8k thus having a 1:1 relationship with VMFS sub-blocks?
8k for VMFS - effectively, yes - there are smaller commands, but you have no way of interacting with them or triggering them (Special metadata updates, ats commands).

Windows 2008/2012 will issue a 1k command to fetch that file - it's smart enough for that (according to our kernel at least, where I've probed). Earlier versions will issue the 4k (or appropriate block) sized command. From there, we pass the command natively now - so if it issued a 1k command, we'll pass it on to the physical layer as a 1k command. As for the array - depends on the platform at that point. Some will process the 1k command, others will do differently.

You're thinking about 4-5 years ago, before we passed everything natively. Right now, we pass what the guest asks for - no matter ~what~ it is. The block/subblock is only for our ~own~ VMFS operations. Guests get translated to the appropriate region on disk, and sent as-is.
I've always found in my testing that 64K has ended up being the happy medium no matter the I/O type. Few gains, if any, found in either direction from there. The only other thing I do specifically now is to use 64K clusters when formatting volumes for MSSQL.