ESX disk write caching

SpaceHonkey · Mar 12, 2009

How can I verify in 3.5 if disk write caching is enabled? The lun on the SAN is set to enable but I'm having some bad performance issues - Read = 349MB/s, Write = 90MB/s. Any clues?

lopoetve · Mar 12, 2009

run iometer - what kind of IOPS are you getting?

What SAN? In general, we don't touch cache, that's all controller based.

SpaceHonkey · Mar 12, 2009

In this case using 2MB blocks:

Write IOps=50.133625
Write MBps=100.26725

That's the highest throughput I see. I ran the test with blocks from 2KB - 4MB. IOps increases as block size decreases. Do you have a preffered .icf I could test with?

IBM DS3400 SAN.

lopoetve · Mar 12, 2009

http://lopoetve.com/vmware-iometer-testconfig.icf

SpaceHonkey · Mar 12, 2009

IOps = 14619.75685
Read IOps = 7310.851672
Write IOps = 7308.905181
MBps = 57.108425
Read MBps = 28.558014
Write MBps = 28.550411
Average Read Response Time = 15.894052
Average Write Response Time = 19.124338

lopoetve · Mar 12, 2009

Iops are great from what I expect these days. How many vms per raid group, how big is the lun, and how big is the load? And how fast are the disks?

lopoetve · Mar 13, 2009

ok. checked in at work.

IOPS are actually slightly low for that array, but not bad. What else is running on there? You should be able to get ~120MB/s with nothing else on there at all, but if this is a prod san there will be other load.

SpaceHonkey · Mar 13, 2009

Well, 120 MB/s is about the max I get (not in production yet). These tests where with no load on the SAN. For an example, with a Win2K3 VM, I'm getting (4MB blocks) 283MB/s read, 105MB/s write. This is with a 5 disk 15K SAS RAID5 array. No optimizations on sector size during format of that drive - didn't know what best to use at the time.

SpaceHonkey · Mar 13, 2009

For comparison, I get 230MB/s read and 129MB/s write on a 2 disk 15K SAS RAID0 array.

SpaceHonkey · Mar 13, 2009

PM sent with ticket # if you're interested...

sabregen · Mar 13, 2009

wait...what? your running disk performance metric tools inside of a vm, using vmfs3 on a ds3400? if this is the case, no wonder your numbers are shit. you can't do accurate performance tests on SAN from a inside of a guest VM...there's two file systems for each VM, in this case, NTFS on top of VMFS3, and the host is artibtrating all of the IO and other resources.

Did I completely misread this, or is this test completly FUBAR?

SpaceHonkey · Mar 13, 2009

I have done tests from both inside VMs and a 2K3 server attached to the san. Here are the results of your .icf from the server directly attached (no VM):

IOps = 15125.35621
Read IOps = 7571.91308
Write IOps = 7553.443129
MBps = 59.083423
Read MBps = 29.577785
Write MBps = 29.505637

SpaceHonkey · Mar 13, 2009

Let me recap what it going on. SAN write performance blows. In a VM or directly attached. RAID5 5 disks or RAID0 2 disks. Heck at one point in the past I tried 5 disk RAID0, still sucked.

The problem is that READ performance is more or less excellent, and write speeds are at BEST 1/2 (usually 1/3) read speeds even w/RAID0. For the fun of it, I'm going to try just a single drive too.

On a RAID5 5 disk lun, using a server (no VM), while DISABLING write caching on the lun, I only see a drop from 73MB/s to 67MB/s when testing sequential writes using 2MB blocks. That to me is awful.

My concern is that I can't put this thing into production if p2v'd servers are going to have to deal with a massive decrease in storage performance. I'm trying to troubleshoot what could be causing the problem, since IBM refuses to provide performance based support without a "software" contract. When I asked them why RAID0 write speeds where 1/2 read speeds, they asked me if that was a bad thing?!

I'm open to all suggestions and testing methodologies. I've got very little hair left to pull out at this point.

I really think it must be related to write caching, but I can't figure out how. One other thing, disabling write caching kills IOPS, so I'm certain it's on.

Surely all sans don't perform like this!

lopoetve · Mar 13, 2009

sabregen said:
wait...what? your running disk performance metric tools inside of a vm, using vmfs3 on a ds3400? if this is the case, no wonder your numbers are shit. you can't do accurate performance tests on SAN from a inside of a guest VM...there's two file systems for each VM, in this case, NTFS on top of VMFS3, and the host is artibtrating all of the IO and other resources.

Did I completely misread this, or is this test completly FUBAR?

yes you can, yes we do and yes, I told him to. This is the standard VMware method for measuring performance of a VM and of the SAN, with a specific ICF for iometer.

The settings we use bypass most of hte filesystem limitations and give us pretty pure performance, especially when paired with a perf. snapshot from the host.

Space, run the test again, and run a vm-support -s from the ESX host at the same time. Upload it to the ticket, I'll take a look at the SCSI queues.

lopoetve · Mar 13, 2009

SpaceHonkey said:
Let me recap what it going on. SAN write performance blows. In a VM or directly attached. RAID5 5 disks or RAID0 2 disks. Heck at one point in the past I tried 5 disk RAID0, still sucked.

The problem is that READ performance is more or less excellent, and write speeds are at BEST 1/2 (usually 1/3) read speeds even w/RAID0. For the fun of it, I'm going to try just a single drive too.

On a RAID5 5 disk lun, using a server (no VM), while DISABLING write caching on the lun, I only see a drop from 73MB/s to 67MB/s when testing sequential writes using 2MB blocks. That to me is awful.

My concern is that I can't put this thing into production if p2v'd servers are going to have to deal with a massive decrease in storage performance. I'm trying to troubleshoot what could be causing the problem, since IBM refuses to provide performance based support without a "software" contract. When I asked them why RAID0 write speeds where 1/2 read speeds, they asked me if that was a bad thing?!

I'm open to all suggestions and testing methodologies. I've got very little hair left to pull out at this point.

I really think it must be related to write caching, but I can't figure out how. One other thing, disabling write caching kills IOPS, so I'm certain it's on.

Surely all sans don't perform like this!

From looking at the logs, no - you definitely have an issue. I want to look at the scsi queues, latency, and errors when the test is being run. I'm seeing lots of path thrashing that shouldn't be going on from the host side.

I confirmed this with the escelation engineer on site too.

SpaceHonkey · Mar 13, 2009

I'll run the tests again - but to be sure, double check the path thrashing isn't recent. I've over time tested out all kinds of stuff, and I'm sure I caused the thrashing at some point in the past. Part of what I'm testing is how to break it to make sure we know what NOT to do. This thing is going to be the most tested piece of infrastructure we have!

ON THE BRIGHT SIDE - I just did a "real world test" on the 2K3 server using robocopy and I'm pretty damn impressed, ~2.4GB in 11 seconds to the RAID5 (5 disk), and 16 seconds to a single disk.

I'll get your test results up as soon as I can!

lopoetve · Mar 13, 2009

There was stuff right when you opened the ticket, which looked like the latest set of logs. I also just want a recent set, with load, so I can look at it with the right folk and know its recent and what was going on then. If we can have it just that one VM on that lun too, that'd be awesome

Some of the thrashing seemed a bit odd, but it's hard to tell.

SpaceHonkey · Mar 13, 2009

Attachment added. This was the only VM powered on at the time.

SpaceHonkey · Mar 13, 2009

When I did a robocopy test in the VM similar to the physical server, I noticed it would copy fast for a half second then halt, then continue, then halt at a regular interval. I less'd /var/log/vmkernel and didn't see it say anything that I recognized to be bad...

lopoetve · Mar 13, 2009

I'll take a look - you're now the 4th performance case I'm working on

SpaceHonkey · Mar 13, 2009

lopoetve said:
I'll take a look - you're now the 4th performance case I'm working on

Thanks man - we really appreciate the help from VMware, especially considering the treatment we got from the storage vendor.

lopoetve · Mar 13, 2009

I know how that goes.

quick look says "all good!" from the host side - latency is in the 2-3ms per command range, which is great. We're dumping the commands out there, the array is just not taking data fast.

what speed are the disks? SCSI or fibre disks?

I'll keep digging.

lopoetve · Mar 13, 2009

ah ha! found stuff in the logs.

Sec, pulling the sense codes.

lopoetve · Mar 13, 2009

the 10GB lun - what's on it?

sabregen · Mar 13, 2009

the ibm ds3400 is FC attached 12 bay encosure, with dual raid controllers, and dual PSUs and redundant fans support up to 12 internal disks in the controller, of either SAS or SATA, an from 1tb 7200RPM to 74GB 15k SAS, all 3.5" The drive shelf has a maximum capacity (with 3 expansion cabinets) of supporting up to 48 total disks, or 48TB of SATA, as a maximum configuration.

We saw LUN thrashing on a DS3400 for a client, as well. Did the OP make sure to set the host type for the logical drive to VMWAREESX? What firmware level are you running on it, op? I would highly recommend moving up to a 7.x firmware, there's known performance issues with 6.x +ESX

SpaceHonkey · Mar 14, 2009

The hardware is as follows:

IBM DS3400, dual controllers, dual channels per controller
12 300GB 15K SAS disks, no expansion units
IBM H series blade chassis
Dual Brocade 4Gb switches
2 HS21 blades with dual port 4Gb Emulex HBAs

The DS3400 is patched up - to 07.35.41.00. Host type for the blades are set to LNXCLVMWARE. I know that host type prevents the possibility of thrashing (it disables AVT).

Fiber is wired one channel per controller to each switch. So 4 paths for each blade, though only two are available since the array is active/passive. FWIW, VIC shows all paths active (I believe).

The errors you saw for the 10GB lun were caused by me deleting the lun before first deleting the datastore (I assume that's what you're asking about). I had to destroy that lun because I had to destroy the two disk raid0 array in order to test to a single disk. For what it's worth the 10GB lun was just used for IOMeter testing. Right now, it's going to just a single disk, the datastore name is VMDatastoreOnedisk (vmhba2:0:6). I've created and deleted several 10GB luns to test over time, but that's what it is now and was when I sent the last attachment.

The test I ran for you was to a RAID5 5 disk lun.

Would it be worth it to do another vm-support -s while doing a large robocopy job? Well I just did it anyway

Have a looksie... Robocopy was from 5 disk RAID5 lun to 1 disk RAID0 lun (and I balance them to different controllers). Also processor utilization was nearly pegged on the VM (> 90%).

Copied 28.616GB in 22 min 1 second for a grand total of 22MB/s. Booo!!

lopoetve · Mar 14, 2009

We detect the 3400 as an active/active array.

K, deleting the lun would explain the massive failover failures for the thing, and why the array was sending an error instead of not_ready.

I'll look when I get back in on monday

That's definitely slow

SpaceHonkey · Mar 14, 2009

My mistake, I misunderstood active/active vs. active/passive.

Yes both SPs are active, however a lun can only be owned by a single SP. I was thinking that active/active meant both SPs could process I/O for a given lun simultaneously. My mistake!

lopoetve · Mar 15, 2009

You understood it correctly - we just work with the array differently than most active/passive sans. (and hence tag it A/A/)

SpaceHonkey · Mar 18, 2009

Any updates?

lopoetve · Mar 18, 2009

Not yet - been slammed on a major ELA case all week. You're on my list for this afternoon.

SpaceHonkey · Mar 18, 2009

Muchas gracias!

SpaceHonkey · Mar 23, 2009

Bump.

SpaceHonkey · Mar 31, 2009

lopoetve - have another look at this ticket (not sure if you're still involved). Thought you might find it interesting...

While running IOMeter -
15k CMD/s total 56 MB/s read+write total 2.03 DAVG 2.14 KAVG. Tech said whole numbers for either DAVG or KAVG are really bad, and my understanding is that KAVG must be far less than DAVG normally.

lopoetve · Mar 31, 2009

Just haven't had time - I've been so totally slammed it's not even funny. I may take ownership of it though.

Whole numbers? what's he smoking? Those are in ms - if you're seeing less than whole numbers, then you're breaking the space time continuum.

Anything over 80 for DAVG is bad. KAVG is kernel time - you shouldn't see much there unless the host is horked. DAVG is fabric time. GAVG is KAVG+DAVG.

SpaceHonkey · Apr 1, 2009

Ok, well I'm stuck again. I just got done speaking with the currently assigned tech, and he said that after talking to the escalation engineers, it seems my numbers are actually about average. I re-ran the same IOmeter test on the physical server, and it saw nearly identical numbers.

OK - so IOPS are good. But I've still got problems with real world file copies. Physical = ~250MB/s, VM = ~75MB/s.

I feel like I'm at square one again. Basically I was told that there was nothing else he can do. So no more love from VMware or IBM

The vmdks are aligned on the luns, I've tried 32K block size in VM (changed to 88MB/s), no other san activity, no pathing issues. dd in the service console writes at about 42MB/s or so. I'm at a complete loss. The problem only exists on ESX yet we're out of solutions.

SpaceHonkey · Apr 4, 2009

Well this appears to be fixed now. Robocopy reports throughput based on job time, not file copy speeds...which was a bit confusing along the way. And I think ESX 3.5U4 helped too. I think there were several causes along the way, but they all seem to be resolved now. One other thing I've learned is that while the DS3400 is a good value, a 40K+ SAN it ain't. We may end up upgrading cache to 1GB from 512MB in the future...

Lopoetve - thanks for all the help and I'm sure you'll here from me here again in the future if it becomes an issue again.

lopoetve · Apr 4, 2009

hrm. Yeah. I was about to pm you for the sr again - my pms were all lost

And non -backupable.

lopoetve · Apr 4, 2009

SpaceHonkey said:
Ok, well I'm stuck again. I just got done speaking with the currently assigned tech, and he said that after talking to the escalation engineers, it seems my numbers are actually about average. I re-ran the same IOmeter test on the physical server, and it saw nearly identical numbers.

OK - so IOPS are good. But I've still got problems with real world file copies. Physical = ~250MB/s, VM = ~75MB/s.

I feel like I'm at square one again. Basically I was told that there was nothing else he can do. So no more love from VMware or IBM

The vmdks are aligned on the luns, I've tried 32K block size in VM (changed to 88MB/s), no other san activity, no pathing issues. dd in the service console writes at about 42MB/s or so. I'm at a complete loss. The problem only exists on ESX yet we're out of solutions.

FWIW, teh EE I talked to disagreed on those numbers for a 3400.

Anyway, glad it's working out for you

AreEss · Apr 5, 2009

Disagreed in what way, lopoetve? I'd agree that they're low, definitely. Way low. Realistic numbers for the DS3400 are much higher - ~500MB/s read, ~400MB/s write, presuming you're doing full stripe reads and writes. The DS3400 is absolutely a >$50K SAN in terms of raw performance - it replaced the much more expensive DS4200 on the basis of being a better performer. (And trust me, the DS4200 is trash. I have one.)

Space, I think the issue is actually in your DS3400 configuration. Can you do me a favor? Under Storage Manager, there's an option (I forget where on the DS3k's) to "Collect All Support Data."

Can you please do that, and send it to me in a PM? The output should be a ZIP file, which contains a raft of data. Disclosure, I am not an IBM or LSI employee - but I've sifted through CASDs so much, I can pick out your configuration without too much trouble.

The TL;DR version is that I think you have an issue in your cache settings on the DS3400. But I don't remember where to find those. If you can find 'em, I need to know your exact watermark settings, your cache mode switches (write-back/write-through, mirroring, etc,) and which cache modules you have installed.

EDIT: I checked my notes. It's definitely something in the DS3400. What firmware version are you running? Forgot to ask that.

ESX disk write caching

Gawd

Extremely [H]

Gawd

Extremely [H]

Gawd

Extremely [H]

Extremely [H]

Gawd

Gawd

Gawd

Fully [H]

Gawd

Gawd

Extremely [H]

Extremely [H]

Gawd

Extremely [H]

Gawd

Gawd

Extremely [H]

Gawd

Extremely [H]

Extremely [H]

Extremely [H]

Fully [H]

Gawd

Extremely [H]

Gawd

Extremely [H]

Gawd

Extremely [H]

Gawd

Gawd

Gawd

Extremely [H]

Gawd

Gawd

Extremely [H]

Extremely [H]

2[H]4U