ESX disk write caching

Way way low for that array. We get that out of a NetApp running NFS.
 
AreEss, I don't see how to attach a file to PM, or an email for you.

Lopoetve - PM sent.

Quick summary of controllers -

CONTROLLERS------------------------------
Number of controllers: 2

Controller in Enclosure 85, Slot A


Status: Online

Current configuration
Firmware version: 07.35.41.00
Appware version: 07.35.41.00
Bootware version: 07.35.41.00
NVSRAM version: N1726D340R335V05
Pending configuration
Firmware version: None
Appware version: None
Bootware version: None
NVSRAM version: None
Transferred on: None
Replacement part number: 39R6502
Model name: 1932
Board ID: 1932
Submodel ID: 40
Product ID: 1726-4xx FAStT
Revision: 0617
Replacement part number: 39R6502
Part number: 34477-01
Serial number: <XXXXXXXXXXX>
Vendor: IBM
Date of manufacture: March 20, 2008
Data Cache
Total present: 310 MB
Total used: 0 MB
Processor cache:
Total present: 202 MB
Host Interface Board
Status: Optimal
Location: Slot 1
Type: Fibre channel
Number of ports: 2
Board ID: 0901
Replacement part number: Not Available
Part number: Not Available
Serial number: Not Available
Vendor: Not Available
Date of manufacture: Not available
Date/Time: Sun Apr 05 22:08:33 CDT 2009
 
Sorry for the LONG delay, SpaceHonkey. I've been swarmed at work.

Unfortunately, the data I need isn't in there. :( Can you do a Collect All Support Data, then open the Zip file, and post a list of the files in there? It's a different naming layout than the DS4k/DS5k. I just need a few lines from one of those files.
 
Le bump, ain't gotten nothing yet. First real chance I've had to sit down - this week has been a madhouse. If you PM me, I can pop in to check it out inside of a few minutes usually though. Presuming something isn't in the middle of blowing up again. ;(
 
Sorry - ran out of time yesterday (and forgot!)

Here's a dir fer ya -

04/22/2009 08:54 AM 37 Connections.txt
04/22/2009 08:54 AM 8,648 driveDiagnosticData.bin
04/22/2009 08:54 AM 0 ESMStateCapture.zip
04/22/2009 08:54 AM 2,260 featureBundle.txt
04/22/2009 08:54 AM 2,507,975 majorEventLog.txt
04/22/2009 08:54 AM 23,103 NVSRAMdata.txt
04/22/2009 08:54 AM 97,837 objectBundle
04/22/2009 08:54 AM 1,486 performanceStatistics.csv
04/22/2009 08:54 AM 36 persistentReservations.txt
04/22/2009 08:54 AM 62 recoveryGuruProcedures.html
04/22/2009 08:54 AM 2,703 recoveryProfile.csv
04/22/2009 08:54 AM 6,386 sasPhyErrorLogs.csv
04/22/2009 08:54 AM 1,574,918 stateCaptureData.txt
04/22/2009 08:54 AM 11,890 storageArrayConfiguration.cfg
04/22/2009 08:53 AM 65,135 storageArrayProfile.txt
04/22/2009 08:54 AM 108 unreadableSectors.txt
 
Ahha. It is a different layout than the DS4k.

Here's the fun part; I need to see sasPhyErrorLogs.csv and recoveryProfile.csv - I think. I'm positive on PhyError, not positive on recoveryProfile. That should be the one with the cache configuration in it.

sasPhyErrorLogs.csv should probably not be 6k, so I think there's something going on there. And I can tell you for absolute certain that majorEventLog.txt should NOT be 2.5MB.

Sorry for the delay, work has been INSANE.
 
Here they are - now keep in mind, over time I had been testing failovers, primarily with disconnecting fiber and maybe downing a SP. But I haven't done that for a while - and considering that this guy's about to go into production, most of the test VMs have been removed and there isn't much activity on it now.

sasPhyErrorLogs.csv said:
"SAS statistics gathered on: 4/25/09 1:03:32 PM - Sampling interval: 11 days, 19:35:09"




"SAS Device Summary:"

"SAS Device #1 = IOC located in Controller Enclosure 85, slot A "
"SAS Device #2 = IOC located in Controller Enclosure 85, slot B "
"SAS Device #3 = Expander located in Controller Enclosure 85, slot A "
"SAS Device #4 = Expander located in Controller Enclosure 85, slot B "
"SAS Device #5 = Drive [85,1]"
"SAS Device #6 = Drive [85,2]"
"SAS Device #7 = Drive [85,3]"
"SAS Device #8 = Drive [85,4]"
"SAS Device #9 = Drive [85,5]"
"SAS Device #10 = Drive [85,6]"
"SAS Device #11 = Drive [85,7]"
"SAS Device #12 = Drive [85,8]"
"SAS Device #13 = Drive [85,9]"
"SAS Device #14 = Drive [85,10]"
"SAS Device #15 = Drive [85,11]"
"SAS Device #16 = Drive [85,12]"

"SAS Statistics Details:"

"Devices (Attached To)","SASDev","PHYID ","CURSP ","MAXSP","IDWC","RDEC ","LDWSC","RPC"

"","","","","","","","",""
"Drive Channel 0, Controller/Drive Enclosure","","","","","","","",""
"","","","","","","","",""
"Controller in slot A - (Expander)","1","0","3 Gbps","3 Gbps","0","0","0","0"
"Controller in slot A - (Expander)","1","1","3 Gbps","3 Gbps","0","0","0","0"
"Controller in slot A - (Expander)","2","2","3 Gbps","3 Gbps","0","0","0","0"
"Controller in slot A - (Expander)","2","3","3 Gbps","3 Gbps","0","0","0","0"
"Controller in slot A - (IOC)","3","16","3 Gbps","3 Gbps","0","0","0","0"
"Controller in slot A - (IOC)","3","17","3 Gbps","3 Gbps","0","0","0","0"
"Controller in slot A - (Expander)","5","0","3 Gbps","3 Gbps","26","0","0","0"
"Controller in slot A - (Expander)","6","0","3 Gbps","3 Gbps","8","0","0","0"
"Controller in slot A - (Expander)","7","0","3 Gbps","3 Gbps","0","0","0","0"
"Controller in slot A - (Expander)","8","0","3 Gbps","3 Gbps","8","0","0","0"
"Controller in slot A - (Expander)","9","0","3 Gbps","3 Gbps","8","0","0","0"
"Controller in slot A - (Expander)","10","0","3 Gbps","3 Gbps","8","0","0","0"
"Controller in slot A - (Expander)","11","0","3 Gbps","3 Gbps","8","0","0","0"
"Controller in slot A - (Expander)","12","0","3 Gbps","3 Gbps","32","0","0","0"
"Controller in slot A - (Expander)","13","0","3 Gbps","3 Gbps","0","0","0","0"
"Controller in slot A - (Expander)","14","0","3 Gbps","3 Gbps","8","0","0","0"
"Controller in slot A - (Expander)","15","0","3 Gbps","3 Gbps","17","0","0","0"
"Controller in slot A - (Expander)","16","0","3 Gbps","3 Gbps","8","0","0","0"
"Controller in slot B - (IOC)","3","18","3 Gbps","3 Gbps","0","0","0","0"
"Controller in slot B - (IOC)","3","19","3 Gbps","3 Gbps","0","0","0","0"
"Drive [85,1]","3","0","3 Gbps","3 Gbps","0","0","0","0"
"Drive [85,2]","3","1","3 Gbps","3 Gbps","0","0","0","0"
"Drive [85,3]","3","2","3 Gbps","3 Gbps","0","0","0","0"
"Drive [85,4]","3","3","3 Gbps","3 Gbps","0","0","0","0"
"Drive [85,5]","3","4","3 Gbps","3 Gbps","0","0","0","0"
"Drive [85,6]","3","5","3 Gbps","3 Gbps","0","0","0","0"
"Drive [85,7]","3","6","3 Gbps","3 Gbps","0","0","0","0"
"Drive [85,8]","3","7","3 Gbps","3 Gbps","0","0","0","0"
"Drive [85,9]","3","8","3 Gbps","3 Gbps","0","0","0","0"
"Drive [85,10]","3","9","3 Gbps","3 Gbps","0","0","0","0"
"Drive [85,11]","3","10","3 Gbps","3 Gbps","0","0","0","0"
"Drive [85,12]","3","11","3 Gbps","3 Gbps","0","0","0","0"
"","","","","","","","",""
"Drive Channel 1, Controller/Drive Enclosure","","","","","","","",""
"","","","","","","","",""
"Controller in slot A - (IOC)","4","16","3 Gbps","3 Gbps","0","0","0","0"
"Controller in slot A - (IOC)","4","17","3 Gbps","3 Gbps","0","0","0","0"
"Controller in slot B - (Expander)","1","2","3 Gbps","3 Gbps","0","0","0","0"
"Controller in slot B - (Expander)","1","3","3 Gbps","3 Gbps","0","0","0","0"
"Controller in slot B - (Expander)","2","0","3 Gbps","3 Gbps","0","0","0","0"
"Controller in slot B - (Expander)","2","1","3 Gbps","3 Gbps","0","0","0","0"
"Controller in slot B - (IOC)","4","18","3 Gbps","3 Gbps","0","0","0","0"
"Controller in slot B - (IOC)","4","19","3 Gbps","3 Gbps","0","0","0","0"
"Controller in slot B - (Expander)","5","1","3 Gbps","3 Gbps","11","0","0","0"
"Controller in slot B - (Expander)","6","1","3 Gbps","3 Gbps","12","0","0","0"
"Controller in slot B - (Expander)","7","1","3 Gbps","3 Gbps","2","0","0","0"
"Controller in slot B - (Expander)","8","1","3 Gbps","3 Gbps","11","0","0","0"
"Controller in slot B - (Expander)","9","1","3 Gbps","3 Gbps","11","0","0","0"
"Controller in slot B - (Expander)","10","1","3 Gbps","3 Gbps","16","0","0","0"
"Controller in slot B - (Expander)","11","1","3 Gbps","3 Gbps","18","0","0","0"
"Controller in slot B - (Expander)","12","1","3 Gbps","3 Gbps","11","0","0","0"
"Controller in slot B - (Expander)","13","1","3 Gbps","3 Gbps","2","0","0","0"
"Controller in slot B - (Expander)","14","1","3 Gbps","3 Gbps","11","0","0","0"
"Controller in slot B - (Expander)","15","1","3 Gbps","3 Gbps","13","0","0","0"
"Controller in slot B - (Expander)","16","1","3 Gbps","3 Gbps","13","0","0","0"
"Drive [85,1]","4","0","3 Gbps","3 Gbps","0","0","0","0"
"Drive [85,2]","4","1","3 Gbps","3 Gbps","0","0","0","0"
"Drive [85,3]","4","2","3 Gbps","3 Gbps","0","0","0","0"
"Drive [85,4]","4","3","3 Gbps","3 Gbps","0","0","0","0"
"Drive [85,5]","4","4","3 Gbps","3 Gbps","0","0","0","0"
"Drive [85,6]","4","5","3 Gbps","3 Gbps","0","0","0","0"
"Drive [85,7]","4","6","3 Gbps","3 Gbps","0","0","0","0"
"Drive [85,8]","4","7","3 Gbps","3 Gbps","0","0","0","0"
"Drive [85,9]","4","8","3 Gbps","3 Gbps","0","0","0","0"
"Drive [85,10]","4","9","3 Gbps","3 Gbps","0","0","0","0"
"Drive [85,11]","4","10","3 Gbps","3 Gbps","0","0","0","0"
"Drive [85,12]","4","11","3 Gbps","3 Gbps","0","0","0","0"


"Legend:"

"PHYID =A number in the range 0-127 that is the PHY identifier"
"CURSP =The current negotiated physical link rate for the PHY "
"MAXSP=The hardware maximum physical link rate for the PHY"
"IDWC=Number of invalid dwords that have been received outside of PHY reset sequences"
"RDEC =Number of dwords containing running disparity errors that have been received"
"LDWSC=Times the PHY has restarted the link reset seq. because it lost dword synchronization"
"RPC=Times the PHY did not receive dword synchronization during the final SAS speed negotiation window"
"DATA INVALID=Statistics could not be retrieved or device is a SATA drive"

recoveryProfile.csv said:
Recovery Profile data from C:\Program Files\IBM_DS4000\client\data\recovery\RECOV.600A0B8000494E800000000048E5EB70.csv

TimeStamp,1,"Apr 2, 2009 4:06:06 PM",1238706366375
Header,600A0B8000494E800000000048E5EB70,BC1SAN1,1
CtlFirmware,SX81212715 ,07.35.41.00
CtlFirmware,SX81212705 ,07.35.41.00
VolumeGroup,600A0B8000494E8A000003C049372552,Array1-Primary,5,false,5000CCA0055620F40000000000000000,5000CCA005594B440000000000000000,5000C5000A4253FF0000000000000000,5000CCA00559F5000000000000000000,5000CCA005577CB80000000000000000
Volume,600A0B8000494E8A0000045F4977525F,VMDatastoreNetware51,600A0B8000494E8A000003C049372552,512,536870912000,0,131072,1,1
Volume,600A0B8000494E800000030849775291,VMDatastore1,600A0B8000494E8A000003C049372552,512,660980367360,262144000,131072,1,1
Drive,5000CCA00559F5000000000000000000,"85,4", J8WLGNSC,600A0B8000494E8A000003C049372552,1,false,""
Drive,5000C5000A4253FF0000000000000000,"85,3",3LM3YGLH00009837SPKH,600A0B8000494E8A000003C049372552,1,false,""
Drive,5000CCA005577CB80000000000000000,"85,5", J8WK3KDC,600A0B8000494E8A000003C049372552,1,false,""
Drive,5000CCA0055620F40000000000000000,"85,1", J8WJBDEC,600A0B8000494E8A000003C049372552,1,false,""
Drive,5000CCA005594B440000000000000000,"85,2", J8WL3BAC,600A0B8000494E8A000003C049372552,1,false,""
VolumeGroup,600A0B8000494E8A000004514970AA42,Array2-Secondary,5,false,5000CCA00D6EC9580000000000000000,5000CCA0055864600000000000000000,5000CCA0055A4B700000000000000000,5000C50008F198B30000000000000000,5000CCA0053DEA840000000000000000
Volume,600A0B8000494E8A0000046C4986E4E6,VMWVC1BackupsVolume,600A0B8000494E8A000004514970AA42,512,536870912000,262144000,131072,1,1
Volume,600A0B8000494E80000002FB4970AB45,VMDatastoreNetware51-Failover,600A0B8000494E8A000004514970AA42,512,536870912000,0,131072,1,1
Volume,600A0B8000494E8A0000046E4986E518,VMDatastore2,600A0B8000494E8A000004514970AA42,512,124109455360,524288000,131072,1,1
Drive,5000CCA0055864600000000000000000,"85,7", J8WKLZNC,600A0B8000494E8A000004514970AA42,1,false,""
Drive,5000CCA0055A4B700000000000000000,"85,8", J8WLNE6C,600A0B8000494E8A000004514970AA42,1,false,""
Drive,5000CCA0053DEA840000000000000000,"85,10", J8W31LNC,600A0B8000494E8A000004514970AA42,1,false,""
Drive,5000C50008F198B30000000000000000,"85,9",3LM3BX9J00009830ZW8F,600A0B8000494E8A000004514970AA42,1,false,""
Drive,5000CCA00D6EC9580000000000000000,"85,6", JHWYXTGC,600A0B8000494E8A000004514970AA42,1,false,""

TimeStamp,1,"Apr 13, 2009 5:30:41 PM",1239661841281
CtlFirmware,SX81212715 ,07.35.41.02

TimeStamp,1,"Apr 13, 2009 5:32:09 PM",1239661929203
CtlFirmware,SX81212705 ,07.35.41.02
 
Sorry again about the delay. Things are hell here.

So, different layout slightly. However, there's definitely something wrong there. You've had invalid dwords (IDWC,) to 10 of 12 drives. That's highly abnormal, but the counts are low. Still, I think we want to rule out potential issues there.
For this, we need some more data. Iopoetve, are you aware of any known FC signalling problems with the DS3400? Is there somewhere in the logs to look for FC errors or get the SCSI sense from them? SpaceHonkey, what FC switch are you using with this? Are you restricting broadcast within zones?

Unfortunately, what I'm really looking for on the cache isn't in either of those.. looks like I'm gonna need NVSRAMdata.txt - be sure to sanitize the IPs. Also, can you please post a diagram of your array to physical layout?

Thanks!
 
For the 10/12 drives, 2 are hotspares.

NVSRAMdata.txt - file is too big to post completely - what part(s) do you care about?
Here's a guess -

Code:
Slot A
  
  NVSRAM Region 3   Region Id = (234) Drive Fault Data

     0000: 1020 3040 5060 0000 0000 0000 0000 0000    ..0.P...........
     0010: 1121 3141 5161 0000 0000 0000 0000 0000    ..1AQa..........
     0020: 1222 3242 5262 0000 0000 0000 0000 0000    ..2BRb..........
     0030: 1323 3343 5363 0000 0000 0000 0000 0000    ..3CSc..........
     0040: 1424 3444 5464 0000 0000 0000 0000 0000    ..4DTd..........
     0050: 0000 0000 0000 0000 0000 0000 0000 0000    ................
     0060: 0000 0000 0000 0000 0000 0000 0000 0000    ................
     0070: 0000 0000 0000 0000 0000 0000 0000 0000    ................

Slot B

   NVSRAM Region 3   Region Id = (234) Drive Fault Data

     0000: 1020 3040 5060 0000 0000 0000 0000 0000    ..0.P...........
     0010: 1121 3141 5161 0000 0000 0000 0000 0000    ..1AQa..........
     0020: 1222 3242 5262 0000 0000 0000 0000 0000    ..2BRb..........
     0030: 1323 3343 5363 0000 0000 0000 0000 0000    ..3CSc..........
     0040: 1424 3444 5464 0000 0000 0000 0000 0000    ..4DTd..........
     0050: 0000 0000 0000 0000 0000 0000 0000 0000    ................
     0060: 0000 0000 0000 0000 0000 0000 0000 0000    ................
     0070: 0000 0000 0000 0000 0000 0000 0000 0000    ................

Brocade switches - 2. Restricting broadcasts within zones, not that I know of. Forgive me here - I'm a noob to this, so that's still a foreign concept to me. After googling around, I get the point though. I looked in the brocade documentation, and it seems to state that only one broadcast zone can exist - and I can tell you right now it doesn't.

Physical connectivity is as follows -
Code:
Brocade Switch A
Port 0 = DS3400 SlotA Channel1 
Port 1 = Blade1
Port 2 = Blade2
Port 15 = VCB
Port 16 = TAPE
Port 19 = DS3400 SlotB Channel2

Brocade Switch B
Port 0 = DS3400 SlotB Channel1
Port 1 = Blade1
Port 2 = Blade2
Port 15 = VCB
Port 19 = DS3400 SlotA Channel2

Zones - 
ESX = Blade1 + Blade2 + VCB
TAPE = VCB + TAPE

To clarify - the switches are not linked in any way. The tape zone only exists on Switch A.
 
ESX blows. I can never make the thing work appropriately

I definitely wouldn't say that - but for me it has been quite a learning curve. I literally had to learn quite a bit about SANs, networking, and of course VI itself - and doing it all at once leaves lot's of room for error. HOWEVER, what it provides (and does now) is quite phenominal, vastly reducing the need for outages, eliminating hardware dependance and providing far more reduncancy than any of the physical servers ever did (in our case).

The efforts are definitely worth it, but it is quite a beast to tame for the uninitiated.
 
ESX blows. I can never make the thing work appropriately

Blame the carpenter, not the tools.

Anyways. That's actually NOT the part I was looking for in the NVSRAMdata.txt file. Would you rather try emailing it to me? That'll probably be easier (and I'll be able to get to it quicker.) There's a LOT of data in there, and to be honest, I don't remember what the handful of lines I'm looking for actually look like. And they're buried in register dumps, so it gets ugly.

The sense data you see there, just says the drive is OK, from what I recall. Meaning those drives are showing no faults. If the other two are hotspares, then they definitely should be showing no failures. Before we get too much further, do you have an IBM support contract on these? We're starting to get toward the limit of what I can do, and I'm starting to wonder if there's a hardware fault in play here.
 
Yep, they're under contract - but only for hardware issues. They were basically unwilling to investigate performance issues since they consider it a software issue. They said, well write-caching is enabled, so that's about all we can do. I'm not sure that they even really investigated the support dump at all.

PM your addy and I'll boot it over to you...
 
Sorry again about the delay. Things are hell here.

So, different layout slightly. However, there's definitely something wrong there. You've had invalid dwords (IDWC,) to 10 of 12 drives. That's highly abnormal, but the counts are low. Still, I think we want to rule out potential issues there.
For this, we need some more data. Iopoetve, are you aware of any known FC signalling problems with the DS3400? Is there somewhere in the logs to look for FC errors or get the SCSI sense from them? SpaceHonkey, what FC switch are you using with this? Are you restricting broadcast within zones?

Unfortunately, what I'm really looking for on the cache isn't in either of those.. looks like I'm gonna need NVSRAMdata.txt - be sure to sanitize the IPs. Also, can you please post a diagram of your array to physical layout?

Thanks!

Negative for known issues.

SCSI Sense is pasted to the vmkernel log.

Turn on advanced settings -> Scsi -> scsiPrintCMDErrors for more logging :)
 
Data received, and I'm reviewing it now.
Performance numbers look off, I may have some tests I need you to do. I also need that layout diagram here, so I can verify against what I see.

WOAH. PROBLEM.
85, 1 Optimal 279.397 GB SAS 3 Gbps VPBA300C3ETS11 N A496
85, 2 Optimal 279.397 GB SAS 3 Gbps VPBA300C3ETS11 N A496
85, 3 Optimal 279.397 GB SAS 3 Gbps ST3300655SS BA29

How the heck do you have 10 of one drive and 2 of another? And those are DEFINITELY different drives between the two. Controller A is also reporting significantly more errors than B; 23 versus 0, 23 timeouts as well. There is something going on there to get 23 timeouts.

Another issue; you're extremely heavy VMDatastore1, or rather, the B controller, and almost doing nothing on the A controller.
VMDatastore1
Total Requests Serviced 3502687
Total Blocks Requested 154688464
Cache Read Check Hits 366662
VMDatastore2
Total Requests Serviced 131795
Total Blocks Requested 19587028
Cache Read Check Hits 38815
 
Last edited:
Layout diagram of what?

All drives are the same FRU, although the hardware apparently differs.

Also - performanceStatistics.csv - for all luns - "Total Cache Write Requests" = 0, is that normal?
 
Space - turn on the scsi logging. lets see what else we're seeing from the host.
 
Layout diagram of what?

All drives are the same FRU, although the hardware apparently differs.

Also - performanceStatistics.csv - for all luns - "Total Cache Write Requests" = 0, is that normal?

I think he want's a fabric map.
 
Scsi logging is turned on. Fiber map is above (Physical Connectivity).

Datastore1 is the only datastore with VMs on it now, Datastore2 is .iso files.

Last messages from vmkernel are below:

Code:
May  1 14:08:16 BC1BLADE1 vmkernel: 1:03:49:45.420 cpu2:1076)Net: 4263: unicastAddr 00:50:56:92:14:7b;
May  1 14:08:16 BC1BLADE1 last message repeated 5 times
May  1 14:08:16 BC1BLADE1 vmkernel: 1:03:49:46.035 cpu2:1076)VSCSI: 2803: Reset request on handle 8192 (0 outstanding commands)
May  1 14:08:16 BC1BLADE1 vmkernel: 1:03:49:46.035 cpu2:1061)VSCSI: 3019: Resetting handle 8192 [0/0]
May  1 14:08:16 BC1BLADE1 vmkernel: 1:03:49:46.035 cpu2:1061)VSCSI: 2871: Completing reset on handle 8192 (0 outstanding commands)
May  1 14:08:22 BC1BLADE1 vmkernel: 1:03:49:51.974 cpu2:1076)Net: 4263: unicastAddr 00:50:56:92:14:7b;
May  1 14:08:22 BC1BLADE1 vmkernel: 1:03:49:51.974 cpu2:1076)Net: 4263: unicastAddr 00:50:56:92:14:7b;
May  1 14:08:22 BC1BLADE1 vmkernel: 1:03:49:51.975 cpu2:1076)Net: 4263: unicastAddr 00:50:56:92:14:7b;
May  1 14:08:22 BC1BLADE1 last message repeated 3 times
May  1 14:08:22 BC1BLADE1 vmkernel: 1:03:49:51.976 cpu2:1076)Net: 4263: unicastAddr 00:50:56:92:14:7b;
May  1 14:08:22 BC1BLADE1 last message repeated 5 times
May  1 14:12:53 BC1BLADE1 vmkernel: 1:03:54:23.015 cpu2:1076)Net: 4263: unicastAddr 00:50:56:92:14:7b;
May  1 14:12:53 BC1BLADE1 vmkernel: 1:03:54:23.015 cpu2:1076)Net: 4263: unicastAddr 00:50:56:92:14:7b;
May  1 14:13:04 BC1BLADE1 vmkernel: 1:03:54:33.403 cpu2:1076)Net: 4263: unicastAddr 00:50:56:92:14:7b;
May  1 14:13:04 BC1BLADE1 last message repeated 5 times
May  1 14:13:04 BC1BLADE1 vmkernel: 1:03:54:34.030 cpu2:1076)VSCSI: 2803: Reset request on handle 8192 (0 outstanding commands)
May  1 14:13:04 BC1BLADE1 vmkernel: 1:03:54:34.030 cpu2:1061)VSCSI: 3019: Resetting handle 8192 [0/0]
May  1 14:13:04 BC1BLADE1 vmkernel: 1:03:54:34.031 cpu2:1061)VSCSI: 2871: Completing reset on handle 8192 (0 outstanding commands)
May  1 14:13:11 BC1BLADE1 vmkernel: 1:03:54:40.239 cpu2:1076)Net: 4263: unicastAddr 00:50:56:92:14:7b;
May  1 14:13:11 BC1BLADE1 last message repeated 3 times
May  1 14:13:11 BC1BLADE1 vmkernel: 1:03:54:40.240 cpu2:1076)Net: 4263: unicastAddr 00:50:56:92:14:7b;
May  1 14:13:11 BC1BLADE1 last message repeated 3 times
May  1 14:13:11 BC1BLADE1 vmkernel: 1:03:54:40.241 cpu2:1076)Net: 4263: unicastAddr 00:50:56:92:14:7b;
Waiting for data... (interrupt to abort)
 
From an ubuntu vm (with vmtools installed) -

Code:
dd if=/dev/zero of=bigfile count=2000000 bs=512
2000000+0 records in
2000000+0 records out
1024000000 bytes (1.0 GB) copied, 10.4108 s, 98.4 MB/s

Code:
sudo hdparm -t /dev/sda
/dev/sda:
 Timing buffered disk reads:  588 MB in  3.02 seconds = 194.67 MB/sec

sudo hdparm -T /dev/sda
/dev/sda:
 Timing cached reads:   12680 MB in  1.99 seconds = 6361.59 MB/sec
 
I think he want's a fabric map.

Actually, array map. Something like this:

Code:
DS3400
Array1 | Array1 | Array1 | Array1 | Array1 | Array1
Array2 | Array2 | Array2 | Array2 | Array2 | Array2
Array3 | Array3 | Array4 | Array4 | Hotspare | Hotspare

I need to see which physical disks map to which array, and it's a pain to do without the tools that IBM has for interpreting CASDs. The total cache write requests being 0 is normal, though I don't know why - neither the DS3200 nor DS3400 seems to log that or records it differently.

hdparm is definitely the wrong test to be using here, too. Test is arguably far too long as
well, given the disks involved.

Here's the change I want you to test, Honkey:
- Change cache block size from 4k to 16k

Then do these tests in a Linux or FreeBSD guest:
- time dd if=/dev/zero of=/some/file bs=64k count=10000 - this will test physical disk writes.
- time dd if=/some/file of=/dev/null bs=64k count=10000 - this is a cache read test
- time dd if=/dev/zero of=/some/file2 bs=128k count=5000 - this is another physical test
- time dd if=/some/file2 of=/dev/null bs=128k count=5000 - this is a cache read test
- time dd if=/dev/zero of=/some/file3 bs=256k count=2500 - this is another physical test
- time dd if=/some/file of=/dev/null bs=64k count=10000 - this is a physical read test
- time dd if=/some/file2 of=/dev/null bs=128k count=5000 - this is a physical read test
- time dd if=/some/file3 of=/dev/null bs=128k count=2500 - this is a physical read test

It's very important to follow the exact order, otherwise you'll have blocks still in cache, when we're trying to avoid that and get the correct physical numbers. (Basically, we're looking for 0% cache hit on these files.)
 
Last edited:
Here's the disk map -
Code:
Disks
-------------------------
|  1  |  2  |  3  |  4  |
|  5  |  6  |  7  |  8  |
|  9  |  10 |  11 |  12 |
-------------------------

Array Number
-------------------------
|  1  |  1  |  1  |  1  |
|  1  |  2  |  2  |  2  |
|  2  |  2  |  HS |  HS |
-------------------------

Array 1 Luns:
0
5

Array 2 Luns:
1
4
2

Here are the test results - I thought the last line had a typo (bs=128 instead of bs=256) so I did is as noted, and then with 256.
Code:
------------------------------------------------------------
time dd if=/dev/zero of=file1 bs=64k count=10000
10000+0 records in
10000+0 records out
655360000 bytes (655 MB) copied, 4.81284 s, 136 MB/s

real	0m5.318s
user	0m0.000s
sys	0m1.656s
------------------------------------------------------------
time dd if=file1 of=/dev/null bs=64k count=10000
10000+0 records in
10000+0 records out
655360000 bytes (655 MB) copied, 2.79301 s, 235 MB/s

real	0m2.812s
user	0m0.000s
sys	0m0.836s
------------------------------------------------------------
time dd if=/dev/zero of=file2 bs=128k count=5000
5000+0 records in
5000+0 records out
655360000 bytes (655 MB) copied, 5.16573 s, 127 MB/s

real	0m5.248s
user	0m0.008s
sys	0m1.492s
------------------------------------------------------------
time dd if=file2 of=/dev/null bs=128k count=5000
5000+0 records in
5000+0 records out
655360000 bytes (655 MB) copied, 2.29132 s, 286 MB/s

real	0m2.302s
user	0m0.000s
sys	0m0.844s
------------------------------------------------------------
time dd if=/dev/zero of=file3 bs=256k count=2500
2500+0 records in
2500+0 records out
655360000 bytes (655 MB) copied, 4.14731 s, 158 MB/s

real	0m4.575s
user	0m0.000s
sys	0m1.484s
------------------------------------------------------------
time dd if=file1 of=/dev/null bs=64k count=10000
10000+0 records in
10000+0 records out
655360000 bytes (655 MB) copied, 2.69258 s, 243 MB/s

real	0m2.785s
user	0m0.000s
sys	0m0.816s
------------------------------------------------------------
time dd if=file2 of=/dev/null bs=128k count=5000
5000+0 records in
5000+0 records out
655360000 bytes (655 MB) copied, 2.47529 s, 265 MB/s

real	0m2.525s
user	0m0.000s
sys	0m0.752s
------------------------------------------------------------
time dd if=file3 of=/dev/null bs=128k count=2500
2500+0 records in
2500+0 records out
327680000 bytes (328 MB) copied, 1.28373 s, 255 MB/s

real	0m1.305s
user	0m0.000s
sys	0m0.388s
------------------------------------------------------------
time dd if=file3 of=/dev/null bs=256k count=2500
2500+0 records in
2500+0 records out
655360000 bytes (655 MB) copied, 2.45159 s, 267 MB/s

real	0m2.525s
user	0m0.000s
sys	0m0.804s
------------------------------------------------------------
 
Okay! Interpretation time!

Firstly, array layout is spot on. I forgot it was 3x4 not 3x6, so that's fine. You should be perfectly balanced on the ESMs.

First of all, your disk write appears to tap out at 135MB/s roughly. That's slightly low but not outside the realm of normal. What's interesting, is that your physical read and your cache read are virtually indistinguishable from each other. I had to read that twice, but that's part of the pain here - your cache is tapping out at ~260MB/sec. Will upgrading to 1GB help? No, not really. The 128k versus 256k was actually a blocksize relation test that I goofed - there can be significant differences in performance going between various block sizes. (64k -> 256k in TSM disks, for example, is a 250MB/s+ difference.)

Here's where it gets funky - this may actually be VMWare we're seeing tapping out at these numbers, and not the actual array. Iopoetve, how can we look at the VMWare side of the house? Is there a way in VMWare itself that we can do the exact same testing?
 
Okay! Interpretation time!

Firstly, array layout is spot on. I forgot it was 3x4 not 3x6, so that's fine. You should be perfectly balanced on the ESMs.

First of all, your disk write appears to tap out at 135MB/s roughly. That's slightly low but not outside the realm of normal. What's interesting, is that your physical read and your cache read are virtually indistinguishable from each other. I had to read that twice, but that's part of the pain here - your cache is tapping out at ~260MB/sec. Will upgrading to 1GB help? No, not really. The 128k versus 256k was actually a blocksize relation test that I goofed - there can be significant differences in performance going between various block sizes. (64k -> 256k in TSM disks, for example, is a 250MB/s+ difference.)

Here's where it gets funky - this may actually be VMWare we're seeing tapping out at these numbers, and not the actual array. Iopoetve, how can we look at the VMWare side of the house? Is there a way in VMWare itself that we can do the exact same testing?

What exactly do you want to see? Tell me what you want to look at, and I'll tell you how to get there :)

FWIW - service console performance may not be relative to actual system performance, since it's a VM.
 
What exactly do you want to see? Tell me what you want to look at, and I'll tell you how to get there :)

FWIW - service console performance may not be relative to actual system performance, since it's a VM.

Oh, UGH, this presents a problem. I want to eliminate the VM part of the equation, so we can be certain it's not something there causing the problems. I did not know that the console itself was a VM. :(

What I'm looking to do is get the "true raw" numbers for the disk. To do that, I need a direct attached host, preferably Linux or FreeBSD 7.1-RELEASE, doing the same tests. I really, REALLY want to eliminate ESX's caching and file system as a suspect here. (Again, block mismatches CRIPPLE performance - you have to do full stripe writes on a DS3400 to get good numbers.)
 
yeah, the SC is a VM with extremely limited resources. your best bet is actually a slimmed down linux vm with full resources. or a direct attach host.

Our block boundry will bridge the DS almost certainly - we do 1/2/4/8mb.
 
I think that you have both hit on the root cause. For the VCB backups volume (which is in use by a physical Win2K3 server), I changed the segment size to 64K and reformated the volume with 64K blocks. I used a utility to make a big (random data) 2GB file, and the transfer rate from the C: drive to the san lun was 485MB/s. With a real world copy of 25 files @ 2.34 GB, speed was 148MB/s. What IOMeter test can I run to get a real test that isn't dependant on the C: drive? If I can do that, then I'll make another Win2K3 VM and do the same test - so we can compare physical to VM.

I think block size and segment size are going to be the main issue here. The default segment size of 128K clearly isn't ideal in apparently most cases. To further this discussion, what sizes do you typically use for your hosts. Of course I know it needs to be based on the application - but how 'bout a few examples. Let's say an exchange server, MS file server, SQL server.

Well, after gooling around - clearly I don't know how to optimize the segment size for ESX. Lopoetve - is there a formula you recommend or document I should refer to?
 
Dingdingding. We've got a weener!

Segment size for your VM guests, is irrelevant. (Oh glorious day.) Added bonus; Windows is actually 4k filesystem. Which is why you can't use Windows, and why I should have specified you can't use Windows. It will not give accurate or appropriate results. Sigh. The random stuff probably got into cache, too. There's absolutely no sane or safe way for you to get any valid numbers out of the Windows host. It has to be a host which can do a 16k file system block at the minimum, and preferably 64k (FreeBSD UFS2 does 64k.)

The key thing here, Iopoetve, isn't avoiding the bridge. It's just ensuring we're hitting full stripes, so 1MB can be done in 2x512, 4x256, etc. Obviously, we want it to be the least possible operations typically - but with the DS3400's IOPS, there's a ton of headroom here. The question isn't about the blocks on the VMDK read though, but rather, the underlying filesystem that the VMDKs sit atop of. Can you find out what the segment or block size is on that? That's what we want to match to.
 
try this - try aligning a data disk to the 64k boundry. I'll look up the details and post here.
 
Dingdingding. We've got a weener!

Segment size for your VM guests, is irrelevant. (Oh glorious day.) Added bonus; Windows is actually 4k filesystem. Which is why you can't use Windows, and why I should have specified you can't use Windows. It will not give accurate or appropriate results. Sigh. The random stuff probably got into cache, too. There's absolutely no sane or safe way for you to get any valid numbers out of the Windows host. It has to be a host which can do a 16k file system block at the minimum, and preferably 64k (FreeBSD UFS2 does 64k.)

The key thing here, Iopoetve, isn't avoiding the bridge. It's just ensuring we're hitting full stripes, so 1MB can be done in 2x512, 4x256, etc. Obviously, we want it to be the least possible operations typically - but with the DS3400's IOPS, there's a ton of headroom here. The question isn't about the blocks on the VMDK read though, but rather, the underlying filesystem that the VMDKs sit atop of. Can you find out what the segment or block size is on that? That's what we want to match to.

VMFS will be 1/2/4/8mb block, based on what spacehockey set it to. That's the actual underlying filesystem - by design, we did that big a block size. Segment size - I'll see what I can find.

We need to get him to align a guest filesystem with both the san and vmfs filesystems. that'll eliminate split reads. Normally it's only a 1-2% increase, but may be more here.
 
Ahha. Okay, that's a new thing by me. That's actually good and bad.

We'll get double digit percentage increases by going to a larger segment size on the VMFS formatted LUNs. Here's the problem - it's incremental, and takes days to complete. If VMFS is 1/2/4/8, we want our LUN segments at 256k minimum, 512k optimal. That would give a 1MB VMFS read a total of two read operations, both full stripe. Same for write. We're definitely seeing some severe penalties from split reads, no doubt, but the key problem is the array segment size. I'll have to go back to my notes and see what Honkey has it set at.
 
I set block size based on lun size. Luns < 256 GB get 1MB, luns > 256 and < 512 get 2MB, and so on. I set the block size so that a single file can use most of the lun (although I know to stay under 95% utilization).

Segment size for most Datastores are now set at 128K (default).

Here's the current config -
Code:
NAME            Segment Size   Block Size    CAPACITY    RAID LEVEL  ARRAY             # Disks
VMDatastore1    128K           2MB           615.586 GB  5           Array1-Primary    5
VMDatastore2    128K           1MB           115.586 GB  5           Array2-Secondary  5
VMDatastore3    128K           2MB           500.0 GB    5           Array1-Primary    5
VMDatastore4	128K           2MB           500.0 GB    5           Array2-Secondary  5

Is there a formula for setting the segment size based on block size and # disks? I ask this because I found the following (3rd post) - but wanted to verify it with you guys first.

http://communities.vmware.com/thread/43702

FYI - I can recreate most of these datastores no problem. Again, not in production - yet.
 
dkvello is pretty close, but not quite, and completely unrealistic. (Especially for an SVC environment, but Iopoetve and I can talk SVC elsewhere, if you've got questions. Yes, I'm more than a bit of an SVC expert.) ;) You would never, ever put that many spindles into a RAID5. It just does not work. Also, let's clarify what we're going for here, versus common misconceptions.

In modern RAID, spindles don't matter as much. You want a certain number minimum, and more increase performance, but just saying X spindles is best is never right. The simplest way to express spindles is thusly:
More spindles is good, but should be met by increasing arrays involved by way of internal LVM on AIX or SVC in all other cases, when using low to mid-range storage or mechanically tiered storage. Spindle count should always be optimized only to the array, e.g. 4+2 and 8+2 RAID6, 4+1 and 8+1 RAID5, 4+1 or 2 RAID3, etcetera.
It SOUNDS complicated, but it's really not. First get the arrays spot on - which you have! Then increase the number of arrays involved, rather than the number of spindles in the array. However, we can't do that here, but single array performance should be sufficient. (Bear in mind, nearly 100 very active VMs only moved ~20MB/s average.)

Second, segment size is the size of a block write or read that will cross all spindles. What this means is that in a 128k segment size, a 128k read will touch all disks, getting full utilization of all the spindles in the array. A 128k write will touch all disks and avoids read-modify-write. Full stripe writes trigger a parity update for all disks at once, which is faster because it does not require the array to read untouched blocks which are typically not in cache to recalculate parity. Any time you can do full-stripe writes, you want to. By the same token, MORE full stripe writes are not better than LESS full stripe writes - any time you do a parity calculation, you're loading down the CPU. You want to do the minimum I/O operations possible at all times in most environments. (Excepting the IBM DS8k, where it really honestly does not matter.)

So, with all of that said, let's do the following. I'm guessing VMDatastore3 and VMDatastore4 are both empty or pretty near it. Let's start there by converting VMDatastore3 to a 256k segment size, and we'll destroy and recreate VMDatastore4 at a 512k segment size. Also, is there any system you could set up with FreeBSD 7.2-RELEASE and direct attach? That'll make it extremely easy for us to test (and for me to interpret results.)
Another key thing - when we do these tests, we will want ALL other VMs shut down or very idle. The dd method is unforgiving, to put it mildly. It will max out until something gives, which is part of what we're looking for. We want to figure out where it's giving out here. This is how I test DS4k's and SVCs too. Welcome to the fun world of performance tuning. ;)
 
Sorry guys, I've been out sick all week. I'll try to see if I can get FreeBSD going via a livecd, but that's as good as I can do on a physical host. I'm more familiar with linux, so if the tools are the same then that may be better.

To further your explanation, can you get into how a VM's block size would matter in relation to lun segment size? If I understand it correctly, it essentially doesn't matter - only the VMFS block size matters to the lun. Is this correct?
 
Sorry guys, I've been out sick all week. I'll try to see if I can get FreeBSD going via a livecd, but that's as good as I can do on a physical host. I'm more familiar with linux, so if the tools are the same then that may be better.

To further your explanation, can you get into how a VM's block size would matter in relation to lun segment size? If I understand it correctly, it essentially doesn't matter - only the VMFS block size matters to the lun. Is this correct?

No worries. I know how that can be. FreeBSD will have to be installed to disk, because LiveCD does not have the isp(4) driver working right now. (That's the Qlogic FC HBA driver.) Installation is relatively painless, since you really don't need to do much there. Disks are handled through sysinstall, but I can toss you instructions if they end up being necessary.

And that's essentially correct; a VM's block size is irrelevant until such point as it crosses the VMFS boundary. As the VMFS boundary is 1MB+, there is no file system you can run there that will cross the boundary. That means you have to key performance off the VMFS block size, because that is what size the reads from LUN will be performed at. You also have two additional layers of caching between the physical disk and application.
 
Well, unfortunately I can't install another OS on the one physical host I have to play with :(

I did do some playing around though. I created 4 new luns, with segment sizes of 64,128,256 and 512KB. Then I created a Win2K3 VM, and used IOMeter to test on each lun. Each was partitioned with 32K clusters and aligned using diskpart. To my suprise, using 32K read and write tests, all luns performed virtually the same. NOW - if I repartitioned with an un-aligned partition (again using 32K cluster size) performance blew - and went to about 1/2.

Reads and writes were about 360MB/s and 260MB/s respectively.

So I guess I'm misunderstanding something here. The little documentation I can find seems to say to set the segment size based on the VM cluster size, not based on the VMFS block size. But this doesn't seem to matter?!

Now I didn't do a super thorough test and record my results, but IOPS seemed to be similar as did MB/s.

One thing that did make a difference was cache block size. Going from 4KB to 16KB made a difference of 100MB/s in large cluster size tests. But it hurt in small size tests, like 512B - IOPS went from 8K to 5K and throughput from 4MB/s to 2.xMB/s.
 
Well, that puts a serious snag in getting some adjustment numbers. All we're getting at this point, is Windows numbers, which are useless for ESX. I need to get my hands on a lab with similar gear; trying to walk you through the tests I wanna do requires additional software and hardware you don't have. The only reason to set based on VM cluster size, is if VMFS cluster size is bogus.
What that means, is that the VMFS cluster size is only an allocation, and read/write operations have a total disconnect from it - in other words, even if VMFS cluster is 1MB, it never reads a full cluster - it only does partial block reads. Iopoetve, is this the case?

The good news is that at least we've found where the problem is. It's DEFINITELY in ESX. Understand, Windows is BAD for this work, and should never be used. It is NOT designed for the kind of load that needs to be forced through this, and NTFS has crippling limitations. If you're getting 360MB/s out of Windows, that should translate to around 400MB/s out of FreeBSD with UFS2. So, the array is definitely up to task. You're leaning on only one controller in your testing, with one host. Both controllers should put you around 500-550MB/sec, or where Iopoetve and I both think it should be.

The reason I wanted cache block at 16K is because to establish physical disk throughput, you should never, ever waste time on "small block" tests. They're worthless for establishing maximum physical throughput. Maximum physical throughput is found by doing long sequential reads of large blocks. We can't get to application tuning until we establish there isn't an array problem - which we now have done.

Now comes the really fun part where we start figuring out why it is that ESX is not getting adequate throughput. Is it too much random seek penalty? Are we seeing a two-tier mismatch? Is there a problem in ESX's caching? Yeah. Now it gets really, really ugly. Definitely going to need Iopoetve's help on this part, because this is getting into ESX internal behavior I'm not familiar with. First, we need an answer to the partial read question from Iopoetve, though - from there, I may have to revise a LOT of documents.
 
Back
Top