NetApp Performance - NetApp FAS3240?

KapsZ28 · May 3, 2013

One of our datacenters has two FAS3240 heads and three DS4323 shelves. Head 1 has a combination of tier 1 and tier 2 storage while head 2 is just tier 2 storage. We also have 12 ESXi hosts running about 200 VMs. As far as VMs that are currently running, I would say about 85 are running on head 1 and another 85 are running on head 2.

Each head only has three 1GB NIC connections setup for production. The performance has always seemed a bit slow, so I tested a simple file copy. I logged into a server with a 1GB connection to tier 1 storage and copied a 7 GB file to another server also with a 1GB connection to tier 1 storage. The highest speed it hit was 20MBps, but the average was lower about 13MBps. It took about 8.5 minutes to copy which calculates to just barely faster than 100Mbps.

Isn't that kind of slow for tier 1 storage? Our tier1 is 15k SAS drives.

DigitalDaz · May 3, 2013

I'd guess that a lot of it will depend upon what else those 15K drives were doing at the same time with regard to all the other VMs you had running too.

dilidolo · May 3, 2013

Too much info missing to judge where the problem is. It is slow for sure.
What is the CPU load?
What is the disk utilization?
How is the aggregate configured?
What is the aggregate capacity utilization?
How is the network configured on NetApp and ESX and how is traffic distributed?
Is ESX setting optimized for NFS?
What guest OS? What virtual hardware?
What are the VMs do? Type of workload? Throughput and IOPS?
Are VMs properly aligned?
What is the latency?
etc. etc. etc.

tonyb · May 4, 2013

There's a lot of different places the choke point could be.

Not enough information to make a good guess, but a shot in the dark might be your spindles are doing too much work. Especially if those VMs are desktops for a VDI implementation (they tend to thrash storage). If it's the spindles, you need more spindles or more cache or combo there of (or less workload).

Could be the network. How are you using those three links? In a LAG? Bandwidth wise each 1 Gig link is capable of max 125 MBytes per second. Have you checked their utilization? My guess it won't be over-saturated, but worth a look.

Mtnduey · May 5, 2013

Hey there I've got quite a history of Netapp (and EMC, Hitachi, Sun, HP) enterprise storage admin under my belt but I have some quick questions for you to help narrow things down.

Let me start off by saying that you have a bottleneck, plain and simple, these questions are designed to help you find the bottleneck. To put it simply, the FAS3240 filer and DS4243 SAS connected shelves are blazingly fast and will more than meet the needs of your VM environment. The FAS3240 is using Nehalem based CPU's, it's predecessor the 3140 was using AMD Socket F dual core CPU's that are slower than the avg i3 based laptop. So to put it simply, I've had 3140's running over 500 guest hosts each filer using crappy old DS14 mk2 shelves with 10,000rpm drives. So your hardware can totally rock this.

Are you using iSCSI storage for your ESXi hosts or are you using Fiber Channel?

If the answer is Fiber Channel then perceived slowness with VM's ISNT because you have three 1gb NIC connections for production as you say, let me know and we'll dig into FC next.

If you're using iSCSI then we need to discuss a few other things.

NIC Bonding. I'm assuming here that you're taking advantage of NIC bonding and if not why not, but since you likely are bonding NIC's, are you using LACP? If not LACP was there a reason why? If you're using Cisco switches, did you setup your port channel correctly for LACP? If yes, are you sure? If you're still sure, are you absolutely sure? I say this because LACP is different than a standard channel group and the settings need to be spot on for LACP to work properly.

Are the Netapp Ethernet links going to a core switch or something further down the distribution chain? If further down the distribution chain, verify you have proper uplinks back to the core switches to handle the kind of bandwidth needed for iSCSI. Are the ESXi hosts using multiple bonded nics themselves? Are those ports setup correctly? Are you sure??? Are the ESXi hosts cabled up to the same switch as Netapp filers or are they on different switches? If they're on different switches then one thing you need to understand is communication between the filer and ESXi host is taking place outside the switch vs intra-switch if on same switch, this slows things down considerably and you can quickly fill up uplink connections using iSCSI in a VM environment of your size without proper vision and implementation. The answer is likely no, but you're not trying to use jumbo frames for iSCSI are you? One more thing, for iSCSI please tell me you're using a layer 2 vlan and not layer 3, for that matter you are using a dedicated VLAN for your iSCSI traffic right? If not, do so sooner rather than later.

We also don't know what sort of servers you're using for your VM environment, tell me more and that may help clarify things.

As for the copy speed with that 7gb file. Was that going from VM to VM or....? Was this Windows or Linux? What protocol did you use (ftp, scp, rsync, super_dynamic_transfer_protocol_42, etc...)? Did you try going from a physical server to a NFS or CIFS share for a speed comparison?

One final question, since you have a Netapp filer you should have the deduplication license (SIS), are you using this with your VM environment? If you are, make sure your SIS schedule is NOT set to auto and has a well defined schedule that does the SIS scan outside of production hours. If you're NOT using SIS, is there a reason? If you do it correctly you can end up in a place where common OS files are running in cache 24x7 resulting in a surprisingly quick responding VM environment.

Once we hash this out we can go into aggregates, volumes, luns and fun stuff like that.

Do me a favor and PM me and let me know you responded to this, I don't normally browse this sub-forum.

mikeblas · May 5, 2013

Yes, 13 megabytes/second is very slow. In a sequential copy, you'd expect four or five times that from a single directly attached consumer SATA drive.

Problem is, your test doesn't do much to isolate what the cause of the poor performance might be. You can't even decide if you were slow reading or slow writing. The most likely issue is that the other load on the storage was competing for bandwidth and IOPS, but there are lots of other contributing factors. You'll need to do some diagnosis if you're interested in learning why your performance was so poor.

KapsZ28 · May 6, 2013

I am not overly familiar with NetApp, but below are some screen shots I took. It appears the three 1GB NICs are trunked. Obviously we are not running FC.

As far as the VM environment, it is a mix between Linux and Windows servers. No VDI. A bunch are web servers, others are IAAS. I know they were using vOptimizer to align the VMs, but I am not sure when it was done last. I found an expired free trial on one of the servers.

My test was a .PST file from a Windows share to another Windows share.

LibertySyclone · May 6, 2013

A) Check what MtnDuey posted
B) doesnt look like you are using dedupe (you should IMO its one of the top 3 features of a netapp box)
C) are your windows shares virtual? or is the FAS handling them? (I assume virtual as I dont see any CIFS traffic on your netapp) I personally prefer the netapp to run my CIFS but its just opinion...

More info would help us help you a ton.

sphinx99 · May 16, 2013

Fwiw I have a 3240 cluster and have no issues pulling well over 1GB/sec over fiber, and I'm not particularly pushing it. They saturate 1Gb over CIFS, iSCSI without issue. So, you do have a bottleneck.

Zer022 · May 17, 2013

I would think that 3gbits (trunked) shared between 85 hosts is going to be your bottle neck. Even just sata disks in a 3240 should be able to deliver enough throughput to saturate a 1gbit link without issue.

Zer022 · May 17, 2013

Also to the best of my knowledge vmware will force synchronous writes when using NFS storage, looks like all your traffic is NFS in the screen shots. A high number of hosts all trying to force every write to be synchronous is probably not going to be fantastic.

NetApp Performance - NetApp FAS3240?

KapsZ28

2[H]4U

DigitalDaz

Weaksauce

dilidolo

Limp Gawd

tonyb

Limp Gawd

Mtnduey

[H]ard|DCer of the Month - Nov. 2013/Nov. 2014

mikeblas

[H]ard|DCer of the Month - May 2006

KapsZ28

2[H]4U

LibertySyclone

[H]ard|Gawd

sphinx99

[H]ard|Gawd

Zer022

n00b

Zer022

n00b