10Gb help on Esxi 5.5

nicholasfarmer

Limp Gawd
Joined
Apr 19, 2007
Messages
238
I need some extra Brains here.....

Hardware: ESXi Server with a ton of resources... CPU/Mem running below 10%
Intel x540-T2 - connected via Cat7 cable to Netgear ProSafe XS708E.
Running ESXi 5.5 build 1331820.
One VM on ESX Server with local SSD array (trust me, its not an IOPS or GBPS issue from the disk...)
VM is running Windows 2012 Server with a VMXNet3 nic driver.
Hardware V10 and tools are updated.
Four Physical Desktops with SSDs as local drive.

Test 1) I can start a file copy from the VM to each local desktop, one at a time, and get full saturation on the physical desktop NIC. (99%. 113MBps flat line) All four desktops can receive the full 1Gb stream. Each desktop has its own network port on the netgear switch.

Issue : When I start multiple flows (copy data to two PCs at once) the bandwidth goes to insanity. Network bandwidth will spike to 224Mbps then flop down to 5Mbps. Very random, very poor throughput. Its actually faster to copy one at a time then to attempt multiple flows.

Test 2) I found an updated driver for ESXi 5.5. A VIB LIST showed 5.5 came with driver ver 3.7.x.x and I updated it to ixgbe-3.15.1.8.

New issue: I can still hit saturation when doing a single flow to a single physical PC, but now when I attempt multiple flows (one flow per Physical PC), it perfectly balances the flows. 113MBps will drop down to two flows at 56Mbps to two PCs. Four file transfers will drop down to 28Mbps each to four PCs. So the random, up and down speed went away but now the NIC/Driver is balancing the flows at a 1Gb limit....
No QOS, im using standard switches so no vDS control issues. I want to keep adding more info to help narrow it down but if this gets too long you will all ignore it!

Please help! I hope someone has exp with this Nic and can help with a quick config or command to get me up to full bandwidth!


..

..

Update 1: I found Driver/VIB 3.18.7 and 3.19.1. Upgraded and tested both drivers with the same "balance" issue but now I can get 38.8 times four PCs....
MysticRyuujin : I would LOVE to do a 10Gb to 10Gb test but I do not have another 10Gb adapter hanging off of this switch. I do have the x540-T2 (two port) adapter so maybe I can wire something up to go out one Nic and in the other port for a 10Gb>10Gb test. I checked the switch firmware and its running the latest. As you said, its new so the firmware options are 1.00.06, 1.00.08, and 1.00.10 and im running .10. I've also reviewed release notes etc and do not see anything about this issue.


Update2: I created another VM on the host and installed Windows 7 (just a quick + different OS to test with) it has the same characteristics as the Win2012 server.
The VM is connected to the same vswitch etc.
This helps me know its not Windows Server jacking with me.
On to the iPerf testing!


Update 3 : iPerf!! (this will show you the balanced flows even on a single PC to PC connection)

[ 4] local 1.1.1.2 port 5001 connected with 1.1.1.3 port 55956
[ ID] Interval Transfer Bandwidth
[ 4] 0.0-10.0 sec 628 MBytes 526 Mbits/sec
[ 4] local 1.1.1.2 port 5001 connected with 1.1.1.3 port 55957
[ 5] local 1.1.1.2 port 5001 connected with 1.1.1.3 port 55958
[ 4] 0.0-10.0 sec 535 MBytes 449 Mbits/sec
[ 5] 0.0-10.0 sec 532 MBytes 446 Mbits/sec
[SUM] 0.0-10.0 sec 1.04 GBytes 895 Mbits/sec
[ 4] local 1.1.1.2 port 5001 connected with 1.1.1.3 port 55959
[ 5] local 1.1.1.2 port 5001 connected with 1.1.1.3 port 55960
[ 6] local 1.1.1.2 port 5001 connected with 1.1.1.3 port 55961
[ 4] 0.0-10.0 sec 370 MBytes 309 Mbits/sec
[ 5] 0.0-10.0 sec 384 MBytes 322 Mbits/sec
[ 6] 0.0-10.0 sec 330 MBytes 277 Mbits/sec
[SUM] 0.0-10.0 sec 1.06 GBytes 905 Mbits/sec
[ 4] local 1.1.1.2 port 5001 connected with 1.1.1.3 port 55965
[ 5] local 1.1.1.2 port 5001 connected with 1.1.1.3 port 55964
[ 6] local 1.1.1.2 port 5001 connected with 1.1.1.3 port 55962
[ 7] local 1.1.1.2 port 5001 connected with 1.1.1.3 port 55963
[ 4] 0.0-10.0 sec 261 MBytes 218 Mbits/sec
[ 5] 0.0-10.1 sec 256 MBytes 213 Mbits/sec
[ 6] 0.0-10.0 sec 310 MBytes 259 Mbits/sec
[ 7] 0.0-10.0 sec 279 MBytes 233 Mbits/sec
[SUM] 0.0-10.1 sec 1.08 GBytes 922 Mbits/sec

I found with iPerf, if I force each Physical PC to do three flows each to the server, I can reach 300+Mbps.
So, the bandwidth is there. I just need to find out why the Hypervisor or the x540 nic is chopping up my bits....

Update 5ish....
I think the issue is with the x540-T2 card. To check the Netgear switch, I connected Physical PCs to ports 1-6. I then moved 50GB files back and forth between the PCs all concurrently. Port 1>2, 3>4, 5>6 and I could get full 1Gbps throughput on each movement. When I move files to/from the PCs through the x540 card, its limited at 2.5Gbps.

Test3??) I placed the x540-T2 into a PC with windows 7. Installed the updated driver. installed some SSDs and attempted to move data between the PC and other PCs. Using multiple SSDs as source so I do not hit a storage limitation.
The card still maxed out at 2.5Gbps.

--
--
--
--
Update 8Jan2014:
I have a support ticket open with Vmware and Intel atm to help identify two things:
1 - (vmware) Why are the flows being balanced with a max of 1Gbps (1 flow = 1X 113Mbps | 2 flows = 2 X 56Mbps | 3 flows = 3 X 38 Mbps | 4 flows = 4 X 28Mbps) Perfect balance!
2 - (intel) Why can I not reach above 2.5-3Gbps on the adapter (When tested in ESX and full Windows build using RAM drives etc)

The second one sounds like the adapter is only using part of the PCIe lanes of the x8 slot OR negotiated down to PCIe v1.. but that is a guess.

Running Linux based iPerf testing now.. results to follow!


--
--
--
Update 9 Jan 2014
I ran a bunch of Linux based iPerf tests. Using my VM as a server and all six physical desktops as clients, I can "receive" over 6Gbps concurrently on the VM
But, When I made all desktops as iperf servers with different port numbers, then turned the VM into the client, I found I could not push more than 1.5Gbps "Transmit" out of the VM. I ran each client connection as a background process. The first client process hit 1Gbps instantly, the second process made the total network bandwidth spike to 1.2Gbps then it started to crash all over the place. Adding more client instances, changing the window size, number of parallel processes, etc did nothing.

Back to VMware to help research the performance issue now that I can reproduce the issue in many different Guest OS.


.
.
.
Update 16Jan2014
I was able to get my hands on another 10Gb network card and run more tests.
IPerf between VM and PC can run full 10Gb line rate both directions.
So this is some thing in the hypervisor or intel adapter and its driver.

Current issue:
vm to bare metal can run line rate if the destination contains a 10Gb adapter.
If the destination contains a. 1Gb adapter, bandwidth for the entire ESXi host is limited to 1Gbps (transmit out of the ESXi host)
Receive traffic has always worked at line rate.


UPDATE 24Jan2014:
I think I've found the issue. It appears to be an Intel driver issue.
- If I install ESXi 5.5 from the default ISO and do not update the driver for my Intel x540-T2 adapter, everything works fine. If I update the VIB/Driver to anything listed as 5.5 supported for the x540-T2 on the vmware site, it breaks.

Intel and Vmware are still working on the issue.
.



Thank you for your help and ideas!
 
Last edited:
Do you have another 10Gb device to test throughput with? Did you check the switch for the latest firmware? That's a sweet switch but it's also brand new lol

Another thing you could try is downloading iPerf and starting 4 server instances on different ports on the server and then running 4 client iperf sessions at the same time (via a script or something).

It would be very helpful though to do a 10Gb to 10Gb test.
 
I'd still like you to do an iPerf test between the server and the 4 devices and see if anything pops.

iPerf will give us a more pure TCP/IP throughput.

You can also look at NIC offload settings. Try disabling or enabling different features on the NIC.
 
Last edited:
We've already established that one to one you're getting line rate. What I'd like is if you started the server instance 4 times on 4 different ports and then connect the 4 clients at roughly the same time, one to each port. Use a long interval like 120 or more seconds then tell us the results/speeds reported by the 4 clients.
 
You could also test client 1 <-> client 2 & client 3 <-> client 4 at the same time to see if there's something going on with the switch processing/switching so many flows/packets at once. That would eliminate the server/10Gb as being the issue...try to saturate the switch with data.
 
I should have clarified my last comment. I used iPerf to start three sessions on each client. On three physical PCs running three iPerf sessions I can reach 3Gbps on the server/VM.
So, I can reach saturation on the 1gbps switch ports concurrently. The issue atm is why do I need so many sessions or flows to achieve this bandwidth. It feels like I'm hitting a "per flow" rate limit policy or some type of packet shaping. Everything is directly connected to the switch so it must be a function of the hyper visor or the NIC.
PC to PC runs at line rate as well. It could still be the switch but to have the behavior change when I updated the ESXi driver tells me it must be the hyper visor or NIC

Thank you for your help!
 
I understand now. The only suggestion I have left is what I said before, look at the NIC capabilities and offload settings and see if changing them shows any noticeable performance changes. If you Google:
Windows Server 2012 10Gb performance tweaks
You can find the performance tuning guidelines for Server 2012 with a couple sections on NIC/Network Performance tuning.

Another test you could perform is to boot a Linux VM on that same server and see if it suffers from the same issue.
 
I have no more useful suggestions haha I'm sure VMWare / Intel techs will be much better suited to help you than anyone on this forum :)
 
I wish that was true! The last support email I received said," I think the vmxnet3 adapter isn't negotiating at 10Gb in the VM OS." After I sent them screen shots of it linked within Linux and windows OS at 10Gb. And this is isn't the first level of support!
Thanks for your ideas this far. If you think of anything else while sitting on the Jon, let me know!

Nick.
 
As far as I know it it's completely irreverent what the OS VM thinks the line speed is as everything is handled at lower levels.

For instance, if you have a ESXi server with 4x 1GB connections in an LACP channel your guest OS will still always show a 1Gb connection, not 4Gb. That's true for both VMWare and Hyper-V.

*EDIT* Haha my Hyper-V guest OS says 10Gb connection and I only have a 2Gb LACP channel.
 
Bump : if anyone was interested in my findings.
It looks like Intel Drivers are busted atm.

UPDATE 24Jan2014:
I think I've found the issue. It appears to be an Intel driver issue.
- If I install ESXi 5.5 from the default ISO and do not update the driver for my Intel x540-T2 adapter, everything works fine. If I update the VIB/Driver to anything listed as 5.5 supported for the x540-T2 on the vmware site, it breaks.

Intel and Vmware are still working on the issue.
 
For me this was good information as I'm planning on upgrading to 10gbit in ~4-5 months period. Please keep updates coming about possible resolutions with new drivers from intel/vmware.

Thanks!
 
Will do!
VMware support currently has no fix and is suggesting to stay with the default Driver until something is "fixed." VMware support suggested that I take the issue to Intel directly if I was something fixed sooner.

The Netgear 10Gb copper switch is fast.
It runs super quiet. The 120mm fan running next to it makes more noise then the internal fans. Granted, I keep my "server room" around 50 deg.
I bought a bunch of Cat7 cables at Frys (priced match with amazon for a super low price)and I can get full Line rate on each port.

The Intel x540-T2 adapters are fast but get a little warm if you plan on using a low air flow desktop or ITX case. The Netgear switch does have a single fiber port that shares with the copper port 8 so you could go with a single port fiber card if you need to run a long distance AND want to spend a ton of cash on GBICs

If I get any updates from Intel support I will add it to this email chain.

Nick
 
Back
Top