Mellanox IB Card suggestion for ZFS

bao__zhe

Weaksauce
Joined
Apr 18, 2012
Messages
123
Hi all,
I have been searching for cheap IB card on eBay for a while an and found it is very confusing for all different types of card.

My intended use is for Solaris ZFS VM so I need it to be compatible with either ESXi 5.1 or at least Solaris 11.1 (so i can passthrough the whole card). Currently I'm looking for Mellanox IB card and have digged through their website. However there aren't any way to tell which card has which feature, especially when the model number is old.

Is there any document/help pages details what each letter/number in the model number means? A lookup/comparison table also helps.

Or maybe some other brands which can meet my compatibility requirements too?
 
Just get an connect-x, where connect-x3 is the newest version.
Infinihost iii cards where before connect-x, mainly pci-x cards, and only the ones with memory where supported on solaris, when they where supported.
 
i'll get PCI-E card. The problem is i can't know from only the model number (which most ebay posting list) what version is it and if it has onboard memory.
 
The following links give you a lot of information on model number, chip the use, and description of the card(features/speed).

Infinihost III Dual Port
Connect-X
Connect-X2

Connect-X gets you PCI-e 2.0 support over the PCI-e 1 in the Infinihost III cards. The second generation of the Connect-X2 cards should work with SMB3(Windows). I have a few cards of each generation, but am running them all in Linux. If you are going with Infinihost cards and Solaris you need the cards with memory(*-1T/*-2T cards). I have not had luck getting any of my cards to pass through in ESXi 5.0

I got my Connect-X cards for ~$50 each, which is also about what I got the Infinihost III cards for. I got the Connect-X2(single port 40GBps) cards for ~$100 each.
 
I use 10g eth and Intel Nics. Almost ever *nix os i can throw at the cards works perfect.
 
I use 10g eth and Intel Nics. Almost ever *nix os i can throw at the cards works perfect.

Sure there's always 10GigE but its 10x the price of IB even on the used market. We're talking $175/10GbE port versus $15-$17.50/10Gb IB port. And in my testing IB clocks closer to theoretical max than 10GbE, and with lower latency. And the 40GbE IB cards, craziness, even a bank of SSD's in host based Raid0 has trouble feeding them data fast enough. Then you add SMB3.0 into the mix and your file transfers utilize multiple ports in parallel - by far my favorite new feature of Server 2012/Windows8.

@cactus knows what he speaks -- Windows Server 2012 has an edge right now with these cheap IB cards, since Solaris needs the versions with onboard memory. It's fun to screw around with in any case - massive, massive speed for peanuts. Some good discussions on all this at ServeTheHome forums.
 
Last edited:
The following links give you a lot of information on model number, chip the use, and description of the card(features/speed).

Infinihost III Dual Port
Connect-X
Connect-X2

Connect-X gets you PCI-e 2.0 support over the PCI-e 1 in the Infinihost III cards. The second generation of the Connect-X2 cards should work with SMB3(Windows). I have a few cards of each generation, but am running them all in Linux. If you are going with Infinihost cards and Solaris you need the cards with memory(*-1T/*-2T cards). I have not had luck getting any of my cards to pass through in ESXi 5.0

I got my Connect-X cards for ~$50 each, which is also about what I got the Infinihost III cards for. I got the Connect-X2(single port 40GBps) cards for ~$100 each.

thanks that's great information. exactly what i'm looking for. One thing i'm not clear is that if the IB cannot be passed to the Solaris VM how ESXi usually utilize the IB card? what's the common setup?

regarding 10GbE, i choose IB due to much better price/throughput/port. although i'm currently based on server 2008r2 and not ready to move to server 2012, it enables the possibility.
 
Sure there's always 10GigE but its 10x the price of IB even on the used market. We're talking $175/10GbE port versus $15-$17.50/10Gb IB port. And in my testing IB clocks closer to theoretical max than 10GbE, and with lower latency. And the 40GbE IB cards, craziness, even a bank of SSD's in host based Raid0 has trouble feeding them data fast enough. Then you add SMB3.0 into the mix and your file transfers utilize multiple ports in parallel - by far my favorite new feature of Server 2012/Windows8.

Check out the Woven Systems Fortinet LB4 on Ebay. Typically $150-300 and has 4 10GbE CX4 ports. 10GbE is not that bad anymore, unless you want a lot of 10GbE hosts. When I go to it myself in 2013 or 2014, I will probably put my server and my main desktop on 10GbE. Everything else I have is just fine at 1Gbps. BTW SMB 3.0 sounds great. I already knew that SMB 2.0 was a huge upgrade, but I was unaware of 2012 coming with a 3.0.


thanks that's great information. exactly what i'm looking for. One thing i'm not clear is that if the IB cannot be passed to the Solaris VM how ESXi usually utilize the IB card? what's the common setup?
IOMMU / VT-d / VMDirectPath, if you can dedicate it to one VM. Could always use bridge it to an ESXi connection if you need it to provide connectivity to other VMs as well, though that'd obviously bring the link down if the one VM went down. Make sure to use a separate NIC for the management network at least. Oh, and if you get a multi-port card, you will NOT be able to split the ports using VT-d. Whole card or nothing.
 
Last edited:
IOMMU / VT-d / VMDirectPath, if you can dedicate it to one VM. Could always use bridge it to an ESXi connection if you need it to provide connectivity to other VMs as well, though that'd obviously bring the link down if the one VM went down. Make sure to use a separate NIC for the management network at least. Oh, and if you get a multi-port card, you will NOT be able to split the ports using VT-d. Whole card or nothing.

In this case i guess i'll make ESXi to use the IB card. If i plug it in will ESXi recognize it and create a NIC out of it? Is there special setup needed? (the other end of the link is a Windows computer where i plan to install the subnet manager.)
 
In this case i guess i'll make ESXi to use the IB card. If i plug it in will ESXi recognize it and create a NIC out of it? Is there special setup needed? (the other end of the link is a Windows computer where i plan to install the subnet manager.)

ESXi supports Infiniband as far as I know, but not sure about IP-over-Infiniband (IPoIB)... I did find some references to ESXi 5 supporting that but not from any sites that I really trust...
 
ESXi HCL says ConnectX+ are supported in 5.0/5.0U1/5.1. Last I read, SRP was not supported past ESXi 4.x.
 
Post driver install, each port of my ConnectX and ConnectX2 cards are seen as 10GbE/IB (D|Q)DR. The drivers support only IPoIB and NFSoIB, nothing is said about anything RDMA related(iSER, SRP, NFSoRDMA, SDP).
 
By supporting IPoIB on a QDR card, is it 40Gbps or 10 Gbps? i.e. is the card operating as 40Gbps IB HCA with IP on top of it or as 10Gb Ethernet NIC directly?

Currently I'm not looking into RDMA stuff as the support on various OSes are not mature and are also different. But i will look into it in the future, eventually...

BTW I finally found the meaning of the model number, for my reference here it is:

M Mellanox Technologies
H Adapter Type H = InfiniBand Host Channel Adapter, N = Ethernet Network Interface Card
T Media Q = QSFP QDR, R = QSFP DDR,
S Adapter Architecture H = ConnectX® or ConnectX-2
# # ports 1 = 1, 2 = 2
I Host Interface X = PCI-X, 4 = PCIe x4, 8 = PCIe Gen1 x8, 9 = PCIe (Gen2 x8),
G Generation <blank> = Initial product generation, B= generation B, C= generation C
- Separator
X Memory Size X = MemFree, 1=128MB, 2=256MB, 3=512MB
B Bracket S = Short, T = Tall, N = None
R RoHS <blank> = non RoHS, C = RoHS R-5 w/ Exemption, R = RoHS R-6 Lead-Free
 
Last edited:
In Solaris 11.1:
The tavor driver that supports older InfiniBand HCAs (Host Channel Adapters) will be removed and not supported. No functionality for InfiniBand on these HCAs will be supported in the future, including firmware upgrade by using the fwflash utility.

Note: The tavor driver supports older InfiniBand HCAs such as:

Mellanox InfiniHost-based for PCIx (Peripheral Component Interconnect Extended)

Mellanox InfiniHost III Ex with onboard memory for PCIe (Peripheral Component Interconnect Express)
 
So I tried passing the ConnectX2 through after installing the VID in ESXi. The card is seen in the Linux VM(this is farther than I had gone when I did not install the driver), but it errors out and wont show up with ibstat in the VM. Looks to be a problem with how the card deals with a soft reset. If you follow the link, Roland Dreier presents a solution to get it to work.

@bao_zhe because of the signaling used(8b/10b) in Infiniband (S|D|Q)DR, QDR IB can not carry 40GbE, so it presents as a 10GbE adapter. FDR is needed for 40GbE.
 
In that sense S/D/QDR have no difference if used solely as IPoIB? Can it actually transfer faster than 10GbE even the NIC appears to be 10GbE?

I remember seeing ESXi doc says the vNIC can transfer internally as fast as it can regardless of the claimed speed as it is only host CPU/Memory bound. Not sure if that can apply here but the fabric does have higher capability.

Or, can 2 or 3 10Gpbs NIC be created on top of 40Gbps IB and use teaming to increase the throughput?
 
teaming almost never increases single stream throughput. meaning, if you're transferring 100TB over 1 x 10gb link your transfer speed will not increase over a 4 x 10gb teamed interface. you can however run 4 transfers which will each run at max speed presuming perfect load balance.

with iscsi and multipathing though you can load balance reads across links but writes are still limited to a single path.
 
teaming almost never increases single stream throughput. meaning, if you're transferring 100TB over 1 x 10gb link your transfer speed will not increase over a 4 x 10gb teamed interface. you can however run 4 transfers which will each run at max speed presuming perfect load balance.

with iscsi and multipathing though you can load balance reads across links but writes are still limited to a single path.

I had played last week with teaming of 2 onboard[s5000psl] Intel PRO/1000 EB NICs and NetgearGSM7248 layer 2 manageable switch and got the same result.
In both Link Agregation scenarios [dynamic and stable],single thread is served from only one of the NICs [both active].When I start another file copy operation,the thread go through the second of the NICs. My humble observations.
 
it has to work like that as you can't guarantee latency. if you sent writes upstream across 2+ links and one link had some interference or something happened then your writes arrive out of order. you don't want that.
 
I had played last week with teaming of 2 onboard[s5000psl] Intel PRO/1000 EB NICs and NetgearGSM7248 layer 2 manageable switch and got the same result.
In both Link Agregation scenarios [dynamic and stable],single thread is served from only one of the NICs [both active].When I start another file copy operation,the thread go through the second of the NICs. My humble observations.

iscsi multipathing is not the same as NIC teaming. It will give you more than a single link of bandwidth to a single client/operation.

@bao_zhe I am not sure if the 10GbE link is limited, I will try to get something set up to test.
 
only for reads in my experience. writes still use a single path.

Makes sense to maintain order on writes.


Having problems with my mlx4 drived based cards, but using point to point 20Gbps Infinihost III 128MB cards, got a max bandwidth of 11.6Gbps using iperf on IPoIB. Using MTU of 65520, 256k buffers (w & l flags), connected mode and 32 threads, Ubuntu Server LTS with kernel 3.5.0-19 at both ends. Once I get the mlx4 stuff figured out, I will post more.
 
please do (post updates). i haven't had time yet to play with IB much. I have mellanox queued up to send me switches and NICs ... just no time to do anything with them. Really curious to see how much over head IPoIB causes versus using just IB and SRP.
 
My only test so far was between two MHEA28-XTC (10Gbps) cards running on linux machines (connected mode, no tweaking of packet sizes). A single iperf instance was only able to transfer 3.8 Gbps.
 
I have some kind of hardware problem with my 20Gbps ConnectX card and my G34 platforms so I haven't tested ConnectX to ConnectX. So far I have 10Gbps => 7.85Gbps(ConnectX - Infinihost III Ex MemFree), 20Gbps => 11.63Gbps(Infinihost III Ex both ends), 40Gbps => 17.8Gbps(ConnectX-2 both ends), 10GbE => 9.91Gbps(Intel AT Copper both ends). So the ipoib module is not limited to strict 10GbE. I am doing a full write up for STH and will have more details there.
 
wow that's very impressive tests. and that makes me feel better to spend a little bit more on 40Gbps IB over 20/10Gbps IB.

Is solaris and windows able to use the connected mode? It seems ESXi can't so i'm planning on passing through the IB to solaris if both ends support.

and thanks for the knowledge on NIC teaming and iSCSI multipathing!
 
Last edited:
I missed the connected mode not being supported in ESXi when I skimmed the Mellanox ESXi manual. I will run the tests again using datagram to compare.
 
I have already changed my setup so this is not 100% comparable, but from my daily computer, E3-1245 V2 Kernel 3.2.0-23, to Host1, G34 6128 Kernel 3.5.0-19, datagram mode gets a max of 13.9Gbps ConnectX-2 to ConnectX-2. So slower, but still better than 10GbE.
 
I have already changed my setup so this is not 100% comparable, but from my daily computer, E3-1245 V2 Kernel 3.2.0-23, to Host1, G34 6128 Kernel 3.5.0-19, datagram mode gets a max of 13.9Gbps ConnectX-2 to ConnectX-2. So slower, but still better than 10GbE.

What commands are you using for tput tests?

I am currently testing ConnectX-2 using stock combinations Centos 6.3 and OI151a5.

I believe that RDMA is the key to building an IB storage server that will rival local disks as you need RDMA to achieve the full BW potential of your IB adapter *and* you need it get the latency down...I believe RDMA latency can be as much as 1/10th of IPoIB latency.

Here's where I'm at:

IB Client: Centos 6.3
ConnectX-2 QDR (40Gbps)

IB Server: Centos 6.3 (open-iscsi target)
ConnectX-2 QDR (40Gbps)
8x Samsung 840 Pro SSDs served up to client using iSER (iSCSI over RDMA)

Hosts connected by Mellanox QDR switch.

IIRC we were only able to achieve about 12.8Gbps (1600MB/s) doing sequential read tests on the client using filebench. We could clearly see the iScsi target was the bottleneck as it's CPU was pegged and swamped with kernel time.

As I understand, the open-iscsi target is a *user-land* iSer implementation...I presume that's creating a major bottleneck so then we switched to OI151 as our iSCSI server...keeping Centos as the client.

In that situation, we were then able to get about 16Gbps (2000MB/s) doing the same sequential read tests on the client. The OI151 server seemed happy as a clam, very low CPU utilization and kernel time. The Centos client through was completely pegged...presumable as the iSER initiator is also user-land.

I presume using an OI151 client would improve things drastically, but we haven't gone down this road as our application requires a Linux client so OI on the client isn't an option.

At this point we're trying to identify other iSer implementation for Linux that perhaps will be a bit more efficient than open-iscsi...not sure if the Linux 3.0 kernel would help things at all.

I'll post our iSER config commands so those that are able to test that can do so.
 
Very interesting Intel article on PCI-e performance testing:

http://download.intel.com/design/intarch/papers/321071.pdf

I had already noticed Mellanox's recommendation of setting PCI-e Max Read Request Size to 4096 byte (some motherboards default to a setting as low as 128 bytes).

You can see that with 128 bytes, PCI-e overhead is going to run 16% (on systems with > 4GB of ram) vs. less than 1% if using 4096 bytes.

What is unclear is that they also indicate you have to be careful about which devices are sharing the PCI-e fabric as if any device can't support the larger size it'll all get throttled down to the lowest common denominator.

On my motherboard (Supermicro X8DTT) there isn't even a setting to contact MRRS....however it looks like it's at 512 bytes:

Code:
# lspci -vv
03:00.0 InfiniBand: Mellanox Technologies MT26428 [ConnectX VPI PCIe 2.0 5GT/s - IB QDR / 10GigE] (rev b0)
	Subsystem: Super Micro Computer Inc Device 0048
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 256 bytes
	Interrupt: pin A routed to IRQ 24
	Region 0: Memory at fbb00000 (64-bit, non-prefetchable) [size=1M]
	Region 2: Memory at f8800000 (64-bit, prefetchable) [size=8M]
	Capabilities: [40] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [48] Vital Product Data
pcilib: sysfs_read_vpd: read failed: Connection timed out
		Not readable
	Capabilities: [9c] MSI-X: Enable+ Count=128 Masked-
		Vector table: BAR=0 offset=0007c000
		PBA: BAR=0 offset=0007d000
	Capabilities: [60] Express (v2) Endpoint, MSI 00
		DevCap:	MaxPayload 256 bytes, PhantFunc 0, Latency L0s <64ns, L1 unlimited
			ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+
		DevCtl:	Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
			RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- FLReset-
			MaxPayload 256 bytes, [B]MaxReadReq 512 bytes[/B]

Although, I'm a bit confused how the Max Read request is *smaller* than the Max Payload.

"MaxPayload is the MTU and MRU for PCIe packets. Each sub-tree of
devices connected to a single PCIe root port needs to have MaxPayload
set consistently. MaxReadReq is the maximum size of any DMA read
request. It is a per-device setting (or possibly per-function; I
forget). It can be much larger than MaxPayload since read completions
can be fragmented."

Note: I presume this same tunable could affect the performance of the HBA as well:

Code:
05:00.0 Serial Attached SCSI controller: LSI Logic / Symbios Logic SAS2308 PCI-Express Fusion-MPT SAS-2 (rev 01)
	Subsystem: LSI Logic / Symbios Logic Device 3000
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 256 bytes
	Interrupt: pin A routed to IRQ 30
	Region 0: I/O ports at e000 [size=256]
	Region 1: Memory at fbcb0000 (64-bit, non-prefetchable) [size=64K]
	Region 3: Memory at fbcc0000 (64-bit, non-prefetchable) [size=256K]
	Expansion ROM at fbd00000 [disabled] [size=1M]
	Capabilities: [50] Power Management version 3
		Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [68] Express (v2) Endpoint, MSI 00
		DevCap:	MaxPayload 4096 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+
		DevCtl:	Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported+
			RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ FLReset-
			[b]MaxPayload 256 bytes, MaxReadReq 512 bytes[/b]

It's set the same.
 
Last edited:
@packetboy have you tried SRP vs iSER?

That is exactly what we are about to do. We had toyed around with SRP about nine months ago and got it working but it seemed very unstable...however at the time we didn't really understand that there are bunches of iSCSI initiator and target subsystems out there...I'm not even sure what we were using.

See:
http://scst.sourceforge.net/comparison.html

So we're going to try SRP as that seems to be the only way to get a kernel-based Target that supports iSCSI/RDMA....unless you use Solaris as an iSer server. As the SCST Target is the only kernel implementation of SRP, we'll obviously be trying that.

My question is this...which Linux *initiator* should one use with an SCST SRP target?

Have you gotten SRP working? If so, what client/server configuration?
 
I have used SCST before and it works great. Havent used it with IB but of the targets i've used SCST was easily the best one.
 
Back
Top