ZFS Monster Stage 4: 40Gbps Infiniband

packetboy

Limp Gawd
Joined
Aug 2, 2009
Messages
288
I had planned on doing this from the beginning, just ordered QDR (40Gbps) Infiniband adapter for my ZFS server, a new 4 blade Supermicro server which each blade outfitted with a Mellanox Connect-2 QDR interface and a 16-port Mellanox QDR switch.

Anyone here played with IB on Solaris/OpenIndiana yet?

This should be interesting....my plan is to use this for a mini Apache Hadoop cluster.
 
Dear Lord, that is some serious throughput...ill be following along on this thread...
 
Oh goodie...Mellanox QDR ConnectX2 Infiniband adapters that are onboard the Supermicro X8DTT-IBQF blades *ARE* detected by Solaris 11 (11-11-11):


Code:
# cfgadm -al


Ap_Id                          Type         Receptacle   Occupant     Condition
hca:2590FFFF2FC81C             IB-HCA       connected    configured   ok
ib                             IB-Fabric    connected    configured   ok
ib::2590FFFF2FC81D,0,ipib      IB-PORT      connected    configured   ok
ib::iser,0                     IB-PSEUDO    connected    configured   ok

Note: I was initially nervous as I initially got a driver error when I booted the Live USB...I tried to tell Solaris to manually download the 'hermon' (infinband) drivers...it tried to but failed as it said root was read-only...I'm confused by this but perhaps I'm not fully understanding the limitations of the live USB.

Regardless, I installed to a hard-drive, re-ran the device driver utility and it automatically grabbed the herman drivers from the Solaris repository.

I am still VERY far from actually moving data across this transport, but this is a huge step.
 
How's this?

Logical design:

hadoopi.png


The SAS switch is my backup plan if disk access latency over IB is unacceptable...eg. I'll create a parallel SAS network strictly for storage connectivity and then the IB will be used only for intra-node communications (which are plentiful when running Hadoop).

supermicroibblade.jpg


bladeibmezzanineandconn.jpg


ibnetwork.jpg


ibswitchcloseup.jpg
 
So- silly question- what do you use this Hadoop thing for?

Some big names on their user list.
 
You're a monster :eek:


This build is freaking epic. Looking forward to seeing some more pics, and even more importantly, some numbers!
 
So- silly question- what do you use this Hadoop thing for?

Hadoop let's you find stuff reasonably quickly...when "stuff" = many Terabytes of highly unstructured data. In my case it's 50TB of raw packet captures (e.g. Wireshark/tcpdump).

That's about all I can say.
 
So when I saw ZFS and Infiniband in the first post..... I subscribed to this thread. And as already stated that is some serious bandwidth.....
 
@OP

The SAS switch is my backup plan if disk access latency over IB is unacceptable...eg. I'll create a parallel SAS network strictly for storage connectivity and then the IB will be used only for intra-node communications (which are plentiful when running Hadoop).

What will be running across the IB connection(s)? iSCSI?
 
I had planned on doing this from the beginning, just ordered QDR (40Gbps) Infiniband adapter for my ZFS server, a new 4 blade Supermicro server which each blade outfitted with a Mellanox Connect-2 QDR interface and a 16-port Mellanox QDR switch.

Anyone here played with IB on Solaris/OpenIndiana yet?

This should be interesting....my plan is to use this for a mini Apache Hadoop cluster.

Some where out there people have a large sum of bank candy they like to feed the technology monster.
 
IMO you need to make your own hadoop benchmark.

And never ask what parityboy does with this stuff :)
 
I know getting Infiniband working under Solaris/OI is going to be more involved, so today decided to install Ubuntu on two of the Blades and see if I could get them to talk to each other. Using Ubuntu 11.10 makes it trivial as oulined here:

http://davidhunt.ie/wp/?p=2291


Code:
root@ib4:~# /usr/sbin/iblinkinfo 
Switch 0x0002c90200443470 Infiniscale-IV Mellanox Technologies:
  
      3   16[  ]  ==( 4X 10.0 Gbps Active /   LinkUp)==>       2    1[  ] "MT25408 ConnectX Mellanox Technologies" (  Could be 5.0 Gbps)
      3   17[  ]  ==( 4X 10.0 Gbps Active /   LinkUp)==>       1    1[  ] "MT25408 ConnectX Mellanox Technologies" (  Could be 5.0 Gbps)



root@ib3:~# ibstat
CA 'mlx4_0'
	CA type: MT26428
	Number of ports: 1
	Firmware version: 2.9.1000
	Hardware version: b0
	Node GUID: 0x002590ffff2fc828
	System image GUID: 0x002590ffff2fc82b
	Port 1:
		State: Active
		Physical state: LinkUp
		Rate: 40
		Base lid: 1
		LMC: 0
		SM lid: 2
		Capability mask: 0x0251086a
		Port GUID: 0x002590ffff2fc829


# ifconfig

ib0       Link encap:UNSPEC  HWaddr 80-00-00-48-FE-80-00-00-00-00-00-00-00-00-00-00  
          inet addr:192.168.9.2  Bcast:0.0.0.0  Mask:255.255.255.0
          inet6 addr: fe80::225:90ff:ff2f:c829/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:2044  Metric:1
          RX packets:6704163 errors:0 dropped:0 overruns:0 frame:0
          TX packets:2174312 errors:0 dropped:6 overruns:0 carrier:0
          collisions:0 txqueuelen:256 
          RX bytes:415695395 (415.6 MB)  TX bytes:11643802311 (11.6 GB)



root@ib3:~# netperf -H 192.168.9.4
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.9.4 (192.168.9.4) port 0 AF_INET : demo
Recv   Send    Send                          
Socket Socket  Message  Elapsed              
Size   Size    Size     Time     Throughput  
bytes  bytes   bytes    secs.    10^6bits/sec  

 87380  16384  16384    10.00    4593.36

Ok, so we've got some work today given "only" 4.5Gbps across at 40Gbps (really 36Gbps after overhead) link. But this is stock Ubuntu 11.10...zero tuning.
 
Iperf results agree with netperf:

Code:
root@ib3:~# iperf -P 1 -c 192.168.9.4
------------------------------------------------------------
Client connecting to 192.168.9.4, TCP port 5001
TCP window size: 16.0 KByte (default)
------------------------------------------------------------
[  3] local 192.168.9.2 port 47622 connected with 192.168.9.4 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  5.04 GBytes  4.33 Gbits/sec
 
Let the tuning begin..first switch IB adapters from 'datagram' mode to 'connected' mode:

Code:
root@ib3:~# echo "connected" > /sys/class/net/ib0/mode
root@ib4:~# echo "connected" > /sys/class/net/ib0/mode

Then retest.

Code:
root@ib3:~# iperf -P 1 -c 192.168.9.4
------------------------------------------------------------
Client connecting to 192.168.9.4, TCP port 5001
TCP window size: 16.0 KByte (default)
------------------------------------------------------------
[  3] local 192.168.9.2 port 47623 connected with 192.168.9.4 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  7.67 GBytes  6.59 Gbits/sec

Whoa...guess that's an important one! That's about a 48% improvement right there.
 
Once in connected mode, higher MTUs are allowed..up to 64K, so that's try that next:

Code:
root@ib3:~# ifconfig ib0 mtu 64000
root@ib4:~# ifconfig ib0 mtu 64000


Code:
root@ib3:~# iperf -P 1 -c 192.168.9.4
------------------------------------------------------------
Client connecting to 192.168.9.4, TCP port 5001
TCP window size:  189 KByte (default)
------------------------------------------------------------
[  3] local 192.168.9.2 port 47626 connected with 192.168.9.4 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  19.6 GBytes  16.8 Gbits/sec

No that's what I'm talking about.
 
Code:
root@ib3:~# iperf -P 1 -c 192.168.9.4
------------------------------------------------------------
Client connecting to 192.168.9.4, TCP port 5001
TCP window size:  189 KByte (default)
------------------------------------------------------------
[  3] local 192.168.9.2 port 47626 connected with 192.168.9.4 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  19.6 GBytes  16.8 Gbits/sec

No that's what I'm talking about.

Good God, that just made my pants dance some :D
 
Once in connected mode, higher MTUs are allowed..up to 64K, so that's try that next:

Code:
root@ib3:~# ifconfig ib0 mtu 64000
root@ib4:~# ifconfig ib0 mtu 64000


Code:
root@ib3:~# iperf -P 1 -c 192.168.9.4
------------------------------------------------------------
Client connecting to 192.168.9.4, TCP port 5001
TCP window size:  189 KByte (default)
------------------------------------------------------------
[  3] local 192.168.9.2 port 47626 connected with 192.168.9.4 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  19.6 GBytes  16.8 Gbits/sec

No that's what I'm talking about.

Very nice!

So a question regarding IB as I am a bit of a noob in that area. When you have your cluster setup, will you be using IPoIB for the communication between nodes and data storage the same as your benchmarking there? Or do you switch it to act more like a SAN where it will address disks remotely? Or does it get mixed with the nodes themselves sharing information via IPoIB but then your ZFS head server talks directly to the drive via a different mode?

Thanks..
 
Very nice!

So a question regarding IB as I am a bit of a noob in that area. When you have your cluster setup, will you be using IPoIB for the communication between nodes and data storage the same as your benchmarking there? Or do you switch it to act more like a SAN where it will address disks remotely? Or does it get mixed with the nodes themselves sharing information via IPoIB but then your ZFS head server talks directly to the drive via a different mode?

Thanks..

I'm about 3 days into IB experience, so noob too.
As I understand it there are a bunch of options:

NFS over IPoIB
iScisi over IPoIB

-OR-

(Assuming the underlying operating systems support it) you eliminate TCP/IP all-together and use NFS and/or iSCSI over RDMA (Remote direct memory access)

iSCSI over RDMA:
http://docs.oracle.com/cd/E23824_01/html/821-1459/fnnop.html#gfcun

NFS over RDMA:
http://www.opengridcomputing.com/nfs-rdma.html

I'm working on getting components in place to do IB testing over RDMA now.
 
I'd definitely be interested if you can get NFS over RDMA working - I played around with it for awhile and had no luck (Mellanox 10Gb IB cards - the onboard memory ones)
 
For ease of use, vanilla NFS and iSCSI are fine. RDMA with them in my experience was difficult to implement, unstable, and not much of a performance boost. This also obviously layers on top of IPoIB, so don't expect latency equivalent to what infiniband may advertise due to overhead.

If you want lowest latency and highest speed storage presentation though, you want to go SRP. That gets you raw disk from storage to client. Implementation on linux is fairly straightforward and decently documented (see SCST or 3.3-rc1 LIO)

Note that infiniband allows protocol mixing without a problem, so you can run any combination of the above with IPoIB, and what have you.
 
I tried nfs over rdma, but gave up, nfs over connected mode works well.

iscsi works ok, but I liked using srp or iser instead, if possible.

Have you done any tuning to your solaris side? I had to do some good tuning to not have it fall over itself, when getting up in speed. Didn't have to do any tuning to ubuntu though.
 
iSER was a no go at the time that I tried it, as the target was nowhere close to being usable. I did not try Solaris clients, only Windows, Linux, and ESX (with SRP). I don't know about your specific needs, but if you have a choice in the matter and have no preference otherwise, I'd actually steer clear of OpenSolaris, especially since it's dead. Infiniband is rather better supported on Linux, though I can't speak for the proprietary Solaris.
 
"I'd actually steer clear of OpenSolaris, especially since it's dead."

In the literal sense, I suppose. Very misleading statement for someone reading this who doesn't know any better...
 
Very cool project. I'll be interested in seeing where it goes.
 
"I'd actually steer clear of OpenSolaris, especially since it's dead."

In the literal sense, I suppose. Very misleading statement for someone reading this who doesn't know any better...

Ah, I actually misread/misunderstood what the OP was using. I should clarify further that based on my experience with repeatedly applying head to desk over figuring out infiniband, Linux was the better documented (and I use the term loosely) platform for infiniband usage because of OFED. YMMV for commercially supported stuff like Solaris 11, of course.
 
Well, not just that. There are at least two active ports based on OpenSolaris (OpenIndiana and Nexenta). People shouldn't shy from a zfs based solution thinking OS is dead...
 
Dunno, I found infiniband setup on openindiana, was basically automatic, I didn't have to do anything really. Just install the srp/iser stuff if I wanted to use it, but everything was automatic.

Far from that on linux for me.
 
Dunno, I found infiniband setup on openindiana, was basically automatic, I didn't have to do anything really. Just install the srp/iser stuff if I wanted to use it, but everything was automatic.

How did you get OFED compiled on OI?
 
Once in connected mode, higher MTUs are allowed..up to 64K, so that's try that next:

Code:
root@ib3:~# ifconfig ib0 mtu 64000
root@ib4:~# ifconfig ib0 mtu 64000


Code:
root@ib3:~# iperf -P 1 -c 192.168.9.4
------------------------------------------------------------
Client connecting to 192.168.9.4, TCP port 5001
TCP window size:  189 KByte (default)
------------------------------------------------------------
[  3] local 192.168.9.2 port 47626 connected with 192.168.9.4 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  19.6 GBytes  16.8 Gbits/sec

No that's what I'm talking about.

That's some of the best throughput I've seen on this forum thus far. :cool:
 
I didn't compile OFED on OI, why would I need OFED?

I may be confused, but I thought that's where the RDMA, iSer support comes from?

What packages did YOU use for this and where did you get them??

We're working on this full bore this week, so should show some major progress.
 
Hmm, nothing is needed for RDMA support.
Looks like the packages are install by default.
Guess it was just a matter of adding them via cfgadm

On the linux side, iser is supported via the iscsi stuff.
But you need to hunt down the program for srp.
 
We just figured out that OFED 1.5.3/1.5.2 is included in the FreeBSD 9.0 distribution..it's NOT compiled into the kernel by default, but setting a few kernel options and a quick recompile and FreeBSD comes up like a champ with OFED *and* support for the Mellanox Connect X-2 adapters we have....took less than 1 hour to do all of this (course helps when you have dual 6 core CPUs). The RDMA functions are supposed to be there too...we'll see.

We're building a second Free BSD box now...hopefully we'll have them talking.

We fired up OI151a...it does NOT see the ib interfaces...unclear if it needs a driver...playing with that now as well.
 
Last edited:
Back
Top