ZFS Monster Stage 4: 40Gbps Infiniband

patrickdk · Jan 26, 2012

Not sure, I hadn't tried the connectx-2 cards in oi. But I thought all the connectx cards used the same driver. I'll see if I can find something out, if time permits.
---
Yep both connect-x and connect-x 2 use the hermon driver. Should be detected fine. Your card, as your other posts say should be using the driver_aliases to load it via:
hermon "pciex15b3,6340"

cfgadm doesn't show ib at all?

packetboy · Jan 27, 2012

I'll post the details of ib detect issues we're having tomorrow on OI151a...in the mean time Syoyo has made major progress:

----
I got success to run OpenSM on OI151a + ConnectX(hermon).
Also confirmed SRP target works on OI151a + ConnectX.
(But one SRP connection is limited to 750MB/s at max on my HW.
Multiple SRP connection will hit QDR limit, i.e. 3.2GB/s)

http://syoyo.wordpress.com/2012/01/23/opensm-on-illumos-hermonconnectx-works/

-----

Seems like SRP my be the ticket...I want to replicate these results!

We got the two FreeBSD 9.0 servers to talk to each other via IPoIB...all we had handy for perf testing was 'scp'....scp across the 1GB Ethernet yielded 100MB/s as expected. Across the IB.... 7MB/s ... dismal.

So although the FreeBSD was trivial to setup, out of the box performance is horrific, please we can't firgure out how to adjust IB driver settings...this is supposed to be done via /sys/class/net/ib0/mode , however, that directory doesn't even exist on the server. Sigh, it's never easy.

brutalizer · Jan 27, 2012

http://forums.overclockers.com.au/showthread.php?t=944153

infiniband on Solaris

packetboy · Jan 27, 2012

So here's the status of the IB adapter on oi151a:

Code:

System Configuration:  Project OpenIndiana  i86pc
Memory size: 49143 Megabytes
System Peripherals (Software Nodes):

i86pc
    ib, instance #0
        srpt, instance #0
        rpcib, instance #0
        rdsib, instance #0 (driver not attached)
        eibnx, instance #0
        daplt, instance #0 (driver not attached)
        rdsv3, instance #0
        sol_uverbs, instance #0 (driver not attached)
        sol_umad, instance #0 (driver not attached)
        sdpib, instance #0 (driver not attached)
        iser, instance #0 (driver not attached)

   scsi_vhci, instance #0
    pci, instance #0
        pci8086,0 (driver not attached)
        pci8086,3408, instance #0
            pci15d9,10d3, instance #0
        pci8086,3409, instance #1
            pci15d9,10d3, instance #1
        pci8086,340a, instance #2
            pci15d9,48, instance #0
                ibport, instance #1
        pci8086,340c (driver not attached)
        pci8086,340e (driver not attached)
        pci8086,342d (driver not attached)
        pci8086,342e (driver not attached)
        pci8086,3422 (driver not attached)
        pci8086,3423, instance #0
        pci8086,3438, instance #0 (driver not attached)
        pci15d9,7 (driver not attached)
        pci15d9,7 (driver not attached)
        pci15d9,7 (driver not attached)
        pci15d9,7 (driver not attached)
        pci15d9,7 (driver not attached)
        pci15d9,7 (driver not attached)
        pci15d9,7 (driver not attached)
        pci15d9,7 (driver not attached)
        pci15d9,7, instance #0
        pci15d9,7, instance #1



oi151:~# ls -ltr /dev | grep ibp
lrwxrwxrwx   1 root     root          67 Jan 26 06:34 ibp0 -> ../devices/pci@0,0/pci8086,340e@7/pci15b3,22@0/ibport@1,0,ipib:ibp0
lrwxrwxrwx   1 root     root          67 Jan 26 06:34 ibp1 -> ../devices/pci@0,0/pci8086,340a@3/pci15d9,48@0/ibport@1,0,ipib:ibp1
lrwxrwxrwx   1 root     root          29 Jan 26 06:34 ibp -> ../devices/pseudo/clone@0:ibp

I *believe* the actual adapter is /pci15b3 (ibp0), however, dladm only sees 'ibp1'..and it shows 'down' even though the ib interface has link with the switch:

Code:

oi151:~# dladm show-ib
LINK         HCAGUID         PORTGUID        PORT STATE  PKEYS
ibp1         2590FFFF2FC828  2590FFFF2FC829  1    down   FFFF

Confusing and maddening...so close!

patrickdk · Jan 27, 2012

You are running a subnet manager correct? It will show down if there is none.

Odd, your second port isn't showing.

You need to use cfgadm to add the extra, driver not attached, ones, if you want to use them.

On my systems it shows:

Code:

System Configuration:  Project OpenIndiana  i86pc
Memory size: 16376 Megabytes
System Peripherals (Software Nodes):

i86pc (driver name: rootnex)
    ib, instance #0 (driver name: ib)
        srpt, instance #0 (driver name: srpt)
        rpcib, instance #0 (driver name: rpcib)
        rdsib, instance #0 (driver name: rdsib)
        eibnx, instance #0 (driver name: eibnx)
        daplt, instance #0 (driver name: daplt)
        rdsv3, instance #0 (driver name: rdsv3)
        sol_uverbs, instance #0 (driver name: sol_uverbs)
        sol_umad, instance #0 (driver name: sol_umad)
        sdpib, instance #0 (driver name: sdpib)
        iser, instance #0 (driver name: iser)

packetboy · Jan 28, 2012

> You need to use cfgadm to add the extra, driver not attached, ones, if you want to use them.

how, exactly?

Also...it's a *single* port IB card...not dual.

patrickdk · Jan 28, 2012

http://docs.oracle.com/cd/E23824_01/html/821-1459/eyaqo.html

I believe I just did, cfgadm -c configure ib::iser,0
and repeated for srpt, sdpib, ...

packetboy · Jan 28, 2012

I'm really seeing why no one is crazy enough to try this...right now my goal is to just get two hosts talking to each other on IB *and* achieve throughput with a real application (NFS, iScsi, etc.) that is close to the theortical max for QDR IB (40Gbps * .90 ~= 36Gbps or 4500MB/s).

I still can't get things working right on OI151 or Solaris 11, so spent all day today just trying to find some *nix distribution where not only IB works, but IB with the Mellanox ConnectX-2 works...*and* RDMA works.

Our closest thing to success is Centos 5.5 (kernel 2.6.18):

Code:

[root@cesena ~]# rdma_bw rubicon 
6708: | port=18515 | ib_port=1 | size=65536 | tx_depth=100 | sl=0 | iters=1000 | duplex=0 | cma=0 |
6708: Local address:  LID 0x04, QPN 0x60004b, PSN 0xfef7be RKey 0xc0041d00 VAddr 0x002b2b91d41000
6708: Remote address: LID 0x01, QPN 0x64004b, PSN 0xb6835a, RKey 0xe8042000 VAddr 0x002b0e2c1ef000


6708: Bandwidth peak (#0 to #987): 3249.94 MB/sec
6708: Bandwidth average: 3249.74 MB/sec
6708: Service Demand peak (#0 to #987): 921 cycles/KB
6708: Service Demand Avg  : 921 cycles/KB

3249 MB/s / 4200Mbps = ~72% of max

Pretty close.

Now going to try SRP (Scsi RDMA Protocol). with a RAM disk.

MarkL · Jan 28, 2012

6708: Bandwidth average: 3249.74 MB/sec

Dude.. Wikipedia says 32Gbit is the max for QDR.. So you hit max with that Linux config..

InfiniBand QDR with 40 Gb/s (32 Gb/s effective)

patrickdk · Jan 28, 2012

Wikipedia is normally wrong when ever I look at it. But in this case it's correct.

But that isn't the issue. 32Gbit == 4.096GB

3249MB/sec != 4096MB/sec

I assume his 4200Mbps was just rounding errors, and should be MB/s

packetboy · Jan 29, 2012

My bad, I was thinking effective max throughput took a 10% haircut due to overhead...you're right...from wiki it's 20%...32Gbps....which is kind of good as that means I was getting pretty close to max throughput with the rdma bandwidth test 3250MB/s out of a possible ~4000MB/s = ~81%

Regardless...CentOS 5.5 started proving useless once we tried to get SRP running...it seemed like CentOS 6.2 was better suited, so we upgraded. rdma_bw tests worked right out of the box...and got same exact results as with 5.5.

Getting the RDMA Scsi Target (SCST) compiled and loaded wasn't too hard (required compile from source). Creating the target was pretty easy...getting the initiator to mount the target took 5 hours of screwing around. This guide is the closest to correct:

http://davidhunt.ie/wp/?p=491

What wasn't completely clear is that you had to basically guess what initiator name to allow, but then monitor /var/log/messages on the target to see how what the actual initiator name was. In my case I saw:

Code:

Jan 28 23:25:44 rubicon kernel: ib_srpt: Received SRP_LOGIN_REQ with i_port_id 0x0:0x2590ffff2fc829, t_port_id 0x2590ffff2fc820:0x2590ffff2fc820 and it_iu_len 260 on port 1 (guid=0xfe80000000000000:0x2590ffff2fc821)
Jan 28 23:25:44 rubicon kernel: [1784]: scst: scst_init_session:6289:Using security group "ib_srpt_target_1" for initiator "0x0000000000000000002590ffff2fc829" (target ib_srpt_target_1)
Jan 28 23:25:44 rubicon kernel: [8391]: scst: scst_translate_lun:3853:tgt_dev for LUN 0 not found, command to unexisting LU (initiator 0x0000000000000000002590ffff2fc829, target ib_srpt_target_1)?
Jan 28 23:25:44 rubicon kernel: [8391]: scst: scst_translate_lun:3853:tgt_dev for LUN 0 not found, command to unexisting LU (initiator 0x0000000000000000002590ffff2fc829, target ib_srpt_target_1)?
Jan 28 23:25:44 rubicon kernel: [8391]: scst: scst_translate_lun:3853:tgt_dev for LUN 1 not found, command to unexisting LU (initiator 0x0000000000000000002590ffff2fc829, target ib_srpt_target_1)?
Jan 28 23:25:44 rubicon kernel: [8391]: scst: scst_translate_lun:3853:tgt_dev for LUN 1 not found, command to unexisting LU (initiator 0x0000000000000000002590ffff2fc829, target ib_srpt_target_1)?

So then went back and did:

Code:

scstadmin -add_init 0x0000000000000000002590ffff2fc829 -driver ib_srpt -target ib_srpt_target_0 -group HOST01

He provides no documentation on how to start the initiator.

What I did was:

Code:

[root@cesena ~]# srp_daemon  -vvvv -a -c
 configuration report
 ------------------------------------------------
 Current pid                		: 2991
 Device name                		: "mlx4_0"
 IB port                    		: 1
 Mad Retries                		: 3
 Number of outstanding WR   		: 10
 Mad timeout (msec)	     		: 5000
 Prints add target command  		: 1
 Executes add target command		: 0
 Print also connected targets 		: 1
 Report current targets and stop 	: 0
 Reads rules from 			: /etc/srp_daemon.conf
 Do not print initiator_ext
 No full target rescan
 Retries to connect to existing target after 20 seconds
 ------------------------------------------------
id_ext=002590ffff2fc820,ioc_guid=002590ffff2fc820,dgid=fe80000000000000002590ffff2fc821,pkey=ffff,service_id=002590ffff2fc820,max_cmd_per_lun=32,max_sect=65535


Took fields from there and updated to /etc/srp_daemon.conf as follows:

[root@cesena ~]# cat /etc/srp_daemon.conf
## This is an example rules configuration file for srp_daemon.
##
#This is a comment
## disallow the following dgid
#d       dgid=fe800000000000000002c90200402bd5
## allow target with the following ioc_guid
#a       ioc_guid=00a0b80200402bd7
## allow target with the following id_ext and ioc_guid
#a       id_ext=200500A0B81146A1,ioc_guid=00a0b80200402bef
## disallow all the rest
#
a	id_ext=002590ffff2fc820,ioc_guid=002590ffff2fc820,dgid=fe80000000000000002590ffff2fc821,max_cmd_per_lun=32,max_sect=65535


Then ran this:

# srp_daemon -e -vvvv -a -f /etc/srp_daemon.conf -R 10

After running above, /var/log/messages on the initiator showed this!

Code:

Jan 29 19:26:38 cesena kernel: [2720]: scst: init_scst:2362:SCST version 2.2.0 loaded successfully (max mem for commands 12064MB, per device 4825MB)
Jan 29 19:26:38 cesena kernel: [2748]: scst: scst_global_mgmt_thread:6593:Management thread started, PID 2748
Jan 29 19:26:38 cesena kernel: [2720]: scst: scst_print_config:2155:Enabled features: EXTRACHECKS, DEBUG
Jan 29 19:28:02 cesena kernel: scsi4 : SRP.T10:002590FFFF2FC820
Jan 29 19:28:02 cesena kernel: scsi 4:0:0:0: Direct-Access     SCST_FIO DISK01            220 PQ: 0 ANSI: 5
Jan 29 19:28:02 cesena kernel: sd 4:0:0:0: Attached scsi generic sg1 type 0
Jan 29 19:28:02 cesena kernel: [2754]: scst: scst_register_device:964:Attached to scsi4, channel 0, id 0, lun 0, type 0
Jan 29 19:28:02 cesena kernel: sd 4:0:0:0: [sdb] 10240000 512-byte logical blocks: (5.24 GB/4.88 GiB)
Jan 29 19:28:02 cesena kernel: sd 4:0:0:0: [sdb] 4096-byte physical blocks
Jan 29 19:28:02 cesena kernel: sd 4:0:0:0: [sdb] Write Protect is off
Jan 29 19:28:02 cesena kernel: sd 4:0:0:0: [sdb] Write cache: enabled, read cache: enabled, supports DPO and FUA
Jan 29 19:28:02 cesena kernel: sdb: unknown partition table
Jan 29 19:28:02 cesena kernel: sd 4:0:0:0: [sdb] Attached SCSI disk
Jan 29 19:29:50 cesena kernel: sdb: sdb1


# fdisk -l

Disk /dev/sdb: 5242 MB, 5242880000 bytes
162 heads, 62 sectors/track, 1019 cylinders
Units = cylinders of 10044 * 512 = 5142528 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 524288 bytes
Disk identifier: 0xbbe37ce9

Did a fdisk, and mkfs.ext2 on it and then was able to mount it:

#mount /dev/sdb1 /mnt/rub58

Created a bunch of 1GB files, and then built this little test script:

Code:

test.sh

dd if=/mnt/rub58/file1.img of=/dev/null bs=1M &
dd if=/mnt/rub58/file2.img of=/dev/null bs=1M &
dd if=/mnt/rub58/file3.img of=/dev/null bs=1M &
dd if=/mnt/rub58/file4.img of=/dev/null bs=1M &

And here are the results:

Code:

# sh test.sh

1048576000 bytes (1.0 GB) copied, 1.27607 s, 822 MB/s
1048576000 bytes (1.0 GB) copied, 1.27607 s, 822 MB/s
1048576000 bytes (1.0 GB) copied, 1.27607 s, 822 MB/s
1048576000 bytes (1.0 GB) copied, 1.27607 s, 822 MB/s

822MB/s * 4 (threads) = 3288MB/s

Wow..that is awesome...we're actually getting SRP performance that is nearly identical to the rdma_bw test. This is definitely looking promising.

syoyo · Jan 30, 2012

Congrats! > 822MB/s * 4 (threads) = 3288MB/s

BTW, I got success to run SRP target on OI151a + ConnectX QDR, and get around 1.2GB/s bandwidth.
My mobo's chipset seems limiting internal bandwidth by 20Gbps, so I cannot get QDR peak, but guess your facility could.

Hope hearing your success of SRP on Solaris.

tormentum · Jan 30, 2012

*subscribed*

This looks interesting guys. I'm interested in the Solaris side as well as I'm having backup time window issues with ZFS send/recv's in our production environment. Looks promising!

packetboy · Jan 30, 2012

syoyo said:
Hope hearing your success of SRP on Solaris.

Yes...we achieved this on Sunday. It was very, very weird...we installed OI151a on one of the Supermicro blades last week...it was doing that thing where is showed an ibp1 interface, but no ibp0 and ibp1 was completely unusable. Even after re-installing OI151a a second time, same issue.

On Sunday, we installed OI151a on a different blade and this time ibp0 came up right away:

Code:

# dladm show-ib
LINK         HCAGUID         PORTGUID        PORT STATE  PKEYS
ibp0         2590FFFF2FC81C  2590FFFF2FC81D  1    up     FFFF

# dladm show-link
LINK        CLASS     MTU    STATE    BRIDGE     OVER
ibp0        phys      65520  up       --         --
e1000g1     phys      1500   unknown  --         --
e1000g0     phys      1500   up       --         --
pFFFF.ibp0  part      65520  up       --         ibp0

The only thing we can think of that we did differently is disable NWAM right away (this is known NOT to work with infiniband interfaces)...at this point I'd go even further and say that if you don't disable it IMMEDIATELY after oi151 install it seems to completely screw up the ib drivers.

Once we got that working, we installed the srp package and quickly got OI setup as an SRP target. We then mounted the SRP target on one of the existing Centos 6.2 systems. Performance was almost identical to the Centos target:

1 dd thread: 1200MB/s
2 dd thread: 2200MB/s
3 dd thread: 2835MB/s
4 dd thread: 3147MB/s <-- Infinband cables noticeable warm here

Next we exported an NFS share on OI using NFSoRDMA and then mounted it on the Centos box:

Code:

mount -o rdma,port=20049,rsize=65535,wsize=65535,noatime,nodiratime rubicon:/mnt/ramdisk /mnt/ramd

Then used this test script:

Code:

[root@cesena ~]# cat test_nfs_rdma.sh
sync
echo 3 > /proc/sys/vm/drop_caches
time dd if=/mnt/ramd/chunk0 of=/dev/null bs=1M &
time dd if=/mnt/ramd/chunk1 of=/dev/null bs=1M &
time dd if=/mnt/ramd/chunk2 of=/dev/null bs=1M &
time dd if=/mnt/ramd/chunk3 of=/dev/null bs=1M &

Total throughput from 4 thread: 1340MB/s

This is still pretty darn good, however, I'm disappointed as Id' much rather use NFSoRDMA than SDP...but given the better than 2.3x performance with SDP, looks like well we going that way.

Note the cache_flush command in the script above...made it much easier to test performance than having to do constant umount/mounts in order to flush linux file system cache. We purposely let the data be cached on the server side as our goal was to test IB throughput NOT actual drive throughput.

packetboy · Jan 30, 2012

syoyo said:
Congrats! > 822MB/s * 4 (threads) = 3288MB/s

BTW, I got success to run SRP target on OI151a + ConnectX QDR, and get around 1.2GB/s bandwidth.
My mobo's chipset seems limiting internal bandwidth by 20Gbps, so I cannot get QDR peak, but guess your facility could.

1.2GB/s with how many threads?

How can you determine how many PCI-e lanes are active on device under OI?

It's real nice to be able to do this under Centos:

Code:

# lspci -v -v 

03:00.0 InfiniBand: Mellanox Technologies MT26428 [ConnectX VPI PCIe 2.0 5GT/s - IB QDR / 10GigE] (rev b0)
	Subsystem: Super Micro Computer Inc Device 0048
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 256 bytes
	Interrupt: pin A routed to IRQ 24
	Region 0: Memory at fbd00000 (64-bit, non-prefetchable) [size=1M]
	Region 2: Memory at f8800000 (64-bit, prefetchable) [size=8M]
	Capabilities: [40] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [48] Vital Product Data
pcilib: sysfs_read_vpd: read failed: Connection timed out
		Not readable
	Capabilities: [9c] MSI-X: Enable+ Count=128 Masked-
		Vector table: BAR=0 offset=0007c000
		PBA: BAR=0 offset=0007d000
	Capabilities: [60] Express (v2) Endpoint, MSI 00
		DevCap:	MaxPayload 256 bytes, PhantFunc 0, Latency L0s <64ns, L1 unlimited
			ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+
		DevCtl:	Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
			RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- FLReset-
			MaxPayload 256 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
		LnkCap:	Port #8, Speed 5GT/s, Width x8, ASPM L0s, Latency L0 unlimited, L1 unlimited
			ClockPM- Surprise- LLActRep- BwNot-
		LnkCtl:	ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk-
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		[B]LnkSta:	Speed 5GT/s, Width x8, TrErr- Train- SlotClk- DLActive- BWMgmt- ABWMgmt-[/B]
		DevCap2: Completion Timeout: Range ABCD, TimeoutDis+
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-
		LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-, Selectable De-emphasis: -6dB
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance De-emphasis: -6dB
		LnkSta2: Current De-emphasis Level: -6dB
	Capabilities: [100] Alternative Routing-ID Interpretation (ARI)
		ARICap:	MFVC- ACS-, Next Function: 0
		ARICtl:	MFVC- ACS-, Function Group: 0
	Capabilities: [148] Device Serial Number 00-25-90-ff-ff-2f-c8-28
	Kernel driver in use: mlx4_core
	Kernel modules: mlx4_core

8x Lanes of PCI-e Gen2 goodness!

Are you sure you are getting 8 lanes?

Also, I haven't experimented yet, but my understanding is that BIOS MSI-X and PCI-e message size settings can have a major impact on performance.
I do have MSI-X on and PCI-e message size is set to 256B (the only other option I believe is 128B).

Rectal Prolapse · Jan 30, 2012

Out of curiousity, is there a reason why you could not use 10 gbe (and maybe they have dual-port cards now)? I was very suspicious when I saw your initial low numbers that I've seen people with 4 10 gbe cards easily exceed - although it sure took a lot of space on the board, and seemed to be only in loopback! :O

packetboy · Jan 30, 2012

10Gbpe is more expensive than QDR IB. The host adapters are about the same price, but the IB switch was a LOT less expensive...I got a 18 port QDR switch for $3500 (new)...that's $194/port.

10GBe switch ports are more like $500 - $1000 a pop.

So 3x the bandwidth for at least half the switch cost....that's what made this enticing.

Rectal Prolapse · Jan 30, 2012

ahhh the switches ok that makes sense.

Stanza33 · Jan 30, 2012

SRPT Installation and Configuration

http://hub.opensolaris.org/bin/view/Project+srp/srptconfig

.

syoyo · Jan 31, 2012

packetboy said:
Yes...we achieved this on Sunday.

Congrats!

Once we got that working, we installed the srp package and quickly got OI setup as an SRP target. We then mounted the SRP target on one of the existing Centos 6.2 systems. Performance was almost identical to the Centos target:

1 dd thread: 1200MB/s
2 dd thread: 2200MB/s
3 dd thread: 2835MB/s
4 dd thread: 3147MB/s <-- Infinband cables noticeable warm here

Numbers are pretty nice!

Code:
Code:

[root@cesena ~]# cat test_nfs_rdma.sh sync echo 3 > /proc/sys/vm/drop_caches time dd if=/mnt/ramd/chunk0 of=/dev/null bs=1M & time dd if=/mnt/ramd/chunk1 of=/dev/null bs=1M & time dd if=/mnt/ramd/chunk2 of=/dev/null bs=1M & time dd if=/mnt/ramd/chunk3 of=/dev/null bs=1M &

Total throughput from 4 thread: 1340MB/s

This is still pretty darn good, however, I'm disappointed as Id' much rather use NFSoRDMA than SDP...but given the better than 2.3x performance with SDP, looks like well we going that way.

Why you just test NFSoRDMA with 1M of data? I guess you have to send much more data, otherwise the overhead of filesystem will limit the bandwidth.

syoyo · Jan 31, 2012

packetboy said:
1.2GB/s with how many threads?

How can you determine how many PCI-e lanes are active on device under OI?

It's real nice to be able to do this under Centos:

4 dd threads, each of 1GB data, as you did.

Are you sure you are getting 8 lanes?

Also, I haven't experimented yet, but my understanding is that BIOS MSI-X and PCI-e message size settings can have a major impact on performance.
I do have MSI-X on and PCI-e message size is set to 256B (the only other option I believe is 128B).

I have no idea how to check LinkSta on OI. I checked it with lspci by running Linux(CentOS) once and confirmed 5GT/s, x8 LinkSta.

I don't know how to set PCI-e message size to 256B, will try to find it if I have time.

FYI, I am running ZFS + IB + OI151a box on this mobo,

http://www.intel.com/content/www/us/en/motherboards/desktop-motherboards/desktop-board-dh57jg.html

Its a mini-ITX(Because I wanted silent and small storage box), so it is not a problem if it can't achieve QDR peak performance ;-)

syoyo · Jan 31, 2012

Rectal Prolapse said:
ahhh the switches ok that makes sense.

Yes, you can also buy much more cheap IB switches on eBay.

I once bought 8-port IB SDR switch for $200, and also bought 10 IB SDR card for $200. Its $25/port, $20/HCA. Its mostly same price of 1GbE. Cables was also around $10 ~ $20/cable.

packetboy · Jan 31, 2012

Why you just test NFSoRDMA with 1M of data? I guess you have to send much more data, otherwise the overhead of filesystem will limit the bandwidth.

We were using *blocksize* of 1M for the read...the test files were 2-4GB.

syoyo · Feb 1, 2012

packetboy said:
We were using *blocksize* of 1M for the read...the test files were 2-4GB.

Ah, I see. How about much more *blocksize*? e.g. 128MB.

FYI, NFS/RDMA seems not stable before OFED 1.5.4.
You'd be better to use OFED 1.5.4 or later, but in this case, most verbs application doesn't work with Solaris IB(e.g. ib_read_bw)

packetboy · Feb 5, 2012

Bad news, good news.

Had to abandon SRP (Scsi over RDMA) dreams...simple dd tests looked promising, but the more we moved data and started doing larger two-way tests, it all just fell apart (file systems on drives actually seemed to become corrupted)..and even before that performance would suddently drop to 70MB/s per spindle.

We decided to punt and go for iSer (iScsi over RDMA)...it took a full day to figure out how to get it working on OI 151a and Centos 6.2, but when we finally did it seemed much more stable and also a lot easier to manage.

This time we created iSCSI targets from the raw SAS devices we had connected to the OI server (Hitatchi 2TB drives). In this configuration OI does NOT seem to do any server side caching at all, thus we could not test iSer throughput from server cache...only throughput to the drives themselves (across the 40Gbps Infiniband network of course):

Seq. read Throughput looked like this:

Code:

Drives    Throughput
1              131MB/s
2              253MB/s
3              393MB/s
4              516MB/s
.
.
8              715MB/s

As each drive was good for about 130MB/s it seemed to scale almost exactly as expected for the first 4 drives. Oddly, once we went about 4 (and thus started using the second SAS wide port on our LSI 9200-8e) we were only getting an incremental boost of 80-90MB/s per addl drive.

That's when we noticed that we'd get the same bandwidth on SAS port 2 when driving 1 or 2 drives, but with 3 or 4 drives performance was a good 20% less than Port 1. We though it might be the el-cheapo SAS cables...swapped the cable and got the same results...so not sure what's going on there.

Because all we had was a 8 drive Sans Digital SAS enclosure, this was as much testing as we can do right now. I have 4 of those Rackable Systems SAS enclosures on the way...once we have those in place we'll be able to see how far we can take iSer.

Will post full details on how we got iSer working tomorrow.

MrGuvernment · Feb 5, 2012

Curious, reason why perhaps something like NGINX isn't used instead of Apache or a mix of the 2, or you using some optimized apache configs?

Also i would of thought another OS than Ubuntu, but i also haven't used the server edition much, CentOS guy myself.

jen4950 said:
So- silly question- what do you use this Hadoop thing for?

Some big names on their user list.

Ya, sounds like a serious power house project

packetboy · Feb 6, 2012

MrGuvernment said:
Curious, reason why perhaps something like NGINX isn't used instead of Apache or a mix of the 2, or you using some optimized apache configs?

Also i would of thought another OS than Ubuntu, but i also haven't used the server edition much, CentOS guy myself.

Apache (Web server) not equal Apache Hadoop

Hadoop is a data warehousing application meant to process, store, and facilitate the querying of unstructured data in the multi-Terabyte range.

We are using CentOS 6.2 ... so far it seems to be the best free Linux option right now for pretty decent Infiniband support right out of the box...with the exception of OI of course.

pjkenned · Feb 6, 2012

packetboy said:
Bad news, good news.

Had to abandon SRP (Scsi over RDMA) dreams...simple dd tests looked promising, but the more we moved data and started doing larger two-way tests, it all just fell apart (file systems on drives actually seemed to become corrupted)..and even before that performance would suddently drop to 70MB/s per spindle.

We decided to punt and go for iSer (iScsi over RDMA)...it took a full day to figure out how to get it working on OI 151a and Centos 6.2, but when we finally did it seemed much more stable and also a lot easier to manage.

This time we created iSCSI targets from the raw SAS devices we had connected to the OI server (Hitatchi 2TB drives). In this configuration OI does NOT seem to do any server side caching at all, thus we could not test iSer throughput from server cache...only throughput to the drives themselves (across the 40Gbps Infiniband network of course):

Seq. read Throughput looked like this:

Code:

Drives Throughput 1 131MB/s 2 253MB/s 3 393MB/s 4 516MB/s . . 8 715MB/s

As each drive was good for about 130MB/s it seemed to scale almost exactly as expected for the first 4 drives. Oddly, once we went about 4 (and thus started using the second SAS wide port on our LSI 9200-8e) we were only getting an incremental boost of 80-90MB/s per addl drive.

That's when we noticed that we'd get the same bandwidth on SAS port 2 when driving 1 or 2 drives, but with 3 or 4 drives performance was a good 20% less than Port 1. We though it might be the el-cheapo SAS cables...swapped the cable and got the same results...so not sure what's going on there.

Because all we had was a 8 drive Sans Digital SAS enclosure, this was as much testing as we can do right now. I have 4 of those Rackable Systems SAS enclosures on the way...once we have those in place we'll be able to see how far we can take iSer.

Will post full details on how we got iSer working tomorrow.

Very interesting. I saw something similar with running the old SF-1200 SSDs on the 9211-8i back in the day, using LSI RAID 0 (just benchmarking no data.) There was a big difference between using 4 ports and 8 ports in terms of incremental speed.

patrickdk · Feb 6, 2012

Wonder if the 9205-8e correct this issue.

I personally had an issue when doing multible transfers over ib, it would fall flat on it's fast. Single transfer speeds of 800MB/sec and high, when I attempted 2 or 3 at the same time, would all start going 30MB/sec each. Just a few quick tunings later and it was all good, mainly following the 10gbit ethernet tuning adjustments.

packetboy · Feb 6, 2012

Evolving the initial design to this:

LSI 9205-8e controllers are so reasonable right now, just seems to make sense to direct SAS connect each blade to the Rackable enclosures. Using Fatwallet you get 3% cashback at Overstock.com, plus another 3% in Overstock dollars making these about $330 net a piece. So basically, for $1300 I can have a dedicated SAS storage network and a dedicated 40Gbps IB network for the blades to talk to each other. I think there is one left on Overstock...so hurry if you want one.

If/when I need/can afford more compute power, then I'll simply convert the existing Twin blade server to an iSER server, stuff a full blown Blade chassis with as much compute power as I can afford and serve it up disks via iSER.

We'll see.

kristofferjon · Mar 7, 2012

Packetboy,

Can you post the iSER configuration details?

Regards,
Kris

sor · May 9, 2012

patrickdk said:
Wonder if the 9205-8e correct this issue.

I personally had an issue when doing multible transfers over ib, it would fall flat on it's fast. Single transfer speeds of 800MB/sec and high, when I attempted 2 or 3 at the same time, would all start going 30MB/sec each. Just a few quick tunings later and it was all good, mainly following the 10gbit ethernet tuning adjustments.

I'd be interested in knowing what you did. We have a fairly large SRP deployment, and have found that distributions based on newer Solaris kernels do this. I've tried Solaris 11/11 and Illumian. So at the moment we're stuck on Nexenta Core 3.1 with the 134 kernel.

I've found that Linux can have problems as a target in fileio mode with SRP. It's relatively easy to fill the dirty memory with so much data that the platters choke and the target becomes unresposive for many seconds. If you run linux as a target, I'd suggest blockio or changing the /proc/sys/vm settings so that the dirty memory flushes often. With the ZFS based systems it's less of an issue, since you set how often you want the dirty memory to flush, and how long the flush should take, and if it exceeds the set time it begins to throttle dirty writes to keep the flush times in check. Linux just makes everything block until the dirty writes complete, or with newer kernels they put the writers to sleep. That doesn't seem to work quite as well, as I frequently got the "blocked for more than 120 seconds" kernel warnings on the SCST threads in fileio mode.

goktugy · Jul 10, 2012

Hello,

This is something interesting to read and good luck with it.
I wonder if there are any updates you might value to share.

Thanks.

newjohnny · Aug 14, 2012

I'm doing something similiar here but with the new ConnectX-3 cards in OpenIndiana 151a5. Loading /drivers/network/hermon in Device Driver Utility fails and leaves the card status at UNK.

I assumed that if ConnectX-2 cards work, then the ver.3 cards would too, but seems not to. I'm still pretty green with Solaris so maybe I'm missing something. Has anyone gotten the ConnectX-3 cards going in OI?

UPDATE: putting this line in driver_aliases and rebooting attached the drivers: hermon "pciex15b3,1003"

Nabisco_DING · Oct 15, 2012

Hate to revive an old thread but I was wondering, after 8 months, if you were ever able to get IB SRP working properly?

Did you stick with iSER after all?
Were you able to figure out why you were getting decreased drive performance after 4 drives?

ZFS Monster Stage 4: 40Gbps Infiniband

Gawd

Limp Gawd

[H]ard|Gawd

Limp Gawd

Gawd

Limp Gawd

Gawd

Limp Gawd

Limp Gawd

Gawd

Limp Gawd

n00b

Limp Gawd

Limp Gawd

Limp Gawd

Gawd

Limp Gawd

Gawd

Gawd

n00b

n00b

n00b

Limp Gawd

n00b

Limp Gawd

Fully [H]

Limp Gawd

[H]ard|Gawd

Gawd

Limp Gawd

n00b

n00b

n00b

n00b

n00b