Rackable SE-3016 JBOD - Poor mans 100TB+ server

packetboy

Limp Gawd
Joined
Aug 2, 2009
Messages
288
So for anyone that was wondering whether the Hitachi 4TB drives would work in this enclosure:

Code:
root@zulu04:/mnt/cytel/tools/lsi# ./lsi163

Main menu, select an option:  [1-99 or e/p/w or 0 to quit] 8

SAS2008's links are 3.0 G, 3.0 G, 3.0 G, 3.0 G, 3.0 G, 3.0 G, 3.0 G, 3.0 G

 B___T___L  Type       Vendor   Product          Rev      SASAddress     PhyNum
 0  10   0  Disk       ATA      Hitachi HDS72404 A3B0  500194000076b200    0
 0  11   0  Disk       ATA      Hitachi HDS72404 A3B0  500194000076b201    1
 0  12   0  Disk       ATA      Hitachi HDS72404 A3B0  500194000076b202    2
 0  13   0  Disk       ATA      Hitachi HDS72404 A3B0  500194000076b203    3
 0  14   0  Disk       ATA      Hitachi HDS72404 A3B0  500194000076b204    4
 0  15   0  Disk       ATA      Hitachi HDS72404 A250  500194000076b205    5
 0  16   0  Disk       ATA      Hitachi HDS72404 A3B0  500194000076b206    6
 0  17   0  Disk       ATA      Hitachi HDS72404 A3B0  500194000076b207    7
 0  18   0  Disk       ATA      Hitachi HDS72404 A3B0  500194000076b208    8
 0  19   0  Disk       ATA      Hitachi HDS72404 A3B0  500194000076b209    9
 0  20   0  Disk       ATA      Hitachi HDS72404 A3B0  500194000076b20a    10
 0  21   0  Disk       ATA      Hitachi HDS72404 A3B0  500194000076b20b    11
 0  22   0  Disk       ATA      Hitachi HDS72404 A250  500194000076b20c    12
 0  23   0  Disk       ATA      Hitachi HDS72404 A3B0  500194000076b20d    13
 0  24   0  Disk       ATA      Hitachi HDS72404 A3B0  500194000076b20e    14
 0  25   0  Disk       ATA      Hitachi HDS72404 A3B0  500194000076b20f    15
 0  26   0  EnclServ   RACKABLE SE3016-SAS       0227  500194000076b23e    24
 0  28   0  Disk       ATA      Hitachi HDS72404 A3B0  5001940000502200    0
 0  29   0  Disk       ATA      Hitachi HDS72404 A3B0  5001940000502201    1
 0  30   0  Disk       ATA      Hitachi HDS72404 A3B0  5001940000502202    2
 0  31   0  Disk       ATA      Hitachi HDS72404 A3B0  5001940000502203    3
 0  32   0  Disk       ATA      Hitachi HDS72404 A250  5001940000502204    4
 0  33   0  Disk       ATA      Hitachi HDS72404 A3B0  5001940000502205    5
 0  34   0  Disk       ATA      Hitachi HDS72404 A3B0  5001940000502206    6
 0  35   0  Disk       ATA      Hitachi HDS72404 A3B0  5001940000502207    7
 0  36   0  Disk       ATA      Hitachi HDS72404 A3B0  5001940000502208    8
 0  37   0  Disk       ATA      Hitachi HDS72404 A3B0  5001940000502209    9
 0  38   0  Disk       ATA      Hitachi HDS72404 A3B0  500194000050220a    10
 0  39   0  Disk       ATA      Hitachi HDS72404 A3B0  500194000050220b    11
 0  40   0  Disk       ATA      Hitachi HDS72404 A3B0  500194000050220c    12
 0  41   0  Disk       ATA      Hitachi HDS72404 A3B0  500194000050220d    13
 0  42   0  Disk       ATA      Hitachi HDS72404 A3B0  500194000050220e    14
 0  43   0  Disk       ATA      Hitachi HDS72404 A3B0  500194000050220f    15
 0  44   0  EnclServ   RACKABLE SE3016-SAS       0227  500194000050223e    24 

root@zulu04:/mnt/cytel/tools/lsi/sas2flash# zpool status tank
  pool: tank
 state: ONLINE
  scan: none requested
config:

        NAME                     STATE     READ WRITE CKSUM
        tank                     ONLINE       0     0     0
          c7t5000CCA22BC2838Cd0  ONLINE       0     0     0
          c7t5000CCA22BC2D72Fd0  ONLINE       0     0     0
          c7t5000CCA22BC2D6F5d0  ONLINE       0     0     0
          c7t5000CCA22BC1150Bd0  ONLINE       0     0     0
          c7t5000CCA22BC2D76Cd0  ONLINE       0     0     0
          c7t5000CCA22BC2C8C5d0  ONLINE       0     0     0
          c7t5000CCA22BC2F88Fd0  ONLINE       0     0     0
          c7t5000CCA22BC2ED3Ad0  ONLINE       0     0     0
          c7t5000CCA22BC2E1D9d0  ONLINE       0     0     0
          c7t5000CCA22BC2FB4Ed0  ONLINE       0     0     0
          c7t5000CCA22BC2B082d0  ONLINE       0     0     0
          c7t5000CCA22BC2D339d0  ONLINE       0     0     0
          c7t5000CCA22BC2D91Fd0  ONLINE       0     0     0
          c7t5000CCA22BC093E3d0  ONLINE       0     0     0
          c7t5000CCA22BC2FB75d0  ONLINE       0     0     0
          c7t5000CCA22BC2D76Fd0  ONLINE       0     0     0
          c7t5000CCA22BC2E692d0  ONLINE       0     0     0
          c7t5000CCA22BC305B7d0  ONLINE       0     0     0
          c7t5000CCA22BC1AFCCd0  ONLINE       0     0     0
          c7t5000CCA22BC2EB4Cd0  ONLINE       0     0     0
          c7t5000CCA22BC23A39d0  ONLINE       0     0     0
          c7t5000CCA22BC2D5B5d0  ONLINE       0     0     0
          c7t5000CCA22BC2E255d0  ONLINE       0     0     0
          c7t5000CCA22BC08CD9d0  ONLINE       0     0     0
          c7t5000CCA22BC2FB46d0  ONLINE       0     0     0
          c7t5000CCA22BC2F5DAd0  ONLINE       0     0     0
          c7t5000CCA22BC2C7E0d0  ONLINE       0     0     0
          c7t5000CCA22BC2C897d0  ONLINE       0     0     0
          c7t5000CCA22BC2C7F6d0  ONLINE       0     0     0
          c7t5000CCA22BC1C88Cd0  ONLINE       0     0     0
          c7t5000CCA22BC2D911d0  ONLINE       0     0     0
          c7t5000CCA22BC093FEd0  ONLINE       0     0     0
[B]
root@zulu04:/mnt/cytel/tools/lsi/sas2flash# df -h
Filesystem            Size  Used Avail Use% Mounted on
tank                  115T  144K  115T   1% /tank

#
[/B]

:eek:
 
Code:
100GB write test:

# dd if=/dev/zero of=/tank/100G.dd bs=1M count=100000
100000+0 records in
100000+0 records out
104857600000 bytes (105 GB) copied, 105.47 s, 994 MB/s

100GB read test:

# dd if=/tank/100G.dd of=/dev/null bs=1M count=100000
100000+0 records in
100000+0 records out
104857600000 bytes (105 GB) copied, 100.966 s, 1.0 GB/s

Not bad...I was really expecting 2.0GB/s as I have each SE3016 on it's own 3Gbps wide port...although both are connected into the same LSI HBA (9200-8e).
 
I'd be more concerned about temps of those drives in that box since the cooling is subpar and if you slowed down the fans then even moreso

Also that 1GBps wall is due to that expander.
 
Is this using a SAS expander solution? Was it not concluded for a few months ago that using a SAS expander together with SATA drives had a high risk to go off with a big bang?
 
I'd be more concerned about temps of those drives in that box since the cooling is subpar and if you slowed down the fans then even moreso

Also that 1GBps wall is due to that expander.

just replace with two 120mm fan 1600 RPM. nothing to worry about.
 
Is this using a SAS expander solution? Was it not concluded for a few months ago that using a SAS expander together with SATA drives had a high risk to go off with a big bang?

on my perception:
I should not to worry about that :). mine is running smooth with 10 SATA HDs
it was a hot debate on ZFS...
 
that is kind of a problem.

Why is that a problem? A number of SAS2 6Gb/s expanders link at 3.0 when used with non-SAS (SATA) drives. I have seen this with the HP expanders as well. With spinning disks, it is not a problem as even in STR they won't approach the limits per port. If it only linked at 1.5 (Which I have also seen with some HP and other expanders as well) and you have the newest high speed spinners then you may hit the hard limit with STR.
 
Why is that a problem? A number of SAS2 6Gb/s expanders link at 3.0 when used with non-SAS (SATA) drives. I have seen this with the HP expanders as well. With spinning disks, it is not a problem as even in STR they won't approach the limits per port. If it only linked at 1.5 (Which I have also seen with some HP and other expanders as well) and you have the newest high speed spinners then you may hit the hard limit with STR.

because

Code:
100GB write test:

# dd if=/dev/zero of=/tank/100G.dd bs=1M count=100000
100000+0 records in
100000+0 records out
104857600000 bytes (105 GB) copied, 105.47 s, 994 MB/s

100GB read test:

# dd if=/tank/100G.dd of=/dev/null bs=1M count=100000
100000+0 records in
100000+0 records out
104857600000 bytes (105 GB) copied, 100.966 s, 1.0 GB/s

Not bad...I was really expecting 2.0GB/s as I have each SE3016 on it's own 3Gbps wide port...although both are connected into the same LSI HBA (9200-8e).

32 drives in one large stripped pool should saturate the full 8088 link. each drive is only pushing 31.5 MB/s? That is pretty atrocious those drives should easily saturate even a full 6gbps 8088 port. worst case at 90% capacity you should still be able to saturate 24gbps of bandwidth from 32 drives.

i'm not saying it is unusual, however i would try to change that in the drive's firmware and or the enclosure controller.
 
That is the primary downside of the SE-3016 in that it's only a 3Gbps expander inside so it's completely expected that the 6Gbps drives I'm testing with would negotiate at 3Gbps AND the expander uplinks to the HBA will also be 3Gbps. But for an enclosure that you can acquire for only $200 it's hard to beat.

What I don't understand is that I have TWO SE-3016s (16 drives each)...In my previous testing I achieved about 1000MB/s with ONE enclosure...so I expected to get nearly double that using two. Not sure if the LSI 9200-8e is bottlenecked or what. I have a beefier 9205-8e hanging around, I may try with that as well.

I still have all 3Gbps Hitachi 2TB and 3TB drives deployed in my production servers (using SC847 JBOD enclosures..which is a waste of 6Gbps goodness). Assuming, I can get decent performance from these SE-3016s, my plan is to swap out the two SC847s I have, so that I can use 45 4TB Hitachies I have in the SC847 enclosures...that should scream as I'll finally be using a setup that will be able to take advantage of the 6Gbps backplane.
 
if they're daisy chained then no, you're bottlenecked at the upstream port. if you have a single port from your HBA into each jbod then another cable between the jobs you should get double the bandwidth ... i think ... would have to see a cable diagram for those jbods though.
 
if they're daisy chained then no, you're bottlenecked at the upstream port. if you have a single port from your HBA into each jbod then another cable between the jobs you should get double the bandwidth ... i think ... would have to see a cable diagram for those jbods though.

They're not (though they might as well be)...eash SE-3016 is connected to a dedicated wide port on a *single* LSI 9200-8e. The LSI controller is in a x16 slot so pretty sure PCI-e is not the problem.

I forget what the max throughput I got out of the 9200 when I did testing a couple of years ago, but I'm pretty sure it was way more than 1000MB/s...weird.
 
@ 6gbps you should be able to bump into the pcie 2.0 x8 cap at 4GBps with all 6gbps drives.

there is only a single connection from the jbod to the hba? if so, that is your problem. add the out cable into the other HBA port. that should double your bandwidth.
 
@ 6gbps you should be able to bump into the pcie 2.0 x8 cap at 4GBps with all 6gbps drives.

there is only a single connection from the jbod to the hba? if so, that is your problem. add the out cable into the other HBA port. that should double your bandwidth.

Re-read what I posteded...it's NOT a 6Gbps expander and backplane..it's all 3Gbps.

With a 3Gbps x4 (wide) port, max theoretically is 300MB/s * 4 = 1200MB/s

..then slap that down for SAS overhead and probably slap it down more for SATA Tunneling Protocol (STP) overhead (when using SATA drives in SAS enclosures)...with that 1000MB/s makes sense...though it should be PER ENCLOSURE.
 
there is only a single connection from the jbod to the hba? if so, that is your problem. add the out cable into the other HBA port. that should double your bandwidth.

I missed this part of your post...yeah I will try that, but that's only going to work if the HBA *and* the expander support that. I'm not even sure what the SAS "feature" is that is required to do this. Today LSI was calling it "load balancing"...I've seen others reference having the ability to negotiate an x8 link (e.g. like the double-wide I live in).
 
Re-read what I posteded...it's NOT a 6Gbps expander and backplane..it's all 3Gbps.
I forget what the max throughput I got out of the 9200 when I did testing a couple of years ago, but I'm pretty sure it was way more than 1000MB/s...weird.
@ 6gbps you should be able to bump into the pcie 2.0 x8 cap at 4GBps with all 6gbps drives.
With a 3Gbps x4 (wide) port, max theoretically is 300MB/s * 4 = 1200MB/s

..then slap that down for SAS overhead and probably slap it down more for SATA Tunneling Protocol (STP) overhead (when using SATA drives in SAS enclosures)...with that 1000MB/s makes sense...though it should be PER ENCLOSURE.

i get that, i still think it is a cabling problem though.
 
I missed this part of your post...yeah I will try that, but that's only going to work if the HBA *and* the expander support that. I'm not even sure what the SAS "feature" is that is required to do this. Today LSI was calling it "load balancing"...I've seen others reference having the ability to negotiate an x8 link (e.g. like the double-wide I live in).

i'm not sure what it is called either. its how i have all my JBODs cabled though either direct back to the HBA or the SAS switch. works great :D.

was trying to find the manual for that and ran across SGI/Rackable's 81 x 3.5" drive JBOD. Yum.

depends on how they break up the expander internally though that could either be a really slick design or a signaling nightmare for the slots furthest away. DDN i think has a 84 drive JBOD coming out soon but you can't get them without buying a whole DDN system which is really sweet but really expensive.
 
so you're talking about hacking the back of the case?

Correct.

I did on mine. Has been running 24/7 for months.
If you need to lower down the PSU noise. Replace 40mm PSU fun with 3000 rpm fan. The fan that comes from PSU is running 10k rpm...
 
again, not news. 2 x 3gbps 8088 ports = 24gbps total bandwidth or roughly 2GBps. he is getting 1GBps.
 
So for anyone that was wondering whether the Hitachi 4TB drives would work in this enclosure:

Code:
root@zulu04:/mnt/cytel/tools/lsi# ./lsi163

Main menu, select an option:  [1-99 or e/p/w or 0 to quit] 8

SAS2008's links are 3.0 G, 3.0 G, 3.0 G, 3.0 G, 3.0 G, 3.0 G, 3.0 G, 3.0 G

 B___T___L  Type       Vendor   Product          Rev      SASAddress     PhyNum
 0  10   0  Disk       ATA      Hitachi HDS72404 A3B0  500194000076b200    0
 0  11   0  Disk       ATA      Hitachi HDS72404 A3B0  500194000076b201    1
 0  12   0  Disk       ATA      Hitachi HDS72404 A3B0  500194000076b202    2
 0  13   0  Disk       ATA      Hitachi HDS72404 A3B0  500194000076b203    3
 0  14   0  Disk       ATA      Hitachi HDS72404 A3B0  500194000076b204    4
 0  15   0  Disk       ATA      Hitachi HDS72404 A250  500194000076b205    5
 0  16   0  Disk       ATA      Hitachi HDS72404 A3B0  500194000076b206    6
 0  17   0  Disk       ATA      Hitachi HDS72404 A3B0  500194000076b207    7
 0  18   0  Disk       ATA      Hitachi HDS72404 A3B0  500194000076b208    8
 0  19   0  Disk       ATA      Hitachi HDS72404 A3B0  500194000076b209    9
 0  20   0  Disk       ATA      Hitachi HDS72404 A3B0  500194000076b20a    10
 0  21   0  Disk       ATA      Hitachi HDS72404 A3B0  500194000076b20b    11
 0  22   0  Disk       ATA      Hitachi HDS72404 A250  500194000076b20c    12
 0  23   0  Disk       ATA      Hitachi HDS72404 A3B0  500194000076b20d    13
 0  24   0  Disk       ATA      Hitachi HDS72404 A3B0  500194000076b20e    14
 0  25   0  Disk       ATA      Hitachi HDS72404 A3B0  500194000076b20f    15
 0  26   0  EnclServ   RACKABLE SE3016-SAS       0227  500194000076b23e    24
 0  28   0  Disk       ATA      Hitachi HDS72404 A3B0  5001940000502200    0
 0  29   0  Disk       ATA      Hitachi HDS72404 A3B0  5001940000502201    1
 0  30   0  Disk       ATA      Hitachi HDS72404 A3B0  5001940000502202    2
 0  31   0  Disk       ATA      Hitachi HDS72404 A3B0  5001940000502203    3
 0  32   0  Disk       ATA      Hitachi HDS72404 A250  5001940000502204    4
 0  33   0  Disk       ATA      Hitachi HDS72404 A3B0  5001940000502205    5
 0  34   0  Disk       ATA      Hitachi HDS72404 A3B0  5001940000502206    6
 0  35   0  Disk       ATA      Hitachi HDS72404 A3B0  5001940000502207    7
 0  36   0  Disk       ATA      Hitachi HDS72404 A3B0  5001940000502208    8
 0  37   0  Disk       ATA      Hitachi HDS72404 A3B0  5001940000502209    9
 0  38   0  Disk       ATA      Hitachi HDS72404 A3B0  500194000050220a    10
 0  39   0  Disk       ATA      Hitachi HDS72404 A3B0  500194000050220b    11
 0  40   0  Disk       ATA      Hitachi HDS72404 A3B0  500194000050220c    12
 0  41   0  Disk       ATA      Hitachi HDS72404 A3B0  500194000050220d    13
 0  42   0  Disk       ATA      Hitachi HDS72404 A3B0  500194000050220e    14
 0  43   0  Disk       ATA      Hitachi HDS72404 A3B0  500194000050220f    15
 0  44   0  EnclServ   RACKABLE SE3016-SAS       0227  500194000050223e    24 

root@zulu04:/mnt/cytel/tools/lsi/sas2flash# zpool status tank
  pool: tank
 state: ONLINE
  scan: none requested
config:

        NAME                     STATE     READ WRITE CKSUM
        tank                     ONLINE       0     0     0
          c7t5000CCA22BC2838Cd0  ONLINE       0     0     0
          c7t5000CCA22BC2D72Fd0  ONLINE       0     0     0
          c7t5000CCA22BC2D6F5d0  ONLINE       0     0     0
          c7t5000CCA22BC1150Bd0  ONLINE       0     0     0
          c7t5000CCA22BC2D76Cd0  ONLINE       0     0     0
          c7t5000CCA22BC2C8C5d0  ONLINE       0     0     0
          c7t5000CCA22BC2F88Fd0  ONLINE       0     0     0
          c7t5000CCA22BC2ED3Ad0  ONLINE       0     0     0
          c7t5000CCA22BC2E1D9d0  ONLINE       0     0     0
          c7t5000CCA22BC2FB4Ed0  ONLINE       0     0     0
          c7t5000CCA22BC2B082d0  ONLINE       0     0     0
          c7t5000CCA22BC2D339d0  ONLINE       0     0     0
          c7t5000CCA22BC2D91Fd0  ONLINE       0     0     0
          c7t5000CCA22BC093E3d0  ONLINE       0     0     0
          c7t5000CCA22BC2FB75d0  ONLINE       0     0     0
          c7t5000CCA22BC2D76Fd0  ONLINE       0     0     0
          c7t5000CCA22BC2E692d0  ONLINE       0     0     0
          c7t5000CCA22BC305B7d0  ONLINE       0     0     0
          c7t5000CCA22BC1AFCCd0  ONLINE       0     0     0
          c7t5000CCA22BC2EB4Cd0  ONLINE       0     0     0
          c7t5000CCA22BC23A39d0  ONLINE       0     0     0
          c7t5000CCA22BC2D5B5d0  ONLINE       0     0     0
          c7t5000CCA22BC2E255d0  ONLINE       0     0     0
          c7t5000CCA22BC08CD9d0  ONLINE       0     0     0
          c7t5000CCA22BC2FB46d0  ONLINE       0     0     0
          c7t5000CCA22BC2F5DAd0  ONLINE       0     0     0
          c7t5000CCA22BC2C7E0d0  ONLINE       0     0     0
          c7t5000CCA22BC2C897d0  ONLINE       0     0     0
          c7t5000CCA22BC2C7F6d0  ONLINE       0     0     0
          c7t5000CCA22BC1C88Cd0  ONLINE       0     0     0
          c7t5000CCA22BC2D911d0  ONLINE       0     0     0
          c7t5000CCA22BC093FEd0  ONLINE       0     0     0
[B]
root@zulu04:/mnt/cytel/tools/lsi/sas2flash# df -h
Filesystem            Size  Used Avail Use% Mounted on
tank                  115T  144K  115T   1% /tank

#
[/B]

:eek:

I spy with my little eye, 2 drives with different firmware.

What's the server specs?

I noticed with my MSA70's when full of 72g SAS drives (slow compared to your 4tb drives) I can bunch then in mirror sets, or simple stripe vdev....

I havn't got past 800mb/s dd test yet

Yet my other play box with just 8 x 450g SAS drives direct connected to a H200 controller can near match it.

DD really isn't in my opinion a good test for a ZFS based system.

As ZFS likes to organize it's writes in memory then purge them out to the drives when it's ready. With DD it just doesn't seem to get the time to organize it I believe.

I would be interested to see some way to measure the following, tho I don't know how apart from watching iostat output

Have a system with a bucket load of ram
Also a decent log device
Write say a giant nfs stream to it ( eg enough to allow +100mb/s stream to each drive) quickly
Let ZFS suck it into it's memory and organize it's write

Then wait the default 30seconds and see how fast it writes purges the data out to the drives.

Which I believe might be a better thruput test??

.
 
As ZFS likes to organize it's writes in memory then purge them out to the drives when it's ready. With DD it just doesn't seem to get the time to organize it I believe.
this behavior only occurs for random writes. DD is always sequential write/read pattern.
 
this behavior only occurs for random writes. DD is always sequential write/read pattern.

and there in is the problem.

There is nothing sequential writing the DD stream across 32 drives.
ZFS still has to decide how much data to write to each drive....within each vdev...within each zpool

Which when you look at the iostat outputs doesn't quite get evenly divided amonst all the drives in the vdevs. Never mind the thruput to each drive isn't as balanced as I would have thought.

a good example is

open 2 x terminal windows
in one window fire off

"zpool iostat -v tank 2"

in another windows fire off your dd write test.

Watch what happens in the iostat window. (Drive thruput... drive space used ... pool space used and the combined thruput etc)

Then keep watching the iostat window for another 40secs when the dd test has finished

.
 
uhm, there is nothing in sequential that means ever single drive in the array is doing the exact same thing. there are many variables however DD is writing zeros as fast as it can to the target file. this is in fact the very best possible sequential write test you can do. FS in turn is striping that input across its vdevs as fast and as efficiently as it can. this doesn't mean each disk is getting the exact same amount of data however in his case they are all likely getting the exact same amount since he has 32 vdevs which is a perfect number of vdevs to break up a 2K/4K/8K/16K/128k etc stripe.

dd if=/dev/random is a random IO test that is entirely CPU bound, he used /dev/zero.
 
I tried hooking the SAS in and SAS out port of ONE SE-3016 to the two wide ports on the LSI 9200-8e....no matter what I do, only one link will come up at a time:

Code:
Main menu, select an option:  [1-99 or e/p/w or 0 to quit] 8

SAS2008's links are 3.0 G, 3.0 G, 3.0 G, 3.0 G, down, down, down, down

 B___T___L  Type       Vendor   Product          Rev      SASAddress     PhyNum
 0  10   0  Disk       ATA      Hitachi HDS72404 A3B0  500194000076b200     0
 0  11   0  Disk       ATA      Hitachi HDS72404 A3B0  500194000076b201     1
 0  12   0  Disk       ATA      Hitachi HDS72404 A3B0  500194000076b202     2
 0  13   0  Disk       ATA      Hitachi HDS72404 A3B0  500194000076b203     3
 0  14   0  Disk       ATA      Hitachi HDS72404 A3B0  500194000076b204     4
 0  15   0  Disk       ATA      Hitachi HDS72404 A250  500194000076b205     5
 0  16   0  Disk       ATA      Hitachi HDS72404 A3B0  500194000076b206     6
 0  17   0  Disk       ATA      Hitachi HDS72404 A3B0  500194000076b207     7
 0  18   0  Disk       ATA      Hitachi HDS72404 A3B0  500194000076b208     8
 0  19   0  Disk       ATA      Hitachi HDS72404 A3B0  500194000076b209     9
 0  20   0  Disk       ATA      Hitachi HDS72404 A3B0  500194000076b20a    10
 0  21   0  Disk       ATA      Hitachi HDS72404 A3B0  500194000076b20b    11
 0  22   0  Disk       ATA      Hitachi HDS72404 A250  500194000076b20c    12
 0  23   0  Disk       ATA      Hitachi HDS72404 A3B0  500194000076b20d    13
 0  24   0  Disk       ATA      Hitachi HDS72404 A3B0  500194000076b20e    14
 0  25   0  Disk       ATA      Hitachi HDS72404 A3B0  500194000076b20f    15
 0  26   0  EnclServ   RACKABLE SE3016-SAS       0227  500194000076b23e    24
 
try on my what? i don't have any of these JBODs. all my JBODs support dual link.

You do not have se3016. Is correct?

Se3016 support only single link as I know. Theoretically can achieve that you mentioned on you post.
 
Here's 16 Hitachi 4TB drives in ONE SE-3016 connected to LSI 9200-8e.

Code:
root@zulu04:/mnt/cytel/tools/lsi# dd if=/dev/zero of=/tank/1G bs=1M count=10000
10000+0 records in
10000+0 records out
10485760000 bytes (10 GB) copied, 11.5962 s, 904 MB/s

Then I did two dd's in parallel, in case it was a single threaded issue:

Code:
root@zulu04:/mnt/cytel/tools/lsi# ( dd if=/dev/zero of=/tank/10Gc.dd bs=1M count=10000 & dd if=/dev/zero of=/tank/10Gd.dd bs=1M count=10000 ) &
[1] 940
root@zulu04:/mnt/cytel/tools/lsi# zpool iostat 5 5 
               capacity     operations    bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
rpool1      7.94G  20.6G      2      2  71.7K  21.0K
tank        21.8G  58.0T      0     85  1.31K  10.1M
----------  -----  -----  -----  -----  -----  -----
rpool1      7.94G  20.6G      0      0      0      0
tank        25.8G  58.0T      0  6.41K      0   781M
----------  -----  -----  -----  -----  -----  -----
rpool1      7.94G  20.6G      0      0      0      0
tank        29.5G  58.0T      0  6.40K      0   779M
----------  -----  -----  -----  -----  -----  -----
rpool1      7.94G  20.6G      0      0      0      0
tank        33.2G  58.0T      0  6.38K      0   776M
----------  -----  -----  -----  -----  -----  -----
rpool1      7.94G  20.6G      0      0      0      0
tank        37.2G  58.0T      0  6.32K      0   776M
----------  -----  -----  -----  -----  -----  -----
root@zulu04:/mnt/cytel/tools/lsi# 10000+0 records in
10000+0 records out
10485760000 bytes (10 GB) copied, 24.6242 s, 426 MB/s
10000+0 records in
10000+0 records out
10485760000 bytes (10 GB) copied, 24.6266 s, 426 MB/s

Nope...you use 1 enclosure and 1 wide port, you get 1000MB/s. You use 2 enclosures and 2 wide ports, you get 1000MB/s...makes no sense.
 
especially if you're using 2 different controllers. that really makes no sense. possibly a cpu bottleneck? idk, grasping at straws.
 
can you try using a different DD ??

eg on SolarisExpress 11 I use

time /usr/gnu/bin/dd if=/dev/zero of=/tank/1G bs=1024000 count=10000

note the GNU version of DD

then also try not using the Direct Pool name... but say a share withing the pool

eg, make a SMB / NFS share then point the DD output into that folder

time /usr/gnu/bin/dd if=/dev/zero of=/tank/share/1G bs=1024000 count=10000

wait 40secs for the flush

Leave that 1G output file

Then run 2 more additional tests say "1Ga" and "1Gb" waiting 40secs between each run for the flush to occur

As I have noticed, when doing DD tests.... it's usually the 3rd or 4th test that runs much faster than the 1st test (eg once you have some actual DATA on the pool)
Don't ask me why... but it just does. (Yes weird)

iostat even shows the increase.

and I can regularly repeat the results.

example

test run's say using 20g test file

1st test comes in around 600 - 650mb/s
2nd test comes in around 750mb/s
3rd test comes in around 875mb/s

doesn't make sense

.
 
If you're testing thruput from the jbod, how about starting a separate process reading from each of N drives? Eliminate any pool issues?
 
This is getting pretty ridiculous. Maybe my test rig just doesn't have the horse power for this kind of testing. Things DID get better using TWO 9200 HBAs, but no where near double. So now I have TWO SE3016 each going to their own dedicated 9200 HBA...nothing else running on this machine....each SE3016 has 16 Hitachi 4TB drives.

I created two seperate pools, one for each enclosure.

First, here's each enclosure being tested seperately:


Code:
dd if=/dev/zero of=/tank2/tree2/10Ga.dd bs=1M count=10000 &

# zpool iostat 5 5 
rpool1      7.94G  20.6G      0      0      0      0
tank        48.9G  58.0T      0      0      0      0
tank2       5.20G  58.0T      0  6.21K      0   766M
----------  -----  -----  -----  -----  -----  -----
rpool1      7.94G  20.6G      0      0      0      0
tank        48.9G  58.0T      0      0      0      0
tank2       8.77G  58.0T      0  6.30K      0   775M

10000+0 records in
10000+0 records out
10485760000 bytes (10 GB) copied, 12.7157 s, 825 MB/s

Here's the second one:

               capacity     operations    bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
rpool1      7.94G  20.6G      5      3   135K  35.7K
tank        41.9G  58.0T      1     96  42.1K  11.1M
tank2        600K  58.0T      2     19  77.4K   862K
----------  -----  -----  -----  -----  -----  -----

rpool1      7.94G  20.6G      0      0      0      0
tank        45.4G  58.0T      0  6.11K      0   751M
tank2        600K  58.0T      0      0      0      0
----------  -----  -----  -----  -----  -----  -----
10000+0 records in
10000+0 records out
10485760000 bytes (10 GB) copied, 12.8703 s, 815 MB/s

Then here's both at the same time:

Code:
root@zulu04:/mnt/cytel/tools/lsi/sas2flash# ( dd if=/dev/zero of=/tank/tree2/10Gb.dd bs=1M count=10000 & dd if=/dev/zero of=/tank2/tree2/10Gb.dd bs=1M count=10000 ) &
[1] 1026
root@zulu04:/mnt/cytel/tools/lsi/sas2flash# !zpool
zpool iostat 5 100
               capacity     operations    bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
rpool1      7.94G  20.6G      5      3   125K  33.2K
tank        49.9G  58.0T      1    242  30.0K  28.6M
tank2       10.8G  58.0T      0    663  16.3K  78.9M
----------  -----  -----  -----  -----  -----  -----
rpool1      7.94G  20.6G      0      0      0      0
tank        52.9G  57.9T      0  5.56K      0   674M
tank2       13.9G  58.0T      0  5.30K      0   633M
----------  -----  -----  -----  -----  -----  -----
rpool1      7.94G  20.6G      0      0      0      0
tank        55.9G  57.9T      0  5.26K      0   626M
tank2       16.7G  58.0T      0  5.07K      0   593M
----------  -----  -----  -----  -----  -----  -----
10000+0 records in
10000+0 records out
10485760000 bytes (10 GB) copied, 15.5535 s, 674 MB/s
10000+0 records in
10000+0 records out
10485760000 bytes (10 GB) copied, 15.8358 s, 662 MB/s

Around 1330MB/s....a heck of a lot better than 900MBs for two enclosures.
 
Let's look at read performance using two HBAs and two enclosures:

Code:
root@zulu04:/mnt/cytel/tools/lsi/sas2flash# dd if=/tank/tree/100ga.dd  of=/dev/null bs=1M count=100000
100000+0 records in
100000+0 records out
104857600000 bytes (105 GB) copied, 98.8859 s, 1.1 GB/s


# dd if=/tank2/tree2/100ga.dd  of=/dev/null bs=1M count=100000
100000+0 records in
100000+0 records out
104857600000 bytes (105 GB) copied, 98.9584 s, 1.1 GB/s

Now let's read from both enclosures at the same time:

Code:
# ( dd if=/tank2/tree2/100ga.dd  of=/dev/null bs=1M count=100000 & dd if=/tank/tree/100ga.dd  of=/dev/null bs=1M count=100000 ) &


----------  -----  -----  -----  -----  -----  -----
rpool1      7.94G  20.6G      0      0    409      0
tank         107G  57.9T  6.50K      0   825M      0
tank2        117G  57.9T  7.75K      0   992M      0
----------  -----  -----  -----  -----  -----  -----
rpool1      7.94G  20.6G      0      0    102      0
tank         107G  57.9T  6.48K      0   823M      0
tank2        117G  57.9T  7.77K      0   994M      0


root@zulu04:/mnt/cytel/tools/lsi/sas2flash# 100000+0 records in
100000+0 records out
104857600000 bytes (105 GB) copied, 104.539 s, 1.0 GB/s
100000+0 records in
100000+0 records out
104857600000 bytes (105 GB) copied, 118.175 s, 887 MB/s

OK..thats more like what I was expecting to see... 1900MB/s...and looks like we are swamped with interrupts that this point so probably bumping up on the CPU:

Code:
last pid:  1386;  load avg:  2.14,  1.19,  0.71;  up 0+01:06:23                                                                                                                                12:05:15
63 processes: 60 sleeping, 3 on cpu
CPU states: 21.2% idle,  0.2% user, 78.6% kernel,  0.0% iowait,  0.0% swap
Kernel: 18670 ctxsw, 37 trap, 12778 intr, 3803 syscall, 8 flt
Memory: 4013M phys mem, 496M free mem, 2007M total swap, 2007M free swap
 
Ill be the one to say it.

Two LSI controllers, 32 4TB drives, two JBODS, etc etc and 4013MB of RAM??? Is this an all in one?
 
why would the ram matter? he's doing raw reads from the pool to test HW thruput.

On a normal machine it wouldn't but ZFS is a bit ram hungry. Hooked up to two HP SAS expanders my ARC-1880x says they are 6 gig link with sata600 5700 rpm hitachi cool spin (3 TB) drives. The controller itself maxes out at around 1.9 GB/sec read and 1 GB/sec write.
 
That doesn't sound right. All ZFS does with extra ram is cache reads - AFAIK it doesn't do anything to speed up disk reads or writes. If you are doing big enough tests to swamp the cache, it shouldn't matter...
 
Back
Top