Project: ZFS Monster - Phase I (testbed)

packetboy · Apr 3, 2010

Mind you this just the test system for the ultimate build.

ASUS M4A79T DELUXE 790FX AM3 - $136 (new)
AMD|PH II X4 965 3.4G 6M RT - $205 (new)
2GBx2 G.Skill F3-12800CL9D-4GBNQ R - $105 (new)
2GBx2 Cosair CMX4GX3M2A1600C9 - $90 (new)
ASUS GT220 1GB Pci-express - $47 (new)
LSI SAS3801E PCI-e - $163 (ebay)
Sans Digital TRX8 Enclosure - $306 (ebay)
Corsair TX650 Power supply - $100 (new)
Kingston SSD V-Series 64GB - $129 (new)
2x Hitachi HDS72202 2TB - $280 (new)
2x Seagate ST32000542AS 2TB - $300 (new)
4x Seagate ST31500541AS 1.5TB - $360 (new)
Supermicro SC-750 Case - bought 10 years ago - $???
HP Sas Expander - $170 (not counted as not in use)
-------------
Total: $2221

OpenSolaris 2009.06
ZFS: 4x1.5TB Raidz + 4x2.0TB Raidz
Formatted usable: 9.3TB ($238/TB)

Note there are ZERO hard drives in this case...only a 2.5 SSD for boot disk..hence this case is ridiculous overkill...I only used it as I had it laying around. One could build this with a far more compact case.

Photos:

Back of system:
http://img532.imageshack.us/img532/2025/frontvm.jpg

Inside:
http://img13.imageshack.us/img13/8641/motherboardb.jpg

Enclosure:
http://img51.imageshack.us/img51/4758/sansdigitalfront.jpg

Power Usage (all 8 drives at idle):
http://img9.imageshack.us/img9/2563/killawatt.jpg

This is just a test system to get familiar with OpenSolaris, ZFS, SAS and external SAS enclosures. I would never build a RAID system using the Sans Digital enclosure unless I really had no need to scale beyond 8 drives. At $300 for 8 drives ($38/drive), no redundant power, and real cheap and flimsy drive carriers it's actually MUCH more expensive than enterprise solutions such as the Supermicro SC-847 which gives you 45 drive capacity for $1600 (e.g. $35/drive) PLUS redundant and 80plus-Gold power, and much beefier chassis, and MOST importantly built-in Sas expander and cascading ports.

Standby - Phase II starts in two weeks...that's when I'm told my TWO! SC-847 JBODS samples will arrive.

staticlag · Apr 3, 2010

Awesome!!! keep us updated with pics!

Tau · Apr 4, 2010

You are going to want to use ECC ram in there.

Also do you have 2 seperate raidz pools configured? You would gain more throughput by striping them (attaching both vdevs to the same pool, essentially Raid 50)

Also depending what your speed requirements, and dataset is like you may also benefit from a dedicated Zil/L2ARC cache drive.

packetboy · Apr 4, 2010

Also do you have 2 seperate raidz pools configured? You would gain more throughput by striping them (attaching both vdevs to the same pool, essentially Raid 50)

That's what I was trying to do..I could have just created one big raidz, but I wanted to simulate starting with a bunch of lower capacity drives (e.g. 1.5TB) and then adding an additional set some time later when higher capacity drives were available (e.g. 2tb):

Code:

zfs01:~$ zpool status
  pool: monster
 state: ONLINE
 scrub: none requested
config: 

        NAME         STATE     READ WRITE CKSUM
        monster      ONLINE       0     0     0
          raidz1     ONLINE       0     0     0
            c7t10d0  ONLINE       0     0     0
            c7t13d0  ONLINE       0     0     0
            c7t14d0  ONLINE       0     0     0
            c7t15d0  ONLINE       0     0     0
          raidz1     ONLINE       0     0     0
            c7t8d0   ONLINE       0     0     0
            c7t9d0   ONLINE       0     0     0
            c7t11d0  ONLINE       0     0     0
            c7t12d0  ONLINE       0     0     0

errors: No known data errors

What I find amazing is it took all of 10 seconds to add the second raidz...try that with any so-called "online capacity expansion" of hardware raid schemes!

sub.mesa · Apr 4, 2010

Keep in mind that when 'expanding' this way; any existing files on the original RAID-Z will not be copied/striped to the second array. So only newly written files will have the striping benefit. That's of course also the reason it takes only a few seconds to reorganise the metadata; no actual data is copied or moved.

Tau · Apr 4, 2010

packetboy said:
That's what I was trying to do..I could have just created one big raidz, but I wanted to simulate starting with a bunch of lower capacity drives (e.g. 1.5TB) and then adding an additional set some time later when higher capacity drives were available (e.g. 2tb):

Code:

zfs01:~$ zpool status pool: monster state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM monster ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c7t10d0 ONLINE 0 0 0 c7t13d0 ONLINE 0 0 0 c7t14d0 ONLINE 0 0 0 c7t15d0 ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c7t8d0 ONLINE 0 0 0 c7t9d0 ONLINE 0 0 0 c7t11d0 ONLINE 0 0 0 c7t12d0 ONLINE 0 0 0 errors: No known data errors

What I find amazing is it took all of 10 seconds to add the second raidz...try that with any so-called "online capacity expansion" of hardware raid schemes!

Good each vdev is attached to the same pool

sub.mesa said:
Keep in mind that when 'expanding' this way; any existing files on the original RAID-Z will not be copied/striped to the second array. So only newly written files will have the striping benefit. That's of course also the reason it takes only a few seconds to reorganise the metadata; no actual data is copied or moved.

+1

After attaching another vdev only newly written data will be striped across it... this is becasue ZFS is a copy on write filesystem.

Easy way to get around it is just copy data off it than back onto it... leave a bit going over night each night... should only take a coupld days depending how much data you have

.

jonnyjl · Apr 4, 2010

This is how I basically did my setup (except hardware), there were some real physical concerns since I was transferring an existing 6x2TB raid6 set to the OpenSolaris server.

How's the Hitachi Drives working out (iostat -En)? At some point I want to transfer my current pool into a new one using 8x2TB Hitachi drives (raidz2) and then rebuild my current pool (2x4driveRaidz) as a 8x2TB raidz2 vdev and add it to the new pool.

I think you'll be happy with a ZFS set-up. I'm running b130, and aside from the lack of domain controller redundancy (home setup, CIFS running in domain mode), I've been really impressed with performance and ease of use. This is from a person that has lived in the Windows world and had rudimentary knowledge of UNIX/OpenSolaris/Linux when I decided to build it last November. ACLs were probably the longest part of it, the design and concepts. Mostly learning enough to know how I want to implement it in a multi-user environment. iSCSI is also running nice and stable (unidirectional auth), though just using that as the backup drive for my Domain Controllers.

I recently had to replace one drive in each vdev (pre-emptive strike) and I found that to be a really simple experience. I did manually activate a hotspare, so not a true disaster situation... but still. Rebuild times seemed reasonable (6-7hrs per vdev).

FYI there's other scripts (or Auto-scrub service) to auto-scrub and alert zpool status.
But I use this one (page describes implementation): http://www.morph3ous.net/2009/09/05...ing-of-zpool-problems-and-weekly-zpool-scrub/ . You can ignore the smarthost sections if your ISP doesn't block and/or if you use your ISPs SMTP servers

Those Supermicro cases look yummy.... just using a Norco 4220 here

PS I'm running WD RE4-GPs and running near capacity. Rebuild definitely seemed faster than on my 1680ix... not to mention I wasn't biting my nails waiting for a drop out.

packetboy · Apr 4, 2010

Throughput with 4x1.5TB in RaidZ:

Code:

zfs01:~# time dd if=/dev/zero of=/monster/filezeros.tst  bs=4096 count=48828
48828+0 records in
48828+0 records out

real    0m0.997s
user    0m0.014s
sys     0m0.316s

200 / .997 = ~200MB/s

Strip in second Raidz with 4x2TB:

Code:

zpool add  monster raidz c7t8d0 c7t9d0 c7t11d0 c7t12d0

(10 seconds)

Code:

zfs01:~# time dd if=/dev/zero of=/monster/filezeros.tst  bs=4096 count=48828
48828+0 records in
48828+0 records out

real    0m0.752s
user    0m0.014s
sys     0m0.314s


200 / .752 = ~266MB/s

Tau · Apr 4, 2010

Code:

Tau@Skynet:~# time dd if=/dev/zero of=/tank1/testfile.file bs=4096 count=48828
48828+0 records in
48828+0 records out

real    0m0.603s
user    0m0.029s
sys     0m0.574s
Tau@Skynet:~#

Same test done on one of my setups

331MB/s ish..... Not to shabby for a 7 drive array ^_^

**EDIT** This is also being done in RAM.... so not a very good drive benchmark, I will post some tests with a 10GB file here in a bit.

Interesting... i could have swore DD would have just created the 2MB file in RAM.... just did a 10GB file and came out with 336MB/s.... guess not ^_^

Code:

Tau@Skynet:~# time dd if=/dev/zero of=/tank1/10gb.file bs=4096 count=20971520
20971520+0 records in
20971520+0 records out

real    5m4.657s
user    0m12.533s
sys     4m0.796s

jonnyjl · Apr 4, 2010

Man I feel like a loser!

Code:

jmaster@svosolaris001:~$ time dd if=/dev/zero of=/svpool1/temp/10gb.file bs=4096 count=20971520
20971520+0 records in
20971520+0 records out
85899345920 bytes (86 GB) copied, 442.359 s, 194 MB/s

real    7m22.350s
user    0m9.954s
sys     2m56.613s

I was watching zpool iostat -v and pretty much only one vdev was being written to. *sigh*

lol I never bothered benchmarking before, figured if I can max out 1gbit eth, which I do, I don't care too much. I should really invest in better network equipment...

jonnyjl · Apr 4, 2010

I didn't see this posted, Packetboy, what will you be using your OpenSolaris server for?

Tau · Apr 4, 2010

Yeah that is still nothing to scoff at... if you are only accessing it over a single gigE line thats loads of speed.

The box I tested on is sitting in my lab... should be getting 14 more 1TB drives and a SAS expander here this month to configure as 3x 7drive Raidz2's... Need to deside how many NICs im going to shove into it... im thinking 4-6

jonnyjl · Apr 4, 2010

Tau said:
Yeah that is still nothing to scoff at... if you are only accessing it over a single gigE line thats loads of speed.

The box I tested on is sitting in my lab... should be getting 14 more 1TB drives and a SAS expander here this month to configure as 3x 7drive Raidz2's... Need to deside how many NICs im going to shove into it... im thinking 4-6

As many NICs that can fit is how I see it

I have dual Intel nics on the server, it would just be pointless to team them since the server sits alone in a back room with only one network run to it (actually to a cheapo home switch... works well enough and supports jumbo frames)

I should invest in some Cisco gigE switches (as in use certification as an excuse to upgrade my home network, haha) and make a few more network runs.

That's why I'm kind of curious on what Packetboy is using this application for. I will say, sometimes it sucks to max out at ~100MB/sec. Good thing the other users on my network don't suck up the backbone too hard.

jonnyjl · Apr 4, 2010

Tau said:
Yeah that is still nothing to scoff at... if you are only accessing it over a single gigE line thats loads of speed.

The box I tested on is sitting in my lab... should be getting 14 more 1TB drives and a SAS expander here this month to configure as 3x 7drive Raidz2's... Need to deside how many NICs im going to shove into it... im thinking 4-6

Tau, have you seen any benchmarks comparing raidz1 vs raidz2?

I think regardless raidz2 is my long term plan. Even though 2x4raidz1 vs 1x8raidz2 technically allows for the same amount of drive failures, I like the distribution of those chances in a raidz2 vdev.

One more thing packetboy, decide early on if you want to use Compression/dedupe on any of the filesystems. I think the consensus is to stay away from dedupe right now (right?). For compression... anything above gzip1 does seem to eat CPU cycles (I"m using a single E5520 right now).

Tau · Apr 4, 2010

jonnyjl said:
Tau, have you seen any benchmarks comparing raidz1 vs raidz2?

I think regardless raidz2 is my long term plan. Even though 2x4raidz1 vs 1x8raidz2 technically allows for the same amount of drive failures, I like the distribution of those chances in a raidz2 vdev.

One more thing packetboy, decide early on if you want to use Compression/dedupe on any of the filesystems. I think the consensus is to stay away from dedupe right now (right?). For compression... anything above gzip1 does seem to eat CPU cycles (I"m using a single E5520 right now).

The diffrence between Raidz and Raidz2 is negligible at best... CPU's are fast enough right now that the extra parity calcs take femtoseconds anyways... so at most there would be 1 MAYBE 2% diffrence (on BIG arrays, think Thumper big....)

IMO I stopped looking at Raid5 ages ago due to how prone the array is to lose another disc during rebuilds.... I have spent many sleepless nights sweating away watching a Raid5 rebuild... Raid6 gives me better piece of mind... lol

It really comes down to what kinda of use the array is going to be serving.... some times Raid10 is better, others Raid6/60 it all depends.

I have plans to build a little poor mans home SAN box here shortly (once my proof of concept box is finished... lol) that I will be using for my VM storage... should be interesting.

And yes I plan to stuff as many NICs in the box as I can, might cave in and order a few 4 port intels

Just need to find a motherboard with the right slot configuration now >.<

packetboy · Apr 5, 2010

jonnyjl said:
I didn't see this posted, Packetboy, what will you be using your OpenSolaris server for?

I do a tremendous amount of packet analysis (e.g. tcpdump,Wireshark)...often handling 10-20GB of new packet captures/day.

I'm sick and tired of building servers with fixed storage expansion for data and dedicated drives for operating system boot..which then need TWO drives for redundancy. I'm sick and tired of hardware raid solutions that don't scale and often fail to protect data (I just had the RAID array on one of my other servers fail for the third time in 5 years and that was despite doing a 5 drive RAID5 and *four* hot-spares)...in one second raid controller said every drive was fine...after a reboot suddenly it decided that 2 drives in the array failed and good by data.

Have decided to divorce storage fully from computational nodes and seek more reliable/scaleable raid.

I also have a CPU farm in my basement (150 Celeron P3s) and looking to cut power and cooling costs by having them all iSCSI boot from the new storage server instead of using a local hard drive (Note: a VM-based approach won't work for reasons I can't go into).

My final design for my "Home NAS" looks something like this:

Compute nodes
1G 1G 1G 1G ...
| | | | ...
[ethernet switch]
|
10G
|
ZFS/NFS/CIFS/iSCSI Storage
(probably a dual 6-core AMD server)
| - SAS cascade
Supermicro SC847 - 45 2TB drives
| - SAS cascade
Supermicro SC847 - 45 2TB drives

I figure that should sufficiently smoke.

packetboy · Apr 5, 2010

Tau said:
Yeah that is still nothing to scoff at... if you are only accessing it over a single gigE line thats loads of speed.

On that note, with 1Gbe so cheap, I'm looking at perhaps just using a NIC and switch setup that will support port-trunking so that storage server could be connected via 4 x 1Gbe instead of spending the ridiculous sums necessary for just a single 10Gbe connection (10Gbe seems to run about $1000 per switch port and $500 per nic...even more if you want an iSCSI HBA with tcp offload)....if that's even possible as number of 10GBe NICs supported on OpenSolaris is slim.

Tau · Apr 5, 2010

packetboy said:
On that note, with 1Gbe so cheap, I'm looking at perhaps just using a NIC and switch setup that will support port-trunking so that storage server could be connected via 4 x 1Gbe instead of spending the ridiculous sums necessary for just a single 10Gbe connection (10Gbe seems to run about $1000 per switch port and $500 per nic...even more if you want an iSCSI HBA with tcp offload)....if that's even possible as number of 10GBe NICs supported on OpenSolaris is slim.

You know its funny i was JUST looking at some 10GigE stuff for at home the other day... I found adaptors for ~$300 a peice, though a switch was a little pricey for my home lab... 14K for a 20 port switch... I dont think the other half would like that one to much..

I plan to roll out one box with ~4 NICs, and the other with 8... or untill i run out of PCI-E slots... LOL.

Then im just going to bond em all together to form a trunk for each box. Should supply the bandwidth needed... Just need to get the spindle count right to get the speeds that I need now >.<

pjkenned · Apr 5, 2010

I hate to ask this, but why 150x P3 Celerons? It seems like there would be a way to cut power/ space by moving to a newer CPU farm.

I'm not entirely sure that you'll need dual 6 core CPU's to run 180TB of storage. I'm guessing one would be fine if you are doing dual parity calcs.

packetboy · Apr 5, 2010

pjkenned said:
I hate to ask this, but why 150x P3 Celerons? It seems like there would be a way to cut power/ space by moving to a newer CPU farm.

Space is not a problem...I have a 500sq/ft data center in my basement. I'm able to hold about 50 of the PCs (they are in mini-cases) on a single Metro shelving rack.

I have a single program that I can run per system (and can't run more than one per system)...these systems were FREE and are actually pretty power efficient (e.g. due to Celeron)...they pull about 40w a piece...about 10W of that is hard drive which I should be able to eliminate via iSCSI. Even at 40w, that's about $0.10/day/system or ~$3.00/month. Even if I could cut power by 50% with a new PC costing $100/each, I'd still be looking at a nearly 6 year payback.

Trivia question: "What NOT to have at the end of your driveway when you wife get's home from work?"

Answer:

pjkenned · Apr 5, 2010

Ok I hope your wife is a system admin or something

How many children do you "employ" to tend to that "farm"?

packetboy · Apr 5, 2010

Decided to blow away the raidz pool and try throughput on a pure stripe:

Code:

zfs01:~# zpool destroy tank
WARNING: it doesn't even prompt you 'are you sure'!!!

zfs01:~# zpool create monster c7t8d0 c7t9d0 c7t10d0 c7t11d0 c7t12d0 c7t13d0 c7t14d0 c7t15d0 
(10 seconds)!

zfs01:~# time dd if=/dev/zero of=/monster/filezeros.tst  bs=4096 count=48828
48828+0 records in
48828+0 records out

real    0m0.650s
user    0m0.013s
sys     0m0.319s

200MB / .650s = 300MB/s

packetboy · Apr 5, 2010

Same test with raidz2:

Code:

zfs01:~# zpool create monster raidz2 c7t8d0 c7t9d0 c7t11d0 c7t12d0 
zfs01:~# zpool add monster raidz2 c7t10d0 c7t13d0 c7t14d0 c7t15d0 
zfs01:~# zpool status
  pool: monster
 state: ONLINE
 scrub: none requested
config:

        NAME         STATE     READ WRITE CKSUM
        monster      ONLINE       0     0     0
          raidz2     ONLINE       0     0     0
            c7t8d0   ONLINE       0     0     0
            c7t9d0   ONLINE       0     0     0
            c7t11d0  ONLINE       0     0     0
            c7t12d0  ONLINE       0     0     0
          raidz2     ONLINE       0     0     0
            c7t10d0  ONLINE       0     0     0
            c7t13d0  ONLINE       0     0     0
            c7t14d0  ONLINE       0     0     0
            c7t15d0  ONLINE       0     0     0

zfs01:~# time dd if=/dev/zero of=/monster/filezeros.tst  bs=4096 count=48828
48828+0 records in
48828+0 records out

real    0m1.151s
user    0m0.013s
sys     0m0.320s

200MB / 1.151s = 174MB/s

Hence: raidz2 is (200MB/s - 174MB/s) / 200MB/s = 13% slower than raidz

Tau · Apr 5, 2010

packetboy said:

Same test with raidz2:

Code:

zfs01:~# zpool create monster raidz2 c7t8d0 c7t9d0 c7t11d0 c7t12d0 
zfs01:~# zpool add monster raidz2 c7t10d0 c7t13d0 c7t14d0 c7t15d0 
zfs01:~# zpool status
  pool: monster
 state: ONLINE
 scrub: none requested
config:

        NAME         STATE     READ WRITE CKSUM
        monster      ONLINE       0     0     0
          raidz2     ONLINE       0     0     0
            c7t8d0   ONLINE       0     0     0
            c7t9d0   ONLINE       0     0     0
            c7t11d0  ONLINE       0     0     0
            c7t12d0  ONLINE       0     0     0
          raidz2     ONLINE       0     0     0
            c7t10d0  ONLINE       0     0     0
            c7t13d0  ONLINE       0     0     0
            c7t14d0  ONLINE       0     0     0
            c7t15d0  ONLINE       0     0     0

zfs01:~# time dd if=/dev/zero of=/monster/filezeros.tst  bs=4096 count=48828
48828+0 records in
48828+0 records out

real    0m1.151s
user    0m0.013s
sys     0m0.320s

200MB / 1.151s = 174MB/s

Hence: raidz2 is (200MB/s - 174MB/s) / 200MB/s = 13% slower than raidz

Try writting a large file (10GB would be good) and watching iostat.... that seems to slow, or the extra stripe actually degraded performance?

packetboy · Apr 5, 2010

It's spreading it out pretty evenly...though only 30MB/s per drive.

Note, I have NOT done any zfs tuning at all.

Code:

zfs01:~# zpool iostat -v monster 10

                capacity     operations    bandwidth
pool          used  avail   read  write   read  write
-----------  -----  -----  -----  -----  -----  -----
monster      13.4G  12.7T      1  1.04K    626   122M
  raidz2     6.71G  7.24T      0    534    481  60.9M
    c7t8d0       -      -      0    284    144  30.0M
    c7t9d0       -      -      0    288    337  30.1M
    c7t11d0      -      -      0    287      0  30.4M
    c7t12d0      -      -      0    288      0  30.5M
  raidz2     6.71G  5.43T      0    531    144  60.8M
    c7t10d0      -      -      0    284      0  30.3M
    c7t13d0      -      -      0    286    144  30.4M
    c7t14d0      -      -      0    284      0  30.3M
    c7t15d0      -      -      0    284      0  30.3M
-----------  -----  -----  -----  -----  -----  -----

spectrumbx · Apr 5, 2010

packetboy said:
Space is not a problem...I have a 500sq/ft data center in my basement. I'm able to hold about 50 of the PCs (they are in mini-cases) on a single Metro shelving rack.

I have a single program that I can run per system (and can't run more than one per system)...these systems were FREE and are actually pretty power efficient (e.g. due to Celeron)...they pull about 40w a piece...about 10W of that is hard drive which I should be able to eliminate via iSCSI. Even at 40w, that's about $0.10/day/system or ~$3.00/month. Even if I could cut power by 50% with a new PC costing $100/each, I'd still be looking at a nearly 6 year payback.

Trivia question: "What NOT to have at the end of your driveway when you wife get's home from work?"

Answer:

You could virtualize those systems (para-virtualization).

Three cheap AMD dual Quad systems ($1500-2000 for all three) might perform better.
You will get that money back in 2 years or less.

tanderson · Apr 5, 2010

awesome build! way to go!

Tau · Apr 5, 2010

What drives/controller are you using? Your speeds seem a bit low to me....

Code:

               capacity     operations    bandwidth
pool         used  avail   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
tank1       3.76T  2.55T      0  2.19K      0   271M
  raidz2    3.76T  2.55T      0  2.19K      0   271M
    c3t0d0      -      -      0    541      0  52.5M
    c3t1d0      -      -      0    562      0  54.6M
    c3t2d0      -      -      0    567      0  55.1M
    c3t3d0      -      -      0    565      0  54.9M
    c3t4d0      -      -      0    572      0  55.6M
    c3t5d0      -      -      0    560      0  54.5M
    c3t6d0      -      -      0    552      0  53.6M
----------  -----  -----  -----  -----  -----  -----

that correlates to the speeds that I got the other day (was ~47MB/s per drive based on my math of ~330MB/s write. I am pretty impressed with the performance thus far.... now if i could just get that kind of speed over cifs >.<

Try rebuilding as a single 8 disk raidz2 and seeing the results.

I plan on doing a HUGE comparison here some time this month when the rest of my parts come in for this box so i will be able to do some complex analysis and tweeking.

packetboy · Apr 5, 2010

Tau said:
What drives/controller are you using? Your speeds seem a bit low to me....

Try rebuilding as a single 8 disk raidz2 and seeing the results.

Drive specs and controller are detailed on first page of this post.

I tried creating single 8 disk raidz2, however, this is not possible as drives are not all the same size (e.g. mix of 1.5TB and 2.0TB).

...however, large package arrived today...24 x 2TB Hitachi's...tomorrow I'll rip out all current drives and repopulate with all Hitachi 2TBs...then retest. I should probably test these drives individually as well.

Tau · Apr 6, 2010

packetboy said:
Drive specs and controller are detailed on first page of this post.

I tried creating single 8 disk raidz2, however, this is not possible as drives are not all the same size (e.g. mix of 1.5TB and 2.0TB).

...however, large package arrived today...24 x 2TB Hitachi's...tomorrow I'll rip out all current drives and repopulate with all Hitachi 2TBs...then retest. I should probably test these drives individually as well.

Nice make sure to get some benches up on those drives for sure!

jonnyjl · Apr 6, 2010

Tau said:
Nice make sure to get some benches up on those drives for sure!

For sure, I'd be really interested in what a 8xraidz2 of Hitachi drives can do.

Out of curiosity, how's the CPU usage during the benchmarks?

Tau · Apr 6, 2010

jonnyjl said:
For sure, I'd be really interested in what a 8xraidz2 of Hitachi drives can do.

Out of curiosity, how's the CPU usage during the benchmarks?

Code:

Tau@Skynet:~# sar -u 10 60

SunOS Skynet 5.11 snv_111b i86pc    04/06/2010

18:20:28    %usr    %sys    %wio   %idle
18:20:38       0      12       0      88
18:20:48       2      54       0      44
18:20:58       2      55       0      43
18:21:08       2      52       0      46
18:21:18       2      50       0      48
18:21:28       2      54       0      44
18:21:38       2      53       0      45
18:21:48       2      50       0      48
18:21:58       2      55       0      43
18:22:08       2      52       0      46
18:22:18       2      51       0      47
18:22:28       2      52       0      46
18:22:38       2      45       0      53
18:22:48       0       6       0      94
18:22:58       2      53       0      45
18:23:08       2      44       0      55
18:23:18       2      52       0      45
18:23:28       3      51       0      46
18:23:38       2      53       0      45
18:23:49       2      52       0      46
18:23:59       3      51       0      46
18:24:09       2      54       0      44
18:24:19       2      53       0      45
18:24:29       3      51       0      46
18:24:39       2      54       0      43
18:24:49       2      55       0      43
18:24:59       3      51       0      46
18:25:09       2      55       0      43
18:25:20       2      53       0      45
18:25:30       2      50       0      47

Call it 50% wich kind of makes me sad since the box is a Q6600

On a side note I setup iSCSI this morning to see if i could get faster speeds than i have been over cifs....

using the COMSTAR target on the solaris box, connection with a x64 Visa Business machine (all over gigE)

I get 60-80MB/s over cifs though with the new iSCSI target setup i start the transfer at 130MB/s drops to ~110 for about 1-2 seconds than nose dives to ~50, than steadily falls down to 10MB/s.... I have been playing with this thing all morning and cant get it figured out... Any ideas? I was totally expecting iSCSI to max my GigE line...

iostat wile the iSCSI transfer shows im writting at ~2.5MB/s to each disk, and im using 5% CPU.... I cant figure out what the issue is

packetboy · Apr 6, 2010

Ripped out the mixed Seagate 1.5TB, Seagate 2.0TB and Hitachi drives and installed 8 x Hitachi (HDS722020ALA330).

First thing I LOVED is I removed all 8 drives, plugged in 8 new Hitachi and the LSI SAS controller/Opensolaris automatically assigned 8 new device Id to these drives...it was awesome. I have NEVER been able to hot-swap drives in eSata port-multiplier enclosures with any kind of reliability...this was a compelete BREEZE.

So onto the benchmarking...first found this:

http://lethargy.org/~jesus/writes/disk-benchmarking-with-dd-dont

Put all 8 drives in a zfs stripe (Raid0) and get this file file bench doing the large file write test:

Code:

zfs01:~# /opt/filebench/bin/amd64/go_filebench
FileBench Version 1.3.4
filebench> load multistreamwrite
 6544: 6.056: Multi Stream Write Version 2.0 personality successfully loaded
 6544: 6.056: Usage: set $dir=<dir>
 6544: 6.056:        set $filesize=<size>    defaults to 1073741824
 6544: 6.056:        set $nthreads=<value>   defaults to 1
 6544: 6.056:        set $iosize=<value> defaults to 1048576
 6544: 6.056:        set $directio=<bool> defaults to 0
 6544: 6.056:
 6544: 6.056:        run runtime (e.g. run 60)
filebench> set $dir=/monster
filebench> run 10
 6544: 18.569: Creating/pre-allocating files and filesets
 6544: 18.569: File largefile4: mbytes=1024
 6544: 18.579: Creating file largefile4...
 6544: 22.375: Preallocated 1 of 1 of file largefile4 in 4 seconds
 6544: 22.376: File largefile3: mbytes=1024
 6544: 22.383: Creating file largefile3...
 6544: 26.080: Preallocated 1 of 1 of file largefile3 in 4 seconds
 6544: 26.080: File largefile2: mbytes=1024
 6544: 26.094: Creating file largefile2...
 6544: 29.767: Preallocated 1 of 1 of file largefile2 in 4 seconds
 6544: 29.767: File largefile1: mbytes=1024
 6544: 29.776: Creating file largefile1...
 6544: 33.545: Preallocated 1 of 1 of file largefile1 in 4 seconds
 6544: 33.545: waiting for fileset pre-allocation to finish
 6544: 33.545: Starting 1 seqwrite instances
 6545: 34.555: Starting 1 seqwrite4 threads
 6545: 34.556: Starting 1 seqwrite3 threads
 6545: 34.556: Starting 1 seqwrite2 threads
 6545: 34.556: Starting 1 seqwrite1 threads
 6544: 37.565: Running...
 6544: 47.665: Run took 10 seconds...
 6544: 47.665: Per-Operation Breakdown
limit                       0ops/s   0.0mb/s      0.0ms/op        0us/op-cpu
seqwrite4                  55ops/s  54.8mb/s     17.8ms/op      471us/op-cpu
limit                       0ops/s   0.0mb/s      0.0ms/op        0us/op-cpu
seqwrite3                  49ops/s  48.8mb/s     19.9ms/op      480us/op-cpu
limit                       0ops/s   0.0mb/s      0.0ms/op        0us/op-cpu
seqwrite2                  52ops/s  51.6mb/s     18.3ms/op      448us/op-cpu
limit                       0ops/s   0.0mb/s      0.0ms/op        0us/op-cpu
seqwrite1                  86ops/s  86.3mb/s     11.3ms/op      409us/op-cpu

 6544: 47.665:
IO Summary:       2443 ops 241.9 ops/s, (0/242 r/w) 241.5mb/s,   3622us cpu/op,  16.0ms latency
 6544: 47.665: Shutting down processes

*Process* CPU utilization was very low through the test...however, *Kernel* CPU utilization was about 20%:

Code:

zfs01:~$ sar -u 10 60

SunOS zfs01 5.11 snv_111b i86pc    04/06/2010

19:47:19    %usr    %sys    %wio   %idle
19:47:29       0      22       0      78
19:47:39       0      19       0      81
19:47:49       0      21       0      79
19:47:59       0      17       0      83

packetboy · Apr 6, 2010

Code:

zfs01:~# zpool create monster raidz c7t16d0 c7t17d0 c7t18d0 c7t19d0 c7t20d0 c7t21d0 c7t22d0 c7t23d0

FileBench Version 1.3.4
filebench> load   multistreamwrite
 6655: 23.946: Multi Stream Write Version 2.0 personality successfully loaded
 6655: 23.946: Usage: set $dir=<dir>
 6655: 23.946:        set $filesize=<size>    defaults to 1073741824
 6655: 23.946:        set $nthreads=<value>   defaults to 1
 6655: 23.946:        set $iosize=<value> defaults to 1048576
 6655: 23.946:        set $directio=<bool> defaults to 0
 6655: 23.946:
 6655: 23.946:        run runtime (e.g. run 60)
filebench> set $dir=/monster
filebench> run 30
 6655: 35.239: Creating/pre-allocating files and filesets
 6655: 35.239: File largefile4: mbytes=1024
 6655: 35.241: Creating file largefile4...
 6655: 39.772: Preallocated 1 of 1 of file largefile4 in 5 seconds
 6655: 39.772: File largefile3: mbytes=1024
 6655: 39.789: Creating file largefile3...
 6655: 43.951: Preallocated 1 of 1 of file largefile3 in 5 seconds
 6655: 43.951: File largefile2: mbytes=1024
 6655: 43.955: Creating file largefile2...
 6655: 48.072: Preallocated 1 of 1 of file largefile2 in 5 seconds
 6655: 48.072: File largefile1: mbytes=1024
 6655: 48.076: Creating file largefile1...
 6655: 52.413: Preallocated 1 of 1 of file largefile1 in 5 seconds
 6655: 52.413: waiting for fileset pre-allocation to finish
 6655: 52.413: Starting 1 seqwrite instances
 6660: 53.418: Starting 1 seqwrite4 threads
 6660: 53.418: Starting 1 seqwrite3 threads
 6660: 53.418: Starting 1 seqwrite2 threads
 6660: 53.418: Starting 1 seqwrite1 threads
 6655: 56.428: Running...
 6655: 86.769: Run took 30 seconds...
 6655: 86.774: Per-Operation Breakdown
limit                       0ops/s   0.0mb/s      0.0ms/op        0us/op-cpu
seqwrite4                  46ops/s  45.7mb/s     21.4ms/op      434us/op-cpu
limit                       0ops/s   0.0mb/s      0.0ms/op        0us/op-cpu
seqwrite3                  38ops/s  38.1mb/s     25.8ms/op      448us/op-cpu
limit                       0ops/s   0.0mb/s      0.0ms/op        0us/op-cpu
seqwrite2                  41ops/s  41.4mb/s     23.8ms/op      462us/op-cpu
limit                       0ops/s   0.0mb/s      0.0ms/op        0us/op-cpu
seqwrite1                  51ops/s  51.1mb/s     19.4ms/op      425us/op-cpu

 6655: 86.774:
IO Summary:       5354 ops 176.5 ops/s, (0/176 r/w) 176.3mb/s,   4731us cpu/op,  22.3ms latency
 6655: 86.774: Shutting down processes

Code:

SunOS zfs01 5.11 snv_111b i86pc    04/06/2010

19:50:30    %usr    %sys    %wio   %idle
19:50:40       0      20       0      80
19:50:50       0      19       0      81
19:51:00       0      21       0      79
19:51:10       0      20       0      80

Ruroni · Apr 6, 2010

I recall doing some reading about a week ago regarding iscsi zfs performance being crap. Not sure if that had a resolution. Certainly has put a halt to my plans for a zfs iscsi storage array.

packetboy · Apr 6, 2010

Code:

zfs01:~# zpool create monster raidz2 c7t16d0 c7t17d0 c7t18d0 c7t19d0 c7t20d0 c7t21d0 c7t22d0 c7t23d0

zfs01:~# /opt/filebench/bin/amd64/go_filebench
FileBench Version 1.3.4
filebench> load   multistreamwrite
 6736: 7.259: Multi Stream Write Version 2.0 personality successfully loaded
 6736: 7.259: Usage: set $dir=<dir>
 6736: 7.259:        set $filesize=<size>    defaults to 1073741824
 6736: 7.259:        set $nthreads=<value>   defaults to 1
 6736: 7.259:        set $iosize=<value> defaults to 1048576
 6736: 7.259:        set $directio=<bool> defaults to 0
 6736: 7.259:
 6736: 7.259:        run runtime (e.g. run 60)
filebench> set $dir=/monster
filebench> run 60
 6736: 17.241: Creating/pre-allocating files and filesets
 6736: 17.241: File largefile4: mbytes=1024
 6736: 17.255: Creating file largefile4...
 6736: 23.173: Preallocated 1 of 1 of file largefile4 in 6 seconds
 6736: 23.173: File largefile3: mbytes=1024
 6736: 23.196: Creating file largefile3...
 6736: 27.966: Preallocated 1 of 1 of file largefile3 in 5 seconds
 6736: 27.966: File largefile2: mbytes=1024
 6736: 27.971: Creating file largefile2...
 6736: 32.610: Preallocated 1 of 1 of file largefile2 in 5 seconds
 6736: 32.610: File largefile1: mbytes=1024
 6736: 32.612: Creating file largefile1...
 6736: 37.324: Preallocated 1 of 1 of file largefile1 in 5 seconds
 6736: 37.324: waiting for fileset pre-allocation to finish
 6736: 37.324: Starting 1 seqwrite instances
 6739: 38.328: Starting 1 seqwrite4 threads
 6739: 38.328: Starting 1 seqwrite3 threads
 6739: 38.328: Starting 1 seqwrite2 threads
 6739: 38.328: Starting 1 seqwrite1 threads
 6736: 41.338: Running...
 6736: 102.228: Run took 60 seconds...
 6736: 102.228: Per-Operation Breakdown
limit                       0ops/s   0.0mb/s      0.0ms/op        0us/op-cpu
seqwrite4                  38ops/s  37.7mb/s     26.2ms/op      506us/op-cpu
limit                       0ops/s   0.0mb/s      0.0ms/op        0us/op-cpu
seqwrite3                  38ops/s  38.4mb/s     25.8ms/op      495us/op-cpu
limit                       0ops/s   0.0mb/s      0.0ms/op        0us/op-cpu
seqwrite2                  34ops/s  34.2mb/s     28.8ms/op      511us/op-cpu
limit                       0ops/s   0.0mb/s      0.0ms/op        0us/op-cpu
seqwrite1                  32ops/s  32.1mb/s     30.7ms/op      523us/op-cpu

 6736: 102.228:
IO Summary:       8677 ops 142.5 ops/s, (0/143 r/w) 142.4mb/s,   5808us cpu/op,  27.7ms latency
 6736: 102.228: Shutting down processes

Code:

zfs01:~$ sar -u 10 60

SunOS zfs01 5.11 snv_111b i86pc    04/06/2010

19:53:56    %usr    %sys    %wio   %idle
19:54:06       0      30       0      70
19:54:16       0      21       0      79
19:54:26       0      21       0      79
19:54:36       0      22       0      78
19:54:46       0      21       0      79
19:54:56       0      21       0      79
19:55:06       0      19       0      81
19:55:17       0      20       0      80

packetboy · Apr 6, 2010

Ruroni said:
I recall doing some reading about a week ago regarding iscsi zfs performance being crap. Not sure if that had a resolution. Certainly has put a halt to my plans for a zfs iscsi storage array.

There are certainly major performance implications of running ZFS on Linux (e.g. ZFS Fuse) as it is implemented in user-space vs. kernel space. I cringed at having to run my storage server on a new o/s I know nothing about (e.g. opensolaris) just to get the performance, but so far it has been easy.

See the benchmarks I've just posted. Very simple using low-performance consumer Sata drives (e.g. Hitachi 2TB)...yielding 150MB/s to 300MB/s (depending on request size and RAID configuration). Even with Raidz2 (Raid6) throughput is 150MB/s.

That is enough of saturate a 1Gbps lan (e.g. 1000Mbps / 8b/B = 125MB/s)...as long as you can achieve that it doesn't really matter.

Ruroni · Apr 6, 2010

packetboy said:
There are certainly major performance implications of running ZFS on Linux (e.g. ZFS Fuse) as it is implemented in user-space vs. kernel space. I cringed at having to run my storage server on a new o/s I know nothing about (e.g. opensolaris) just to get the performance, but so far it has been easy.

See the benchmarks I've just posted. Very simple using low-performance consumer Sata drives (e.g. Hitachi 2TB)...yielding 150MB/s to 300MB/s (depending on request size and RAID configuration). Even with Raidz2 (Raid6) throughput is 150MB/s.

That is enough of saturate a 1Gbps lan (e.g. 1000Mbps / 8b/B = 125MB/s)...as long as you can achieve that it doesn't really matter.

Well the results I've seen are purely OpenSolaris, not ZFS Fuse on Linux. The local performance was never in question. iSCSI only. I suppose you could make an iSCSI target on your ZFS volume and test for us considering you have it to play with for now.

Would go a long way in providing further information for us to work with.

packetboy · Apr 6, 2010

Ruroni said:
Well the results I've seen are purely OpenSolaris, not ZFS Fuse on Linux. The local performance was never in question. iSCSI only.

Agreed, whole round of iSCSI, NFS, CIFS benchmarking is necessary...I'm deferring that until I make sure I have ZFS operating as expected locally first.

The good news is I have 20 years of packet analysis experience, so if throughput suddenly dies once we introduce the LAN I should be able to figure it out.

I suspect it hinges a lot on having TCP stack tuned properly (e.g. TCP_RWIN) and jumbo frames enabled.

Here is iSCSI test from a Fedora box to another Fedora box..it's even going through an intermediate firewall! 1Gbps LAN, jumbo frames enabled:

Code:

# dd if=/dev/zero of=/mnt/blo0001/tstfile.tst bs=1000000 count=100
100+0 records in
100+0 records out
100000000 bytes (100 MB) copied, 1.05278 s, 95.0 MB/s

Here's the killer. The target drive is a SINGLE Sata drive....***and*** the it's a Blowfish-256 bit encrypted filesystem!

packetboy · Apr 6, 2010

Tau said:

Code:

I get 60-80MB/s over cifs though with the new iSCSI target setup i start the transfer at 130MB/s drops to ~110 for about 1-2 seconds than nose dives to ~50, than steadily falls down to 10MB/s....  I have been playing with this thing all morning and cant get it figured out... Any ideas?  I was totally expecting iSCSI to max my GigE line...
[/QUOTE]

Kick off a tcpdump while throughput is sucking wind...eg.:

tcpdump -U -s 0 -w cifs_test.pcap port *insert_iscsi_tcp_port_number*

Let capture run for 15-30 seconds while tput low.

IM me and we'll figure out how to arrange time to analyze Pcap together.

Do you have a managed switch? e.g. so you can check ports for Ethernet errors?

The most common Ethernet problem I run into is Duplex mis-match issues (e.g. where host and switch port it's connected to do NOT negotiate full-duplex...or someone has forced something to half).  That will create a situation where if you are only using 10% of your network capacity you won't even notice a problem, but as soon as you try to burst traffic the whole thing just collapses due to massive packet errors, *late* collisions, and packets generally dropped all over the floor.

Project: ZFS Monster - Phase I (testbed)

Limp Gawd

[H]ard|Gawd

Limp Gawd

Limp Gawd

2[H]4U

Limp Gawd

Limp Gawd

Limp Gawd

Limp Gawd

Limp Gawd

Limp Gawd

Limp Gawd

Limp Gawd

Limp Gawd

Limp Gawd

Limp Gawd

Limp Gawd

Limp Gawd

[H]ard|Gawd

Limp Gawd

[H]ard|Gawd

Limp Gawd

Limp Gawd

Limp Gawd

Limp Gawd

[H]ard|Gawd

Weaksauce

Limp Gawd

Limp Gawd

Limp Gawd

Limp Gawd

Limp Gawd

Limp Gawd

Limp Gawd

Weaksauce

Limp Gawd

Limp Gawd

Weaksauce

Limp Gawd

Limp Gawd