Sanity check - performance bottleneck with Xeon CPU on ZFS SSD pool

MarkL

Limp Gawd
Joined
Aug 19, 2010
Messages
202
Hi guys,

I've been trying to figure out if my CPU really is my bottleneck here... hopefully someone with some experience can weigh in or point out what I am missing.

I am setting up a new ZFS storage box.

Basic specs:

Supermicro 16x 2.5" bay chassis - CSE-213A-R740WB
Memory - 64gb
CPU - 1x E5-2603 V2
HBA - 2x AOC-S2308L-L8i (9207-8i equivalent)
Drives - 7x 840 Pro 512GB - intended to be used as a 6disk RaidZ2 with 1 spare.
OS - OmniOS (latest) with Napp-It on top

I have everything installed and running and I created my RaidZ2 array with 6 drives and I was doing a bunch of benchmarks but I was only seeing just over 500mB/sec write... I started doing a bunch of tests with different pool configurations and I seem to be hitting some kind of limit on writing as soon as the drives are striped. I am using Bonnie++ for my benchmarking. (NOTE: I have the drives separated across the HBA's as well, 4 on one, 3 on the other)

This is what I get for a single drive:

Bonnie command used for large pools:
Code:
 bonnie++ -d /Testpool/bm -u root -f -s 225000
Bonnie command used when testing multiple in parallel:
Code:
 bonnie++ -d /Testpool/bm -u root -f -s 50000 -r 10000
(The -r 10000 is just to force Bonnie to run a 50GB test even though the system had 64gb of memory - because I was running 3-7 at the same time)


Code:
Version 1.03e       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
2015.01.05  225000M           [B][COLOR="red"]500686[/COLOR][/B]  76 163859  38           [B][COLOR="red"]520021  [/COLOR][/B]37 12283  35
Basically - 500mB/sec write, and 520mB/sec read. Great!

But as soon as I stripe two drives together (not RaidZ, just stripe - so Raid0) I get:

Code:
Version 1.03e       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
2015.01.06  100000M           [B][COLOR="Red"]673257[/COLOR][/B]  99 336536  77           [B][COLOR="red"]998914  [/COLOR][/B]63 15904  44


So the read doubled, but the write only goes up to 673mB/sec.

Then I tried the same for 3x in a stripe:

Code:
Version 1.03e       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
2015.01.06  100000M           [B][COLOR="red"]675508  [/COLOR][/B]98 469291  93           [B][COLOR="red"]1515632  [/COLOR][/B]86 +++++ +++

So read tripled, but write is still at that same place.. As the stripe increases the read also gets capped around 1.6gB/sec:

Code:
Version 1.03e       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
4X Stripe    100000M           666164  96 464853  93           1614523  91 +++++ +++
5X Stripe    100000M           664409  96 467964  93           1609386  92 +++++ +++
6X Stripe    100000M           660486  95 465797  92           1600067  91 +++++ +++


I did some more testing with multiple single-drive zpools, running Bonnie++ in parallel across them. When I do this - I see what I'd expect for performance:

For 6 drives at once:
Code:
Drive1  100000M           264298  43 158171  35           463943  33 11236  23
Drive2  100000M           267198  44 156709  35           465954  33 11554  24
Drive3  100000M           264293  43 156424  35           464482  33 10407  19
Drive4  100000M           264858  43 157579  35           458790  32 11728  23
Drive5  100000M           263174  43 156470  35           458161  32 10533  20
Drive6  100000M           266659  43 156296  35           460471  32 11214  23

Write: 1588
Read: 2768

For 7:
Code:
Drive1  100000M           223623  36 148292  33           407406  29 10308  22
Drive2  100000M           222744  36 149911  34           414207  29 12017  24
Drive3  100000M           224019  36 148816  33           416678  29 12855  25
Drive4  100000M           229917  37 146537  33           411769  29 11229  22
Drive5  100000M           231382  38 149004  34           408070  29 10735  23
Drive6  100000M           221885  36 149957  34           413474  29 11656  23
Drive7  100000M           226960  37 149537  34           405300  28 12302  25

Write: 1576
Read: 2874

So everything in the system seems to be able to handle the speed - but something to do with striping limits it..

I (maybe stupidly?) figured that any modern Xeon is going to be able to handle something like being able to stripe @ 1gB/sec.. but this is the cheapest E5 V2 cpu and only clocks @ 1.8GHz.


So - does anyone have any suggestions on other things I could test to try and work out if it really is the CPU or not?


Thanks
 
Last edited:
First, please use CODE tags if you post something that is intended for monospace fonts.
I gave up trying to decipher the bonnie outputs after 30 seconds.
If you post benchmark results please include the command/settings you used.

Please post the command you created the pool with. Are you sure that you use multiple vdevs?
I assume you are doing something wong, I had pools from harddisks that were faster than 600 MB/s writing.

As a sidenote, a RAIDZ/Z2/Z3 pool has the same IOPS as a single disk.
 
Last edited:
Thanks for the reply Omniscence. I've fixed up the tags - hopefully easier to read now.


For the zpool's - for the stripe only tests I just created them with:
Code:
zpool create Testpool <drive1> <drive2> <drive3>..
zfs set atime=off Testpool
zfs create Testpool/bm

For the single drives it was just 'zpool create Test1 <drive>' for each drive, and then disabled atime + zfs create.

For the raidz2 - just the same as the stripe but 'raidz2' before the drives..

Example of test raidz2 pool:
Code:
  pool: Test1
 state: ONLINE
  scan: none requested
config:

        NAME                       STATE     READ WRITE CKSUM
        Test1                      ONLINE       0     0     0
          raidz2-0                 ONLINE       0     0     0
            c1t50025385A0258FADd0  ONLINE       0     0     0
            c1t50025385A02A6A0Bd0  ONLINE       0     0     0
            c1t50025385A0258FAEd0  ONLINE       0     0     0
            c1t50025385A025806Dd0  ONLINE       0     0     0
            c1t50025385A02A6A0Ad0  ONLINE       0     0     0
            c1t50025385A0258FA0d0  ONLINE       0     0     0
 
Just thinking out loud here: what does top report? is there a lot of load?
 
Here is some debug output while I am running Bonnie++ on the 6 disk RaidZ2:

Code:
iostat (1second intervals)
   tty       rpool          sd0          Test1          sd1            cpu
 tin tout kps tps serv  kps tps serv  kps tps serv  kps tps serv   us sy dt id
   0   12  35   3    1   19   2    0  29477 283   24   16   2    0    2  4  0 93
   0  293   0   0    0    0   0    0  836771 8270   22    0   0    0    2 49  0 49
   0  138   0   0    0    0   0    0  907010 8401   30    0   0    0    2 51  0 47
   0  138   0   0    0    0   0    0  817664 8037   24    0   0    0    1 50  0 48
   0  137   0   0    0    0   0    0  867887 8201   30    0   0    0    2 51  0 47
   0 5256   0   0    0    0   0    0  855080 8110   23    0   0    0    1 50  0 49
   0  137   0   0    0    0   0    0  869652 8307   28    0   0    0    1 51  0 47
   0  139   0   0    0    3  12    0  855753 8384   23    3  12    0    2 51  0 46
   0  139   0   0    0    0   0    0  884929 8249   33    0   0    0    1 52  0 47
   0  137   0   0    0    0   0    0  840475 8216   24    0   0    0    2 49  0 49
   0 5229   0   0    0    0   0    0  857748 8705   29    0   0    0    1 52  0 47
   0  138   0   0    0    0   0    0  864483 8337   24    0   0    0    2 50  0 48
   0  137   0   0    0    0   0    0  864123 7985   24    0   0    0    1 51  0 48
   0  139   0   0    0    0   0    0  854600 8242   24    0   0    0    2 50  0 49
   0  138   0   0    0    0   0    0  809870 7713   22    0   0    0    2 49  0 50
   0 5201   0   0    0    0   0    0  876094 8238   25    0   0    0    2 52  0 47
   0  138   0   0    0    0   0    0  846134 8049   22    0   0    0    2 50  0 49
   0  138   0   0    0    0   0    0  886374 8215   24    0   0    0    1 51  0 48
   0  137   0   0    0    0   0    0  835242 8125   23    0   0    0    2 49  0 49

Code:
zpool iostat Test 1

Test1       50.9G  2.73T      0  5.36K      0   579M
Test1       51.8G  2.73T      0  5.06K      0   522M
Test1       52.5G  2.73T      0  5.66K      0   584M
Test1       53.3G  2.73T      0  4.96K      0   519M
Test1       54.1G  2.73T      0  5.49K      0   583M
Test1       54.9G  2.73T      0  5.22K      0   538M
Test1       55.8G  2.73T      0  5.42K      0   558M
Test1       56.5G  2.73T      0  5.34K      0   566M
Test1       57.4G  2.73T      0  5.30K      0   537M
Test1       58.1G  2.72T      0  5.51K      0   576M
Test1       59.0G  2.72T      0  5.19K      0   527M
Test1       59.7G  2.72T      0  5.40K      0   576M
Test1       60.5G  2.72T      0  5.12K      0   526M
Test1       61.3G  2.72T      0  5.04K      0   521M
Test1       62.1G  2.72T      0  5.57K      0   578M
Test1       62.9G  2.72T      0  5.43K      0   571M
Test1       63.7G  2.72T      0  5.22K      0   532M
Test1       64.5G  2.72T      0  5.52K      0   583M
Test1       65.3G  2.72T      0  4.98K      0   519M

Code:
Top lines of prstat

   PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/NLWP
  7077 root        0K    0K sleep   99  -20   0:03:57  28% zpool-Test1/229
 19131 root     3164K 1636K cpu0     0    0   0:02:51  20% bonnie++/1

Bottom line of prstat after a couple of minutes (~200gB written)

Total: 90 processes, 714 lwps, load averages: 3.89, 2.31, 1.07
 
I personally use fio to benchmark my setups. You have very fine control over the type of workload.
I like it more than bonnie and it is so much better than dd (which is not a benchmark and I cannot understand why some people use that).

For example, to test my sequential write bandwidth on my rpool (single Samsung 830) I could do:
Code:
$ fio --name=writebw --filename=test --size=60g --direct=0 --end_fsync=1 --rw=write --bs=1m --iodepth=32 --refill_buffers
writebw: (g=0): rw=write, bs=1M-1M/1M-1M/1M-1M, ioengine=sync, iodepth=32
fio-2.1.3
Starting 1 process
writebw: Laying out IO file(s) (1 file(s) / 61440MB)
Jobs: 1 (f=1): [F] [100.0% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta 00m:00s]
writebw: (groupid=0, jobs=1): err= 0: pid=24980: Thu Jan  8 20:29:26 2015
  write: io=61440MB, bw=293113KB/s, iops=286, runt=214643msec
    clat (usec): min=104, max=7714.3K, avg=3265.97, stdev=96546.48
     lat (usec): min=104, max=7714.3K, avg=3266.17, stdev=96546.48
    clat percentiles (usec):
     |  1.00th=[  108],  5.00th=[  111], 10.00th=[  113], 20.00th=[  121],
     | 30.00th=[  131], 40.00th=[  149], 50.00th=[  169], 60.00th=[  185],
     | 70.00th=[  940], 80.00th=[ 1656], 90.00th=[ 2480], 95.00th=[ 2832],
     | 99.00th=[ 3376], 99.50th=[ 3536], 99.90th=[ 5344], 99.95th=[1777664],
     | 99.99th=[4751360]
    bw (KB  /s): min= 1138, max=3239936, per=100.00%, avg=594880.71, stdev=723362.06
    lat (usec) : 250=68.83%, 500=0.85%, 750=0.09%, 1000=0.58%
    lat (msec) : 2=13.78%, 4=15.66%, 10=0.13%, 100=0.01%, 750=0.01%
    lat (msec) : 2000=0.04%, >=2000=0.05%
  cpu          : usr=4.85%, sys=4.40%, ctx=163028, majf=0, minf=25
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=61440/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
  WRITE: io=61440MB, aggrb=293112KB/s, minb=293112KB/s, maxb=293112KB/s, mint=214643msec, maxt=214643msec

You should delete the file after that.

This is a ZFS on Linux system, on Illumos based systems you may have to use a bit diffent settings.
Benchmarking ZFS can be difficult because non-sync writes go to the ARC first.
If the working set is to small and the benchmark does not sync at the end, you may measure the ARC speed.

EDIT: You may consider using ashift=13 for the 840 Pro, they have 8K page sizes.
Depending on what you want to store on your RAIDZ2 (i.e. zvols with small blocksizes) you may however waste a lot of space.
 
Last edited:
Wow - thanks for the tip omniscence!

Maybe bonnie++ is what is at fault..

I just got fio installed and ran it and got much better results:

Code:
/opt/csw/bin/fio --name=writebw --filename=/Test1/bm/writetest --size=250g --direct=0 --end_fsync=1 --rw=write --bs=1m --iodepth=32
writebw: (g=0): rw=write, bs=1M-1M/1M-1M/1M-1M, ioengine=sync, iodepth=32
fio-2.0.14
Starting 1 process
writebw: Laying out IO file(s) (1 file(s) / 256000MB)
Jobs: 1 (f=1): [F] [100.0% done] [0K/0K/0K /s] [0 /0 /0  iops] [eta 00m:00s]
writebw: (groupid=0, jobs=1): err= 0: pid=1906: Thu Jan  8 14:25:20 2015
  write: io=2048.0MB, bw=1322.4MB/s, iops=1322 , runt=193601msec
    clat (usec): min=346 , max=235384 , avg=564.90, stdev=1795.77
     lat (usec): min=358 , max=236774 , avg=702.29, stdev=2308.29
    clat percentiles (usec):
     |  1.00th=[  354],  5.00th=[  362], 10.00th=[  362], 20.00th=[  366],
     | 30.00th=[  370], 40.00th=[  378], 50.00th=[  386], 60.00th=[  398],
     | 70.00th=[  414], 80.00th=[  454], 90.00th=[  660], 95.00th=[  772],
     | 99.00th=[ 3408], 99.50th=[ 6304], 99.90th=[22656], 99.95th=[34560],
     | 99.99th=[69120]
    bw (MB/s)  : min=  239, max= 2603, per=100.00%, avg=1368.47, stdev=408.28
    lat (usec) : 500=84.02%, 750=10.24%, 1000=1.84%
    lat (msec) : 2=2.24%, 4=0.82%, 10=0.55%, 20=0.16%, 50=0.11%
    lat (msec) : 100=0.02%, 250=0.01%
  cpu          : usr=3.88%, sys=52.90%, ctx=224263, majf=0, minf=0
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=0/d=256000, short=r=0/w=0/d=0

Run status group 0 (all jobs):
  WRITE: io=256000MB, aggrb=1322.4MB/s, minb=1322.4MB/s, maxb=1322.4MB/s, mint=193601msec, maxt=193601msec

I am doing more testing now..
 
If you use a filesystem with compression you may have to add '--refill_buffers'. I added it to my command line above.
Otherwise you will may get compressible data depending on the IO size.
 
Cool - Thanks!

I haven't enabled compression yet as I wanted to test raw first.
 
EDIT: You may consider using ashift=13 for the 840 Pro, they have 8K page sizes.
Depending on what you want to store on your RAIDZ2 (i.e. zvols with small blocksizes) you may however waste a lot of space.

I had updated sd.conf on my install so that it knows about the 840 Pro and is already creating the zpool's with ashift 13.

Just to clarify what you mean by wasted space though - the ashift doesn't represent a minimum file size does it? eg: It means every file uses up a minimum of 8K?

Part of this pool is going to be used for mail server storage meaning millions of files and with compression a lot of those are going to be under 8kB..
 
Yes it means that the smallest unit the drive stores is 8 KiB. But it gets worse, if you build a RAIDZ2 of 6 drives, the smallest unit is still 8 KiB on each drive. So if you store less than 4*8 KiB=32KiB (4 because of 4 data disks), ZFS will still allocate 8 KiB on each drive and just pads the remaining space.

What I'm not totally sure about is, what happens once you go below/equal to 8 KiB. It could store two 8 KiB records in a stripe and still have two copies of each so you can lose two disks without data loss. But to be sure about that I would have to look into the code or play around with zdb. I assume for my setups that a 8 KiB record/file sill consumes a full stripe of 32 KiB of effective space. This would be easier to implement.

But what I am sure about is, that for every record/file larger than 8 KiB, ZFS will consume at least 32 KiB of the effective space with 6 disks in RAIDZ2, or 48 KiB of the total space including parity data.

All numbers here are for ashift=13. For ashift=12 it is half of these and for ashift=9 it is 1/16. If you want to reduce the wasted space you could go with ashift=9 and live with the reduced throughput for random accesses and higher write amplification. The throughput would still be high, the 840 Pro are very fast drives after all.
 
Last edited:
You have a 4 core 4 threaded CPU that only goes to 1.8 GHz.

It doesn't take much to hit the limit, or to notice it being "SLOW" when you're used to 3ghz+. How it affects "software" RAID I'm not 100%, but it would make sense that it affects it like other software the lower frequency the slower it is overall for all software.

Should you hit > 550mb/write/read with that CPU... yes I believe so, unless the CPU is doing more work too. Remember it's only 1.8ghz 4 Core/4Thread

One experience I had recently:
Serving mostly static websites I noticed night and day difference between 2x quad cores (with hyper threading) that maxed out at 2.2ghz going to a single CPU (quad core, 8 threads) but goes to 3.7ghz 'turbo'. Both were also using Hardware Raid1.

I could be completely off base with my experience, and your application but 4C/4T running an OS + Software RAID I wouldn't think you have much CPU left.

What's TOP output look like?
 
but his cpu isn't pegg'd at 100% at least in the output of top it's at 25% or so. Perhaps there's some IO load that top isn't reporting?
 
Hey guys - thanks for all the responses btw.

So I think what may be happening is this is just a limit of how fast Bonnie++ can generate it's data - maybe because it is only a 1.8GHz clock. That also explains the 25% CPU.. because only one core is maxing out and Bonnie must be single threaded.

I was thinking about it while driving home and realized I never tested multiple Bonnie's running on the same zpool - I had only done parallel tests on different pools, not the same one.

I just did a quick test - 3x at once:

Code:
Version 1.03e       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
RaidZ2-6disk-run1 150000M           316629  54 218758  51           720619  53 12871  37
RaidZ2-6disk-run2 150000M           317503  55 218792  51           707746  52 +++++ +++
RaidZ2-6disk-run3 150000M           316543  55 218862  51           709557  52 13822 148

So thats now 950mB/sec write and 2.1gB/sec read! That's more like it...

Anyway - now I have to work out what I am doing with the ashift.. I really don't want to be wasting that much space.

Especially with LZ4 compression - there are going to be millions of tiny files. sigh - the fun never ends! :)
 
Back
Top