how to get better performance out of mdraid

Red Squirrel

[H]F Junkie
Joined
Nov 29, 2009
Messages
9,211
I have a file server with a couple raid arrays, mostly raid 10's and uses mdraid. Server is a Xeon with 8GB of ram, fairly recently built.

I noticed what whenever I'm doing anything IO intensive like installing several OSes at a time in VMs (VM server uses NFS to map to the raid arrays) then if I'm listening to music (stored on server via NFS) it can't keep up with the stream. This is unacceptable especially considering I paid over 3 grand for this machine, it's not like it's a single 4500rpm drive sitting in an enclosure this is a file server with quad core and whole nine yards.

How do I go about determining where the bottleneck is, and fixing it?

Load average is 10, which seems ridiculously high.

Here's iostat output, but I'm not really sure how to interpret that, so no idea if those numbers are good or bad.

Code:
[root@isengard ~]# iostat -x -c 1
Linux 2.6.32-358.el6.x86_64 (isengard.loc) 	29/11/14 	_x86_64_	(8 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.12    0.00    0.48    1.77    0.00   97.64

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sdn               0.00     0.45    0.24    0.66    12.35     8.08    22.69     0.00    1.43   1.10   0.10
sdm              74.16    18.90   10.50    5.64  1103.60   177.64    79.36     0.07    4.00   1.84   2.97
sdg              74.75    19.17    9.88    5.97  1103.75   182.42    81.12     0.07    4.12   2.05   3.24
sdc              75.13    19.02   10.06    6.01  1106.55   181.63    80.14     0.11    6.88   2.24   3.60
sdl               0.00     0.00    0.00    0.00     0.00     0.00     8.42     0.00    0.51   0.51   0.00
sda              74.96    18.98    9.58    5.53  1103.00   177.42    84.74     0.07    4.66   1.98   2.99
sdb              74.84    19.16    9.75    5.83  1103.39   181.16    82.45     0.07    4.22   1.97   3.07
sdf              74.96    19.01    9.61    5.53  1103.34   177.69    84.60     0.07    4.76   2.29   3.46
sde              75.05    19.13    9.53    5.54  1103.39   178.76    85.05     0.08    5.47   2.39   3.60
sdd              75.04    19.05    9.59    5.88  1103.64   180.77    83.05     0.08    5.08   2.58   3.99
sdh               6.72    14.15   10.72   10.83  3455.13   175.17   168.51     0.26   11.96   2.17   4.66
sdk               9.99    14.08    7.34   10.47  3426.16   171.80   201.94     0.06    3.53   2.72   4.85
sdj              11.66    14.22    4.92   10.76  3154.79   175.17   212.36     0.05    2.99   2.95   4.63
sdi              12.16    14.13    4.46   10.43  3182.71   171.80   225.40     0.07    4.83   3.10   4.61
dm-0              0.00     0.00    0.24    1.01    12.34     8.08    16.28     0.02   17.22   0.79   0.10
dm-1              0.00     0.00    0.00    0.00     0.00     1.05  5186.60     0.00    0.33   0.08   0.00
md1               0.00     0.00   35.30  144.28  3693.28  1136.45    26.89     0.00    0.00   0.00   0.00
md0               0.00     0.00    9.06   45.26  5682.07   345.47   110.97     0.00    0.00   0.00   0.00
dm-2              0.00     0.00    0.00    0.00     0.00     0.00     8.00     0.00    0.32   0.26   0.00
md2               0.00     0.00    0.00    2.74     1.35    21.95     8.49     0.00    0.00   0.00   0.00
sdt               4.93     0.37    4.96    1.86  1266.46    10.60   187.30     0.16   23.75   2.20   1.50
sdu               1.42     0.38    8.42    1.86  1261.69    10.60   123.77     0.05    4.91   1.23   1.26
sdv               2.80     0.37    7.08    1.60  1266.39     8.42   146.88     0.11   12.25   1.44   1.25
sdw               1.79     0.37    8.05    1.59  1261.43     8.42   131.70     0.07    7.48   1.18   1.14
sdy               6.75     0.37    3.14    1.61  1266.03     8.55   268.30     0.23   49.35   2.97   1.41
sdz               1.05     0.40    8.82    1.61  1261.82     8.80   121.86     0.04    3.99   1.01   1.06
sdaa              2.49     0.37    7.39    1.78  1266.12     9.91   139.07     0.10   11.40   1.51   1.39
sdab              3.86     0.39    5.99    1.78  1261.03    10.25   163.54     0.13   17.20   1.78   1.38
md3               0.00     0.00    0.39    5.15    62.27    36.40    17.80     0.00    0.00   0.00   0.00

I don't think it's network, but I suppose it could be, how do I check current network usage in Linux using command line?
 
What is the layout of your RAID volumes? Are you using the same disk in multiple RAIDs. If you VMs are going over NFS, then network could be your bottleneck. The NFS protocol can lockup disk access to ensure data is committed to disk. Look up NFS tuning, as there are a few mount flags you can add to improve performance. Ideally, you should be running your VMs locally or use iSCSI.

You can use netwatch to see traffic on your network interface.
 
The raids are as followed, all full/individual disks, they're not enterprise but they are a combination of WD blacks and some WD reds, and some Hitachis. (typically same brand per array)

Code:
[root@isengard ~]# mdadm --detail /dev/md0
/dev/md0:
        Version : 1.2
  Creation Time : Thu Sep  5 00:19:01 2013
     Raid Level : raid10
     Array Size : 5860270080 (5588.79 GiB 6000.92 GB)
  Used Dev Size : 2930135040 (2794.39 GiB 3000.46 GB)
   Raid Devices : 4
  Total Devices : 4
    Persistence : Superblock is persistent

    Update Time : Sat Nov 29 23:32:42 2014
          State : clean 
 Active Devices : 4
Working Devices : 4
 Failed Devices : 0
  Spare Devices : 0

         Layout : near=2
     Chunk Size : 512K

           Name : isengard.loc:0  (local to host isengard.loc)
           UUID : 2e257e19:33dab86c:2e112e06:b386598e
         Events : 275

    Number   Major   Minor   RaidDevice State
       0       8      112        0      active sync   /dev/sdh
       1       8      144        1      active sync   /dev/sdj
       2       8      160        2      active sync   /dev/sdk
       3       8      128        3      active sync   /dev/sdi
[root@isengard ~]# 
[root@isengard ~]# mdadm --detail /dev/md1
/dev/md1:
        Version : 0.90
  Creation Time : Sat Sep 20 02:15:28 2008
     Raid Level : raid5
     Array Size : 6837319552 (6520.58 GiB 7001.42 GB)
  Used Dev Size : 976759936 (931.51 GiB 1000.20 GB)
   Raid Devices : 8
  Total Devices : 8
Preferred Minor : 1
    Persistence : Superblock is persistent

    Update Time : Sat Nov 29 23:32:47 2014
          State : clean 
 Active Devices : 8
Working Devices : 8
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 64K

           UUID : 11f961e7:0e37ba39:2c8a1552:76dd72ee
         Events : 0.2093999

    Number   Major   Minor   RaidDevice State
       0       8       32        0      active sync   /dev/sdc
       1       8      192        1      active sync   /dev/sdm
       2       8       48        2      active sync   /dev/sdd
       3       8       16        3      active sync   /dev/sdb
       4       8       96        4      active sync   /dev/sdg
       5       8        0        5      active sync   /dev/sda
       6       8       80        6      active sync   /dev/sdf
       7       8       64        7      active sync   /dev/sde
[root@isengard ~]# mdadm --detail /dev/md3
/dev/md3:
        Version : 1.2
  Creation Time : Mon Jul 28 23:31:31 2014
     Raid Level : raid10
     Array Size : 7813531648 (7451.56 GiB 8001.06 GB)
  Used Dev Size : 1953382912 (1862.89 GiB 2000.26 GB)
   Raid Devices : 8
  Total Devices : 8
    Persistence : Superblock is persistent

    Update Time : Sat Nov 29 23:32:20 2014
          State : active 
 Active Devices : 8
Working Devices : 8
 Failed Devices : 0
  Spare Devices : 0

         Layout : near=2
     Chunk Size : 512K

           Name : isengard.loc:3  (local to host isengard.loc)
           UUID : 99f0389f:dbf75cb3:c841340e:33f62841
         Events : 62

    Number   Major   Minor   RaidDevice State
       0      65       48        0      active sync   /dev/sdt
       1      65       64        1      active sync   /dev/sdu
       2      65       80        2      active sync   /dev/sdv
       3      65       96        3      active sync   /dev/sdw
       4      65      128        4      active sync   /dev/sdy
       5      65      144        5      active sync   /dev/sdz
       6      65      160        6      active sync   /dev/sdaa
       7      65      176        7      active sync   /dev/sdab
[root@isengard ~]#


I use NFS as it's natively made to have multiple hosts access the same volume. iSCSI is technically suppose to be 1 host per volume though I know there's ways to make it work but I did not want to mess with that and risk data corruption. The whole point of this server is centralizing data so it would be silly to make the VMs local, and have to deal with individual controllers and hardware raid. (don't think ESX can do mdraid). Right now I just have one ESX host but eventually want to build more so this was built with future proofing in mind. I also don't want to use any ESX specific features (such as getting iSCSI to work across multiple hosts) as I may not always use ESX. I actually wanted to use Qemu/KVM but it proved too complicated but eventually I may write a program that makes it easier to manage.

My data (which seems to suffer a lot when there's a lot going on the server) is on a raid 5 array but there's no VMs on that array. Could that still be an issue? I do want to eventually retire that array so any new data repository/VM that I make is being put on the raid 10's.

I tried to install netwatch through apt-get (CentOS 6.4) but it just segfaults right away when I go to run it.
 
Last edited:
What your read/writes with a program like IOMeter locally ran on the NFS server to the raid arrays?
 
Need more details on your rig, network setup and nfs export options. I had a similar issue where everything paused when my PVR fired up multiple recordings to my NFS. No amount of tweaking could resolved the problem. I had to put the VMs with high IO on their own RAID volume. That meant separating my PVR and web VMs. It turned out that NFS sync was locking things up waiting for write confirmation from the disks.
 
So your using the full disk and not partition for MD Raid?

I would rebuild it again, this time making an aligned partition 1st, Then Create the array from the partitions.

eg partition disks with 4k sectors

sudo parted /dev/sdc
GNU Parted 2.3
Using /dev/sdc
Welcome to GNU Parted! Type ‘help’ to view a list of commands.
(parted) mklabel gpt
Warning: The existing disk label on /dev/sdc will be destroyed and all data on this disk will be lost. Do you want to continue?
Yes/No? y
(parted) unit s
(parted) mkpart primary 2048s 100%
(parted) align-check opt 1
1 aligned
(parted) set 1 raid on
(parted) quit
Information: You may need to update /etc/fstab.

Then create your MDADM Array from the partitions and not the whole drives

mdadm --create --verbose /dev/md0 --level=10 --raid-devices=8 /dev/sd[bcdefghi]1

mdadm --detail /dev/md0 should then show the partitons not whole /dev/

eg your /dev/md3 would then look like

Code:
[root@isengard ~]# mdadm --detail /dev/md3
/dev/md3:
        Version : 1.2
  Creation Time : Mon Jul 28 23:31:31 2014
     Raid Level : raid10
     Array Size : 7813531648 (7451.56 GiB 8001.06 GB)
  Used Dev Size : 1953382912 (1862.89 GiB 2000.26 GB)
   Raid Devices : 8
  Total Devices : 8
    Persistence : Superblock is persistent

    Update Time : Sat Nov 29 23:32:20 2014
          State : active 
 Active Devices : 8
Working Devices : 8
 Failed Devices : 0
  Spare Devices : 0

         Layout : near=2
     Chunk Size : 512K

           Name : isengard.loc:3  (local to host isengard.loc)
           UUID : 99f0389f:dbf75cb3:c841340e:33f62841
         Events : 62

    Number   Major   Minor   RaidDevice State
       0      65       48        0      active sync   /dev/sdt1
       1      65       64        1      active sync   /dev/sdu1
       2      65       80        2      active sync   /dev/sdv1
       3      65       96        3      active sync   /dev/sdw1
       4      65      128        4      active sync   /dev/sdy1
       5      65      144        5      active sync   /dev/sdz1
       6      65      160        6      active sync   /dev/sdaa1
       7      65      176        7      active sync   /dev/sdab1

that is each /dev/sdx would now be /dev/sdx1

give that a whirl and see if performance is better?

also make sure the arrays are synced before testing for speed, by watching the output of mdstat

eg

#cat /proc/mdstat

will show details of where it's at.

.
 
So your using the full disk and not partition for MD Raid?

I would rebuild it again, this time making an aligned partition 1st, Then Create the array from the partitions.

eg partition disks with 4k sectors

sudo parted /dev/sdc
GNU Parted 2.3
Using /dev/sdc
Welcome to GNU Parted! Type ‘help’ to view a list of commands.
(parted) mklabel gpt
Warning: The existing disk label on /dev/sdc will be destroyed and all data on this disk will be lost. Do you want to continue?
Yes/No? y
(parted) unit s
(parted) mkpart primary 2048s 100%
(parted) align-check opt 1
1 aligned
(parted) set 1 raid on
(parted) quit
Information: You may need to update /etc/fstab.

Then create your MDADM Array from the partitions and not the whole drives
....
that is each /dev/sdx would now be /dev/sdx1

give that a whirl and see if performance is better?

also make sure the arrays are synced before testing for speed, by watching the output of mdstat

Why use partition as oppose to the entire disk? Do you have reference for the difference in performance? I thought partitioning was pointless for RAID disk these days.
 
There is no point in using partitions.

This is likely normal linux mapper layer buffering.

What I have noticed, is, writes using stuff like mdraid/dmcrypt/lvm will hang around in here for like 30seconds or so, then all get flushed at once, causing lots of io, and making other transaction wait (like your music).

Since your going be using mdraid, you will need to tune this, to not cache so much. In most of my cases, I just uninstall lvm, and go direct, but if you need mdraid/dmcrypt/lvm, that isn't an option.

I can't remember what I used to make it better, been a few years, and those systems have been scrapped.
 
The stripe size cache can be tuned for RAID5/6. No sure if it applies to RAID10. Here is what I have for my RAID6 arrays.
Code:
echo 8192 > /sys/block/md2/md/stripe_cache_size
 
Hmm mine is only 256, is that right? I probably can't change that now through right? It only seems to apply to the raid 5 array, though that's where my personal files are. Could I prevent these lag issues by moving the files to one of the raid 10s? The raid 5 has no VMs on it though so it should technically have been idle.

I also have no partitions, I just use the disks directly. I also can't rebuild, this is all production. Though I've been toying with the idea of moving some data around. The less important stuff is on the raid 5 including the music, so I could possibly move it and redo the mapping. Eventually I want to retire the raid 5 anyway since it's all 1TB drives and I could do a much larger raid 10 with bigger drives instead.

As far as network config it's fairly simple, it's just a gigabit connection on the same vlan as everything else. I was going to take the SAN approach and have it on it's own network but it was just simpler to make it more like a NAS/file server. Would taking the SAN approach increase performance? I don't think network is the issue though, but it seems to be something hard to measure in Linux when not using a GUI.
 
The default of 256 for cache size is way to low. You can change it on the fly with the above command. Try it and see if performance improve. You can play around with different cache sizes to find the right number.

There are tons of non-GUI tools to monitor and test network performance. Check with the CentOS gurus for suggestions. You can also test disk/volume throughput with hdparm -[tT].
 
There is no point in using partitions.

I beg to differ

A good reason to partition the disk is this

You create a raid array of say 4 x 2tb seagate drives.

Later down the track you have a drive fail.... So you pull out the dead seagate drive and replace it with say a WD 2tb drive..... But the WD is slightly smaller.

So now you have to run around trying to find another drive the same.

It's a similar problem to people making zfs arrays a while back with 512b drives.... And now you can only get 4k drives. And you have to copy all the data off / blow away the array and rebuild with a different ashift...

Now if initially when you created the array in mdadm, and instead of using the whole drive, you instead created a partition on each drive that was say 200mb smaller than the whole disks capacity. This would hopefully be larger enough leway to cover the new drive being just that little bit smaller than the currently used ones.

.
 
Now if initially when you created the array in mdadm, and instead of using the whole drive, you instead created a partition on each drive that was say 200mb smaller than the whole disks capacity. This would hopefully be larger enough leway to cover the new drive being just that little bit smaller than the currently used ones.

200 MB ???? :eek:

I have never seen a 1+ TB drive from a major manufacturer that is even 1 sector different from the IDEMA standards.

If you have, how about some details?

I could possibly believe a few sectors difference, maybe up to 4KiB, in very unusual circumstances. But 200 MB seems crazy.
 
There are ways of solving the drive changing size issue with mdadm, without using partitions. Yes, partitions make an easy way to do it, but it is hardly the only way.

I don't see how partitions solve a zfs 512b to 4k sector difference? there is no way around that, except to have the drive lie, or have zfs ignore it.
 
It's been a few years since I've setup my 8x2tb mdadm raid6 - but one thing I do recall when adjusting the stripe cache size - too big of a number can hurt performance. mdadm will use available system memory as a write cache by default as well.

I've got an IBM m1015 in IT mode passed through to a Ubuntu VM - with those disks on it (WD Green 2TB - yeah i know) - setup in the raid array described above. I do use a full GPT partition on every drive instead of just telling mdadm to use X drive. I found through testing and failure that a partition seems to work much better when it comes to mdadm handling multiple raid arrays, or failed/lost configurations. GPT partition with it flagged as linux raid and you are golden. I pass this huge ass raid array back to the esxi server over NFS as a datastore - and have multiple VMs running off of it without issue. Using vmxnet3 10gb I get great performance...I'll see if I can't get some screenshots for you.
 
Never considered that drives could differ sizes, guess I've been lucky. I even rebuilt the entire raid 5 array 1 drive at a time once as I had a power failure that was longer than what the UPS could run and I did not trust the drives anymore. I had 2 fail at a time and almost lost the whole array, so I proceeded to change 1 drive at a time. That was several years back. Went from Hitachi to WD Blacks. I have a 4 hour run time UPS setup now to prevent that scenario from happening again. Want to eventually upgrade to a 48v system instead of 12v, and add another 4 batteries.


IOmeter is not in the yum repository but bonnie++ is so I just did some tests:
.. this produces really ugly output though... Not sure how helpful it is. Trying to read it alone is a challenge.

Raid md0: (raid10)

Code:
# bonnie++ -d test/ -u root
Using uid:0, gid:0.
Writing a byte at a time...done
Writing intelligently...done
Rewriting...done
Reading a byte at a time...done
Reading intelligently...done
start 'em...done...done...done...done...done...
Create files in sequential order...done.
Stat files in sequential order...done.
Delete files in sequential order...done.
Create files in random order...done.
Stat files in random order...done.
Delete files in random order...done.
Version  1.96       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
isengard.loc 15680M  1010  97 283481  20 145639  11  4920  92 367444  12 724.8  10
Latency             12420us     194ms     343ms   12956us     200ms     119ms
Version  1.96       ------Sequential Create------ --------Random Create--------
isengard.loc        -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16 27113  23 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++
Latency                90us     573us     293us      77us      10us      30us
1.96,1.96,isengard.loc,1,1417402311,15680M,,1010,97,283481,20,145639,11,4920,92,367444,12,724.8,10,16,,,,,27113,23,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,12420us,194ms,343ms,12956us,200ms,119ms,90us,573us,293us,77us,10us,30us


Raid md1: (raid5)
Code:
# bonnie++ -d test/ -u root
Using uid:0, gid:0.
Writing a byte at a time...done
Writing intelligently...done
Rewriting...done
Reading a byte at a time...done
Reading intelligently...done
start 'em...done...done...done...done...done...
Create files in sequential order...done.
Stat files in sequential order...done.
Delete files in sequential order...done.
Create files in random order...done.
Stat files in random order...done.
Delete files in random order...done.
Version  1.96       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
isengard.loc 15680M   868  84 231795  25 135095  13  3981  75 416392  16 615.0   6
Latency             20485us    2026ms    1290ms   33271us     124ms     393ms
Version  1.96       ------Sequential Create------ --------Random Create--------
isengard.loc        -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16  4471   3 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++
Latency               629ms     591us     419us     664us      10us      28us
1.96,1.96,isengard.loc,1,1417403304,15680M,,868,84,231795,25,135095,13,3981,75,416392,16,615.0,6,16,,,,,4471,3,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,20485us,2026ms,1290ms,33271us,124ms,393ms,629ms,591us,419us,664us,10us,28us


Raid md3: (raid 10) Most VMs are here
Code:
# bonnie++ -d test/ -u root
Using uid:0, gid:0.
Writing a byte at a time...done
Writing intelligently...done
Rewriting...done
Reading a byte at a time...done
Reading intelligently...done
start 'em...done...done...done...done...done...
Create files in sequential order...done.
Stat files in sequential order...done.
Delete files in sequential order...done.
Create files in random order...done.
Stat files in random order...done.
Delete files in random order...done.
Version  1.96       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
isengard.loc 15680M  1010  96 488516  31 166108  13  4472  82 309837   9 567.8   7
Latency             11902us     184ms     311ms   23890us     130ms     103ms
Version  1.96       ------Sequential Create------ --------Random Create--------
isengard.loc        -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16 28736  24 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++
Latency               455us     521us     366us    4527us      10us    8198us
1.96,1.96,isengard.loc,1,1417403944,15680M,,1010,96,488516,31,166108,13,4472,82,309837,9,567.8,7,16,,,,,28736,24,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,11902us,184ms,311ms,23890us,130ms,103ms,455us,521us,366us,4527us,10us,8198us


Also another thing to note is the VMs seem slow too, like I'm installing VMware tools in one of them and the setup has been running for over 5 minutes. When I installed Windows it sat at "expanding files" for like an hour.


Is there perhaps some kind of command line system monitor that monitors everything at once? Would be nice to be able to get a full blown overview of everything in one easy to read format. Tools that just spit out large numbers without much explanation or ranges don't help much. Idealy something that displays percentages would be nice, where 100% means it's at full capacity and 0 not being used. For disk, network, etc.
 
It looks to me like you are doing multiple things at the same time there, in particular there seems to be constant write activity intermixed with the reading.

I think you are just hitting the limits of what happens when conventional harddrives are driven with simultaneous requests while on a computer that is very fast but also low on RAM.

Questions:
- what are the harddrives?
- what filesystem?
- is all of that activity via NFS? NFS is brutal especially for writes
 
OP, are the various block layers aligned? What filesystem ? What mkfs parameters were used? Are the drives 4k? Are the partitions (if appilcable) on 4k boundaries?

You can do a LOT of optimization by getting the base stuff correct first and performance can/will be MUCH MUCH better.

Once you have done that, then it really helps to pick up an ssd as a cache and use that. Your performance should increase a lot.
 
I beg to differ

A good reason to partition the disk is this

You create a raid array of say 4 x 2tb seagate drives.

Later down the track you have a drive fail.... So you pull out the dead seagate drive and replace it with say a WD 2tb drive..... But the WD is slightly smaller.

So now you have to run around trying to find another drive the same.

It's a similar problem to people making zfs arrays a while back with 512b drives.... And now you can only get 4k drives. And you have to copy all the data off / blow away the array and rebuild with a different ashift...

Now if initially when you created the array in mdadm, and instead of using the whole drive, you instead created a partition on each drive that was say 200mb smaller than the whole disks capacity. This would hopefully be larger enough leway to cover the new drive being just that little bit smaller than the currently used ones.

.

This. I use partitions to solve this problem. When you intermix drives, one may be slightly smaller or larger than the others.
 
OP, are the various block layers aligned? What filesystem ? What mkfs parameters were used? Are the drives 4k? Are the partitions (if appilcable) on 4k boundaries?

You can do a LOT of optimization by getting the base stuff correct first and performance can/will be MUCH MUCH better.

Once you have done that, then it really helps to pick up an ssd as a cache and use that. Your performance should increase a lot.

Honestly no idea on most of those questions, but file system is ext4, whatever mkfs.ext4 does by default is what I used. No partitions on the drives. Any way I can check the other stuff?

The drives are a mix but most individual arrays use the same drives.

md0:
Code:
[root@isengard test]# smartctl -a /dev/sdh | grep -i model
Device Model:     Hitachi HDS723030BLE640
[root@isengard test]# 
[root@isengard test]# smartctl -a /dev/sdj | grep -i model
Device Model:     TOSHIBA DT01ACA300
[root@isengard test]# smartctl -a /dev/sdk | grep -i model
Device Model:     TOSHIBA DT01ACA300
[root@isengard test]# smartctl -a /dev/sdi | grep -i model
Device Model:     TOSHIBA DT01ACA300

md1:
Code:
[root@isengard test]# smartctl -a /dev/sdc | grep -i model
Model Family:     Western Digital Caviar Black
Device Model:     WDC WD1002FAEX-00Y9A0
[root@isengard test]# smartctl -a /dev/sdm | grep -i model
Model Family:     Western Digital Caviar Black
Device Model:     WDC WD1002FAEX-00Y9A0
[root@isengard test]# smartctl -a /dev/sdd | grep -i model
Model Family:     Western Digital Caviar Black
Device Model:     WDC WD1002FAEX-00Z3A0
[root@isengard test]# smartctl -a /dev/sdb | grep -i model
Model Family:     Western Digital Caviar Black
Device Model:     WDC WD1002FAEX-00Y9A0
[root@isengard test]# smartctl -a /dev/sdg | grep -i model
Model Family:     Western Digital Caviar Black
Device Model:     WDC WD1002FAEX-00Y9A0
[root@isengard test]# smartctl -a /dev/sda | grep -i model
Model Family:     Western Digital Caviar Black
Device Model:     WDC WD1002FAEX-00Y9A0
[root@isengard test]# smartctl -a /dev/sdf | grep -i model
Model Family:     Western Digital Caviar Black
Device Model:     WDC WD1002FAEX-00Z3A0
[root@isengard test]# smartctl -a /dev/sde | grep -i model
Model Family:     Western Digital Caviar Black
Device Model:     WDC WD1002FAEX-00Z3A0

md3:
Code:
smartctl -a /dev/sdt | grep -i model
Device Model:     WDC WD20EFRX-68AX9N0
[root@isengard test]# smartctl -a /dev/sdu | grep -i model
Device Model:     WDC WD20EFRX-68AX9N0
[root@isengard test]# smartctl -a /dev/sdv | grep -i model
Device Model:     WDC WD20EFRX-68AX9N0
[root@isengard test]# smartctl -a /dev/sdw | grep -i model
Device Model:     WDC WD20EFRX-68AX9N0
[root@isengard test]# smartctl -a /dev/sdy | grep -i model
Device Model:     WDC WD20EFRX-68AX9N0
[root@isengard test]# smartctl -a /dev/sdz | grep -i model
Device Model:     WDC WD20EFRX-68AX9N0
[root@isengard test]# smartctl -a /dev/sdaa | grep -i model
Device Model:     WDC WD20EFRX-68AX9N0
[root@isengard test]# smartctl -a /dev/sdab | grep -i model
Device Model:     WDC WD20EFRX-68EUZN0



Most data is accessed through NFS. Is there something better that is still designed for multi client access without needing a special file system? I know there's samba but that's mostly for windows.
 
I assume you are not using dmcrypt? That has some odd implications on performance.

My performance problems usually got solved by increasing the stripe cache size. This is possible at any time! But it increases the chance of a stripe inconsistency on a sudden power loss.

The second thing is to increase the block device read-ahead. The md layer does this for the underlying devices, but you can still increase this value.

The third is to set the scheduler to deadline if this is not the default setting for your installation.

You can also use a filesystem with more agressive write caching (xfs) or disable sync writes for the VMs.

Finally, there may not be a solution to this problem. Intermixing a lot of random IO with a constant media stream while everything has the same priority can starve the one stream that has to fulfill time constraints. Especially when you only use 5400 rpm drives, which are generally a bad choice for VM backstorage. A large pool of these drives can easily deliver a gigabyte per second for sequential data but starts to choke as soon as a single random IO thread competes. I suggest to put the VMs on a separate pool with SSDs.

I started with a setup like that with comparable issues. My setup right now is a mirror of 2 Samsung 830 SSDs for OS and VMs, a mirror of 2 WD Red 3 TB disks with an SSD cache for my documents and scratch data and 14 separate WD Red 3 TB drives for media storage with SnapRaid. I also started with XFS but now everything runs on ZFS, mainly because of the snapshots. Everything runs through dmcrypt and I could not be more satisfied with the performance at the moment.
 
Last edited:
Is there a stripe cache for raid10? I do not seem to have any mdraid10s here anymore to check.
 
You are right, there isn't a stripe cache for raid10. It wouldn't even make sense. I did miss the part about the raid level.
 
No dmcrypt as far as I know. Is that something I would have had to put in myself or could it be something set as default? I'm on 7200rpm drives as far as I know, at work we had like 50+ VMs on a similar setup as mine (well it was an IBM SAN but it's literally Maxtor desktop drives in there with a different firmware) No stripe size for raid 10.

Though, if one raid array is busy can this affect other raid arrays? If I make everything raid 10 will the overall performance be better? The raid 5 is a very old array (newer drives but array itself is old) and I could easily replace it with a new raid 10 using bigger drives. Not sure if I can transform from raid 5 to raid 10 though but the raid 5 array does not have as much live data so it would be easier to shift stuff to the other arrays.

Is ZFS also worth looking into? I don't want to use pure SSDs for raid as consumer ones wont last very long under heavy VM load but with ZFS I could use some as cache and have spare on hand and just swap them out when they get used up every couple years or w/e.

Another thing I noticed is when my backup jobs run things get pretty slow too. Just opening a folder on my workstation will take forever. (via NFS). The idea behind this is centralized shared storage so that's why everything is NFS. Is there perhaps a better way while still having a centralized shared storage setup?

Is there some kind of tool I can install to get real time resource utilization for everything in an easy to read format? Disk, NFS, network, cpu, etc. Half the issue here is not knowing where the bottleneck is. Tools like iostat don't really help me much as I don't really know what any of the numbers mean and what is considered good and what is considered bad.
 
Different arrays should not affect each other. If this happens you run into a bottleneck other than IO, i.e. CPU or network. But not somewhat modern CPU should be taxed by software RAID.

I think that ZFS is worth looking into, but not because of the SSD cache. If you run into endurance problems with an all-SSD-array/pool, your cache device probably also will, especially if the footprint of your workload does not fit on the cache. How much writes does your workload do? I doubt that this will be too much for a good drive.

I think that 'iostat -xmd 1' is good to detect IO bottlenecks. If your member drives are at close to 100% utilization you know that you are IO limited.
 
Oh so is % utilization what I want to be looking at, I kinda guessed but was not sure. For some reason the md devices stay at 0 as well.

This is what it's looking like right now, there's a couple backup jobs running:

Code:
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
sdn               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdm              11.00  1095.00   60.00   90.00     2.31     4.59    94.19     4.72   32.31   3.32  49.80
sdg               6.00  1117.00   60.00   85.00     2.31     4.70    99.03     5.95   41.32   3.30  47.90
sdc              23.00  1105.00   53.00   78.00     2.10     4.59   104.55     7.44   50.71   4.02  52.70
sdl               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sda               9.00  1108.00   55.00   73.00     2.29     4.62   110.56     4.01   32.48   3.11  39.80
sdb              11.00  1109.00   65.00   63.00     2.45     4.55   111.94     4.64   37.69   2.36  30.20
sdf               9.00  1098.00   61.00   71.00     2.49     4.63   110.48     4.49   37.39   3.99  52.70
sde              17.00  1099.00   59.00   66.00     2.29     4.57   112.45     5.36   43.06   6.78  84.80
sdd              10.00  1098.00   54.00   80.00     2.30     4.57   104.96     4.77   35.66   2.79  37.40
sdh               0.00     0.00    0.00   12.00     0.00     0.02     2.92     0.11    9.50   9.25  11.10
sdk               0.00     0.00    0.00    7.00     0.00     0.00     0.43     0.04    5.71   5.71   4.00
sdj               0.00     0.00    0.00   11.00     0.00     0.02     3.18     0.10    9.45   9.45  10.40
sdi               0.00     0.00    0.00    7.00     0.00     0.00     0.43     0.04    5.29   5.29   3.70
dm-0              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
dm-1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
md1               0.00     0.00  307.00 8051.00    17.51    31.42    11.99     0.00    0.00   0.00   0.00
md0               0.00     0.00    0.00    8.00     0.00     0.02     4.00     0.00    0.00   0.00   0.00
dm-2              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
md2               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdt               0.00     4.00    1.00   17.00     0.00     0.04     4.61     0.15    8.50   7.61  13.70
sdu               0.00     5.00    0.00   16.00     0.00     0.04     4.69     0.13    8.06   8.06  12.90
sdv               0.00     0.00    0.00   12.00     0.00     0.00     0.25     0.06    4.83   4.83   5.80
sdw               0.00     0.00    0.00   12.00     0.00     0.00     0.25     0.06    4.58   4.58   5.50
sdy               0.00     0.00    0.00   12.00     0.00     0.00     0.25     0.07    5.83   5.83   7.00
sdz               0.00     0.00    0.00   12.00     0.00     0.00     0.25     0.08    6.33   6.33   7.60
sdaa              0.00     0.00    0.00   20.00     0.00     0.03     3.35     0.15    7.40   6.85  13.70
sdab              0.00     0.00    0.00   20.00     0.00     0.03     3.35     0.14    6.75   6.45  12.90
md3               0.00     0.00    1.00   25.00     0.00     0.07     5.54     0.00    0.00   0.00   0.00


Maybe I'm just hitting the limits of consumer grade hardware and just have to deal with it? Though 2 VMs should not be killing the IO that much. Is there a tutorial somewhere that explains which each collunm means? Is there also a way to get that output in 1 shot? If I omit d flag the numbers are completely different, I think it's accumulative or something. If I could add this to my monitoring software I could set alert triggers for numbers that are getting too high, so at least I can tell where the bottleneck is. If I can find a way to get network utilization % I could also put that in my monitoring software. I already have load and ram usage. Basically the more stuff I can monitor and graph at the same time the easier it will be to find the bottleneck.
 
The md device itself does not do much by itself, it just distributes reads and writes to the member disks.
Your sde drive has a particularly high utilization while sharing the same workload as the other members.
If this is always the case there may be something wrong with that drive, or it is just a bit slower.
It is always the slowest drive that limits the pool.

With that kind of workload your network will not be bandwidth limited.
Are you using sync NFS? With that you may be latency limited, but this may not be as easy to detect by just looking at the utilization.
 
Last edited:
ZFS will make it worse since you have so little RAM.

Unfortunately I am sure that you problem is that you just have too many simultaneous accesses through NFS going on. NFS is brutal in its implications on write cache. Mixing a lot of reading and writing is also going to hurt since the disk heads just can't move quickly enough. I don't think that switching to SMB will have a fundamental impact.

More RAM.

Or in this specific case a hardware RAID controller with battery backed up write cache. I normally don't like this things and prefer md and ZFS but writes through NFS are a prime use case for BBU.

Or you could try to split things up so that different workloads hit different physical disks. I partially do that. But my setup is mostly single user. And you run out of spindles quickly.


ETA: a really well working cache on a SSD, replacing the BBU in a RAID controller with something that works software wise in the OS, would be cool, but I don't think we are there yet.

ETA2: iSCSI for some specific clients might also help because the client-side caching changes.
 
ZFS with NFS syncs is brutal. Don't even bother.

There are literally dozens of threads on this on the forum and I've yet to see a definitive fix/workaround.
 
This is just incorrect, sorry. Get a decent SSD for the SLOG, and you can get within 10% of sync=disabled performance. I'm doing it myself... I am using an intel s3700...
 
Last edited:
ZFS with 8 GB will still kill the OP, and a "decent" SSD won't cut it for the SLOG as long at it is consumer SATA crap.

Having said that, more RAM and something durable for the SLOG will help. Even a spindle drive for the SLOG will help as it adds another spindle for writes (or write batching) and prevents the heads of the regular drives from hunting too much. I imagine a small 15,000 rpm drive would actually do nicely and doesn't have the write durability concerns of a consumer SATA ssd. I didn't try it myself, though. And with 8 GB don't go ZFS, that is just not going to improve the situation at all.
 
A S3700 SSD is good enough for an SLOG device. If you look at the recent SAS SSD benchmarks at servethehome, you will notice that thier performance is not really outstanding.
They may last forever, but with a 100GB S3700 provisioned to 10-15GB, you can still achieve nearly the same.
 
Right. As I said, I am getting virtually the same performance as with sync=disabled with my overprovisioned s3700.
 
Honestly no idea on most of those questions, but file system is ext4, whatever mkfs.ext4 does by default is what I used. No partitions on the drives. Any way I can check the other stuff?

The drives are a mix but most individual arrays use the same drives.

md0:
Code:
[root@isengard test]# smartctl -a /dev/sdh | grep -i model
Device Model:     Hitachi HDS723030BLE640
[root@isengard test]# 
[root@isengard test]# smartctl -a /dev/sdj | grep -i model
Device Model:     TOSHIBA DT01ACA300
[root@isengard test]# smartctl -a /dev/sdk | grep -i model
Device Model:     TOSHIBA DT01ACA300
[root@isengard test]# smartctl -a /dev/sdi | grep -i model
Device Model:     TOSHIBA DT01ACA300

md1:
Code:
[root@isengard test]# smartctl -a /dev/sdc | grep -i model
Model Family:     Western Digital Caviar Black
Device Model:     WDC WD1002FAEX-00Y9A0
[root@isengard test]# smartctl -a /dev/sdm | grep -i model
Model Family:     Western Digital Caviar Black
Device Model:     WDC WD1002FAEX-00Y9A0
[root@isengard test]# smartctl -a /dev/sdd | grep -i model
Model Family:     Western Digital Caviar Black
Device Model:     WDC WD1002FAEX-00Z3A0
[root@isengard test]# smartctl -a /dev/sdb | grep -i model
Model Family:     Western Digital Caviar Black
Device Model:     WDC WD1002FAEX-00Y9A0
[root@isengard test]# smartctl -a /dev/sdg | grep -i model
Model Family:     Western Digital Caviar Black
Device Model:     WDC WD1002FAEX-00Y9A0
[root@isengard test]# smartctl -a /dev/sda | grep -i model
Model Family:     Western Digital Caviar Black
Device Model:     WDC WD1002FAEX-00Y9A0
[root@isengard test]# smartctl -a /dev/sdf | grep -i model
Model Family:     Western Digital Caviar Black
Device Model:     WDC WD1002FAEX-00Z3A0
[root@isengard test]# smartctl -a /dev/sde | grep -i model
Model Family:     Western Digital Caviar Black
Device Model:     WDC WD1002FAEX-00Z3A0

md3:
Code:
smartctl -a /dev/sdt | grep -i model
Device Model:     WDC WD20EFRX-68AX9N0
[root@isengard test]# smartctl -a /dev/sdu | grep -i model
Device Model:     WDC WD20EFRX-68AX9N0
[root@isengard test]# smartctl -a /dev/sdv | grep -i model
Device Model:     WDC WD20EFRX-68AX9N0
[root@isengard test]# smartctl -a /dev/sdw | grep -i model
Device Model:     WDC WD20EFRX-68AX9N0
[root@isengard test]# smartctl -a /dev/sdy | grep -i model
Device Model:     WDC WD20EFRX-68AX9N0
[root@isengard test]# smartctl -a /dev/sdz | grep -i model
Device Model:     WDC WD20EFRX-68AX9N0
[root@isengard test]# smartctl -a /dev/sdaa | grep -i model
Device Model:     WDC WD20EFRX-68AX9N0
[root@isengard test]# smartctl -a /dev/sdab | grep -i model
Device Model:     WDC WD20EFRX-68EUZN0



Most data is accessed through NFS. Is there something better that is still designed for multi client access without needing a special file system? I know there's samba but that's mostly for windows.

I've done samba+krb & it works quite well for linux clients. The big downside is lack of sy mlink support (last I checked).
 
Back
Top