SnapRaid server - slow performance

steakman1971 · Jan 4, 2015

I'm starting to use my Ubuntu Server (12.04) which I recently added SnapRAID to.
This system is running an Intel Core2Duo CPU (~2.5ghz), 6 GB RAM, and has about 16TB of disk space allocated to SnapRAID. 2 3TB drives are assigned for parity, the rest are a combo of mostly 2 and 3 TB drives. The OS is on a 90gb SSD.
My mobo only has SATA2 as does my Vantec RAID card (which is just being used for SATA2 expansion - RAID feature is not utilized).
The system is also running SABNZB, CouchPotato, Sickbeard, Webmin, etc.

I'm having poor IO speeds with concurrent operations. When downloading and trying to copy files, it brings everything to a crawl. I'm trying to figure out how to troubleshoot the system.

Looking at webmin, it's reporting 97% IO for CPU. Uptime reports:
10:49:06 up 1 day, 10:36, 1 user, load average: 2.24, 2.40, 2.07

Running top, I see values like:
21858 nobody 20 0 126m 4560 3440 D 24 0.1 3:35.74 smbd
1087 root 20 0 1020m 13m 476 S 18 0.2 13:18.98 mhddfs
1575 chris 20 0 1447m 297m 3932 S 6 5.0 165:17.47 sabnzbdplus
1286 chris 20 0 1221m 100m 3740 S 3 1.7 26:19.83 python

The top process make sense as far as usage:
Windows shares are utilizing smbd.
mhddfs is utilized by SnapRAID.
sabnzbdplus is downloading stuff.
python is used by sab, couchpotatoes, sickbeard, etc.

RAM utilization does not seem to be an issue - I've not seen more than 2GB used when I've monitored it.

My understanding of uptime is telling me that I have processes waiting on the CPU.

Are there any better tools to help me diagnose this system? Really, just using uptime, top, and a bit of WebMin to watch it.

The Core2Duo is older (probably about 6-7 years?). The two cores are probably holding my system back.

Before I throw hardware at the system, I want to make sure I'm addressing it smartly.

olol · Jan 4, 2015

2.24 means that you currently have 2.24 processes in queue for computing. with two cores (two threads) if you have 2.0 the cpu is fully used, with 1.5 it's no problem and with over 2.0 you have something that will havbe to wait untill another computation is completed. so it seems that your processor is having trouble keeping up with the load of the system.

omniscence · Jan 4, 2015

I find it unusual that this type of workload can bring down such a processor.
On the other side multiple python programs can eat away a lot of CPU power.

On my server the top-output looks like this (please use CODE tags)

Code:

top - 18:06:48 up 57 days,  1:56,  1 user,  load average: 2.38, 2.31, 2.45
Tasks: 1847 total,   3 running, 1844 sleeping,   0 stopped,   0 zombie
%Cpu(s): 10.4 us,  6.1 sy,  0.0 ni, 81.9 id,  1.5 wa,  0.0 hi,  0.1 si,  0.0 st
KiB Mem:  32948208 total, 30200412 used,  2747796 free,       12 buffers
KiB Swap: 33552252 total,  2383016 used, 31169236 free.    71532 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
10397 root      20   0 7678280 5.883g   4064 S  66.9 18.7  60214:06 qemu-system-x86
24614 root      20   0 7251428 4.866g   3932 S  35.3 15.5  26795:39 qemu-system-x86
21753 root      20   0 2983412 2.034g  10776 S   8.6  6.5  31:54.98 qemu-system-x86
26235 root      20   0       0      0      0 S   3.6  0.0   0:02.03 kworker/6:0
26263 root      20   0       0      0      0 R   3.6  0.0   0:01.30 kworker/2:1
24340 root      20   0       0      0      0 S   3.0  0.0   1:54.10 kworker/6:2

What is also interesting here is the %Cpu line as it shows what the CPU does.

To see if you are IO-limited you can run 'iostat -xmd 1' and see if there is a device that is constantly at near 100 %util during the slowdown times.

A common problem with Linux servers is that buffered writes (like large copy operations from samba, which does not do any syncs in the default configuration) can fill a multiple gigabyte sized page cache and as soon as the memory gets full, the system starts to flush the entire page cache while blocking all writes for tens of seconds. This can cause timeouts and system hangs. You can circumvent this with careful tuning. I also recommend to start snapraid with a lower io priority (using 'ionice').

steakman1971 · Jan 6, 2015

Running iostat, I'm getting:

Code:

Linux 3.13.0-36-generic (minty)         01/06/2015      _x86_64_        (2 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           5.12    0.02    3.33   20.88    0.00   70.66

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda               0.00         0.01         0.00       2169        192
sdb               7.94      1200.76      1079.28  404056961  363177732
sdc               1.96         0.01       998.79       2765  336094388
sdd               4.61      1002.64         6.81  337389097    2292648
sde               2.28         0.01       998.79       2740  336094264
sdf               3.46       822.60         6.78  276804837    2281504
sdg              44.68      2782.63      2913.01  936356233  980231188
sdh               3.37        17.85        28.94    6005307    9739212
sdi               0.00         0.01         0.00       2189        192

Followed by "iostat -xmd 1, I can see my drive sdg is getting some activity.

Code:

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdb               0.00     1.00    0.00    2.00     0.00     0.01    12.00     0.03   14.00    0.00   14.00  14.00   2.80
sdc               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdd               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sde               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdf               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdg              15.00  5039.00   78.00   67.00    11.31    19.95   441.60    14.68  101.46   26.46  188.78   6.01  87.20
sdh               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdi               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00

JoeComp · Jan 7, 2015

What is the output of `snapraid -T` ? That should give us an idea of how fast snapraid can compute checksums and parity on your CPU.

snapraid reports the amount of RAM it is using when you run a sync (maybe you need the -v option, I can't remember). What does it say? It can easily use many GB of RAM depending on the size and number of files you have and the snapraid setting for blocksize.

Is snapraid, combined with your other programs, using enough RAM to cause the system to start swapping and thrashing?

Another consideration is that snapraid achieves the best performance if it can have sole access to your drives when it is running a sync. If another program is hitting any of your data drives hard during a sync, it will slow snapraid down a lot because snapraid will only read from each data drive at the rate of the slowest of the data drives.

omniscience makes a good point that if you have a large page cache (essentially a RAM cache for drive writes) that it can fill up and then require high priority writing to disk, which can block all other processes trying to do I/O. I have always drastically reduced the dirty_bytes limit on my linux servers to minimize this kind of blocking:

Code:

# echo "67108864" > /proc/sys/vm/dirty_bytes
# echo "16777216" > /proc/sys/vm/dirty_background_bytes

You can monitor the page cache, as well as memory and swap, with a terminal running this command:

Code:

$ watch -n 1 cat /proc/meminfo

omniscence · Jan 7, 2015

The system does not seem to be CPU-limited.

The throughput seems fine for concurrent operations on a mechanical drive.
Reading 10 MB/s while writing 20 MB/s, you won't get much more.

If that is not enough for your application you need an SSD.
Or distribute the accesses to different drives.

steakman1971 · Jan 10, 2015

The output of "snapraid -T":

Code:

snapraid v7.0 by Andrea Mazzoleni, http://snapraid.sourceforge.net
Compiler gcc 4.6.3
CPU GenuineIntel, family 6, model 23, flags mmx sse2 ssse3
Memory is little-endian 64-bit
Support nanosecond timestamps with futimens()

Speed test using 8 data buffers of 262144 bytes, for a total of 2048 KiB.
Memory blocks have a displacement of 1792 bytes to improve cache performance.
The reported values are the aggregate bandwidth of all data blocks in MiB/s,
not counting parity blocks.

Memory write speed using the C memset() function:
  memset    3522

CRC used to check the content file integrity:
   table     782
   intel

Hash used to check the data blocks integrity:
            best murmur3 spooky2
    hash spooky2    2482    4968

RAID functions used for computing the parity with 'sync':
            best    int8   int32   int64    sse2   sse2e   ssse3  ssse3e    avx2   avx2e
    gen1    sse2            4668    5613    8674
    gen2   sse2e            1517    3219    7255    8344
    genz   sse2e            1144    1941    2687    2725
    gen3  ssse3e     313                                    3990    3249
    gen4  ssse3e     277                                    2812    2756
    gen5  ssse3e     226                                    1813    1934
    gen6  ssse3e     188                                     626    1213

RAID functions used for recovering with 'fix':
            best    int8   ssse3    avx2
    rec1   ssse3     276     334
    rec2   ssse3      93     195
    rec3   ssse3      46     205
    rec4   ssse3      31     151
    rec5   ssse3      22      98
    rec6   ssse3      16      69

If the 'best' expectations are wrong, please report it in the SnapRAID forum

I've not been having issues the last few days. This might be a "duh" moment, but since I was downloading files (downloads go to a drive out of of the Snap array) and copying large files (anywhere from 3GB-15GB) - yeah, there could be some performance issues. It's also possible the raid was running parity. I adjusted my cronjob to start at 3am - not sure how long it takes.

drescherjm · Jan 10, 2015

Here is what I get for a core2quad.

Code:

jmd0 ~ # uname -a
Linux jmd0.comcast.net 3.17.8-gentoo-jmd0-zfs-0.6.4-git-20150109 #1 SMP Fri Jan 9 21:30:50 EST 2015 x86_64 Intel(R) Core(TM)2 Quad CPU Q9550 @ 2.83GHz GenuineIntel GNU/Linux
jmd0 ~ # snapraid -T
snapraid v7.0 by Andrea Mazzoleni, http://snapraid.sourceforge.net
Compiler gcc 4.8.3
CPU GenuineIntel, family 6, model 23, flags mmx sse2 ssse3
Memory is little-endian 64-bit
Support nanosecond timestamps with futimens()

Speed test using 8 data buffers of 262144 bytes, for a total of 2048 KiB.
Memory blocks have a displacement of 1792 bytes to improve cache performance.
The reported values are the aggregate bandwidth of all data blocks in MiB/s,
not counting parity blocks.

Memory write speed using the C memset() function:
  memset   16817

CRC used to check the content file integrity:
   table     958
   intel

Hash used to check the data blocks integrity:
            best murmur3 spooky2
    hash spooky2    2942    8050

RAID functions used for computing the parity with 'sync':
            best    int8   int32   int64    sse2   sse2e   ssse3  ssse3e    avx2   avx2e
    gen1    sse2            6895   12275   20815                        
    gen2   sse2e            2052    4057   10697   12019                
    genz   sse2e            1467    2542    5524    5886                        
    gen3  ssse3e     467                                    5420    5902        
    gen4  ssse3e     345                                    4185    4293        
    gen5  ssse3e     274                                    3268    3112        
    gen6  ssse3e     226                                    1233    1594        

RAID functions used for recovering with 'fix':
            best    int8   ssse3    avx2
    rec1   ssse3     486     891
    rec2   ssse3     225     478
    rec3   ssse3      60     290
    rec4   ssse3      38     191
    rec5   ssse3      26     134
    rec6   ssse3      18      92

If the 'best' expectations are wrong, please report it in the SnapRAID forum

jmd0 ~ #

JoeComp · Jan 10, 2015

Look how much faster drescherjm's memory throughput is than steakman1971's. About 5 times faster.

Not that steakman's 3522 MiB/s is likely to be a bottleneck for him, but it makes me wonder if something else is wrong.

omniscence · Jan 10, 2015

Yes, 3.5 GB/s is somewhat slow even with single-channel DDR2 memory.
You should check the BIOS settings.

drescherjm · Jan 10, 2015

BTW the motherboard I am using with the core2quad q9550 is an ASUS P5Q pro.

steakman1971 · Jan 11, 2015

I also have an ASUS P5Q Pro. I'll take a look at my bios settings - I doubt I've messed with these in quite some time.
I ordered a Core2Quad 8300 from Ebay (for $30). I've had a few cases where uptime is showing me the system is taxed. I think it will be worth the money for 2 extra cores. (There was a big discuss in the Intel CPU forum on HardOCP - interesting comments!)
Appreciate you sharing your info.

omniscence · Jan 12, 2015

Do you use an asymetric or even single channel memory configuration?
I doubt that you will get much more out of the quad core for your setup, at least for SnapRaid. It is only single-threaded.

TCM2 · Jan 12, 2015

What relevance has the SnapRAID speed here? It's not even in the process list under normal circumstances - only when you actually update the parity.

drescherjm · Jan 12, 2015

delete.

TCM2 · Jan 12, 2015

Obviously, this host is running some VMs itself.

drescherjm · Jan 12, 2015

My fault that was not even the OP I was replying to.. Sorry for the noise.

JoeComp · Jan 12, 2015

TCM2 said:
What relevance has the SnapRAID speed here? It's not even in the process list under normal circumstances - only when you actually update the parity.

Good catch. I thought the OP meant that he was having I/O problems during the times when SnapRAID was running a sync or scrub, but now that you point it out, that is NOT what the OP actually wrote. It seems SnapRAID is irrelevant to this thread and it was misleading to put it in the subject line.

SnapRaid server - slow performance

steakman1971

2[H]4U

olol

n00b

omniscence

[H]ard|Gawd

steakman1971

2[H]4U

JoeComp

[H]ard|Gawd

omniscence

[H]ard|Gawd

steakman1971

2[H]4U

drescherjm

[H]F Junkie

JoeComp

[H]ard|Gawd

omniscence

[H]ard|Gawd

drescherjm

[H]F Junkie

steakman1971

2[H]4U

omniscence

[H]ard|Gawd

TCM2

Gawd

drescherjm

[H]F Junkie

TCM2

Gawd

drescherjm

[H]F Junkie

JoeComp

[H]ard|Gawd