SSD Suggestions for KVM storage, and L2ARC and SLOG?

tycoonbob

Limp Gawd
Joined
Jul 29, 2012
Messages
157
Long story short, I have this box:

Chassis: Supermicro SC846E16-R1200B (Replaced PSU's with PWS-920P-SQ)
Motherboard: Supermicro X8DTE-F
CPU's: x2 Intel Xeon L5640 (Hexa-core, 2.26GHz, HT)
RAM: 96GB DDR3 (12 x 8GB PC3-10600, Registered ECC)
HBA: Adaptec 5805 (All drives in JBOD passthrough)
SAS Expander/Backplane: Supermicro BPN-SAS-846EL1 (The SAS Expander is the backplane of the chassis)
OS Drives: x2 Crucial MX100 128GB SSD (ZFS Mirror)
Storage Drives: x8 Toshiba PH3500U-1I72 5TB 7200RPM (ZFS Mirror)

This box is running Proxmox, and is my primary hypervisor, and storage server (basically Debian + KVM + ZoL). Right now I have ~10TB of data on the main ZFS pool, including the 15 or so VM's I have running. My plan is to create a new ZFS Mirror to use for VM storage, and looking to do ~500GB SSD's for that. Right now I'm looking at the MX200 500GB.

Also, I'm looking to add L2ARC and a SLOG to this box. I have my ARC limited to 48GB of RAM, but intent to raise that to 64B before adding a L2ARC. I limited the ARC to half the system memory to see what performance I got, and to leave plenty open for my VM's. Right now the system is consuming ~66GB of RAM (48GB for ARC, 18GB for VM's) leaving 32GB free. Increasing my ARC limit to 64GB will still leave me 14-16GB free RAM, leaving me room for a few more VM's, while keeping at least 8GB free.

Performance on the 8 x 5TB drives is pretty good already:
Drive Test Results
Sequential Writes:
Code:
root@mjolnir:~# time sh -c "dd if=/dev/zero of=/tanks/tank_data_01/dd_test_8g_128k.tmp bs=128k count=62500"
62500+0 records in
62500+0 records out
8192000000 bytes (8.2 GB) copied, 6.23245 s, 1.3 GB/s

real	0m6.236s
user	0m0.024s
sys	0m3.748s

root@mjolnir:~# time sh -c "dd if=/dev/zero of=/tanks/tank_data_01/dd_test_8g_128k.tmp bs=128k count=62500 conv=fdatasync"
62500+0 records in
62500+0 records out
8192000000 bytes (8.2 GB) copied, 42.656 s, 192 MB/s

real	0m43.282s
user	0m0.035s
sys	0m6.443s

root@mjolnir:~# time sh -c "dd if=/dev/zero of=/tanks/tank_data_01/dd_test_256g_128k.tmp bs=128k count=2000000 conv=fdatasync"
2000000+0 records in
2000000+0 records out
262144000000 bytes (262 GB) copied, 448.095 s, 585 MB/s

real	7m28.098s
user	0m1.417s
sys	2m38.794s

root@mjolnir:~# time sh -c "dd if=/dev/zero of=/tanks/tank_data_01/dd_test_256g_1m.tmp bs=1M count=256000 conv=fdatasync"
256000+0 records in
256000+0 records out
268435456000 bytes (268 GB) copied, 461.685 s, 581 MB/s

real	7m41.688s
user	0m0.288s
sys	2m34.050s

Sequential Reads:
Code:
root@mjolnir:~# time sh -c "dd if=/tanks/tank_data_01/dd_test_8g_128k.tmp of=/dev/null bs=128k count=62500"
62500+0 records in
62500+0 records out
8192000000 bytes (8.2 GB) copied, 12.0287 s, 681 MB/s

real	0m12.031s
user	0m0.010s
sys	0m2.722s

root@mjolnir:~# time sh -c "dd if=/tanks/tank_data_01/dd_test_256g_128k.tmp of=/dev/null bs=128k count=2000000"

2000000+0 records in
2000000+0 records out
262144000000 bytes (262 GB) copied, 413.405 s, 634 MB/s

real	6m53.411s
user	0m0.435s
sys	1m22.216s

root@mjolnir:~# time sh -c "dd if=/tanks/tank_data_01/dd_test_256g_1m.tmp of=/dev/null bs=1M count=256000"
256000+0 records in
256000+0 records out
268435456000 bytes (268 GB) copied, 387.86 s, 692 MB/s

real	6m27.866s
user	0m0.130s
sys	1m18.439s

But when I am doing large file copies (sequential files, above 20GB), I notice a decrease in VM performance. Obviously moving my VM's onto a dedicated SSD pool will invalidate this concern, I still want to add a SLOG and L2ARC.

SLOG: It's my understanding that ZFS flushes the ZIL to the pool every 5 seconds. Even though I have 4 GbE NIC's, rarely will I be saturating all 4 from multiple devices at the same time. Theoretically, I shouldn't have more than 500MB of data in the SLOG before it's flushed to my ZFS pool. Am I interpreting this correctly? Of course I want to over-provision for write tolerance, I really won't need an SSD larger than 8GB. Now the problem with smaller SSD's is that the write performance is always lower. It seems you don't get solid write performance unless you get at least a 400GB SSD, which is definitely overkill for what I need. I don't want to spend more than $400 on a SLOG device, and only spend $400 if there is a performance reason to do so. I'm considering just going with an Intel S3710 200GB ($300), or S3510 480GB ($365), or possibly some PCIe alternative (limited to PCIe 2.0 bandwidth, I believe).

L2ARC: Want to stick under $400 for this one as well. Intel S3510 480GB?

These Intel drives definitely give the endurance, but their performance isn't quite up there with the Samsung 850's, Crucial MX200's, etc.


Questions:
1) VM drives: (two in ZFS mirror) 500GB Crucial MX200 seems like a solid choice at $200/ea. Are there any others I should be considering around that price range?
2) SLOG: What are some good options under $400? Intel DC SSD offerings? Any NVMe offerings I should be considering? How about NVRAM options? Just found a "Curtiss Wright 5453 1GB NVRAM PCIe Card" which seems to be a battery backed NFS accelerator, and can be bought used for under $40. 1GB should be plenty for a SLOG, and I'd assume that'd be fast. Or how about a Fusion-IO 320GB PCI-e drive? I can get that on eBay for under $300.
3) L2ARC: Intel S3710 480GB sufficient? Are there any concerns with size for L2ARC? Better options under $400?

As you can clearly see, I am wiling to put money into this build. It's my only box; and really the only critical use it gets other than backing up files, is storing my fiancee's RAW images as she edits. She works on her pictures directly on the server, and this is what I would consider mission critical.

Thanks!
 
Last edited:
Just found a "Curtiss Wright 5453 1GB NVRAM PCIe Card" which seems to be a battery backed NFS accelerator, and can be bought used for under $40.
That does sound very interesting to me as well :)
However I don't think it would be useable as a block device without first setting up something like nvramdisk. Any actual experience with this would be helpful.

In your usage scenario it will probably not be an issue, but the endurance of the Intel S35xx could be insufficient for L2ARC. Note that L2ARC writing is capped at 8MB/s per default, so that provides an upper limit for the calculation.

Of course L2ARC is pointless if your pool already consists of SSDs, even a SLOG device might be.
 
To clarify, the SLOG and L2ARC would only be for the 8 x 5TB 7200RPM pool, assuming you can set L2ARC and SLOG per pool.

The 2 x 500GB SSD mirror pool would not have SLOG or L2ARC.


I am willing to go for the 200GB Intel S3710 for the L2ARC, if that's enough space. The endurance is a lot higher on it, versus the S3510, in comparable size.


I may buy one of those 1GB NVRAM PCIe cards, just to see if it shows up as a block device. I think it would make an awesome SLOG, and 1GB should be plenty for my needs (I think).
 
If the disk pool only receives async writes, the SLOG device won't give any benefit. If you're serving files via SMB (and have all VMs on the other pool), that is the case.

The Intel S3610 already has an order of magnitude more endurance than the S3510, so it should suffice.
 
"These Intel drives definitely give the endurance, but their performance isn't quite up there with the Samsung 850's, Crucial MX200's, etc."

This is just flat out wrong in most cases and is not even a fair comparison... consumer drives vs enterprise??? Especially for a SLOG, intel is better hands down.
The intel has much lower latency and that's what's REALLY important for a great SLOG. I'd go with Intel S3700 for SLOG and another for L2ARC but I have had great luck with Intel in servers, and performance that is predictable and consistent.

If you are up for it the Fuision-IO's are still faster than the S3700 intels and on PCIE have lower latency, and are easily had for <300$ for the 350gb MLC version, run 1 for SLOG 1 for L2ARC and great performance for the $$.
 
If the disk pool only receives async writes, the SLOG device won't give any benefit. If you're serving files via SMB (and have all VMs on the other pool), that is the case.

The Intel S3610 already has an order of magnitude more endurance than the S3510, so it should suffice.

All read/write activity to the data pool (8 x 5TB mirrored) will be over NFS; none over SMB.


"These Intel drives definitely give the endurance, but their performance isn't quite up there with the Samsung 850's, Crucial MX200's, etc."
This is just flat out wrong in most cases and is not even a fair comparison... consumer drives vs enterprise??? Especially for a SLOG, intel is better hands down.
The intel has much lower latency and that's what's REALLY important for a great SLOG. I'd go with Intel S3700 for SLOG and another for L2ARC but I have had great luck with Intel in servers, and performance that is predictable and consistent.

I said this only because the Sequential and random spec numbers are higher on the 850 than the 200GB S3710. For example, the 200GB S3710 says sequential R/W at 550/300 Mb/s, while the 250GB 850 Evo reports those same numbers as 540/520. However, after posting that, i did some more research and found that the latency is what's most important for the SLOG. I don't know how the sequential and random speeds come into play.

If you are up for it the Fuision-IO's are still faster than the S3700 intels and on PCIE have lower latency, and are easily had for <300$ for the 350gb MLC version, run 1 for SLOG 1 for L2ARC and great performance for the $$.
Yeah, since I have a free PCI-e slot, and know I won't be need anymore NIC's (already have 8 in the box), I'm thinking a Fusion-IO drive for the SLOG. For the L2ARC, I'm torn between a 480GB S3610, or a 200GB S3710. The 480GB S3610 seems to have higher sequential speeds (550/440), while the 200GB S3710 is rated at 550/300). The 480GB S3510 has lower 4KB random write IO at 28K, while the 200GB S3710 is rated at 43K. Random read IO is a wash, at 84K vs 85K. Endurance is also a wash, 200GB S3710 at 3.6PB and the 480GB S3510 at 3.7PB. I'm not sure what size L2ARC I really could benefit from (or at what point I have too much that it's eating up a lot of RAM that would otherwise be used for my ARC).

I'd love to get 400GB S3710's to use as both my L2ARC and SLOG (one for each), but that is double my price range. Especially considering I still need to buy 2 500GB SSD's for my VM pool. The S3710 seems to be best fit for intense write operations (10 drive writes per day endurance), so I think if I went with a SSD over a Fusion-IO card for SLOG, the 200GB S3710 would be the smart choice. I may just go ahead and consider the 200GB S3710 for my L2ARC as well, mainly for the endurance. It's good to know it should last a at least 5 years under heavy use.

So I guess the question now, is used 350GB Fusion-IO card, or new 200GB S3710 for SLOG?


EDIT: What about Intel's new PCIe SSD's? Aren't those essentially the same thing as the Fusion-IO ioDrives?
 
Last edited:
The new drives are better -- but only certain motherboards support them.

Having S3700's, NVME, and Fusion IO, for the $$ a FuisonIO is much better for SLOG due to low latency, and in fact it's faster in transfer too :) If you can, find a SLC version, even faster!
 
be careful about specs.

samsung really pushes their specs high for random iops, where drives with specs 10x lower, can outperform them. This is cause samsung does the specs at a lot higher queue depth, and it will be rare for anyone to actually get it that high, and also the samsung ssd's have some long latency on transactions, making them not good at all for slog (better than nothing, but not good for consistant performance).

You do know that l2arc is only good for repeated usage. You really have to have a large, workingset of data, and a lot of times peoples data is a lot smaller than they think it is.

Have you been tracking your arc measurements? your mru/mfu, and ghost values? your hit/miss rates?
 
be careful about specs.

samsung really pushes their specs high for random iops, where drives with specs 10x lower, can outperform them. This is cause samsung does the specs at a lot higher queue depth, and it will be rare for anyone to actually get it that high, and also the samsung ssd's have some long latency on transactions, making them not good at all for slog (better than nothing, but not good for consistant performance).

You do know that l2arc is only good for repeated usage. You really have to have a large, workingset of data, and a lot of times peoples data is a lot smaller than they think it is.

Have you been tracking your arc measurements? your mru/mfu, and ghost values? your hit/miss rates?

I'm not interested at all in using Samsung SSD's, and I'm primiarly looking at the 200GB Intel S3710 for my SLOG.

FYI, I just bought my VM SSD's, and decided to go with the Mushkin Striker 480GB drives (bought 2), which I will likely put in a hardware RAID 1 instead of a ZFS mirror (I expect hardware RAID 1 to give better performance, and will be able to test this tomorrow once my drive trays show up).

Anyway, I have only been running ZFS for about 2 weeks now. My ARC is limited at 64GB, and my system is constantly using all 64GB, and it takes about 30 minutes to get back up to that after a reboot. I'm pretty confident that this is because I'm running my VM's in my main storage pool, and I expect different results once I move these VM's onto their own dedicated pool/array.

Here is the stats I've been going off of for my ARC lately:
Code:
root@mjolnir:~# arcstat.py 10 10
    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c
11:20:16     0     0      0     0    0     0    0     0    0    63G   64G
11:20:26    67     1      2     1    2     0    8     1   18    63G   64G
11:20:36   411     4      1     1    0     2   70     1   17    63G   64G
11:20:46   217     5      2     0    0     4   97     0    6    63G   64G
11:20:56    98     3      3     1    1     1   76     1   15    63G   64G
11:21:06    87    30     34     1    2    28  100     1    6    63G   64G
11:21:16    66     5      7     1    2     3  100     1   11    63G   64G
11:21:26    82     4      5     1    1     3   13     1   14    63G   64G
11:21:36    72     8     12     1    2     7   60     1    9    63G   64G
11:21:46    39     1      2     1    2     0    5     1   15    63G   64G

I'm still not fully up to snuff on how ARC actually works behind the scenes (everyone says it's a read cache, so more ARC the better -- which I know isn't always true), but these stats show me that it's being used to some extent. Thus, I have a 128GB Crucial MX100 already that I could use for L2ARC, instead of buying something new. Once I get my drive trays in, I may end up adding this 128GB MX100 to my L2ARC just to see if there is an improvement or not in my performance. If there is, I will likely stick with that drive. If performance decreases, I will re-evaluate and either buy something different (200GB Intel S3710, 400GB Intel S3610, or 480GB Intel S3510 -- depending on size needed, and performance of these drives, but also open to other suggestions).

For SLOG, based on my research, the 200GB Intel S3710 seems to be a solid choice, but no one has recommended anything different (like I hoped). The S3710 has about as low of latency you can get for a SATA SSD, 4K random write IO is pretty good, and the drive really only lacks in Seq write speed, which is specc'd at 300MB/s. Still fast, but still less than my theoretical maximum of writing 500MB/s to this pool (limited based on quad GbE NIC's in LACP). The alternative I have been looking at is a Fusion-IO card, but I know little about them, and haven't found any good information that they are ideal for SLOG.
 
Your max write speed, for slog, for 4 gigabit, is going be down around 4x600mbit or so. This is cause your doing sync writes, every write must go over the gigabit network, then be processed and written to the slog, then a confirmation has to go back, THEN your system is allowed to write more data.

Basically, your network is half duplex now, waiting on confirmation packets. You will need several proxmox systems to get over that limit. If your using iscsi atleast, if your using nfs, it's limited to one gigabit link anyways.
 
Your max write speed, for slog, for 4 gigabit, is going be down around 4x600mbit or so. This is cause your doing sync writes, every write must go over the gigabit network, then be processed and written to the slog, then a confirmation has to go back, THEN your system is allowed to write more data.

Basically, your network is half duplex now, waiting on confirmation packets. You will need several proxmox systems to get over that limit. If your using iscsi atleast, if your using nfs, it's limited to one gigabit link anyways.

Thanks for the clarification on the theoretical SLOG limits.

I would also like to clarify that I am NOT storing VM's on this pool, so definitely no iSCSI usage, nor any NFS to Proxmox storage (well, at least for VM's -- I do have a dataset that's used for backups).

Now something to note; I got my 2 new SSD's in today, threw them in a RAID 1 on my Adapatec controller, and moved my ~15 VM's to that array. My ARC usage dropped from the max limit of 64GB, to only 12GB. Clearly an indicator that my ARC usage was with those VM's, which were temporary from the start. Since then, I've dropped my ARC max size from 64GB to 48GB, and I will continue to monitor that as my fiancee starts working more with her pictures (where the pool will likely see it's most use). Because of this, I'm questioning if I even need to consider a L2ARC drive at this point.

I also added 2 more 5TB drives to the zpool, so now it's 5 2-way mirrors in the pool. Those two extra drives did make a difference in pool performance, and I suspect that will continue to increase as I add up to 10 more 5TB drives to the pool, but that will be a bit in the future. I'm now on the fence if I should bother to add a SLOG.


Code:
root@mjolnir:~# arcstat.py -f time,read,hits,hit%,miss,miss%,arcsz,c 5 10
    time  read  hits  hit%  miss  miss%  arcsz     c
17:34:49     0     0   100     0      0    12G   12G
17:34:54    58    53    90     5      9    12G   12G
17:34:59    41    41   100     0      0    12G   12G
17:35:04  1.1K  1.0K    91    89      8    12G   12G
17:35:09   10K   10K    99    47      0    12G   12G
17:35:14  1.2K  1.1K    95    51      4    12G   12G
17:35:19    50    50   100     0      0    12G   12G
17:35:24    57    57    99     0      0    12G   12G
17:35:29    55    55   100     0      0    12G   12G
17:35:34    52    52    99     0      0    12G   12G

Code:
root@mjolnir:/# time sh -c "dd if=/dev/zero of=/tanks/tank_data_01/dd_test_256g_128k.tmp bs=128k count=2000000 conv=fdatasync"
2000000+0 records in
2000000+0 records out
262144000000 bytes (262 GB) copied, 803.636 s, 326 MB/s

real	13m23.647s
user	0m1.499s
sys	2m41.972s
root@mjolnir:/# time sh -c "dd if=/tanks/tank_data_01/dd_test_256g_128k.tmp of=/dev/null bs=128k count=2000000"
2000000+0 records in
2000000+0 records out
262144000000 bytes (262 GB) copied, 302.518 s, 867 MB/s

real	5m2.524s
user	0m0.563s
sys	1m25.764s
 
Last edited:
Back
Top