Storage Tiering - Spinners and NVMe

pclausen · Aug 29, 2020

I'm looking for advice on how to consolidate my current storage solution to better leverage some of my existing hardware as well as make some upgrades.

My main storage is configured as follows:

Supermicro X10SRL-F motherboard
E5-2680 v3 CPU - 12 Core @ 2.5GHz
32GB of DDR4 RAM
Intel X520 dual SFP+ 10G NIC
Areca ARC-1882IX-16 RAID Controller

The storage is spreads across 2 Supermicro 846 chassis and contains a pair of RAID60 arrays configured as follows:

The drives in each chassis are as follows:

I have had this setup for years and it has been rock solid. Sure, I'll loose the occasional drive from time to time, but I always have cold spares on hand, and replacing a failed drive has always brought the array back to normal.

In addition to the main storage server above, I have a 3 Node Storage Spaces Direct cluster where each node consists of the following:

Supermicro X10SRL-F motherboard
E5-2680 v3 CPU - 12 Core @ 2.5GHz
32GB of DDR4 RAM
Mellanox ConnectX-4 CX4121A SFP28 Dual 25Gbps NIC
2x Supermicro AOC-SLG3-2M2 PCe to dual NVMe adapters
4x Toshiba 256GB NVMe flash drives

The X10's has bifurcation enabled so that all 4 NVMe's in each node have 4x PCIe 3.0 lanes to the CPU.

Each node have both 25Gbps LAN connections going to a Dell S5148F-ON switch and have the following vSwitch configuration:

This then allowed me to validate the cluster and deploy S2D across the 3 Nodes which gave me the following:

I created a single VD using all the available space, which ended up only being about 900GB because with only 3 Nodes, S2D insist on doing a 3-way mirror. This means my storage efficiency is only 33%.

If I go to 4 Nodes, I have 50% efficiency, which still sucks compared to my RAID60 arrays, which have an efficiency of 83% (2/12-1*100).

S2D is pretty slick, but the 50% efficiency bites and it would take going to 7 Nodes to get to 67% and 9 Nodes to reach 75% and you finally max out at 80% efficiency with 16 nodes.

So moving all my spinner to S2D is not an option. Besides, my NVMe cache is way to small to support that anyway.

So other than going to a real SAN, are there any solutions that one assemble using off the shelf components to get tiered storage?

Alternatively, perhaps upgrading from my Areca 1882 to a BroadCom MegaRaid 9580-8i8e would be a viable alternative?

This way I could stand up my RAID 60 arrays on it as my fixed large volume tier, and also move all my NVMe storage to it as my high speed tier for VMs?

If I went this route, I would need to track down a chassis with a backplane that can support 32x NVMe drives that has a 8x SFF-8654 connector to match the internal connector on the MegaRaid 9580.

Supermicro has this guy:

But they only sell it turn-key starting at around $5,500 which is way too rich for my blood.

Any less expensive solutions out there?

In a nutshell, I'm looking for a single server solution that have handle my RAID60 arrays as well as a single NVMe array that I can grow over time. A 32 slot solution like the SMC pictured above would fit the bill nicely.

I would probably build this new solution on this soon to be released SMC motherboard:

https://www.supermicro.com/en/products/motherboard/H12SSL-i

Thanks!

ND40oz · Aug 29, 2020

Have you looked at Nutanix? You can do single node with it if you really want to stick to a single server but three node clusters are usually where you start, you can also do a two node cluster with a separate witness as well.

_Gea · Aug 30, 2020

Are you limited to Windows and hardwareraid?
For a single server solution I would always recommend modern softwareraid over hardwareraid, as with a modern system this is faster as you can use much more RAM for caching and a current CPU is faster than that on the raid adapter. This was different 10 years ago but not it is.

The only OS where hardwareraid is still a decent option is Windows and ntfs. On all other OSs like the Unix options BSD and Solarish or Linux, you will always use software raid and a modern filesystem like ZFS developped by Sun. Windows has ReFS with some of its features but not yet in par regarding most features or performance.

A move to ZFS would give you a much better datasecurity due Copy on Write (no corrupt filesystem on a crash during write, no chkdsk needed), ransomware save snapshots (thousands when needed), data and metadate checksums to detect any silent data problems, a state of the art ram caching for reads and writes and special vdevs as a tiering alternative. A special vdev is a raid array ex an NVMe mirror that holds data based on type like metadate, dedupdate or single filesystems (similar to partitions but without a fixed size) based on recsize.

From OS your current best options are based on Solaris where ZFS comes from and you get native ZFS. For Opensource options you can use Open-ZFS on Free-BSD, Linux and Illumos , a free Solaris fork with OmniOS as the main distribution for peoduction storage. The new special vdev feature is available on Linux and OmniOS for around 1.5 years. On Free-BSD is will be available soon.

From hardware you would need two modifications. First is to move from your hardwareraid to an HBA like a Broadcom 3008 or the same from SuperMicro AOC-S3008L. As ZFS use TAM massively for caching an update to 64 GB RAM is more than suggested for best performance.

If you want to use an NVMe only array, at least addidionally, you should stay with a ready to use solution from Supermicro. As a cheap alternative you may use as many NVMe as your mainboard supports either with M.2 slots, Oculink slots or via PCI-e adapters.

As an alternative to a massive NVMe array that may gives data rates way below what you can process on most systems, you can use 12G SAS SSDs like WD SS530. They are near to NVMe without the hassle. You would need a 12G expander for maximum sequential transfers. With a 6G expander you should link HBA and expander with 2 mini SAS cables per expander.

I have collected some manuals and ideas if ZFS is an option
https://napp-it.org/manuals/index.html

All ZFS soultions on any OS (even Windows where a ZFS driver is in early beta) are available for free or with paid support and extras on management tools or the OS, ex on Solaris a support contract is mandatory for professional use. OmniOS offers this as an option. Some like Nexenta or IX bundle OS, management tool and hardware.

pclausen · Aug 30, 2020

_Gea appreciate the feedback!

I started out with hardware raid about 20 years ago with a Buslogic raid adapter with PATA interfaces and a bunch of 300GB drives. Moved up from that and then switched to FreeNAS 8 years ago, and then gravitated back to Hardware raid maybe 5 years ago using Areca 1882 adapters (current have 3) and have stayed there every since.

I'm not limited to Windows and Hardware raid. Been following the new TrueNAS development and might be ready to give it another try.

I still have a bunch of LSI SAS9211-8i and SAS9200-8e laying around from back when I was running a 60 2TB FreeNAS array on my Supermicro 846 chassis with SAS2 expanders. But I realize those are pretty ancient by today's standards.

A friend of mine (who is running very large arrays using Areac 1883 controllers, guess he didn't get the memo either that hardware raid is obsolete in 2020, lol) looked into the TrueNAS solution and provided me with the following feedback:

--- cut ---
FreeNAS is not smart enough to migrate hot data to an SSD. All ZFS storage is just block storage of varying speeds, there's no disk-speed based autotiering. Before considering an SSD L2ARC, plan to up your RAM instead - this *is* used for primary cache to speed things up. There are posts around on when/why adding L2ARC (and ZIL and SLOG) are useful, and more importantly when they are not, go read around the topic because it's workload and hardware dependent.

Even shorter answer: Set yourself up a separate SSD pool and manually use it for SSD-type tasks.

If you want really lightning fast storage for your hot spots on HDD, there are two basic things to do:

1) For read, have sufficient ARC+L2ARC to cover the working set. So if your working set is 1TB of data out of 10TB total data, have at least 128GB main memory and a pair of 500GB SSD's. Ideally 256GB main memory. Your working set reads will *fly*. This does not actually cause "storage tiering" in the sense that it moves data off of the HDD and onto SSD, but the effect is basically the same, with the added advantage that you do not need RAID or mirroring for the SSD's, unlike a typical tiering solution. All data is always on the HDD in the pool. The stuff you use frequently is on SSD.

2) For write, have GOBS of free space on the pool. Never fill a pool more than 50%. Keep it below 25% if you can. Once you get below that, it really starts to feel like an all-SSD pool even though it is HDD. This is because ZFS is converting all your writes, random, sequential, etc., into sequential writes. This isn't storage tiering either, but it will feel fast, so maybe that doesn't matter.
--- cut ---

As for flash storage, those SAS SSDs sure are expensive (like the WD SS530 you linked to).

My current thinking is to keep my Areca 1882 on my X10 SuperMicro mobo and then stand up a new server based on a H12 Supermicro EPYC mobo that will soon be released and then get a PCIe 4.0 to 4x M.2 adapter card for it with 4 1TB NVMe flash drives. 4TB will be plenty for all my VMs and I would then back that up on a regular basis to one of the RAID60 spinner arrays. This would give me very fast storage for my VMs (on the order of 20 GB/s of read and about 17GB/s read using PCIe 4.0 x4 flash drives).

At some point down the road, I could then look at upgrading my X10 storage server to a H12 Epyc as well, and get one with an embedded S3008 and explore OS options other than Windows.

I'll have to look into this new special vdev feature you speak of available on Linux and OmniOS. I'll also study your collection on manuals and ideas on ZFS. I really liked ZFS when I was running FreeNAS and never had any data loss issues. One minor gripe I had with it was that it would not turn on the red LED on my Supermicro SAS2 backplanes when a drive failed, which is something my Areca controller does, but that is very minor in the grand theme of things. LOL.

_Gea · Aug 31, 2020

pclausen said:
I'll have to look into this new special vdev feature you speak of available on Linux and OmniOS. I'll also study your collection on manuals and ideas on ZFS. I really liked ZFS when I was running FreeNAS and never had any data loss issues. One minor gripe I had with it was that it would not turn on the red LED on my Supermicro SAS2 backplanes when a drive failed, which is something my Areca controller does, but that is very minor in the grand theme of things. LOL.

A hardware controller knows when a disk fails and it has access to the backplane to switch the led on. With softwareraid, only the OS knows when a disk is bad but the OS has no standard way to switch an alert led on. For some BroadCom HBAs you can use sasircu (6G adapters) or sas3ircu (12G adapters) to switch the alert led on for a failed/given WWN disk.

The special vdev idea on modern ZFS (developped by Intel) is an intelligent alternative to tiering. Tiering means that you copy active or performance sensitive data over (and back later) to a faster part of the array. Under load such a copy has itself performance impacts especially with dynamic data. A special vdev follows a different approach as it automatically stores data on the faster array part based on performance sensitive data structures like small io, data with a smaller recsize than a threshold, metadata or in case of realtime dedup to hold the dedup table there instead of using RAM for it.

Storage Tiering - Spinners and NVMe

pclausen

Gawd

ND40oz

[H]F Junkie

_Gea

Supreme [H]ardness

pclausen

Gawd

_Gea

Supreme [H]ardness