IDEA: using 3DXpoint memory to improve SMR HDD Array performance?

Gigas-VII · Apr 8, 2018

A few months ago I had an idea. SMR disks tend to struggle with writes, especially random writes, so what about using an Optane NVMe stick (3DXpoint memory) as an insanely good write cache?

I've configured a pair of 5TB SMR disks as a mirrored pool of disks using Storage Spaces, running on a current build of Windows 10 Enterprise.

I'm trying to figure out how to add additional tiers of disks (happy to use powershell, just unsure what to do), ideally 1x SATA SSD for a Read cache, and my 32GB Optane stick as a Write cache. I figure that makes good use of the PCIe bandwidth I have, while giving me a huge capacity of redundant storage, minus the performance penalties of SMR disks.

I can wipe the disks at any time as this is only a test. I'm not trying to make an enterprise-grade pool here. I recognize I might lose some data from the write cache during an unexpected reboot. I recognize I might gain nothing from this whole expedition. But I figured that for the price of a couple disks, it was a worthwhile venture.

Wizrobe · Apr 8, 2018

I'm sure it would make a good performance improvement, improving upon the current flash-based "SSHD" hybrid drives.

It's a fairly narrow price niche you can wedge yourself into between the cheap-storage HDs and high-performance SSDs, though. Existing hybrid drives have managed to find only a fairly small market.

Gigas-VII · Apr 8, 2018

Wizrobe said:
I'm sure it would make a good performance improvement, improving upon the current flash-based "SSHD" hybrid drives.

It's a fairly narrow price niche you can wedge yourself into between the cheap-storage HDs and high-performance SSDs, though. Existing hybrid drives have managed to find only a fairly small market.

I'd love to actually bench this idea, though. To find out if it's actually worth doing, or not. I already own the hardware.

Luke M · Apr 9, 2018

Shingled drives have their own on-disk cache, so random writes are fast, as long as the cache doesn't fill up (it's emptied in the background).

sinisterDei · Apr 9, 2018

Luke M said:
Shingled drives have their own on-disk cache, so random writes are fast, as long as the cache doesn't fill up (it's emptied in the background).

Yep. The Archive 8TB drives have ~20 GB of cache per StorageReview's review of the drive. I was under the impression this was a portion of the disk written as PMR, which the drive then spooled out to the SMR rest of the disk, rather than any format of flash based cache. I have a bunch of the archive drives, and their performance bears this out - if they had a NAND based cache I would expect *lightning* quick writes for the first 20 GB or so, but instead it's "normal 5400/7200 RPM drive speed" writes for the first 20 GB and then it slows down after that.

Gigas-VII · Apr 10, 2018

singled drives do have on-disk caching, but if you need to write more data than is in available (forgive my incorrect use of terms, someone can correct me later) in fresh blocks within fresh pages, then a cleanup cycle may occur during writing. If we can coalesce writes in advance on even better, more durable storage media, we can make sure the SMR drives have much more time to do their job cleanly.

You can dismiss my idea all you want, but I'd like to implement it anyway just to give it a try.

When SMR drives hit their bottleneck, they go from "oh this is great" or "oh this is terrible..." and it takes a while for them to sort themselves out. By design. but all that happens independent of the OS, because neither the OS nor the filesystem are ready to command SMR operations, despite SMR drives being on the market for over 4 years.

Markus.Schragner · Apr 11, 2018

for this kind of performancetuning you would need "Host-Managed Shingled Magnetic Recording" Drives.
The Problem with this drives comes from the fact, that for every "changed bit" they would need to write the whole "track of shingles" (128MB, if i remember correctly). which involves a Read-Modify-Write cycle. This is the cause for the abismal performance.
So every kind of caching can just postpone the performancecollapse with these SMR-Drives

sinisterDei · Apr 11, 2018

Markus.Schragner said:
So every kind of caching can just postpone the performancecollapse with these SMR-Drives

Postpone or prevent entirely, depending upon the scenario. If you write out ~100 GB a day to this drive over the course of several hours, even randomly, then you'll likely never notice the host-managed SMR nature of the drive because it has a combination of enough on-disk cache area to handle the writes with enough idle time to spool that data out to the shingled area of the drive. If you're slamming the drive with writes 100% of the time, then you bought the wrong drive. There's a reason Seagate puts the word Archive in the name of the drive, the design goal is "write once, read many"

Gigas-VII · Apr 11, 2018

Markus.Schragner said:
for this kind of performancetuning you would need "Host-Managed Shingled Magnetic Recording" Drives.
The Problem with this drives comes from the fact, that for every "changed bit" they would need to write the whole "track of shingles" (128MB, if i remember correctly). which involves a Read-Modify-Write cycle. This is the cause for the abismal performance.
So every kind of caching can just postpone the performancecollapse with these SMR-Drives

you're right, but, while the pool is working from the cache rather than the disks, the disks are left with more available idle cycles to rearrange data onto contiguous blocks, and mark old blocks as available for re-use. Right now that happens internally to the disks as they are self-managing, rather than being managed by the OS.

So if the OS is touching the disks less often, and the touches are generally already organized, the SMR disks should find themselves struggling a lot less.

I figure having a 5:1 HDD:SSD capacity ratio, or even a 10:1, would handle the read operations fairly well. And then a substantial write cache (32GB in my case, as a test) should dramatically improve write performance, while penalizing the SMR disks less. Their own on-board caches will smooth things a little more but we already know those caches aren't sufficient in terms of capacity.

Gigas-VII · Apr 11, 2018

sinisterDei said:
Postpone or prevent entirely, depending upon the scenario. If you write out ~100 GB a day to this drive over the course of several hours, even randomly, then you'll likely never notice the host-managed SMR nature of the drive because it has a combination of enough on-disk cache area to handle the writes with enough idle time to spool that data out to the shingled area of the drive. If you're slamming the drive with writes 100% of the time, then you bought the wrong drive. There's a reason Seagate puts the word Archive in the name of the drive, the design goal is "write once, read many"

Exactly. but what if you could get really nice performance out of pools of these disks, in a small form-factor, just by adding another storage tier? That's what I'm trying to find out.

I'm not debating the theory. I'm ready to test it and provide some benchmark results. I need some help to configure the pool correctly.

sinisterDei · Apr 11, 2018

I don't think there's any way to fully bypass the 'SMR penalty' if you in any way exceed the average sustained write speed of the drives over a longterm period. We once had an array of Archive drives with a 1TB SSD write cache put in front of them, and they worked great for a very short period of time (days) but our overall average writes were just too much, and eventually the SSD filled up and couldn't spool it out tot he archives at an appropriate rate and the whole thing came crumbling down performance wise.

On the other hand, if you *don't* exceed the average write speed, then you may or may not even need the cache.

It's a very... niche performance category where having a big write cache can help you sustain large instantaneous write loads, while still having enough idle time to actually commit that write to disk. Of course, the larger the brief write load is the longer you'll need the array to be idle in order to commit that burst of activity to disk. That's where we ran into trouble; we didn't have enough downtime to spool the write cache out, so eventually it filled up.

Adding Optane into the mix, especially a nominal capacities like 32 GB, isn't likely to help much. It'll extend your write buffer by a little bit, and maybe that's enough to get you over the hump (maybe you write data in 50 GB bursts with 2 hours of downtime in between during which there is zero activity). But it's risky.

On the other hand, non-SMR disks have dramatically come down in price. We originally bought the Archives because they were $250 for 8TB and the closest PMR 8TB drives were like $400+. Now, new non-SMR 8TB drives are less than that and very competitive with the Archives in price, so they don't make much sense. If they pop out 12/14 TB archives for $250 and establish price/TB superiority again, then maybe it's worth considering, but not until.

Gigas-VII · Apr 12, 2018

sinisterDei said:
I don't think there's any way to fully bypass the 'SMR penalty' if you in any way exceed the average sustained write speed of the drives over a longterm period. We once had an array of Archive drives with a 1TB SSD write cache put in front of them, and they worked great for a very short period of time (days) but our overall average writes were just too much, and eventually the SSD filled up and couldn't spool it out tot he archives at an appropriate rate and the whole thing came crumbling down performance wise.

On the other hand, if you *don't* exceed the average write speed, then you may or may not even need the cache.

It's a very... niche performance category where having a big write cache can help you sustain large instantaneous write loads, while still having enough idle time to actually commit that write to disk. Of course, the larger the brief write load is the longer you'll need the array to be idle in order to commit that burst of activity to disk. That's where we ran into trouble; we didn't have enough downtime to spool the write cache out, so eventually it filled up.

Adding Optane into the mix, especially a nominal capacities like 32 GB, isn't likely to help much. It'll extend your write buffer by a little bit, and maybe that's enough to get you over the hump (maybe you write data in 50 GB bursts with 2 hours of downtime in between during which there is zero activity). But it's risky.

On the other hand, non-SMR disks have dramatically come down in price. We originally bought the Archives because they were $250 for 8TB and the closest PMR 8TB drives were like $400+. Now, new non-SMR 8TB drives are less than that and very competitive with the Archives in price, so they don't make much sense. If they pop out 12/14 TB archives for $250 and establish price/TB superiority again, then maybe it's worth considering, but not until.

The write scenario you described is exactly the kind of scenario that I intend.

I'm a photographer. I download sometimes 200GB of images from memory cards all at once, and then need to go into editing them. The actual RAW files are perfect to be read-cached because the data in them doesn't change. The database that contains the changes lives on a different volume. The write cache would help smooth out performance during large download operations. There's time for the whole pool to catch up once all the files are done. Also during a bake of a set, where we're loading RAWs, applying transforms in memory, and creating high-res JPEGs from them, the pool will once again see a bunch of activity but it's more like 2-10MB JPEGs flowing out super super fast, and that write cache would hold all of them at once without crippling the disks.

For a server-type workload, this isn't going to fix SMR. I'm not talking about that usage case at all.

but I think i can get 5TB Usable, with SSD-like performance, out of a well-cached pool like I'm describing. And it it turns out not to work, I won't be too upset.

sinisterDei · Apr 12, 2018

Sounds good. Sounds like you may be in the niche .

Markus.Schragner · Apr 12, 2018

I don't know enough about the technical side of things, but i see a problem.
We assume the type of cache you are implementing, be it regular SSD, XPoint or even RAM, is a FiFo.
The RAM-part behaves this way, i'm sure. It slowly fills up while getting more and more data, and writes the received data to disk.

I don't belive SSD's are working this way. Not by themselves.

I would say you need the Storage-Layer to do this abstraction for you. By that i mean, to primarily write to SSD's and only as the ssd fills up, start to write to disks. (as reading back from ssd to write to hdd decreases writeperformance)
This would perfectly work for your scenario, but it would need the Storage-Layer to know that the Size of data that is going to be written will completely fit into the cache-SSD.
And only begin writing to HDD, once the SSD receives no more writes.

Gigas-VII · Apr 14, 2018

Markus.Schragner said:
I don't know enough about the technical side of things, but i see a problem.
We assume the type of cache you are implementing, be it regular SSD, XPoint or even RAM, is a FiFo.
The RAM-part behaves this way, i'm sure. It slowly fills up while getting more and more data, and writes the received data to disk.

I don't belive SSD's are working this way. Not by themselves.

I would say you need the Storage-Layer to do this abstraction for you. By that i mean, to primarily write to SSD's and only as the ssd fills up, start to write to disks. (as reading back from ssd to write to hdd decreases writeperformance)
This would perfectly work for your scenario, but it would need the Storage-Layer to know that the Size of data that is going to be written will completely fit into the cache-SSD.
And only begin writing to HDD, once the SSD receives no more writes.

I hear what you're thinking. In other filesystems, the behavior you're describing might be possible. in Storage Spaces + ReFS, I don't think I could tune the pool to work exactly the way you're thinking. However, I also think it's not big deal to test it out.

Here's what I need. I need someone experienced with configuring Storage Spaces via PowerShell to walk me through how to configure a specific SSD to be a write cache for the pool. I know for certain this can be done, but I don't know the exact calls.

I can destroy and rebuild the pool as necessary. I just don't want to consume the Optane stick as a Read cache because I intend to use a SATA SSD for that role.

Gigas-VII · Oct 11, 2018

I wanted to follow up on this. I found some instructions on how to create Storage Spaces tiers manually using Powershell. I've run a preliminary test of my mirrored storage space, and saw approximately 60-70MB/s write speeds, and 150-200MB/s read speeds. That's not amazing but not horrific. I want to add my Optane stick as a fast tier, which I'll be experimenting with today.

I'm also thinking about adding a SATA SSD tier in between per my original idea. I just haven't installed the disks into the rig. (to be honest this project was stalled because of a GPU problem, which I have now resolved just this week)

Gigas-VII · Oct 16, 2018

Turn out, to do what I want to do, I need to have 2x Optane sticks installed in my system, and I'm out of M.2 ports. So now I'm thinking I'll continue this experiment on my Threadripper build, where I have more PCIe and M.2.

I know I could get a PCIe -> M.2 adapter card, but that would give different fabric to each of the Optane sticks (one would be PCIe 3.0 and the other would be 2.0)

DavidUK · Oct 26, 2018

Have you looked at DrivePool? This allows you to create a landing disk which could be your Optane stick, which will then migrate the data to the HDD in idle time. Whilst it may not be the best option, it might allow you to test the the benefit. According to the DrivePool web site it uses a plug-in:
SSD Optimizer

Author: Covecube Inc.

With this plug-in you designate one or more disks as SSD disks.
SSD disks will receive all new files created on the pool.
It will be the balancer's job to move all the files from the SSD disks to the Archive disks in the background.
You can use this plug-in to create a kind of write cache in order to improve write performance on the pool.
Optionally, you can also set up a file placement order for your archive disks, so that they will be filled up one at a time.

IDEA: using 3DXpoint memory to improve SMR HDD Array performance?

Gigas-VII

Limp Gawd

Wizrobe

n00b

Gigas-VII

Limp Gawd

Luke M

Gawd

sinisterDei

[H]ard|Gawd

Gigas-VII

Limp Gawd

Markus.Schragner

n00b

sinisterDei

[H]ard|Gawd

Gigas-VII

Limp Gawd

Gigas-VII

Limp Gawd

sinisterDei

[H]ard|Gawd

Gigas-VII

Limp Gawd

sinisterDei

[H]ard|Gawd

Markus.Schragner

n00b

Gigas-VII

Limp Gawd

Gigas-VII

Limp Gawd

Gigas-VII

Limp Gawd

DavidUK

n00b