Need new drives to upgrade our 10k SAS 6Gbps 2.5 hotplugs; recommendations?

RavinDJ

Supreme [H]ardness
Joined
Apr 9, 2002
Messages
4,440
We have a Dell PowerEdge R720 and we currently have three (3) 900GB 10K RPM SAS 6Gbps 2.5in Hot-plug Hard Drive in RAID 5 for a total of 1.8TB of storage. The server has 32GB of RAM and an Intel Xeon E5-2620 CPU @ 2.00Ghz.

We would like to upgrade our drives to more storage and hopefully more speed. Looking for a total of 2.5-3.0TB of storage and could use an increase in speed, as well.

We have the 2.5" Chassis with up to 8 Hard Drives and a Dell Perc H710 RAID controller.

I would like to add new drives at a reasonable price (not looking to go under $400 but not looking to go over $4,000).

Any recommendations on which drives, what number of those drives, and in what RAID configuration to get?

Any help will be greatly appreciated.

Thanks!!!
 
Move to consumer SSD drives. If you're looking for ~3TB of space, then I would add 4x 1TB Samsung 860 EVO SSDs in RAID 5. That'll give you 3 TB (2.73 TB really) capacity, and compared to any spinning disk will seem *lightning* fast. The 1TB drives are currently ~$150/each, so that puts your upgrade in the $600 range. You also might have to buy one drive sled, because a lot of Dells come with 'dummy' trays in the slots that aren't populated.

Obviously if you need more space, then you can just buy more drives - if you spend $1200 you can get 8x 1TB SSDs and run them in RAID6 for extra redundancy.

My company has put a ton of 850 and 860 EVO drives in R610/620/710/720 servers and they're chugging along to this day as a result.

If you have any questions, I'd be happy to answer.
 
Move to consumer SSD drives. If you're looking for ~3TB of space, then I would add 4x 1TB Samsung 860 EVO SSDs in RAID 5. That'll give you 3 TB (2.73 TB really) capacity, and compared to any spinning disk will seem *lightning* fast. The 1TB drives are currently ~$150/each, so that puts your upgrade in the $600 range. You also might have to buy one drive sled, because a lot of Dells come with 'dummy' trays in the slots that aren't populated.

Obviously if you need more space, then you can just buy more drives - if you spend $1200 you can get 8x 1TB SSDs and run them in RAID6 for extra redundancy.

My company has put a ton of 850 and 860 EVO drives in R610/620/710/720 servers and they're chugging along to this day as a result.

If you have any questions, I'd be happy to answer.

That's soooo much, Sinister!! I was doing some research online and many posts said RAID 5 is bad. I should go with RAID10. Was that the old case with spinning disks or should I still spend a little more and go RAID10?

I definitely need tray caddies because we have the 'dummy' trays in the 5 empty slots. But what do you mean "one drive sled"? You mean the caddies?

In the photo, it's the R720 on top.
IMG-1539.JPG
 
That's soooo much, Sinister!! I was doing some research online and many posts said RAID 5 is bad. I should go with RAID10. Was that the old case with spinning disks or should I still spend a little more and go RAID10?
RAID5 is 'bad' because as capacity increases you run the risk of suffering a second drive failure during a rebuild operation.
RAID10 is faster than RAID5 or 6 because it's a simpler math for the RAID controller to perform since it doesn't involve calculating parity. It also suffers no write amplification penalty as a result, which affects RAID5 and doubly affects RAID6.

With all that said, at the capacity levels you're discussing and the fact you're considering SSDs instead of mechanical drives, RAID5 would still be fine.
If you move to 8 drives, I would use RAID6. RAID6 fully mitigates RAID5's potential for failure during a rebuild, while only sacrificing one additional disk's capacity to parity data. It'll be slower than RAID10 (while still being way faster than what you have now), but you'll have more capacity compared to RAID10, and I personally prefer the more predictable nature of RAID6 fault tolerance. With RAID10, if you get extremely unlucky a second drive failure can nuke your array, where RAID6 always has 2 drive tolerance. An 8-drive RAID10 array would (typically) consist of four RAID1 groups that are striped together. In this arrangement you could potentially survive up to four simultaneous drive failures, assuming all four failures were on different RAID1 groups; however it is possible that if you got super unlucky and lost both drives from any one of the four RAID1 groups that the entire array would be lost.


I definitely need tray caddies because we have the 'dummy' trays in the 5 empty slots. But what do you mean "one drive sled"? You mean the caddies?
Sorry, was using tray / caddy / sled interchangeably. I meant you might have to buy one if you put in 4x SSDs to replace your 3x existing drives, because you currently only have 3 trays/caddies/sleds. Obviously in reality you need to buy as many trays as you want to add drives :)
 
RAID 6 does not fully mitigate, it can fail as well, it's just less of a chance of two drives vs one. It also must read all the other drives to rebuild the array and on very large drives it takes some time. With raid 10 you *could* lose two drives and be down, but it would have to be 2 specific drives. Any other 2nd drive and it can still function. Also it only needs to rebuild one piece rather than hammer all disks, which means less to rebuild. I run RAID 10 in my home server (Dell R710) and get around 1GB/s sustained (sequential) read speeds from my 5.4k RPM drives. I have 6 500gb drives, if I lose one I need to read/ rebuild 500gb. If I put it into RAID 5, I need to read 2.5TB to rebuild 500gb of data. My array is small enough raid 5 or 6 would have been ok, but it was mostly for playing and learning.
That said, depending on how critical your data is, 4 2TB SSD in raid 10 will give you 4TB and have read/write speeds near 2TB/s with great random I/O as well. You would have room to upgrade if needed. I don't think your likely to have to many issues with such small drive counts, but RAID 6 with 4 drives doesn't net you anything over RAID 10.
 
RAID 6 does not fully mitigate

On a smart RAID controller, it 100% does.

The reason is this; RAID 5 rebuilds can fail because the mathematical chance of an of unrecoverable read errors (URE) starts getting pretty high when the capacity of an array goes up. All drives have a rated URE rate, and if you take the mathematical chance of any particular read event failing combined with the need to read all the data from an array of big disks, your odds of running into a URE get uncomfortable. With RAID 5, during a rebuild if a URE is encountered then the rebuild will fail because without the data from the rest of the drives, the missing drive can't be reconstructed and in fact you've lost a tiny bit of a second drive, thus violating RAID 5's 1-drive fault tolerance.

Given an intelligent RAID controller in an array missing a single disk, RAID 6 fully mitigates this. Here's why.

If you're rebuilding a disk in RAID 6 in an array missing a disk, you can encounter a URE and keep on trucking. The reason is because whatever drive encountered the URE can be 'worked around' because the controller can calculate the data the URE missed by polling the other drives and calculating from the second parity drive.

Now then, there are two kinds of RAID controllers here; smart and dumb. Dumb controllers, upon encountering a URE, will drop a disk from an array because it encountered a failure. In the context of RAID 5 this sort-of made sense, because there's nowhere else to recover that bit of data that the URE messed up. In RAID 6 though, this doesn't make sense, because a URE on a single bit of data is not an indicator that the other 99.999% of data on the drive has anything wrong with it, so dropping the drive is ridiculous. A smart controller, upon encountering a URE, will utilize the parity bit (or bits) in the RAID array to correct the URE. On a smart controller, if you're rebuilding RAID 6 and encounter a URE it'll simply correct for the URE by calculating with parity. In order for a single drive RAID 6 array rebuild to fail, you would need to get *two* UREs (mathematically unlikely) and, assuming a smart controller, both UREs would need to affect the same bit of data being read - in other words, the second URE would need to happen *while attempting to correct for the first URE*. That scenario - a second URE when correcting for the first one - is functionally a 0% chance of happening.

RAID 6 with two fully dead drives is just as likely to die during a rebuild as RAID 5. RAID 6 with 1 dead drive, given a smart controller, should have essentially a 100% chance of a successful single drive rebuild.
 
On a smart RAID controller, it 100% does.

The reason is this; RAID 5 rebuilds can fail because the mathematical chance of an of unrecoverable read errors (URE) starts getting pretty high when the capacity of an array goes up. All drives have a rated URE rate, and if you take the mathematical chance of any particular read event failing combined with the need to read all the data from an array of big disks, your odds of running into a URE get uncomfortable. With RAID 5, during a rebuild if a URE is encountered then the rebuild will fail because without the data from the rest of the drives, the missing drive can't be reconstructed and in fact you've lost a tiny bit of a second drive, thus violating RAID 5's 1-drive fault tolerance.

Given an intelligent RAID controller in an array missing a single disk, RAID 6 fully mitigates this. Here's why.

If you're rebuilding a disk in RAID 6 in an array missing a disk, you can encounter a URE and keep on trucking. The reason is because whatever drive encountered the URE can be 'worked around' because the controller can calculate the data the URE missed by polling the other drives and calculating from the second parity drive.

Now then, there are two kinds of RAID controllers here; smart and dumb. Dumb controllers, upon encountering a URE, will drop a disk from an array because it encountered a failure. In the context of RAID 5 this sort-of made sense, because there's nowhere else to recover that bit of data that the URE messed up. In RAID 6 though, this doesn't make sense, because a URE on a single bit of data is not an indicator that the other 99.999% of data on the drive has anything wrong with it, so dropping the drive is ridiculous. A smart controller, upon encountering a URE, will utilize the parity bit (or bits) in the RAID array to correct the URE. On a smart controller, if you're rebuilding RAID 6 and encounter a URE it'll simply correct for the URE by calculating with parity. In order for a single drive RAID 6 array rebuild to fail, you would need to get *two* UREs (mathematically unlikely) and, assuming a smart controller, both UREs would need to affect the same bit of data being read - in other words, the second URE would need to happen *while attempting to correct for the first URE*. That scenario - a second URE when correcting for the first one - is functionally a 0% chance of happening.

RAID 6 with two fully dead drives is just as likely to die during a rebuild as RAID 5. RAID 6 with 1 dead drive, given a smart controller, should have essentially a 100% chance of a successful single drive rebuild.
It highly reduces the chance of complete failure, but doesn't completely (just as raid 0 or 10 doesn't). RAID 10 is still considered safer and unless you have a lot of drives you don't lose much capacity vs raid 6, while gaining speed. With many drives, raid 6 will give you more space at the expense of speed and more likely to encounter issues while rebuilding. A 12 array raid 6, with 500gb each would need to read 5TB worth of data to rebuild itself which is slow and has a good amount of time where a second failure could (unlikely) occur. RAID 10 only needs to read 500GB to rebuild, so will rebuild in 1/10th of the reads, meaning faster rebuild and less chance of a second drive failure. Again, were talking highly unlikely but these things can come into play depending on the needs. I've seen things like raid 50 and raid 60 for similar reasons. Chances of 2 RAID 5 arrays to both lose 2 disks each is low, chance of 2 raid 6 arrays losing 3 drives each is even lower. Anyways, this really didn't seem like much of a concern for the OP so I'll leave it at this unless you wanted to discuss further. Large companies will even try to use drives from different batches just to help reduce the chances of them dying at the same time because the chance of 2 drives from the same batch with the same mechanical wear increases the chance of failure at the same time. Again, doesn't seem pertinent in this case, sorry if I side tracked this post to much.
 
I cannot thank you enough guys!! I was confused by what to get... and Dell is charging some obscene $800+ for a single 1.2TB spin disk... I mean, WTF?!?!?

After reading your posts (twice, actually), I decided to go with four 2TB Samsung 860 EVO drives in a RAID10 configuration. This allows me to leave the original three 900GB 10k drives in place. That way, I can backup and restore the data to make sure I don't screw anything up.

I'll keep you posted on the results and I'll post the BEFORE and AFTER results from ATTO :D
 
Sounds great! You'll like the performance for sure! And still significantly under your original maximum budget lol. Dell's prices for storage are *ridiculous* full stop.
 
Just wanted to thank again all you guys for the help!

I finally installed the drives a few weeks ago... haven't had the chance to run Atto and post here due to COVID.

Look at the BEFORE and AFTER!!

BEFORE: Three (3) 900GB 10K RPM SAS 6Gbps 2.5in Hot-plug Hard Drive in RAID 5 for a total of 1.8TB of storage

atto_sas.jpg


AFTER: Four (4) 2TB Samsung 860 EVO drives in RAID 10 for a total of 3.63TB of storage

atto_ssd.jpg


NIGHT and DAY difference!!!

Thank you again for the help with the recommendations!!
 
Chances of 2 RAID 5 arrays to both lose 2 disks each is low, chance of 2 raid 6 arrays losing 3 drives each is even lower

For some reason, I was reading over this thread again and noticed something I missed before. RAID 50 and RAID 60 are RAID 0 arrays comprised of two (or more) RAID 5 or RAID 6 arrays, respectively. The simplest RAID 50 array is two 3-disk RAID 5 arrays, so six disks total, and the simplest RAID 60 is two 4-drive RAID 6 arrays, so 8 drives total.

The part I missed before is Ready4Dis's stated failure scenario; "2 RAID 5 arrays to both lose 2 disks each" - that's not how RAID 50 dies. RAID 50 dies when either (not both) of the RAID 5 arrays loses two drives, because when you lose one of the components in a RAID 0 array the entire array is lost. For RAID 60 it's similar; if you lose 3 disks from either of the sub-arrays, then the entire RAID 60 array will fail. This means RAID 50 is still more reliable than RAID 5, since it stands a chance of sustaining multiple disk failures without losing the entire array, but it's not guaranteed to be able to sustain multiple disk failures; toss in some bad luck and as few as two dead disks can doom a RAID 50 array.

Obviously none of this ended up being relevant since Ravin went with RAID10, but it's still good knowledge to have.
 
For some reason, I was reading over this thread again and noticed something I missed before. RAID 50 and RAID 60 are RAID 0 arrays comprised of two (or more) RAID 5 or RAID 6 arrays, respectively. The simplest RAID 50 array is two 3-disk RAID 5 arrays, so six disks total, and the simplest RAID 60 is two 4-drive RAID 6 arrays, so 8 drives total.

The part I missed before is Ready4Dis's stated failure scenario; "2 RAID 5 arrays to both lose 2 disks each" - that's not how RAID 50 dies. RAID 50 dies when either (not both) of the RAID 5 arrays loses two drives, because when you lose one of the components in a RAID 0 array the entire array is lost. For RAID 60 it's similar; if you lose 3 disks from either of the sub-arrays, then the entire RAID 60 array will fail. This means RAID 50 is still more reliable than RAID 5, since it stands a chance of sustaining multiple disk failures without losing the entire array, but it's not guaranteed to be able to sustain multiple disk failures; toss in some bad luck and as few as two dead disks can doom a RAID 50 array.

Obviously none of this ended up being relevant since Ravin went with RAID10, but it's still good knowledge to have.
Man, I missed that completely, lol. I meant to say RAID 51 and 61!!! So mirrored RAID arrays, not striped. Good catch, my apologies for the mistake. I still had RAID 10 in my head.
 
I don't know why you'd use consumer drives when enterprise drives like the Micron 5300 series drives are so inexpensive to begin with. I use RAID 5'd Microns in my backup server, have 14 of them in two arrays, if a drive fails, I'll pause backups and let it rebuild, doesn't take that long when it's an SSD.
 
I don't know why you'd use consumer drives when enterprise drives like the Micron 5300 series drives are so inexpensive to begin with.

For me, three reasons:
1. Your 5300 series drives are relatively inexpensive, but still not as cheap as the 860 EVO series of equivalent capacity.
2. From a practical performance and reliability standpoint, there is no advantage for the enterprise drives in many usage profiles. Your 5300 drives have double the MTBF and higher write endurance, but common usage scenarios will not stress the specs of the 860 EVO drives. When you're already tall enough to ride the ride, being extra tall isn't of much value.
3. I can drive down the street to Microcenter and buy an 860 EVO; I can't for a Micron.
 
Doesn't Microcenter have a 2 drive limit on them though? I guess you can always go back a few times if you're lucky enough to have a store near you. https://www.microcenter.com/product...-sata-iii-6gb-s-25-internal-solid-state-drive
Bulk discounts from CDW bring the 5300 Pro to price parity to the 860 EVO: https://www.cdw.com/product/micron-5300-pro-solid-state-drive-3.84-tb-sata-6gb-s/5813370?pfm=srh
If you don't need the endurance, the 5210 Ions are much cheaper: https://www.cdw.com/product/micron-5210-ion-solid-state-drive-3.84-tb-sata-6gb-s/5359199?pfm=srh
And if you need them next day, Amazon Prime has you covered and you don't need a CDW account to negotiate a discount: https://smile.amazon.com/Micron-5200-Solid-State-Drive/dp/B07JQDBX87/
 
I don't know about Microcenter's drive limits; the on-site availability is useful in a failure scenario (typically only 1 drive) rather than when I'm buying new disks. I live around Houston so we're lucky enough to have one! The Samsungs are sold in Best Buy too though, so they are generally available at retail across the country.

I'd certainly be willing to give the Micron's a start, but I'm pretty sure the difference between the two would be small, and thus far I've just gone with what I have personal experience with. I only know what I know, and until this thread I was barely aware of either of those series of drives. Maybe I'm underinformed, but I've always been satisfied with the performance, price, and reliability of the drives I've used thus far so I cannot say I'm unsatisfied.
 
Also... not to drag this out, but... are those numbers right on the RAID 5 configuration for WRITE speeds?!?!? 35/36/37 MB/s??? That's so slow...
 
Also... not to drag this out, but... are those numbers right on the RAID 5 configuration for WRITE speeds?!?!? 35/36/37 MB/s??? That's so slow...

There's something wrong, what are your controllers settings for that drive?
 
There's something wrong, what are your controllers settings for that drive?

I know!!! Right?

Dell_OpenManage.PNG


But where would I look into the settings? It's actually not an issue any more because we don't use those drives, we just use the SSD disks. I might as well just format them, wipe, wipe them, and remove them from the server.

But still... it's interesting. I wonder if it's a setting or something with the drives or options I selected while setting up the RAID 5. But weird... we were using those drives for a few years, already.

Needless to say, the SSDs are like NIGHT and DAY.

[edit]
Went into the logs and saw this:

statusOK_16.png
2095Fri Apr 3 03:38:03 2020Storage ServiceUnexpected sense. SCSI sense data: Sense key: B Sense code: 0 Sense qualifier: 0: Physical Disk 0:1:6 Controller 0, Connector 0

There are A LOT of them. Thoughts?
[/edit]

[edit 2]
Physical Dsk 0:1:6 is the new SSD, anyways... so the above log doesn't explain slowness of the SAS drives
[/edit 2]
 
Last edited:
Do you write-back cache on that controller? Looks like you have a battery so you'd want that on. How do the individual drives perform outside of the array?
 
The status of the write-back cache is the most likely explanation for the slow write speed, and that setting is likely set per-virtual drive and may be enabled on the SSD array and not on the HDD array.

With that said, a big reason the SSDs will be faster will have nothing to do with write speed or read speed, but will instead be related to latency, which will be orders of magnitude better on SSDs than on HDDs. Lots of reads and writes to disk will be very small amounts of data, and on SSDs the data seek time is functionally zero because there's no spinning physical disk that has to be manipulated to find any particular bit of data. For most folks, this latency reduction is the larger portion of the perceived speed upgrade of moving to a SSD than the raw throughput.
 
Thanks for the heads up! I'll check on the write-back cache. I don't know what setting is and I can't check it remotely... at least I don't think I can. Is there a way to check it other than rebooting the server and going into the controller menu via CTRL-S or CTRL-M or whatever the key combination is to enter the controller setup?
 
You should be able to install the MegaRAID Storage Manager and view/manage the array from within Windows, assuming you're running Windows. The Dell PERC cards are rebadged LSI controllers, so the LSI management utilities tend to work on them.
 
Should look something like this with MegaRAID, this is four 8TB WD Reds in RAID 5, no reason your SAS drives shouldn't outperform them in writes.


Screen Shot 2020-04-10 at 12.38.58 PM.png
 
Dell_OpenManage2.PNG


Here's the info from Dell OpenManage. But I'll download MegaRAID, as well.

"Write Policy" is Write Through for the old RAID 5. It's Write Back for the new SSDs.

[edit]
Never knew about MegaRAID. Thanks, guys!!!
[/edit]
 
Write through is secret RAID controller code for "disable the write cache"

So that'll explain the piss poor performance of the old RAID 5. Assuming you have a battery backup, go ahead and set the policy to Write Back if you're going to keep the old RAID 5 array. It'll perform much better!
 
CURRENT SETTINGS:

Dell_OpenManage3.PNG


These are my options:

Code:
Read Policy:
    Read Ahead
    Adaptive Read Ahead
    No Read Ahead

Write Policy:
    Write Through
    Write Back
    Force Write Back

Disk Cache Policy:
    Disabled
    Enabled

What are the best options?

Also, what do all the options mean? How did you learn all this? I'm more and more interested in data, storage, etc. Would love to learn more... online course, books, or learn by doing?

Thanks again!!

[edit]
Battery looks okay:
Dell_OpenManage4.PNG

[/edit]
 
On the write policy, your three options:
1- write through. This bypasses the write cache entirely and sends data straight to disk
2- write back. This uses the write cache, as long as the battery is connected and in good shape
3- force write back. This forces the write cache, even if there is no battery or if it is bad.

The battery exists because the write cache is stored in RAM on the RAID controller, and memory contents are lost on power loss. If the array is in the middle of writing some data and power is suddenly lost, there might be data the OS thinks has been written - stored in the cache - that hasn't actually been put on disk yet. If power is lost, that data will be lost, and you may experience some data loss.

The battery fixes this by keeping the cache powered up so it doesn't lose its data, and when you power the server back on it completes the final writes properly
 
that explain why the HDDs was going super slow , Disk Cache disabled and "Write Though" settings is the safe way to do it as it practically guarantees no data loss (up to a point)

interesting it defaulted to Write cache Enable and "Write back" on the SSDs raid array , maybe it was because it was RAID 10 was used (or the way it was set up originally on RAID 5)
 
that explain why the HDDs was going super slow , Disk Cache disabled and "Write Though" settings is the safe way to do it as it practically guarantees no data loss (up to a point)

interesting it defaulted to Write cache Enable and "Write back" on the SSDs raid array , maybe it was because it was RAID 10 was used (or the way it was set up originally on RAID 5)

It's also possible that perhaps the battery was not working or in retraining or somesuch at the exact moment the RAID5 array was created, which would move the defaults.
 
agreed or it was created in the boot up raid bios menu where that might be more safe settings (some warn when you try to enable Write Cache and Write back, if there is no battery some just flat out gray out the option with note saying missing battery)
 
Back
Top