Confirm configuration for 2950/MD1000/ZFS

med

n00b
Joined
Mar 3, 2016
Messages
11
Hi all- first post, been lurking and admiring for some time and thought this was a good time to join in. I'm not super well-versed in hardware let alone large storage solutions, so I figured this would be a good way to learn some new tricks.

Currently I have an 8-port Dell PE 2950 and PowerVault MD1000, came with a PERC 5E raid card. I've tested it out with a handful of disks on hand and everything seems to be working as expected. This is all for a media server build, so speed really is not an issue-- it's all about capacity and failure tolerance for me. The 2950 is running ubuntu server 14.04 lts since I'm fairly familiar with it.

Originally, I thought I'd fill the MD1000 with 2TB drives in a raid-10 off the 5E card, but if at all possible I'd love to go with 4TB drives in Z3. The problem is of course that the 5E supports neither 4TB drives nor JBOD for zfs. Last week before deciding to go for ZFS I went ahead and picked up a PERC H800 which would do 4TB drives but is still hardware raid, not JBOD. From what I've read the H800 could only do a series of 1-disk RAID-0 vdisks, which is not only the right thing for ZFS, but might actually invalidate the benefits of it. So scratch that. From what I can tell, two options might actually attain what I'm looking for.. a PERC H200e (which can do JBOD for sure, though I can't determine 4TB compatibility?) or LSI 2008 HBA, which I believe would check all the boxes for me, but tbh I'm not entirely which specific one would be the right choice for the 2950.

For disks, I'm thinking WD Red 4tb, though a buddy of mine recommended comparable HGST 4tb based on reputation for low failure rate. (I think from BackBlaze's most recently published reports..) As long as neither is a terrible decision (have read to avoid green drives for disk arrays), I think I'm good there.

Interposers. What the hell do they do. Have read conflicting reports that interposers are required / not required for sata disks, and that they do or do not limit recognition of disk capacity down to 2TB in the MD1000. I figured worst-case scenario, I could go ahead and buy a couple 4tb disks, a couple interposers, try it both ways, and see which one works out properly-- unless you guys know for sure.

Configuration... 14-disk z3 plus a hot spare = 44 TB usable? It gets a little fuzzy for me when trying to determine how this all relates to block size and what constitutes a good idea vs a bad idea. I've seen a few setups on here with 11-disk z3 plus hot spare, but wouldn't quite be able to articulate why that would be better or worse than a 14-disk setup.

The OS on the 2950 is on a 2-disk raid 0-- my other tinge of paranoia was using software-based ZFS and then having the OS disk(s) go belly-up on me, therefore losing the config for the zfs pool.. is there any way to abate this risk?
Last but not least, the 2950 has dual quad core xeons (max spec, I don't remember precisely atm) but only 8gb ram, so I'm unsure if that will be sufficient for what will effectively become a NAS, or if I should look at more. Probably will.

I feel like I've done most of my homework at this point, but can't quite confirm compatibility on the setup without some advice. Muchos thanks for any help pointing me in the right direction.

Included potato-phone picture for reference. Why the hell is it so dusty? Why is that blue LED so damn bright? I have no idea in either case. Cheers.

2016-02-28 21.22.40.jpg
 
from my quick study at spec sheets of the MD1000, it looks like it is an external 15x3,5" Drive-Case, connectet via SAS, with internal expander chip. Herein lies the problem i Think, as this internal expander could not handle Larger than 2TB Disks (maybe than can be fixed by firmwareupdates, but i doubt it because than there would not be so many appearing on the second hand market)
The same Problem with the Pec5i,5e,6i,6e. they all are capped at 2TB per disk as they have not enough regiters to store addresses exeeding the 2TB bounderies.

The H800, as every hardware-raid controller in principle is usable as long as you know that you loose the "plug and play/hotplug" capability.
On a replace, you would need to generate a new 1disk raid0. so you would need the management software for the controller on your storage server or reboot and do this task in the controller bios (so not really hotplug)

The Controller-Chip from LSI ist the 2008 which comes in tons of flavors. Two of them are the H200 and the H200e
( LSI SAS 2008 RAID Controller/ HBA Information )
The older LSI 1078 chip also has the problem with the max. 2TB per Drive

As for disks, The WD Green drives main problem is the Vibration in a 15Case enclosure. They are simply not build for that. The WD Red as i recall correctly are spec'd for up to 8Disks in one Enclosure. (I Personally run 16 of them in one Supermicro 836 Case ;) )

In Principe Interposers are not needed as every SAS bachplane can work with SATA Drives.... except of course there are some firmware "Tweaks" from Dell to force you to use "certified" dell hardware or someshit like that.
So maybe you really just have to try, as a different firmware of the Backplane could make a difference

As for configuration. Most of the Users are just oriented on ZFS best practices.
So to not waste storagespace, and utilize maximal performance, it is best to use a factor of 2 datadisks (2,4,8,16 datadisks comes to RaidZ1 3,5,9,17 disks, RaidZ2 4,6,10,18 disks or RaidZ3 5,7,9,19 disks)

ZFS stores all configuration regarding the pool on the pool. So if you f*** up your system, install a new one, import the pool and you are good to go. (of course just the storageaccess is restored, you have to install OS services and applikations on the ne OS instance)

As long as you don't do deduplication, 8GB is more than enough. Just try and have a look on your utilization-charts. CPU wise, most of them are Monsters for a simple NAS appliance.
 
Good stuff, thanks for the reply. After looking at it closer I discovered what you mentioned above that the H200E is an LSI 2008-based chip, which is apparently the way to go. I'll pick one of those up. I know the H800 card and various other Dell cards had the "Dell-only disks" restriction removed on later firmware updates (once they were old enough to fall out of Dell support), but of course I won't be using that one so it's a bit of a moot point.
As far as the MD1000 itself supporting 4tb, I'm pretty sure it can (?), only since I seem to recall seeing some setups configured as such. For example this guy on eBay is selling something pretty close to what I'm talking about building (MD1000/H800/4TB disks) already preconfigured. Though his would have to be hardware raid and not ZFS. Since I don't have enough posts I don't believe I can post links, but it's item #221265161885.

I'll try to get my hands on an h200e card and a 4tb disk to test it out...
 
Looks like this is my jam.. H200e SAS HBA / 8 Port (2 x External SFF-8088 Connection) Plain HBA Card...
If I've got the plain HBA card, no need to reflash, yes? Just grab a mini-SAS to SAS (8088 to 8470) cable to go with it...?
Getting closer, I can smell it...
 
Hi all- first post, been lurking and admiring for some time and thought this was a good time to join in.

Hey med. I stumbled across your post a few days ago while traveling down essentially the same path.


I'm not super well-versed in hardware let alone large storage solutions, so I figured this would be a good way to learn some new tricks.

I'm a bit ahead of you here. I've been into computers since I was a kid, and got involved in Linux in the early 90's. I transitioned from enthusiast to sysadmin about a decade later, specializing in part in storage. This is however my first experience with the MD1000 and SAS in general.

I'm upgrading a pair of Dell 1950 servers that are currently hosting several SCSI RAID cabinets. They're a bit long in the tooth, and I'm looking to move to something I can continue expanding for a while. I'm in an academic setting, and needless to say, there isn't the budget for enterprise storage.


Originally, I thought I'd fill the MD1000 with 2TB drives in a raid-10 off the 5E card,

I've never been a fan of mirrors. Not enough fault tolerance.

but if at all possible I'd love to go with 4TB drives in Z3.

Z3 may be overkill depending on your vdev size and number of drives. If you break your array into two 7 disk vdevs and add them to the same pool, you should be quite safe with a Z2 array. I realize this results in slightly (one drive) less sotrage than Z3/14, but it gives you more flexibility in upgrading later.

[edit] It also effects performance. I just read on the Wikipedia ZFS page:

"If the zpool consists of only one group of disks configured as, say, eight disks in RAID Z2, then the write IOPS performance will be that of a single disk. However, read IOPS will be the sum of eight individual disks. This means, to get high write IOPS performance, the zpool should consist of several vdevs, because one vdev gives the write IOPS of a single disk. However, there are ways to mitigate this IOPS performance problem, for instance add SSDs as ZIL cache — which can boost IOPS into 100.000s. In short, a zpool should consist of several groups of vdevs, each vdev consisting of 8–12 disks. It is not recommended to create a zpool with a single large vdev, say 20 disks, because write IOPS performance will be that of a single disk, which also means that resilver time will be very long (possibly weeks with future large drives)."


The problem is of course that the 5E supports neither 4TB drives nor JBOD for zfs.

I had no SAS HBA before starting this project. I looked at the Dell HBAs, but they seemed to use older chipsets and limited what you could and couldn't do. I went with a pair of LSI 9205-8e cards, which use the older brother (SAS2308) of the Perc's SAS2008 controller. I found the firmware and flashing utility on LSI's website to update them to the newest 'IT' (JBoD capable) firmware. This was probably the easiest part to figure out.


Last week before deciding to go for ZFS I went ahead and picked up a PERC H800 which would do 4TB drives but is still hardware raid, not JBOD. From what I've read the H800 could only do a series of 1-disk RAID-0 vdisks, which is not only the right thing for ZFS, but might actually invalidate the benefits of it. So scratch that. From what I can tell, two options might actually attain what I'm looking for.. a PERC H200e (which can do JBOD for sure, though I can't determine 4TB compatibility?) or LSI 2008 HBA, which I believe would check all the boxes for me, but tbh I'm not entirely which specific one would be the right choice for the 2950.

I've seen quite a bit of chatter about using the H800 in JBoD, but no real details about how to do it. Have you explored it's BIOS settings? You may be able to reflash it with an LSI firmware as well. For what it's worth, I paid close to half for the 9205 card as I see the H800 go for. It might be worth flipping it.

For disks, I'm thinking WD Red 4tb, though a buddy of mine recommended comparable HGST 4tb based on reputation for low failure rate. (I think from BackBlaze's most recently published reports..)

I totally concur. I originally had Seagate 2TB drives in the SCSI cabinets, and experienced a couple of early deaths back to back (2nd failure during the rebuild!). RAID6 saved me, and the array kept on rolling despite being down two disks at once. When I upgraded to 4TB disks, I went with the HGST drives, and they're all still running solid today. Once this new array is up and data migrated, I'm looking forward to recovering them for my own build.

Interposers. What the hell do they do.

I haven't the slightest clue! Mechanically and electrically SAS and SATA are the same. They only differ logically.

Have read conflicting reports that interposers are required / not required for sata disks, and that they do or do not limit recognition of disk capacity down to 2TB in the MD1000.

As have I, but I worked out that they're ONLY required if you're mixing SAS and SATA disks on the same backplane. SInce you're running all SATA you shouldn't need them. I didn't on mine. As for the capacity limitations, that's entirely dependent on the controller. The expanders and backplane are agnostic.

I figured worst-case scenario, I could go ahead and buy a couple 4tb disks, a couple interposers, try it both ways, and see which one works out properly-- unless you guys know for sure.

Try it without. It's cheaper. I have all 15 slots populated without them, and I see every drive.

Configuration... 14-disk z3 plus a hot spare = 44 TB usable? It gets a little fuzzy for me when trying to determine how this all relates to block size and what constitutes a good idea vs a bad idea.

Rough rule of thumb is RAID/Z2 is two drives of parity, and RAID/Z3 is three drives parity. How many drives they're protecting determines utilization. Since the backplane in the MD1000 is split 7/8, it makes sense to assume each pool will have 7 drives with 2/3 of those drives for parity. With 4TB drives, you'd either get 20TB or 16TB for Z2/Z3. I'm going with 6TB drives, and Z3, because I really can't afford to lose this data. By splitting in to two vdevs, I can upgrade incrementally, replacing drives in only one vdev as needed, plus the resilver time will be lower (7 vs 14).

I've seen a few setups on here with 11-disk z3 plus hot spare, but wouldn't quite be able to articulate why that would be better or worse than a 14-disk setup.

There is a lot of debate over how many drives of data can be reliably protected by 2/3 drives of parity. In my current setup I'm limited to 8 drives per volume. Raid6 consumes 25% of the raw disk. With 3 of 14 that drops to 21%, but with a ridiculous increase in reliability.

I found a calculator "RAID RELIABILITY CALCULATOR" on Serve The Home (dot com) website. (I'd post the link, but because I'm a new user here I can't post links.) Play with the values a bit to plan your vdev.

The OS on the 2950 is on a 2-disk raid 0-- my other tinge of paranoia was using software-based ZFS and then having the OS disk(s) go belly-up on me, therefore losing the config for the zfs pool.. is there any way to abate this risk?

I haven't explored this. I lost the system disk on a system that hosted the metadata for an LVM volume that consisted of multiple disks. I don't remember the specifics, but the LVM tools allowed me to scan the individual pieces and rebuild the metadata with little effort. Let's hope ZFS is capable of similar. With my MD1000 up and running, I'm only just now turning to exploring ZFS for the first time. I'll keep you posted.

Last but not least, the 2950 has dual quad core xeons (max spec, I don't remember precisely atm) but only 8gb ram, so I'm unsure if that will be sufficient for what will effectively become a NAS, or if I should look at more. Probably will.

From what I've read about ZFS, the more memory the better. One server already has 32GB, but the other only has 8. I'll be maxing out the second. Memory for these things is stupid cheap these days. I won't be running dedupe, which is supposedly the biggest memory consumer in ZFS.

I feel like I've done most of my homework at this point, but can't quite confirm compatibility on the setup without some advice. Muchos thanks for any help pointing me in the right direction.

Plug together what you've got and see where it gets you! I don't have the final disks for my set up. 30 6TB drives is going run more than $6,000! For the purposes of testing, I found that Microcenter has 40GB drives for $3.99! They looked at me like I was crazy buying that many used hard drives, but it did the job, and for a whopping $64.

Included potato-phone picture for reference.

Looks great!

Why the hell is it so dusty?

Fans. Get used to it.

Why is that blue LED so damn bright?

To distract you from the ever present drone of the fans.
 
Last edited:
Hey there, thanks for the reply. I have some more info I can add along the way here, but you might also be interested in a later post of mine, where I have this thing more fully-baked:
27tb zfs with Dell 9th gen hardware

You mentioned not being too hot on mirrors above, and originally I was in the same boat, but I think ultimately I am good with the fault tolerance of 7x2 disk mirrors.. technically I could lose up to 7 (wow) but it's also true that a very specific 2 disks could fail the array-- the difference in resilver time is significant. I'm looking at 1 hour (maybe a little more) to resilver a vdev as opposed to several times more if I were using z3. I was a little bit influenced by some reading like this article: ZFS: You should use mirror vdevs, not RAIDZ. | JRS Systems: the blog
But of course, different strokes for different folks. This is a pretty low-utilization system so I think the risk is similarly low. All data is replaceable non-mission-critical stuff. I'd be interested to know what you see for resilver time once yours it up and going. I didn't do any practical tests of z3, just estimating based on other comparable systems.

Regarding H800 in JBoD, it turns out it isn't meant to be. You can make 1-disk RAID-0 devices out of it, but that's not the same as what zfs wants for a jbod. I ended up just flashing an H200 according to this walkthrough: A Cheaper M1015 - the Dell H200 and a HOWTO Flash - Overclockers Australia Forums ... much cheaper card, not a difficult process. I imagine you might be able to reflash it, but there might not be much point in going with the more expensive H800 when the H200 is a proven path.

I still don't know exactly what purpose the interposers serve, other than that they are actually needed in my specific case. Plugging sata disks directly into the MD1000 (using the flashed H200 card) just resulted in a solid activity light, and didn't ever present the disk to the system. It could be specific to what card is in use, so I wouldn't be surprised if a different setup wouldn't require them. I'm using all SATA disks. For example the perc/5i doesn't have a problem with straight sata disks without interposers in the main 2950 chassis.

One thing I wasn't able to find was the max disk size for the H200/LSI card that I ended up with. 4TB? 8TB? 10TB? Could get pretty crazy.

I found I actually ended up with 45% of the raw storage available after formatting in my mirror-vdev setup.. If I do another one of these, I might try a 14-disk z3 just for fun.

Turns out for me, 8gb of RAM is entirely sufficient. No dedup. But I'm going to go ahead and add plenty more, as I'm aware that zfs can take advantage of more available ram for caching and such. Plus as you mentioned, it's stupid cheap.

I actually did get some reprieve from the fans, and it was a combination of multiple things:
- Firmware hack for 2950 that allowed me to adjust the lower fan threshold (so that consumer fans could be used without setting off alarms)
- Swap of the 4 main fans with slower, quieter consumer fans
- Adding a resistor to the +12v lead for the fan in each power supply (stepping down to 7v) to reduce speed/noise, actually does not set off alarms
- Swapped out the MD1000 power supply fans for quieter consumer fans, though I did have to add a dummy device to simulate a tach signal to keep the MD1000 from freaking out

After all that, it's actually ok to have in the same room as my home theater. If you are interested in more info there I have a few good links for the 2950 hack.

Cheers
 
i just wanted to rant a bit about your "rebuild" times with 4TB Drives in Mirror mode.... The current maximum Drive-Write-Speed is something about 200MB/s (actually, every current Drive is slower!) whitch gives something about max. 720GB/h rebuild-time. So i wanted to ask how you could rebuild a 4TB Drive in 1h.... until i remebered that you probably have "nothing" stored on your Drive.... :p

But in all seriousness, current gen. CPUs do not even have to leave sleeping state to calculate Z3 Cheksums (of course i am exaggerating, but we are talking about GigaFLOP CPUs here)

The recommendation for Mirros comes from Pools that can not properly function in the event that drive reconstruction needs to read from all drives do rebuild the data. The Limiting factor for Hard drives are I/O operations
So mirror/stripe pools do not need to involve the whole pool to reconstruct their data, but just a small portion of the whole pool (The actual drive it reconstructs/mirrors from)
Evidently a mirror/stripe pool is just a waste of space in an home environment or "a pretty low-utilization system"

Every SSD is kicking "100-Drive miror/stripe pools butt". So if you need IOps, use a SSD!


A Few more words on Those 2TB Limit....

in ancient times Hardware manufacturers and software developers tryed to save overywhere... So they build a MBR partitioning sheme that could address 2^32 Sectors (which is about 2,2TB for 512b Sec.)
Hardware manufacturers also tryed to save on every register....
Since all current Harddrives use 512byte Sectorsize on their SATA Interface, this is where the 2TB Limit comes from.

So what happens if you try to adress some space beyond the 2^32 byte LBA? best case scenario: nothing.... but actually there are millions of cases all over the net where every write beyond the LBA Limit of 2^32 was just writing to that "last sector" or even worse, began from the start, on LBA Sector 0....

So the general Term "works with >2TB Drives" is not something i would rely on. (and could be pretty hard to verify/test).. Every component along the way... Controller Chip, Controller BiOS, SAS-Backplane-Expander-Chip, Backplane ROM, may clip the amount of usable bits for adressing. To test this, you would have to exceed the 2^32 adress limit on every single component, so write more than 2TB on a single Disk, more than 4TB on a 3disk z1 pool etc.
And actually need a way to verify that what was send to the drive/array is really the same as you get when you read from, after overcoming those boundaries....


So back to Topic (kind of) of whats the maximum Disk Size..... I remember i read something about a 2^48 Sector limit that is implemented in current tech.... which translates to 2^16=65536 x 2^32=2,2TiB (for 512b Sectors)
so a pretty damn fuckload of storage space for one single Drive..... Times 8 if the switch to 4k Sectorsize is to be done in the future ;)
 
Back
Top