ZFS NetApp replacement

stevebaynet · Feb 8, 2012

@Nex7:

Regarding the ZIL, I have purchased your reccomended STEC ZeusRAM, but i have been told it doesnt HAVE to be mirrored. And that if it failed, the pool would just be in a degraded state until it was replaced.

Can you please touch on that in more detail? Would love to get a definitive word on this.

Nex7 · Feb 8, 2012

@stevebaynet:

(word of warning: I substitute 'ZIL device', 'slog device', 'slog', and so on pretty regularly, sorry about that -- it is technically a LOG device, and a 'ZIL' is the ZFS Intent Log, something that can exist on the pool data drives or on specific log devices devoted to it).

So here I am writing sync data at a ZFS zpool with a slog device -- the ZIL device is in use, doing its thing, and >wham<, it goes dead. In a vacuum where that's my only failure, the repercussions of this are two-fold:

1 - for the next 'x' seconds (however long until the next txg commit), the incoming write data is being held solely in RAM; I've lost my slog device backup of that data.
2 - once the txg commit has come and the data's been written out, ZFS is going to start using the pool itself as the ZIL device (as it does when no slog has been dedicated), which is likely going to impact (possibly quite severely) my environment's performance.

So in theory, assuming the only thing to fail is the ZIL device itself, and the above issues are understood, then hey, I survived. I'm now impacted on performance until I get another slog device, but my data's protected by the ZIL now utilizing the pool. This is where the logic comes from that you don't *NEED* a mirrored set of ZIL devices.

But there's some obvious problems with the above scenario. It is making a lot of assumptions. Amongst them are:

- I'm only going to lose the ZIL device.
- I'm going to lose my single ZIL device in a 'nice' way (it's going to stop working, not go wonkers).
- I'm not going to lose power before the next txg_commit.
- I'm not going to lose power or intentionally shutdown or export before replacing my ZIL device, or my pool is recent, and so is my OS -- older versions of ZFS lacked the ability to import a pool if the ZIL device was missing.
- My setup has enough leeway in it that the sudden influx of cache flushing random write I/O to the pool data drives doesn't quickly lead to my entire environment going offline.

The thing is, some of these above assumptions end up not being true in quite a few production environments I've seen. I've seen boxes lose the ZIL as part of a power event, for example (either causational or just unfortunate timing). I've seen more than a few environments where the loss of the ZIL took the whole 'cloud' offline shortly after, because the SAN couldn't keep up and clients started timing out too much then going read only and boom. While in that latter example data was effectively safe, it still led to eventual total downtime until the ZIL was replaced -- better than data loss, for sure, but still something they'd have preferred to avoid (they now have 2 slogs mirrored).

This is why you will see enterprise grade solutions really pushing for mirrored slogs -- all of the above assumptions have to be true for it to be unnecessary, and they very rarely are all true.

That said, personally, if you've got a budget of $3000 to spend on slog devices, I'd prefer you buy 1 STEC ZeusRAM instead of 2 mirrored SLC SSD's. The ZeusRAM being RAM, its failure rate and chance of failure and longevity are all far better than even the best SLC SSD's, so in terms of risk, you end up safer on that 1 ZeusRAM than you would on mirrored SSD's.

MarkL · Feb 8, 2012

Hey - just wanted to thanks for the big write up Nex7!

stevebaynet · Feb 8, 2012

@Nex7:

Makes perfect sense, thanks! We are not putting anything production yet on this build. So for now we will keep the single STEC and consider a second once we start putting production data on it.

Thanks for the great write up!

brutalizer · Feb 8, 2012

Lot of great info from Nexenta! Thank you a lot! More please!

Regarding ZIL, I heard that one guy that used 3-5 VirtualMachines and his zpool froze now and then. After adding ZIL, it never froze again. It was butter smooth. So, it zeems that ZIL is necessary for smooth operation.

"I also have to mention that I use VMware ESXi for virtualization and the backend storage runs over NFS against the file server. NFS uses sync writes and this is probably where I saw the biggest difference when I added the ZIL devices. Earlier the VMs would pretty much be unresponsive whenever i copied a big file, because they couldn’t get their sync writes through to the disks. This is gone now, thankfully."

Suprnaut · Feb 8, 2012

Thanks for the detailed response. So to sum it up from my situation I should:

1. Not use local disks for ZIL or L2Arc if I want to use the HA module. Instead I could use a 2.5" external array to house SSDs
2. Max out the ram (288GB on a R710 or 1TB on an R810)
3. I am aware of the issues with the Dell broadcom onboard nics there will be Intel nics in the boxes.

Some questions:
1. You say the Dell MD 1200 is only supported using the Dell Perc Raid cards. That is what Dell supports, but I have seen numerous posts of people running them on LSI HBA cards. So are you saying that since Dell doesn't support them Nexenta won't support them? Or simply warning that Dell won't support them?
2. You mention the cookie cutter solutions from Nexenta's partners as being the only ones supported by Nexenta for HA. Does that mean that I can not create this custom solution and still use the HA module? I would be fine with using one of those configs, but as I stated I already have a ton of Dell hardware and the organization I work for prefers to only buy Dell equipment hence why I decided to use the MD 1200 arrays. From an asthetic perspective it would also look much nicer to have a rack full of similar looking hardware as well.
3. I hear what you are saying about iSCSI vs NFS. We already have a fiber backbone we use with the NetApp and two fiber cards in each VM server. They work at 8 GB in an active-active config essentially giving each VM server 16GB/sec with redundancy. It would be nice to leverage that fiber infrastructure rather than going to 10GGE.

Nex7 · Feb 8, 2012

@brutalizer: Happy to help!

So that scenario could actually have been any number of things, but yes, especially in latency-sensitive and/or high write IOPS environments, a dedicated fast slog device is important. I want to be clear -- it needs to be fast. I've seen customers use spinning rust, even 15K, and that ends up being worse than not having one at all. It needs to be an SLC SSD or preferably a ZeusRAM (since it's RAM with a supercapacitor & flash backup for power loss coverage).

There's a few reasons for this, that all have to do with what ZFS is doing with the ZIL. I cover this a bit on the website I linked -- in essence, ZFS ingests all writes into RAM, always. For async writes, that's it, at that point ZFS responds saying 'OK, I wrote that'. On a sync write, however, what it will also do is write it to the ZIL, flush the cache on the ZIL, and only then respond saying 'OK, I wrote that'.

Then, every 'x' seconds (used to be 30, now it's usually I think 5) seconds, a 'txg commit' occurs and ZFS lays out all the writes its holding in RAM to data disks. Note I said from RAM. ZFS never reads from the ZIL in typical day to day use. It only writes to it. It is in this way a lot like a SQL transaction log. Data goes in, but it isn't referenced ever again unless there's a failure.

The reason a dedicated device for the ZIL is so critical in some environments comes from understanding the above -- especially the part about how sync writes are going to get into the ZIL *immediately* instead of as part of a txg commit AND it's going to by default FLUSH the cache of the device holding the ZIL (your pool, if you don't have a dedicated log vdev). A dedicated log device means that no matter what kind of writes are coming in, your poor pathetic (because they really are all pathetic) spinning disks get to deal only with the kind of write load they are best at -- large, sequential writes every 'x' seconds. If you don't dedicate a slog, they get to do that workload AND they get slammed every millisecond with random write I/O that's also constantly telling them to flush their cache, something your average spinning disk is just not going to like you doing to it at all.

It has some extra benefits -- for starters, ZFS responds affirmative to the sync write once it's in the ZIL.. when your ZIL is on a fast SLC SSD or RAM, its latency to respond to a write request is far lower than your spinning disks, meaning your clients get lower write latency which has a host of ancillary benefits. It also means that even in scenarios where your pool may have actually been able to handle the ZIL traffic and normal usage, that it doesn't have to means the disks will almost undoubtedly live longer, potentially much longer.

How the ZFS Intent Log works and how that behavior is modified by having dedicated slog devices is a topic I didn't get around to covering well enough, I think, in the link I sent. I'll try to get around to providing better information at some point. Sufficed to say, the ZFS Intent Log as a workload is a terrible, terrible thing for spinning disks, but also very necessary if you like your data and don't want to lose it due to power or hardware loss events -- so if you can offload it to SSD or RAM, you very much should, and you will not regret it.

A word of warning: at least on Nexenta, and I believe defaulting on other appliances and OpenSolaris, the COMSTAR block target subsystem defaults to 'write cache enable' on LU's. It is super important to understand this. I cannot tell you how many customers I've seen who either have a log device sitting there unutilized or don't have a log device because they never saw a performance problem to warrant it that are both because they were using solely iSCSI and using things like VMware and not ever even knowing about WCE. Enabling the write cache on a COMSTAR LUN effectively tells COMSTAR that IF the incoming data is async write, it can go into RAM alone, and not the ZIL. If the data coming in specifically is sync, it still should go to the ZIL. This sounds very reasonable. There's a problem. Most clients? They don't specify sync on iSCSI. In fact, by default, very few do. VMware, for instance, does not. Your VM on VMware may be doing SQL traffic and doing it sync to the VMware provided disk, but VMware itself when talking iSCSI back to us may very well be doing async. Note that with write cache disabled on the LU, all writes go into the ZIL, and the problem is avoided.

I routinely talk to customer with VM or other environments and they tell me that when they lost their SAN, after rebooting it they found that some/all of their VM's had issues ranging from complaints about filesystem integrity requiring a chkdsk/fsck all the way up to flat out corrupted disks that they couldn't boot or access any more at all. Inevitably, they are using iSCSI with WCE, or they had gone into the Nexenta system or their OpenSolaris install and turned off ZFS' cache flushing (don't do that, btw, if you do not know what you are doing). Some of these customers had a ZIL device, so they were very confused -- and I was able to show them that in active use, with all the VM's running full tilt, the ZIL device was sitting at 0 bytes/s write, because they were all iSCSI with write cache enabled. Be very careful with that setting, know it exists, and unless you tune your clients to specify sync write, turn it off.

Nex7 · Feb 8, 2012

@Supernaut:

1 - Yes - snagging a 2.5" array or just getting some 3.5" bay to 2.5" disk trays would work.
2 - Yes -- assuming you're planning to have a significant pool. 288 GB of RAM is some pretty silly overkill a 5 TB pool, even in a cloud, usually. I routinely see customers with 5 to 20 TB pools doing VMware with anywhere from 48 to 96 GB of RAM and still getting very appreciable 80%+ ARC hit rates. That said, if you've got the cash, go for it. A 100% ARC hit rate would be cool.

3 - Yeah; again, feel free to use for management, just not data.

1 - There's a pretty big difference here between what the companies SUPPORT, and what is technologically possible. I have personally plugged an MD1000 into an LSI SAS HBA and had no problem running that for a year. But would Dell SUPPORT it? No. If I'd told them that's what I was doing, they've have told me they couldn't help if there was any issue. Let me say now that I don't make policy at Nexenta or Dell, so this is just my understanding of these things. The issue is that even if we at Nexenta support it, that Dell does not means that if something goes wrong, what do you do? If there ends up being a firmware issue that Dell COULD resolve on the MD1220, how do you get them to do that? You won't be able to -- because they have no intention of desire of fixing a firmware bug that only exhibits when an MD1220 is plugged into an LSI HBA, something they specifically do not support. So while Nexenta may not technically care one way or the other, I'd suggest that you really want support from ALL parties involved in a SAN. This is one of the stickier points of using Nexenta or any DIY SAN as opposed to a NetApp -- getting buy-in from all parties.

My last job, we had to use Dell and only Dell, so if that's the case with you I empathize. That said, we ended up breaking our own rule and using an SM JBOD plugged into an LSI card in a Dell chassis for our Nexenta SAN's -- the only time we ever violated the no-Dell rule.. and it was because our IT department put their foot down and said "no unsupported by all vendor configurations". But as you say, you've got a lot of existing hardware -- this is where I'd definitely reach out to a Nexenta Sales person and get an SE involved and discuss what we are willing to do.

2 - No, my strong suggestion was just using a partner/reseller to do this, and that their cookie cutter builds often have the virtue of being well-tested, if only because of how often they're deployed. Quite a few of our resellers, like (and this isn't a specific recommendation for them) PogoLinux, would likely work with you to build whatever you want, within the confines of the HSL and the vendors they resell. Every component on the box MUST be on our HSL for it to qualify for 'Gold', and 'Gold' is a requirement of an HA license (you'll notice we don't sell Silver HA licenses). You can get Gold HA doing it yourself in terms of hardware purchase, but if you are going that route, you will want to engage a Nexenta SE. At the end of the day, what you do not want is to buy hardware, get it, install Nexenta, and then have your system go through what we call a 'Support Acceptance Check' and have us come back with a 'no, we won't accept this'.

3 - This is a perfect example of when an existing environment overrides the guidelines. If you don't even have an existing 10gbe setup but do have an existing 8 gb FC network, then I would by all means suggest that you use FC LUN's from the Nexenta -- just go into it understanding the differences (be sure to note what I said in the last post about WCE, for example; it applies to FC as well as iSCSI). Note also that Nexenta does not have any client-side plugins at this time to deal with 'freeing blocks' on block-level targets.. you either live with the idea that eventually Nexenta LUN's are going to report 100% used even when the client says 5%, or you get used to running 'sdelete'.

Having said the above, if you are open to the idea of NFS or you do already have a 10gbe network that you just don't presently utilize for NFS, it should be discussed. There's a lot of wins on NFS over block level when it comes to VM's, ESPECIALLY as the world moves into NFSv4.

Callek · Feb 8, 2012

DataCore SAN Symphony can perhaps be a viable option as well?

Mtnduey · Feb 9, 2012

what sort of performance issues are you running into with your Netapps? What model filer, disk shelf model & disk type, what version of Data Ontap are you running?

In one of my development labs we have a small Netapp cluster running 1000 VM's, a Dozen databases and a ton of NFS/CIFS shares.

I'm more than happy to help get things sorted out.

As for the Oracle ZFS storage, you may want to hold off on that idea as there are a couple nasty bugs that dont have a fix eta yet, and the workaround is ugly.

Suprnaut · Feb 9, 2012

Our NetApp has:
A FAS3170 head unit with 2 1TB PAM card and 2 8GB fiber HBA cards. OnTap version 8.0.1P4-7
DS 4243 Shelves

The performance issue is mainly in our wallet. But we do see IO issues, lots of latency. To be fair though we are running many heavy IO virtual hosts. It is a nice system, but charging tens of thousands of dollars for something as simple as NFS is ridiculous. Then there are the costs of new shelves and support contracts etc...

Nex7 · Feb 9, 2012

"As for the Oracle ZFS storage, you may want to hold off on that idea as there are a couple nasty bugs that dont have a fix eta yet, and the workaround is ugly. "

Can you quantify that? What bugs? Oracle 7000 appliance specific bugs, or more generic Solaris/ZFS bugs? I'd be happy to look into any of them to verify that they do not affect Nexenta.

Brahmzy · Feb 9, 2012

Glad my NetApp days are behind me for now. Been there, done that. Gobs of cash for more disk shelf after disk shelf after disk shelf that don't touch the I/O req's. The joys of WAFL & RAID DP. Very happy with my EMC arrays/licensing model.

Mtnduey · Feb 9, 2012

Nex7 said:
"As for the Oracle ZFS storage, you may want to hold off on that idea as there are a couple nasty bugs that dont have a fix eta yet, and the workaround is ugly. "

Can you quantify that? What bugs? Oracle 7000 appliance specific bugs, or more generic Solaris/ZFS bugs? I'd be happy to look into any of them to verify that they do not affect Nexenta.

Not sure if it goes beyond the 7000 appliances or not as the bug report they sent us is still private but when it opened a few months ago they were talking they were going to have to completely rewrite the way resilver works and didn't have an ETA. The workaround for systems that could not tolerate a lenghty disk rebuild time (7-14 days) was to setup the appliance as a double or tripple mirror. And no I'm not kidding, that was their suggestion to us. But that is enough to have me hold off on anything important running a large scale ZFS for a while.

Mtnduey · Feb 9, 2012

Suprnaut said:
Our NetApp has:
A FAS3170 head unit with 2 1TB PAM card and 2 8GB fiber HBA cards. OnTap version 8.0.1P4-7
DS 4243 Shelves

The performance issue is mainly in our wallet. But we do see IO issues, lots of latency. To be fair though we are running many heavy IO virtual hosts. It is a nice system, but charging tens of thousands of dollars for something as simple as NFS is ridiculous. Then there are the costs of new shelves and support contracts etc...

are you using ASIS on your VM volumes? Make sure your SIS schedule on VM volumes is not during the heavy workload hours (ours was mon-fri 6am-6pm as heavy load) so set your SIS scan times outside those times to ensure a SIS scan isn't impacting your business.

We have 1000 active running VM's split across the two 3140 filers (running 8.0.2.P4), using ASIS on all VMware volumes so we see a very high cache hit ratio, all this with no PAM cards. We have a mix of the new and cool DS4243 shelves and some old crappy DS14 shelves doing the work. Shocked you see any I/O issues at all, esp with the PAM cards.

You'll prob need to do a deep dive to really find the cause of the issue you're seeing.

I agree Netapp prices are high, but that may be an issue with your reseller than Netapp, tell them they need to bring the price down, besides NFS is supposed to be part of the base services sent with the filer. Having datacenters with storage from every major vendor out there, the only one that doesn't give me a headache with service calls is Netapp.

Also be vocal with your Netapp rep, tell them your frustration with the prices and such. Give them a chance to retain your business, they may pull out some big guns to get your issues resolved

apnar · Feb 9, 2012

I've also had great luck with NetApp. We've really only run into performance issues in two situations and both were clearly our fault. First, by filling up volumes and aggregates nearly or completely full and the second is by putting way too many files in a single directory (someone had tweaked the max dir option to be much higher then it ever should be). I have occasionally been annoyed with the lack of upgrade paths between old and new technologies (like the move from trad vols to flex vols or now the move from 32bit aggregates to 64bit aggregates), but not so much as to look for other options. You basically get most all of the ZFS goodness plus some extras in a very nice easy to maintain package.

The pricing is definitely up there but there is also a good amount of room for them to work with you on price. I agree with Mtnduey that if price is your primary concern push back hard on your reseller, find a new reseller, or (if you're big enough) just deal with NetApp directly.

Mtnduey · Feb 9, 2012

Another thought occured. Have you thought about using OpenFiler HA for a really inexpensive HA NFS/CIFS option that's still really easy to manage? and just use your Netapp for the important stuff like VMware, database luns, etc...? Openfiler can also do iSCSI/FC among other stuff like block replication.
http://www.openfiler.com/

Also, not sure if you've seen the Backblaze blog but they go into pretty good depth of how they make their massive-scale PB backup devices. Pretty neat stuff and they even have a parts list at the bottom if you want to make your own:

http://blog.backblaze.com/2011/07/20/petabytes-on-a-budget-v2-0revealing-more-secrets/#more-337

food for thought

danswartz · Feb 9, 2012

Well, I spent a couple of hours trying to install and configure the EMC Unisphere VSA and never did get it working right. Deploy the same identical ova 4 times, and see 4 different sets of messages on the console (including some failure cases.) vmware tools not running by default (although supposedly installed via rpm when the OVA was created). Although it seems to be a redhat install and google seems to indicate a 64-bit kernel, the OVA has guest type of '32-bit Other'. I was unable to install vmware tools from vsphere client, since apparently prebuilt modules were not correct and no gcc is installed. There are apparently multiple versions of this appliance, and google was largely finding howtos that didn't match what I was seeing. Maybe I'm just stupid

I'm going to give this one more shot, and if no joy, I'll recover some disk space...

spankit · Feb 10, 2012

This thread is awesome. Thanks for all the great information Nex7!

Nex7 · Feb 11, 2012

@spankit: My pleasure!

@Mtnduey:

Hmm, I'm unaware of any bugs with raidz resilvers in Nexenta or illumos (I assume that's where the issue lies, implied by the suggest that mirrored pools are unaffected). In fact, recent versions of Nexenta expose some tunables for how fast you want resilvers to go (obviously also available on any illumos-based distro, who are frankly still ahead of us code-wise until Nexenta 4.0, when we move to an illumos core from our current OpenSolaris-based core) in the form of "sys_zfs_resilver_delay" which sets the number of ticks to delay resilver and "sys_zfs_resilver_min_time" which tunes the minimum number of milliseconds to spend resilvering per zfs txg commit.

I don't know how much this may affect things, but I'd take a close look at a few things:

1) Every top ZFS developer I'm aware of has left Oracle.
2) Every top ZFS developer that left Oracle that I'm aware of now works for a company that is contributing code to illumos.
3) A number of the top ZFS developers I'm aware of are working on ZFS in part or full-time at their new companies; new features, bug fixes, and so on that they make -- it is all going into illumos, not Solaris, and I have no idea if Oracle is snagging that code back into Solaris, but I suspect not.

@apnar:

I had the joy of dealing with NetApp's at my last job -- I still remember the actual feeling of excitement that I had when I was shipped off to get training and certification on NetApp. To this day, I have great respect for NetApp (and EMC) and what they do, and one thing I think it is safe to say is their product is and has been for a long time extremely stable, which to me is a big deal, being an administrator. You and Mtnduey's advice on pushing the reseller or going NetApp direct (we were direct) is all sound.

That said, I seriously disagree with your sentence, "You basically get most all of the ZFS goodness plus some extras in a very nice easy to maintain package.". That's actually sort of the reverse of the truth, my friend. I somehow suspect NetApp didn't go litigate Sun over ZFS because it wasn't a threat, and they were just bored that day. To suggest you get "most all" of the ZFS goodness from WAFL is a gross misrepresentation of the two technologies. There's not only already a decent bit of goodness in ZFS that you won't find in WAFL, there's far more on the horizon for ZFS than I'm aware of for WAFL. And likely at a faster clip, too, would be my guess.

And no matter how hard you push NetApp on pricing, and how far they drop their shorts for you, the fact remains they are going to charge you for every protocol you want that filer to talk past the first one, they are going to charge you for almost every extra software feature, etc. I'm not aware of any ZFS storage appliance vendors who charge cash for protocol support - they're all flat rate to my knowledge. Nexenta, for example, only charges extra for an HA plugin (which has nothing directly to do with ZFS) -- CIFS, NFS, iSCSI, FC, and so on are all simply included, as is snapshots, clones, deduplication, compression, the ability to use SSD or RAM-based log and cache devices, etc. And perhaps more importantly, NetApp is going to lock you to them. Period. You put a couple PB of data on NetApp and then try to move those disks into an EMC if your NetApp relationship sours, let me know how that goes for you.

If you decide you hate Nexenta, fine, the hardware you're running it on isn't ours in the first place (which is part of why it was almost inevitably less expensive to build out than an equivalent EMC or NetApp array, despite NetApp arrays really just being an x86-based server with a few extra customizations anyway), so just install OpenIndiana or whatever else that comes along (heck, FreeBSD). If you later decide you love Nexenta again, install it back.

Or move the disks into a new system running a different OS that supports ZFS. This is all doable -- completely OK -- and completely impossible with EMC or NetApp. One of Nexenta's primary goals as a company is to break open the old model of vendor lock-in storage, as was part of the credo for ZFS and the 'Open Storage' movement Sun began years ago. This is a goal I personally share.

Nex7 · Feb 11, 2012

@Mtnduey:

Be careful with OpenFiler. In my opinion, the technologies underlying it are crap. And that pains me greatly to say, you understand -- I didn't spend the last decade of my life as a Solaris admin.. I spent them in the webhosting industry (ServerBeach, Site5, ThePlanet, etc) as a Linux guy. Linux is still my first love, and bluntly, I still prefer it to illumos-based distributions for everything but storage stuff. But for storage stuff? Wow.

I've lost 10+ TB of data thanks to XFS "design decisions". I've had arrays utilizing lvm2 keel over because we dared to take snapshots on heavy load systems, even momentarily, and the resulting performance drop was enough to tip the scales. I've had to fsck ext filesystems so large it took a week to complete. I've used large Linux systems as SAN/NAS's, number in the 10's of TB's per system. Sometimes they were even backed by solid storage (EMC, for example), but Linux was providing the filesystems.

No, sir, I say. No. I used to do it, and didn't even realize there was a better way. Now? My strong opinion on this subject is do not trust any _storage_ system that relies on XFS, nor on ext2/3/4, nor on lvm for anything production or enterprise. Your best case scenario is it works, but far more poorly than a ZFS solution would have on much the same hardware. That's your best case scenario. And you're going to be lucky to get it. Your worst case scenarios are all disasters, and far more likely to occur on those technologies than on any ZFS-based storage system.

apnar · Feb 11, 2012

Nex7 said:
@spankit: My pleasure!
That said, I seriously disagree with your sentence, "You basically get most all of the ZFS goodness plus some extras in a very nice easy to maintain package.". That's actually sort of the reverse of the truth, my friend. I somehow suspect NetApp didn't go litigate Sun over ZFS because it wasn't a threat, and they were just bored that day. To suggest you get "most all" of the ZFS goodness from WAFL is a gross misrepresentation of the two technologies. There's not only already a decent bit of goodness in ZFS that you won't find in WAFL, there's far more on the horizon for ZFS than I'm aware of for WAFL. And likely at a faster clip, too, would be my guess.

Nex7, I think you read a little more into my comment then I intended. I was just speaking at a high level features kind of comparison; things like snapshots, cloning, dedupe, mirroring to another array including snapshot history, etc. In comparison to classical block level storage I'd say they're pretty similar. If anything the litigation bolsters the argument of their similarities.

Regardless, I'm a huge ZFS proponent and run it myself in multiple locations. I've also run Nexenta Core in that past, was happy with it, and I look forward to trying it again when you guys finish rebasing. If you look back in this thread you'll see that I even pointed the poster to the Nextenta HA info page. I am very much a proponent of open storage approach you mention with ZFS (I'm on my 3rd different OS at home on top of my ZFS pool). My only hope is all the various groups working to move Illumos (and ZFS) forward can work together and coordinate (not too encouraging at the moment with the various packaging directions being taken). With all of them working together I can easily see ZFS surpassing just about every other storage technology out there.

As an aside, I think you're adding a ton of great info to this thread and I hope you continue to contribute.

Nex7 · Feb 11, 2012

@apnar: Haha, no worries, I apologize if I got a little overzealous in there.. proposing ZFS over legacy arrays is something I do (and often have to fight about) weekly (and I'm not even in sales!), so it just comes out naturally.

I appreciate the positive comments from all, and I do intend to continue monitoring this thread. I thought about poking into the other ZFS-related threads, but the 100+ page one is a little overwhelming; I think diving into that one may have to wait until I'm on my 2nd or 3rd glass tonight..

RESTfulADI · Feb 14, 2012

I wish the legacy hardware I was told to use had been on Nextenta's HCL when I was working out the details of the HA cluster with you guys.

femi · Mar 1, 2012

Can't believe I completely missed this thread.

Lot's of valuable ZFS info.

Thx to @NexSeven for his posts, both here & on his blog

enuro12 · Mar 3, 2012

Thanks Nex7! Great info. I'm a ZFS beginner, but feel like i'm getting a better understanding of the 'root' of it.

I'm a bit overwhelmed but i'm still going to attempt a backup server with ZFS....

Jay69 · Mar 3, 2012

Just a little input to return all those help from the others.. In order to use maximum ram, you need 2 CPU. The R710 is actually a nice solution.. It's LOUD ..Do your company already has 10GB switch ?

Suprnaut said:
I'm putting together a proposal to replace our NetAPP. My company has complained about the cost of disk shelves and performance. I would prefer to use a ZFS solution. The storage will mainly be used for an ESXi cluster.

The company prefers to buy Dell equipment so that is what I have to work with. Please tell me if you think this configuration is viable or any suggestions. Thanks.

What I'm thinking is:
Server: Dell R710 Poweredge
>>>>> Single Xeon 6 core processor
>>>>> Max ram (I think 128GB)
LSI SAS 9201-16e HBA adapter
2 Qlogic 8GB HBA fiber cards
Intel Dual 10GB copper NIC card
Local hard drives to boot Solaris and possibly SSDs or PCIe card for cache.

Storage Shelves:
As many PowerVault MD1220 filled with 24 1TB 2.5" drives as we need.

sphinx99 · Mar 5, 2012

One more piece of input. Yes, NetApp nickle and dimes you on every software option (although you can push a VAR hard to get the cost down) but you are paying for astonishingly good hardware, support and availability. I've rolled disk firmware updates out live across hundreds of spindles at once without even a pang of nervousness... it's something special to see an array simply off-lining, upgrading then online-ing spindles one at a time in live, high throughput production volumes. I've had to take down cluster nodes actively serving 6-figure sustained IOPS across multiple protocols without anything going down. The value proposition really shows up when you move to a HA cluster, and this is something that will be very time-intensive to replicate. And support has to be factored in; it's great that you're reading up on this forum but "management" will be looking for a system that can be supported by someone off the street making a phone call to the vendor support in a pinch, should you leave someday.

Lastly make sure your storage subsystem is supported by vmware. They will drop you like a hot potato if you have a storage issue and they see that you are running something not on their official HCL.

femi · Mar 5, 2012

Nexenta is VMware Ready certified
http://nexenta.com/corp/partners/technology-partners

Suprnaut · Mar 6, 2012

No we don't have any 10GB switches yet. They strongly prefer Cisco over anyone else, but it has been hard to find a solution with many sfp+ ports. Brocade and other companies seem to have much better offerings on 24 port 10GB sfp+ solutions.

stevebaynet · Mar 6, 2012

Suprnaut said:
No we don't have any 10GB switches yet. They strongly prefer Cisco over anyone else, but it has been hard to find a solution with many sfp+ ports. Brocade and other companies seem to have much better offerings on 24 port 10GB sfp+ solutions.

Check out Cisco Nexus 5500 series

brutalizer · Mar 7, 2012

sphinx99 said:
...but you are paying for astonishingly good hardware,

I have read that NetApp just uses off the shelf commodity hardware. I think it was in here.
http://www.theregister.co.uk/2011/09/22/vmworld_hol_sakac/

BTW, Nexenta does good:
http://www.theregister.co.uk/2011/03/04/nexenta_fastest_growing_storage_start/

apnar · Mar 7, 2012

brutalizer said:
I have read that NetApp just uses off the shelf commodity hardware. I think it was in here.
http://www.theregister.co.uk/2011/09/22/vmworld_hol_sakac/

I'm guessing that is referring to base components like hard disks, CPUs, memory, FC and SAS controllers, etc. The cases for their filers and shelves sure seem custom as do the logic boards in each. I'd also assume that even for off the self components they often have custom firmwares.

brutalizer · Mar 7, 2012

apnar said:
I'm guessing that is referring to base components like hard disks, CPUs, memory, FC and SAS controllers, etc.

That is my guess too. Bottomline, NetApp also use commodity hardware. Not some tailor made super stuff only for them.

sphinx99 · Mar 7, 2012

Yeah, I don't think so. The controllers have on-board integrated Li-Ion packs to battery-back the RAM banks. All the quad 6Gb SAS cards are pretty custom. The basic architecture is commodity - you've got Xeon processors, PCIe, and the like, but there is some pretty significant custom engineering going on. Don't forget things like the service processors, instrumented PSUs, flash modules either. I've opened up quite a few controllers and there is stuff in there you won't find in your local PowerEdge. And let's face it, if you are anything more than a beginner at this sort of thing, you can negotiate the controllers at what is probably not significantly over their cost. NetApp makes its money on the software. I'll definitely vouch for the hardware being more over-engineered than your typical Dell, Proliant, etc. boxes.

ZFS NetApp replacement

Limp Gawd

Weaksauce

Limp Gawd

Limp Gawd

[H]ard|Gawd

Weaksauce

Weaksauce

Weaksauce

Weaksauce

[H]ard|DCer of the Month - Nov. 2013/Nov. 2014

Weaksauce

Weaksauce

Supreme [H]ardness

[H]ard|DCer of the Month - Nov. 2013/Nov. 2014

[H]ard|DCer of the Month - Nov. 2013/Nov. 2014

Weaksauce

[H]ard|DCer of the Month - Nov. 2013/Nov. 2014

2[H]4U

Limp Gawd

Weaksauce

Weaksauce

Weaksauce

Weaksauce

2[H]4U

Limp Gawd

Gawd

n00b

[H]ard|Gawd

Limp Gawd

Weaksauce

Limp Gawd

[H]ard|Gawd

Weaksauce

[H]ard|Gawd

[H]ard|Gawd