ZFS based storage

titusc · Feb 26, 2014

Hi, I need some guidances for putting together a storage unit to serve 2 - 3 ESXi via iSCSI and act as a large store via NFS.

Knew next to nothing about storage and been researching a lot for the past few weeks. Here're the requirements ....
1) Pretty much sure it has to be ZFS based because I want the native checksum, raidz, scalability, compression, etc. Don't think BRRFS is production ready, and will be here to stay.
2) Good enough to run the servers via iSCSI mounts so I wouldn't get issues with ESXi or timeouts, etc but doesn't have to handles huge IOPS. Think if we use this forum's traffic as a benchmark we are good.
3) Majority of the space will be for the NFS, shared across the VMs.
4) Want to have Active Active HA via Cluster in a Box (ie 2 nodes in a 1U units).
5) Start with 8 x 4TB in raidz2 in a 24 bay 2U JBOD and grow.

Looked at the following:
Nexanta
- Too expensive because they charge per storage space unit.

Eon NAS
- Doubtful on their performance because like most of these pre built devices they have a puny amount of RAM available (32GB for their high end units).

FreeNAS
- Don't like the fact there is no ability to do global hot spare.

TrueNAS
- Ridiculous money they are charging.

Napp-It
- First time I looked I wasn't able to find much English pages. Second time I looked I found some but not sure how widely it's deployed and tested.

ZFS on Linux
- Read it's production now but some features available to the Solaris and FreeBSD variants are still not available or Linux? And think I read some ran into issues with it as well?

My background:
- Extremely familiar with Linux via CLI but not so much with Solaris or FreeBSD.
- No experience building servers myself so I wouldn't know things like whether it is possible to plug in hard drives in advance and power them up via IPMI. Need to do some more research on this but if someone knows, please let me know. Also I wouldn't know I should buy Dell or Super Micro. What are the pros and cons?

If there is anyone who had built a scalable storage for servers and can give me some steer, comments, etc, it's much appreciated.

drescherjm · Feb 26, 2014

ZFS on Linux
- Read it's production now but some features available to the Solaris and FreeBSD variants are still not available or Linux? And think I read some ran into issues with it as well?

Last month I moved ~35TB of my work ~ 70TB storage on my linux servers from btrfs on top of mdadm raid6 to raidz2 or raidz3 ( 3 servers total ). So far so good. I am doing a similar migration at home but there I do not use raid (each disk is a pool) but backup data to other disks instead. At work I use a tape archive for backups.

FrankD400 · Feb 26, 2014

Power them up via IPMI?

Do you need cold spares that are plugged? You'd need a special backplane with power control to do that if so. If you just want to toggle the drive controller SATA hot plug will do that just fine attached to a SAS or AHCI SATA controller. Were you looking to do L2ARC? What kind of ZIL?

How are you going to do traffic separation/aggregation? Do you have dedicated cards for iSCSI or do you want to aggregate network and disk traffic on a single link? Are they iSCSI HBAs or ethernet cards with software iSCSI implementations?

iroc409 · Feb 26, 2014

Well, it looks like you've ruled out just about everything. However, rest assured, Napp-It has been used a lot--just look at the forums, or the thread about it. There's tons of stuff. I believe Gea has said it's used quite a bit in enterprise, but don't quote me on it.

My money's on native FreeBSD, but I've used it a long time. I have installed Napp-It, and it was super easy. I think at the time I used OpenIndiana. I have no Solaris experience whatsoever, and it was incredibly easy with Gea's documents to be up and running in pretty much minutes with a test system. ETA: OpenIndiana's documentation originally pointed (mostly) to Oracle's. For the most part, I believe a lot of the documentation is useful for any Solaris-based system, and I've even used their documentation for general ZFS help. It's very good, and they have a lot of it.

I've never used IPMI, so can't help there.

drescherjm · Feb 26, 2014

I've never used IPMI, so can't help there.

I was actually saved (having to drive 20 miles to work at midnight) by IPMI on Sunday when I remotely rebooted a server (to fix a stale nfs lock) only to have it stuck at the grub prompt. Although I guess I would have never tried the reboot without it.. Anyways after virtually mounting a sysrescue iso in the iKVM I was able to fix the problem with GRUB after a few forced reboots via IPMI.

No experience building servers myself so I wouldn't know things like whether it is possible to plug in hard drives in advance and power them up via IPMI.

Hmm. My IPMI (via supermicro motherboard) allows me to power on, power off or reset the system + adds a iKVM and a few other features like access to the system temps and voltage sensors in the IPMI web interface. I have no control of drive power there. However again this is what I get from a supermicro xeon server board.

iroc409 · Feb 26, 2014

drescherjm said:
I was actually saved (having to drive 20 miles to work at midnight) by IPMI on Sunday when I remotely rebooted a server (to fix a stale nfs lock) only to have it stuck at the grub prompt. Although I guess I would have never tried the reboot without it.. Anyways after virtually mounting a sysrescue iso in the iKVM I was able to fix the problem with GRUB after a few forced reboots via IPMI.

When my server dies, I will probably make sure the next one has IPMI. I've even thought about it for a firewall upgrade. I think you can do some of the same things with vPro, but I've never actually used it. It was just becoming available at my last job where I had a lot of access to that kind of stuff, but that was years ago now.

dave99 · Feb 27, 2014

remember napp-it is really just a web-gui / management tool, not a distro. I've got several storage boxes on omnios now (with napp-it), I'm pretty happy with those. Very lean, no desktop stuff, built just for being a server. Most things I just do from the CLI, every now and then I'll use napp-it to configure something I'm not as familiar with.

drescherjm · Feb 27, 2014

iroc409 said:
I will probably make sure the next one has IPMI.

I plan to do the same for every new server I buy at work.

titusc · Feb 27, 2014

drescherjm said:
Last month I moved ~35TB of my work ~ 70TB storage on my linux servers from btrfs on top of mdadm raid6 to raidz2 or raidz3 ( 3 servers total ). So far so good. I am doing a similar migration at home but there I do not use raid (each disk is a pool) but backup data to other disks instead. At work I use a tape archive for backups.

Can you tell me a bit more about your setup at both home and work? What hardware and software do you use and what is the topology like? How long have you been running the Linux based ZFS at work?

FrankD400 said:
Power them up via IPMI?

Do you need cold spares that are plugged? You'd need a special backplane with power control to do that if so. If you just want to toggle the drive controller SATA hot plug will do that just fine attached to a SAS or AHCI SATA controller. Were you looking to do L2ARC? What kind of ZIL?

How are you going to do traffic separation/aggregation? Do you have dedicated cards for iSCSI or do you want to aggregate network and disk traffic on a single link? Are they iSCSI HBAs or ethernet cards with software iSCSI implementations?

Yes I want to be able to have the ability to place another pool of 8 disks in the JBOD so in case I run out of space I can activate them remotely. However until the existing group of disks are running out of disks space I want to power them just yet to prolong life and save power. So I guess cold spare is the correct terminology. Can you tell me what type of backplane I need? Is there a name for this? Not following you on the SATA hot plug.

In terms of L2ARC and ZIL, I think if I use 2 SSDs in RAID 1 mode for the ZIL it's called a SLOG and it'd help if data is being written to the storage and suddenly power is lost? Don't think I'd need L2ARC for read cache. My understanding of ZFS is I need 1GB of RAM per 1TB of storage. So if I want to start with 8 x 4TB disks in radiz2 and grow to 24 x 4TB eventually, I need 32GB of RAM to start and grow to 96GB of RAM eventually. Would having a ZIL and L2ARC soften this requirement?

Good question about the traffic separation / aggregation topic. I never thought of that! I guess I'd have everything aggregated through a 1Gbps NIC. Any concerns for this? What would you suggest?

iroc409 said:
My money's on native FreeBSD, but I've used it a long time. I have installed Napp-It, and it was super easy. I think at the time I used OpenIndiana. I have no Solaris experience whatsoever, and it was incredibly easy with Gea's documents to be up and running in pretty much minutes with a test system. ETA: OpenIndiana's documentation originally pointed (mostly) to Oracle's. For the most part, I believe a lot of the documentation is useful for any Solaris-based system, and I've even used their documentation for general ZFS help. It's very good, and they have a lot of it.

Any reason why you switched from Open Indiana to Free BSD? Given Oracle Solaris is free to download, and they have good documentation, any reason to not use Oracle Solaris instead?

drescherjm said:
Hmm. My IPMI (via supermicro motherboard) allows me to power on, power off or reset the system + adds a iKVM and a few other features like access to the system temps and voltage sensors in the IPMI web interface. I have no control of drive power there. However again this is what I get from a supermicro xeon server board

Is it correct to say IPMI is like what HP's OpenView is for? So OpenView is what HP uses and they don't use IPMI? Or OpenView has features in addition to IPMI?

dave99 said:
remember napp-it is really just a web-gui / management tool, not a distro. I've got several storage boxes on omnios now (with napp-it), I'm pretty happy with those. Very lean, no desktop stuff, built just for being a server. Most things I just do from the CLI, every now and then I'll use napp-it to configure something I'm not as familiar with.

Can you tell me why you selected omnios over other OS like Free BSD, Open Indiana? Can you share with us your topology and size of storage?

Just browsing online and came across a product called Open-E. It appears to be a very good and comprehensive storage solution. They even have active active feature available and prices appear to be affordable. If they use ZFS rather than XFS, and support hot swap disks (don't know how they didn't implement this feature), I'll give it serious considerations.

Also any advise of why I'd choose to go with Dell or HP or SuperMicro? I saw some nice JBODs and CiB units from DataOn but they never reply to my mails so I gave them up.

Finally, is there ever a way to try out OpenIndiana or Free BSD or ZFS on Linux without having a few physical disks and a physical HBA available? I have just bought VMware Workstation but have read I need to have physical devices and operate them in pass through mode? Since I'm still experimenting I don't even know what hardware I'm going to buy!

danswartz · Feb 27, 2014

No you don't need actual disks

You can just create, say, 2 32GB virtual disks, create a smaller one to install the zfs distro on, and once it's up and running, tell it to create, say, a raid1 on the two 32GB disks and play around...

dave99 · Feb 27, 2014

I use omni simply because I've used zfs since around 2008 when it was only really available in opensolaris, and I wanted to stick with what I know. Openindiana works fine also, but it includes a desktop/gui, and I don't need or want that on my servers. I use zfs primarily as backup servers at some client sites and then 1 primary system at my office that archives all client data that is rysnc'd each night, variety of systems, raw space: 20tb, 10tb, 4tb, 4tb. The last 3 are a variety of HP microservers, my main system is a whitebox.

While oracle solaris is free to download, it's not technically free to use, it's supposed to be for eval only. Unless you can afford to pay for ongoing support, I wouldn't go that direction. zfs is forked now, you can't switch between oracle and illumos (omnios, openindiana, eon etc) anymore.

fields_g · Feb 27, 2014

dave99 said:
remember napp-it is really just a web-gui / management tool, not a distro. I've got several storage boxes on omnios now (with napp-it), I'm pretty happy with those. Very lean, no desktop stuff, built just for being a server. Most things I just do from the CLI, every now and then I'll use napp-it to configure something I'm not as familiar with.

I want to highlight this comment. Napp-it is not a distro. It is a frontend. It can only do what the system it is installed on can do. As dave99 suggested, OmniOS is pretty solid. Give it (and Napp-it) a try.

HobartTas · Feb 27, 2014

Greetings

dave99 said:
While oracle solaris is free to download, it's not technically free to use, it's supposed to be for eval only. Unless you can afford to pay for ongoing support, I wouldn't go that direction. zfs is forked now, you can't switch between oracle and illumos (omnios, openindiana, eon etc) anymore.

There's nothing stopping you from creating your Solaris pool at a common ZFS level like say version 28? up front as the default action when you create a pool is to use the highest current level which in Solaris 11.1 is level 34. This then would give you the ability to move it if you so desire while still letting you use all the other benefits of the latest Solaris OS. This will only continue to apply provided you do not upgrade the ZFS level in whatever OS you are using at that time as otherwise you will then become locked in.

Cheers

titusc · Feb 27, 2014

danswartz said:
No you don't need actual disks You can just create, say, 2 32GB virtual disks, create a smaller one to install the zfs distro on, and once it's up and running, tell it to create, say, a raid1 on the two 32GB disks and play around...

Ah I didn't know that?! I have always thought I must have proper HBA and disks and use pass through mode when using VMware in order to try out these software. Let me try out over the weekend.

dave99 said:
I use omni simply because I've used zfs since around 2008 when it was only really available in opensolaris, and I wanted to stick with what I know. Openindiana works fine also, but it includes a desktop/gui, and I don't need or want that on my servers. I use zfs primarily as backup servers at some client sites and then 1 primary system at my office that archives all client data that is rysnc'd each night, variety of systems, raw space: 20tb, 10tb, 4tb, 4tb. The last 3 are a variety of HP microservers, my main system is a whitebox.

While oracle solaris is free to download, it's not technically free to use, it's supposed to be for eval only. Unless you can afford to pay for ongoing support, I wouldn't go that direction. zfs is forked now, you can't switch between oracle and illumos (omnios, openindiana, eon etc) anymore.

Okay thought most of the OS will have a front end. Speaking of which wouldn't these native front ends have some sort of storage management functions? I use CentOS and it does have a GUI for its LVM.

I notice you mostly use it for backups and archives and the sizes aren't really large. I'm looking to scale to the hundreds of TBs eventually as a primary storage. Do we know if any of these software have any issues with that?

Just checked with Oracle. Their "Oracle Solaris Premier Subscription for Non-Oracle Hardware (1-4 socket server)" support is USD $1,000 / socket. There is no software cost mentioned. Their cluster edition does have a separate software cost of USD $1,050 also. Does this mean if I run single node, it'd be free and I only pay for support if and when I need it?

Just checked Napp-It. It says
Commercial OS-Support is available for:
- OmniOS (Omniti)
- Solaris 11 (Oracle)

So ... curious why these 2 OS. Any idea what the costs for Napp-It is like for commercial support?

fields_g said:
I want to highlight this comment. Napp-it is not a distro. It is a frontend. It can only do what the system it is installed on can do. As dave99 suggested, OmniOS is pretty solid. Give it (and Napp-it) a try.

Thanks yes I know. I have never thought of it as a separate OS.

madrebel · Feb 27, 2014

titusc said:
Nexanta
- Too expensive because they charge per storage space unit.

they charge per managed raw TB. idk what exactly you're trying to build and how critical it is, i can tell you nexenta's support engineers kick ass.

Napp-It
- First time I looked I wasn't able to find much English pages. Second time I looked I found some but not sure how widely it's deployed and tested.

this is a great management utility to learn with. If you're staffed correctly and understand what it really means to go it alone then you can use napp-it ontop of omniOS very happily in production environments.

ZFS on Linux
- Read it's production now but some features available to the Solaris and FreeBSD variants are still not available or Linux? And think I read some ran into issues with it as well?

the thing with ZoL is that first, foremost, and primarily (for now) the focus is on iscsi. there probably won't be much focus on NAS performance/stability for awhile.

- No experience building servers myself

as long as you understand hardware you should be ok. if you went the nexenta route you can work with their partners on the building of the system you want.

whether it is possible to plug in hard drives in advance and power them up via IPMI.

IPMI no. uhmm ... hmm you can power down a drive if it isn't part of any pools but not with (not directly with) IPMI. You would need to know commands I don't know (possibly sesctl or smartmon?) or a third party toolset like SANTools. If the drive is part of a pool though powering it down will cause a fault. Not entirely sure what would happen if you power off a spare that is in a pool as a spare ... might be fun to test. I will almost guarantee you will have some sort of fault/failure if you lose a drive and then try to activate a powered off pool spare ... kinda curious what happens there.

generally speaking though outside of a SOHO environment powering down a drive that costs so little to power 24/7 isn't worth the hassle. if you're amazon and building out glacier .. ok saving a few pennies per month per drive adds up when you're talking 10K drives.

Also I wouldn't know I should buy Dell or Super Micro. What are the pros and cons?

i prefer dell, they were the first server i worked with that had really good integration with linux (firmware updates from yum was sexy).

nothing 'wrong' with supermicro per se. again though, if you're going to white box but have no experience managing firmware etc in a production environment you should look into whats involved in that. SM support is good, not great but good. their RMA process is also good.

Dell JBODs destroy supermicro IMO. I've had a few SM dense JBODs/Servers and it is really frustrating having 5 of the 36 drive slots that are just a huge PITA to work with. I had one that I had to grab vice grips for the damn caddie just refused to come out.

Haven't had a single issue with the serviceability of dell's JBODs.

you should quote out a SM build though and use it to beat dell across the face over price. dell WILL get REALLY close to matching SM prices you just need to properly motivate them to do so

If there is anyone who had built a scalable storage for servers and can give me some steer, comments, etc, it's much appreciated.

i have 5 nexenta SANs in production and building out 4 more. not going to go into the exact design in public.

advice i would give:

2 cables to each JBOD, always. 48gbps of bandwidth to a JBOD is better than 24 and redundancy etc.

whatever you're thinking of building, consider building 2 or 3 smaller units to attain whatever goal you're attempting to hit. horizontal is almost always superior to vertical here and the larger the system the more can and will go wrong with it.

while zfs can use 'X' GB of ram i would (i am in fact) waiting for a few things to be resolved before i ever go above 128GB again.

L2arc ... is ALWAYS inferior to ARC. Meaning, build some sort of inexpensive dev system. run some workloads against it and see what your ARC stats are. if you're 85%+ on arc cache hits spending a few thousand on SSDs which may only get you another 10% or so on cache hits really isn't worth it.

with PCI-e 3.0, multiple 10gbps NICs, and multiple HBAs you will run into IRQ limitations (when was the last time you worked with IRQs?). as an example you can fit 2 x 10gig cards, and 3 x LSI 9206-16E (6 total HBA chips) in a dual CPU Dell R720. Trying to add a third NIC or 4th HBA = cannot bind to an IRQ. As I discovered, these cards require their own IRQ, cant share IRQs, and there are only so many to go around. also the HBAs were somewhat finicky depending on which PCIe slots they were in. Rock solid once you get all that fingered out.

Just mentioning this as a follow up to the 'build more and smaller' i mentioned above. three are trade offs and limitations hardware wise that you can get around, sometimes, but three is always a trade off. just more reason why you should think smaller and more numerous, IMO.

zrav · Feb 28, 2014

madrebel said:
the thing with ZoL is that first, foremost, and primarily (for now) the focus is on iscsi. there probably won't be much focus on NAS performance/stability for awhile.

This is not true. Zvols have never been the primary focus of ZoL. If you look at the last year of ZoL development activity the vast majority of performance and stability improvements are generic. If at all, zvols are the least mature part of ZoL.

titusc said:
whether it is possible to plug in hard drives in advance and power them up via IPMI

You can control the drive power states with hdparm.

titusc · Feb 28, 2014

madrebel said:
they charge per managed raw TB. idk what exactly you're trying to build and how critical it is, i can tell you nexenta's support engineers kick ass.

Sorry is there a difference between "managed raw TB" vs "TB"? If I manage it entirely on my own do I get to use the non community edition software? I think they put a cap of 18TB for the community edition and if you want to go above that you have to pay to get a license key. Just so I get an idea, how much data you have in your pool and how much do they charge you?

madrebel said:
this is a great management utility to learn with. If you're staffed correctly and understand what it really means to go it alone then you can use napp-it ontop of omniOS very happily in production environments.

Yes I think I can handle it. Was looking to buy an appliance like the QNAP ones but they only go up to 0.5PB vertically because they use EXT4. Even the ZFS ones like the Eon NAS claims to be able to grow by adding JBODs but with only 32GB max per node, I can't see how that'd work beyond 32TB of storage. Because these are out of the question, I need to do it myself. Happy to do that just didn't know I can play around in my VMware Workstation. Thought I must have physical HBAs and disks and enable pass through but danswartz corrected me above.

madrebel said:
the thing with ZoL is that first, foremost, and primarily (for now) the focus is on iscsi. there probably won't be much focus on NAS performance/stability for awhile.

This is sad because Linux is second nature to me. I guess it's still early days for them. But if I use a Solaris based OS for the storage tier to offer iSCSI and my iSCSI initiator is a Linux so its EXT4 based, this still means the iSCSI targets offered by the Solaris ZFS based storage will be formatted as EXT4. Unless there's something I'm not aware of, this means the Oracle dB I'm running on top of CentOS 6.5 in the ESXi VMs will not get any ZFS features.

madrebel said:
you should quote out a SM build though and use it to beat dell across the face over price. dell WILL get REALLY close to matching SM prices you just need to properly motivate them to do so

The issue is with Dell I can quickly see how much it's going to cost. With Super Micro I don't think they have a configurator or anything. The thing I like SM is they offer these 2 nodes in 1 U servers so if I manage to get active-active setup, I can use these.

madrebel said:
2 cables to each JBOD, always. 48gbps of bandwidth to a JBOD is better than 24 and redundancy etc.

You mean like how they show it in this PDF?
http://i.dell.com/sites/doccontent/...ts/en/Documents/SAS_Cabling_revised5_1_12.pdf

madrebel said:
whatever you're thinking of building, consider building 2 or 3 smaller units to attain whatever goal you're attempting to hit. horizontal is almost always superior to vertical here and the larger the system the more can and will go wrong with it.

You mean completely different storage units controlled by completely independent heads?Okay I don't mind that but if I am to build an application that looks for files at a common location such as \\nfs.mydomain.com\files, I now have to look across \\nfs1.mydomain.com\files, \\nfs2.mydomain.com\files, and \\nfs3.mydomain.com\files if I have 3 horizontal storage units?

madrebel said:
while zfs can use 'X' GB of ram i would (i am in fact) waiting for a few things to be resolved before i ever go above 128GB again.

Okay now I'd really love to hear what sort of problems you run into. And what I have to be weary of.

madrebel said:
L2arc ... is ALWAYS inferior to ARC. Meaning, build some sort of inexpensive dev system. run some workloads against it and see what your ARC stats are. if you're 85%+ on arc cache hits spending a few thousand on SSDs which may only get you another 10% or so on cache hits really isn't worth it.

Okay great. Was planning to use a SLOG but wasn't planning to use L2ARC.

madrebel said:
as an example you can fit 2 x 10gig cards, and 3 x LSI 9206-16E (6 total HBA chips) in a dual CPU Dell R720.

The LSI 9206-16E "supports up to 1024 SAS or SATA end devices" each. Why 3? Because you have 3 JBODs?

_Gea · Feb 28, 2014

Regarding very large RAM for Arc use (>128 GB) and deadlocks

ARC-Ram > 128 GB is not used very often, and should be avoided or at least intensively evaluated prior use as a common ZFS recommendation.
follow the discussions from the Illumos mailing lists about very large RAM:

http://www.listbox.com/member/archive/182191/2013/11/sort/time_rev/page/4/entry/19:173/
http://www.listbox.com/member/archi...0154148:F51B8FC4-C960-11E2-BEB6-DF1BADB14CBE/
http://www.listbox.com/member/archi...9180421:45E384F8-091B-11E3-9319-A79E4289FE01/

Illumos work on that so this may not be true forever.
For my own I have no experiences as my servers do not use that many RAM.

titusc · Feb 28, 2014

_Gea said:
Regarding very large RAM for Arc use (>128 GB) and deadlocks

ARC-Ram > 128 GB is not used very often, and should be avoided or at least intensively evaluated prior use as a common ZFS recommendation.
follow the discussions from the Illumos mailing lists about very large RAM:

http://www.listbox.com/member/archive/182191/2013/11/sort/time_rev/page/4/entry/19:173/
http://www.listbox.com/member/archi...0154148:F51B8FC4-C960-11E2-BEB6-DF1BADB14CBE/
http://www.listbox.com/member/archi...9180421:45E384F8-091B-11E3-9319-A79E4289FE01/

Illumos work on that so this may not be true forever.
For my own I have no experiences as my servers do not use that many RAM.

_Gea, thanks. I'll have a look at those links you provided. Out of curiosity, if we have issues above 128GB, then how can we scale above 128TB of storage based on the "1GB RAM requires 1TB storage" rule? What is the largest deployment you have done, and how much RAM you used in that case?

m1abram · Feb 28, 2014

titusc said:
_Gea, thanks. I'll have a look at those links you provided. Out of curiosity, if we have issues above 128GB, then how can we scale above 128TB of storage based on the "1GB RAM requires 1TB storage" rule? What is the largest deployment you have done, and how much RAM you used in that case?

Well that "rule" is not a rule but a guideline and is not for ZFS in general but with ZFS using deduplication.

_Gea · Feb 28, 2014

titusc said:
_Gea, thanks. I'll have a look at those links you provided. Out of curiosity, if we have issues above 128GB, then how can we scale above 128TB of storage based on the "1GB RAM requires 1TB storage" rule? What is the largest deployment you have done, and how much RAM you used in that case?

There is no such rule.
All Ram needs are related to a Arc-cache hit rate that you want to reach to serve all or most reads from cache (beside dedup where you need more). If data is not in Arc you fall back to pure disk/pool performance. A real need for RAM > 100 GB is very rare. I use 64 GB in my machines (ok up to 50 TB so not far away from a 1GB per 1 TB) but this is mostly due to RAM is cheap and I prefer to add RAM instead of an slower L2ARC.

titusc · Feb 28, 2014

Okay thanks. In that case I should be fine for the NFS file shares even if every read have to goto the disks. But for the iSCSI targets would this be an issue? I don't know enough about iSCSI to tell if there are special requirements on say whether there is a maximum time within which the storage must respond or something. The iSCSI will be used for VMs and databases running in these VMs.

Just to be clear I don't think I need a L2ARC for read cache. But I do want a mirrored SSD SLOG for synchronous writes in case of power failures. The RAM will therefore be entirely for ARC (ie not L2ARC and not ZIL) for the overall function of the storage. I do need compression on however for the NFS share. Don't think I can have compression for the iSCSI because they will be for ESXi VMs. Don't think ESXi will create the VMDKs using ZFS. Or will they?

danswartz · Feb 28, 2014

You are confused. iSCSI presents a zvol (or maybe a flat file) to ESXi as a LUN (read: block device). ESXi then formats it using whatever version of vmfs is relevant. It has no visibility into what underlies the block device (LUN) being presented. If the datastore is NFS, then ESXi uses NFS commands to the server, which operates on files&etc using ZFS - never visible to ESXi.

danswartz · Feb 28, 2014

Also, depending on how often you back up, you may be able to get away with skipping an SLOG device and disabling sync. I do that because I do hourly replications and daily backups, so if there is a hard crash and one or more VMs get hosed, I can easily restore them. This is especially a concern when the datastore is NFS, since ESXi forces sync mode writes, which absolutely kills performance without a very good SLOG device.

madrebel · Feb 28, 2014

zrav said:
This is not true. Zvols have never been the primary focus of ZoL. If you look at the last year of ZoL development activity the vast majority of performance and stability improvements are generic. If at all, zvols are the least mature part of ZoL.

i've heard differently. LLML is the force behind ZoL and they don't use NFS. I would be happy to be wrong but from the guys i've talked to working on it, NAS just isn't a concern yet.

You can control the drive power states with hdparm.

even if it is in a pool? fairly sure if the drive is a pool spare ZS will keep pinging it every so often waking it up.

titusc said:
Sorry is there a difference between "managed raw TB" vs "TB"? If I manage it entirely on my own do I get to use the non community edition software? I think they put a cap of 18TB for the community edition and if you want to go above that you have to pay to get a license key. Just so I get an idea, how much data you have in your pool and how much do they charge you?

managed TB in the nexenta world means simply how many raw TBs are behind the filers. my systems range from 240TB to 720TB.

Yes I think I can handle it. Was looking to buy an appliance like the QNAP ones but they only go up to 0.5PB vertically because they use EXT4. Even the ZFS ones like the Eon NAS claims to be able to grow by adding JBODs but with only 32GB max per node, I can't see how that'd work beyond 32TB of storage. Because these are out of the question, I need to do it myself. Happy to do that just didn't know I can play around in my VMware Workstation. Thought I must have physical HBAs and disks and enable pass through but danswartz corrected me above.

just so you know, nexenta will not support HBAs passed through to a VM. they will support a VM that managed iscsi block targets. meaning, if you pass the HBA/drives directly to the VM you're going to run into more issues than if you passed iscsi blocks to the VM. many here have done it without issues, i believe gea_ even runs some all in one stuff in prod? you're not gea_ though

. the reason nexenta doesn't support if is pass through like that still isn't production ready in their opinion. some cards dont like being passed through, some are finicky etc.

imo great way to learn but i wouldn't do prod with an all in one and passed through HBAs.

This is sad because Linux is second nature to me. I guess it's still early days for them. But if I use a Solaris based OS for the storage tier to offer iSCSI and my iSCSI initiator is a Linux so its EXT4 based, this still means the iSCSI targets offered by the Solaris ZFS based storage will be formatted as EXT4. Unless there's something I'm not aware of, this means the Oracle dB I'm running on top of CentOS 6.5 in the ESXi VMs will not get any ZFS features.

incorrect. so, an iscsi target is more or less a virtual sector table right. you tell SCST or COMSTAR or whatever that you're exporting a 'container' that is 2TB. This container then is nothing, its just a file. when the client connects to whatever you're using as a scsi target the scsi target (based on access rules etc) then says oh, hi, i have this disk that you're allowed to access. here is the sector map/table etc so you know how to read and write to it.

at this point whatever OS you're using will partition and format the drive. if this happens to be an OS with ZFS then you format that iscsi block device with a ZFS file system.

on your filer be it nexenta/omni/etc its still ZFS on your side and you can still enable compression, dedup, disable atime etc etc on the zvol itself. whatever data is in that zvol though you can't read unless you mounted the zvol. if you're user encrypted the FS on that zvol you won't ever see anything without the key.

The issue is with Dell I can quickly see how much it's going to cost. With Super Micro I don't think they have a configurator or anything. The thing I like SM is they offer these 2 nodes in 1 U servers so if I manage to get active-active setup, I can use these.

i'll say it one last time. Dell will match SM prices (or get really damn close) if you make them.

You mean like how they show it in this PDF?
http://i.dell.com/sites/doccontent/...ts/en/Documents/SAS_Cabling_revised5_1_12.pdf

sort of. that PDF shows a lot of JBOD to JBOD daisy chaining. don't daisy chain unless you have too. even if you have to for some reason i would prefer using SAS switches (and i fucking hate SAS switches). I greatly prefer direct connections between HBA/JBOD or switch/jbod.

You mean completely different storage units controlled by completely independent heads?Okay I don't mind that but if I am to build an application that looks for files at a common location such as \\nfs.mydomain.com\files, I now have to look across \\nfs1.mydomain.com\files, \\nfs2.mydomain.com\files, and \\nfs3.mydomain.com\files if I have 3 horizontal storage units?

depends, in that exact example, probably yes. however if you're using VMs then you should instead be looking at 3 large datastores for VMs. these VMs are then setup however you like. if a single filer dies 2/3rds of your VMs are unaffected. don't focus on the literal how until you have to. the point is more baskets with a fewer percentage of eggs is a good thing when talking storage.

Okay now I'd really love to hear what sort of problems you run into. And what I have to be weary of.

the links gea_ provided cover most of it. basically the CPUs enter kind of a race condition and can hang the system for 30+ seconds reclaiming huge chunks of ram. i'm not 100% positive this occurs on AMD systems as those have significantly higher CPU to CPU bandwidth and lower latency. Intel QPI is great and all but a 2 CPU intel system smokes a 4 CPU intel system for ZFS. reason being the intel CPUs get caught in cross call hell managing memory and the QPI link above 2 CPUs just can't handle it.

For the record, 128GB of ram is a LOT of ram. i have big systems in a big data center and a fairly large vcloud environment plus colo etc. all 10gig etc etc. we aren't pushing enough yet to really tax these systems. point is, it takes a LOT to even use 128GB of ARC.

Okay great. Was planning to use a SLOG but wasn't planning to use L2ARC.

there are other issues with L2arc and high availability. kind of a bug but probably an over sight. basically the crux of the issue is on export, zfs used to tear down the entire arc table before export. this was yet another reason to avoid large memory however this behavior was resolved in nexenta 3.1.4. what they did was say 'fuck the arc table. we aren't going to use it so instead of worrying about purging all those entries just export the damn pool and we can clean up ram later".

and this worked exceptionally well. pool exports used to take 10 minutes, they now take like 45s ... unless you have more than 2 l2arc devices. what happens is there are supposed to be more than 2 threads for this l2arc clean up but for some reason 2 of the 4 threads are busy doing other things leaving only two threads to manage l2arc device clean up. if you have more than 2 devices the 3rd through X devices get stuck waiting for an available thread and once they get a thread, the 3rd through X l2arc devic are purged in a serial manner instead of parallel like the first two devices. this can cause 2-5 minutes (or more if a lot of large l2arc devices) of fail over time. not good, this is being resolved very soon though.

still, ram is always better than l2arc. l2arc is great for mostly static read data though.

The LSI 9206-16E "supports up to 1024 SAS or SATA end devices" each. Why 3? Because you have 3 JBODs?

[/quote] i have 6 JBODs

. 9206-16E is technically 2 x 9207-8Es on one PCIe card with a PCIE bridge chip. so 4 ports per card. 3 x 4 = 12. 6 JBODs x 2 connections each = 12.

with that setup you can get 48gbps of throughput to each JBOD. its fast

couple of reasons for doing it like this, redundancy and speed. the first is obvious, if you have two connections 1 from two different HBAs, you have no single points of failure.

the second seems obvious but often times people forget something very important about mirror setups.

so, for my mirror setups i carefully create the vdevs using 1 drive in JBOD1 and 1 drive in JBOD2. again, keep in mind the physical back end means when i talk to that mirror vdev i am using two different physical HBA chips/paths to do it. in addition this would be involving 2 different PCIe slots as well so my bandwidth through out this whole example is as fast as it can be.

reason this is important for mirroring is say i have 1GB of data coming into me over the wire. in order to write that 1GB to disk i have to write 2GB worth of data. 1GB to one side of the mirror, and 1 GB to the other side. some call this mirror bloat. if you're building something that has to be screaming fast and don't account for write bloat it can slow your whole system down as you simply run out of bandwidth.

this isn't typically a concern until you reach a pretty good sized system however it is worth thinking about as it gets your brain focusing on the ins n outs of the physical build which imo is always useful.

titusc · Mar 7, 2014

Sorry have gone missing for a few days. Have been searching to see if there are alternatives to a commercial vendor like Nexanta, except without the price they charge. There appears to be another one available soon but pricing is still out of my budget. Strange enough first time I come across ZFS I looked at the CLIs on Oracle's site and thought it's straight forward. But I want a GUI with support and HA blah blah blah but that's out of reach it seems. Then with the open source crowd, there are a lot of choices and the more I dig around online the more confused I get because of things like should I go with OI, FreeBSD, OmniOS, how to do HA, etc. Unless I spend considerable amount of time understanding the pros and cons of which distro I use, I can't seem to make a decision. And it seems like gea only support specific distro for commercial support?

danswartz said:
You are confused. iSCSI presents a zvol (or maybe a flat file) to ESXi as a LUN (read: block device). ESXi then formats it using whatever version of vmfs is relevant. It has no visibility into what underlies the block device (LUN) being presented. If the datastore is NFS, then ESXi uses NFS commands to the server, which operates on files&etc using ZFS - never visible to ESXi.

Okay I didn't think ESXi will use ZFS either. But even if it's a zvol iSCSI exported and ESXi formats it using their proprietary vmfs, that block device iSCSI target is still ZFS based and will have raidz, compression, etc available. Right?

madrebel said:
incorrect. so, an iscsi target is more or less a virtual sector table right. you tell SCST or COMSTAR or whatever that you're exporting a 'container' that is 2TB. This container then is nothing, its just a file. when the client connects to whatever you're using as a scsi target the scsi target (based on access rules etc) then says oh, hi, i have this disk that you're allowed to access. here is the sector map/table etc so you know how to read and write to it.

at this point whatever OS you're using will partition and format the drive. if this happens to be an OS with ZFS then you format that iscsi block device with a ZFS file system.

on your filer be it nexenta/omni/etc its still ZFS on your side and you can still enable compression, dedup, disable atime etc etc on the zvol itself. whatever data is in that zvol though you can't read unless you mounted the zvol. if you're user encrypted the FS on that zvol you won't ever see anything without the key.

Okay so you are confirming with my suspicion. Despite it's a zvol formatted by the iSCSI initiator (a ESXi powered VM in my case) using its own file system (ie vmfs), it'll still be running on top of raidz and compression.

madrebel said:
i'll say it one last time. Dell will match SM prices (or get really damn close) if you make them.

I hear you but I don't think Dell has anything like the SuperMicros's SSB. I want a 2 node in a box solution with shared disk. SuperMicro has the 6037B-DE2R16L in $3,500 bare bone package in 3U. I can start with 16 x 4TB SATA disks and save on the costs for the JBOD, plus running all my application VMs along side my storage VM in the same ESXi per node.

madrebel said:
the links gea_ provided cover most of it. basically the CPUs enter kind of a race condition and can hang the system for 30+ seconds reclaiming huge chunks of ram. i'm not 100% positive this occurs on AMD systems as those have significantly higher CPU to CPU bandwidth and lower latency. Intel QPI is great and all but a 2 CPU intel system smokes a 4 CPU intel system for ZFS. reason being the intel CPUs get caught in cross call hell managing memory and the QPI link above 2 CPUs just can't handle it.

Okay so in my case I'm looking to have a Dual Xeon E5-2609 v2 setup per node in the SuperMicro SBB 6037B-DE2R16L. I guess I can start off by letting all VM to have access to all 8 cores from the 2 E5-2609 and if it becomes a problem I can set affinity to have 4 vCores only for the storage head bounded to the same physical CPU.

madrebel said:
For the record, 128GB of ram is a LOT of ram. i have big systems in a big data center and a fairly large vcloud environment plus colo etc. all 10gig etc etc. we aren't pushing enough yet to really tax these systems. point is, it takes a LOT to even use 128GB of ARC.

What is the largest ZFS deployment you have and how much RAM do you use for it just so I can have a feel for this?

madrebel said:
there are other issues with L2arc and high availability. kind of a bug but probably an over sight. basically the crux of the issue is on export, zfs used to tear down the entire arc table before export. this was yet another reason to avoid large memory however this behavior was resolved in nexenta 3.1.4. what they did was say 'fuck the arc table. we aren't going to use it so instead of worrying about purging all those entries just export the damn pool and we can clean up ram later".

Okay I'm going to want to use active-active. I don't know how yet but am counting on looking at http://zfs-create.blogspot.nl/ later when I have more time. But am a bit lost here. If the node is dead (ie power failure, or board fried, etc), certainly it can't clean up the ARC table and do any export, right? But at this point in time I don't know enough HA for ZFS to know how it works underneath. Are we even using the same SLOG or L2ARC shared between the 2 nodes?

dave99 · Mar 8, 2014

You mentioned earlier scaling to hundreds of terabytes, and above about active-active config. Frankly, trying to do a roll your own massive mission critical here with limited knowledge of zfs is a recipe for trouble. The smart answers here are:
1) Find a supported solution and pay for it (nexenta, oracle etc)
2) Find an appropriate employee who has a lot of experience with zfs and hire&pay him/her accordingly.
3) Spend a lot of time learning zfs yourself over many months, building/destroying/fixing/repairing/unplugging things etc with non-critical data until you really understand it.

spazoid · Mar 8, 2014

If you're considering giving all your VMs 8 vCPUs, you're gonna have a bad time.

titusc · Mar 8, 2014

dave99 said:
You mentioned earlier scaling to hundreds of terabytes, and above about active-active config. Frankly, trying to do a roll your own massive mission critical here with limited knowledge of zfs is a recipe for trouble. The smart answers here are:
1) Find a supported solution and pay for it (nexenta, oracle etc)
2) Find an appropriate employee who has a lot of experience with zfs and hire&pay him/her accordingly.
3) Spend a lot of time learning zfs yourself over many months, building/destroying/fixing/repairing/unplugging things etc with non-critical data until you really understand it.

My preference is to find a commercial solution with support. But either I stick with EXT4 solution from QNAP, XFS from Open-E, the only commercial vendor now for ZFS are either Nexanta, InforTrend, etc and they all charge silly money. While I can pick up anyone of the Solaris derivatives and run the commands on my own, I want to understand the implications of any choices made and yes with lots of testing like what I typically do in order to have procedures precise enough that anyone can follow which inadvertently requires a lot more researching than I originally have anticipated (have other development tasks to focus). Anyway I will in fact be relieved if I know I can contract this out to someone who can do this but at the same time I want a bit more understanding and control. If I'm already compiling applications and libraries from sources for everything I install I want to do similar for storage. I have in fact paid someone to do some consulting but found it to be insufficient. Having said that do we have anyone who will be able to put together an active-active setup as a contract? I'm interested to know because if I end up not being able to do this due to whatever reasons I want to know who I can contract this out to. There are quite a few experienced people here so better to do this here than hiring someone.

titusc · Mar 8, 2014

spazoid said:
If you're considering giving all your VMs 8 vCPUs, you're gonna have a bad time.

Okay 8 is a bit extreme and there maybe limitations. What sort of issues are you referring to? I'm running 4 VMs each with 1vCore on a quad core processor and realize it's better to overcommit a bit so at least we can multithread.

spazoid · Mar 8, 2014

It is in no way better to overcommit. Never give your VMs more vCPUs than they need. I was at a seminar at some point where this was discussed in great detail by someone very knowledgeable about this stuff, but I've forgotten the specifics. The gist of it is, however, that your VMs will wait for each other to finish their workloads. I'll try to exemplify by using extremes:
VM1 and VM2 both have 2 vCPUs and your host has 1 dualcore CPU - 2 cores total.
If VM1 is running a workload that requires 1 core 100%, and VM2 starts a workload that requires 2 cores - it gets none. VM2 will have to wait for VM1 to finish it's workload.

Obviously, this is an extreme that will never happen in real life, but giving all VMs access to all your CPU resources will cause havoc on your hypervisor. Start with one or two vCPUs per VM and if any of them get CPU starved - add another core.

I'm sure they can give you exact details in the virtualization forum.

titusc · Mar 8, 2014

spazoid said:
It is in no way better to overcommit. Never give your VMs more vCPUs than they need. I was at a seminar at some point where this was discussed in great detail by someone very knowledgeable about this stuff, but I've forgotten the specifics. The gist of it is, however, that your VMs will wait for each other to finish their workloads. I'll try to exemplify by using extremes:
VM1 and VM2 both have 2 vCPUs and your host has 1 dualcore CPU - 2 cores total.
If VM1 is running a workload that requires 1 core 100%, and VM2 starts a workload that requires 2 cores - it gets none. VM2 will have to wait for VM1 to finish it's workload.

Obviously, this is an extreme that will never happen in real life, but giving all VMs access to all your CPU resources will cause havoc on your hypervisor. Start with one or two vCPUs per VM and if any of them get CPU starved - add another core.

I'm sure they can give you exact details in the virtualization forum.

You are right. Now that you mention this I do recall have seen something like this before. How many cores would be sufficient for the storage heads?

spazoid · Mar 9, 2014

Personally I use 4 vCPUs for my omnios VM, but only because I've actually tested it and this is how my particular setup works best at the moment. I would (and did) start out with 2 vCPUs and then check the ESXi performance monitor during peaks and take it from there.

titusc · Mar 9, 2014

spazoid said:
Personally I use 4 vCPUs for my omnios VM, but only because I've actually tested it and this is how my particular setup works best at the moment. I would (and did) start out with 2 vCPUs and then check the ESXi performance monitor during peaks and take it from there.

Thanks. Just so I can have a perspective of how many cores I'll need, can you tell me what type of load are you having? Say IOPS or MB/s?

spazoid · Mar 10, 2014

I've seen 30.000+ IOPS from ARC + 600-ish from disk, and this is where the CPU horsepower is really needed. Usually my workload is a LOT lower than this, though, and one or two cores would suffice and give my other VMs better performance, but I measure lower disk latencies when I assign the extra vCPUs to the omnios VM, even with the lower workload.

ZFS based storage

n00b

[H]F Junkie

Limp Gawd

[H]ard|Gawd

[H]F Junkie

[H]ard|Gawd

2[H]4U

[H]F Junkie

n00b

2[H]4U

2[H]4U

Weaksauce

Limp Gawd

n00b

Gawd

Limp Gawd

n00b

Supreme [H]ardness

n00b

2[H]4U

Supreme [H]ardness

n00b

2[H]4U

2[H]4U

Gawd

n00b

2[H]4U

Limp Gawd

n00b

n00b

Limp Gawd

n00b

Limp Gawd

n00b

Limp Gawd