ZFS and SAS expanders with SATA drives a toxic combo?

i beg to differ, i just built a high end zfs server. if you read some of the recent white papers, specifically last vmworld, nexenta is showing off a lot of really high end servers.

as for DDN, i know for a fact (the DDN guys told me this) folks are fronting those arrays with ZFS because DDNs front ends ... i dont think they do iscsi. i forget exactly what the issue was but recently when i was talking with them their front ends were lacking something that would have been extremely useful to me and i think it was iscsi.

i know of a few 10+ PB (one is 80PB) installations using nexenta/zfs. very high utilization installations too btw.
 
i beg to differ, i just built a high end zfs server. if you read some of the recent white papers, specifically last vmworld, nexenta is showing off a lot of really high end servers.

as for DDN, i know for a fact (the DDN guys told me this) folks are fronting those arrays with ZFS because DDNs front ends ... i dont think they do iscsi. i forget exactly what the issue was but recently when i was talking with them their front ends were lacking something that would have been extremely useful to me and i think it was iscsi.

i know of a few 10+ PB (one is 80PB) installations using nexenta/zfs. very high utilization installations too btw.

Yeah, it was probably iSCSI. I actually know someone who was using iSCSI for their worldwide video surveillance network and replaced it with the DDN WOS products which were much more efficient (estimate lifespan, speed, rackspace, power and cooling costs) than their iSCSI gear they were replacing.
 
the WOS thing is great, but has a well defined use case. we had no need for what WOS does well.

we were looking at the NAS scalers and they're great boxes just expensive.
 
I don't know if this fits your budget or use scenarios, but have you looked at any equipment from Scale Computing?
 
The DDN box we are most likely going with has an obscene amount of bandwidth, which is exactly what we need for this application. But bandwidth is just one part of the equation. Throughput, IOPs and especially latency all come into account, depending on the application. Especially with internationally connected boxes, latency is the true killer for some applications and bandwidth (or lack thereof) is the true killer for others. This is the same for internally connected devices as for world connected ones. Some companies and traders are spending billions to bring lower latency connections from Japan and Europe and the US because 50-100ms can mean the difference between making money and losing it. A recent link talking about some basics http://www.extremetech.com/extreme/...-cost-of-cutting-london-toyko-latency-by-60ms But the exchange as an example has differing needs based on system. Their trading systems have High-IOP/Low-Latency needs. Their global backup, DR and general load needs have higher bandwidth needs. Like I said, no one anything is perfect for everything.
50-100ms? That is extremely slow. The fastest Stock Exchanges in the world today have a latency of 0.1ms. To achieve such low bandwidth, the trader must use co-location (place the server next to the Stock Exchange server), in other words, he only trades on this particular exchange and is probably a low latency trader.

If you are a trader that is connected to several exchanges (with such a cable as in the link) then you are likely an arbitrage trader, not a low latency trader.
 
I don't agree with this... There are extremely HA and wicked-fast ZFS box options.
If you look on the Oracle ZFS benchmarks, there is no entry for high-end storage servers.

Do you know more about these servers you talk of? Any links? madrebel? Any links?
 
Last edited:
50-100ms? That is extremely slow. The fastest Stock Exchanges in the world today have a latency of 0.1ms. To achieve such low bandwidth, the trader must use co-location (place the server next to the Stock Exchange server), in other words, he only trades on this particular exchange and is probably a low latency trader.

If you are a trader that is connected to several exchanges (with such a cable as in the link) then you are likely an arbitrage trader, not a low latency trader.

50-100ms worldwide is SLOW? Are you kidding me? And did you actually read the article? The cable is specifically being constructed for high-frequency algorithmic stock trading (a specific you mentioned in post 159). Regardless of your local latency, it could be .0000001ms (by colo'ng a box as you suggest) but if you're in Japan 190ms away then that is how far the command and control is (they don't just setup a box and leave it to run on its own, regardless of the algo they are continually in control and updating things.... You have to look at end to end latency just as you do bandwidth.
 
Last edited:
Worldwide it is not slow, no. But traders concerned with low latency exclusively uses co-location. This cable is for arbitrage traders.
 
Do you know more about these servers you talk of? Any links? madrebel? Any links?
http://www.slideshare.net/NexentaWebinarSeries/nexenta-at-vmworld-handson-lab

btw that is kind of a sand bag config. those JBODs are not what you would use if performance is your main priority. also, there is no reason they couldn't have run 3 HBAs in each head and dual home directly to each JBOD. Performance wouldn't have been that much better but the whole stack would be more fault tolerant.

those particular JBODs can only take SSDs in certain slots too and further the SSD count was really quite low.

there will likely be a white paper done on the system i just built, still waiting for all components to hit the ground. its going to be pretty stupid though. i'll link it here once all that is done.

btw, the emc solution failed during the show which is why that nexenta stack ran 8 of 12 labs for the duration of the show. the netapp couldnt have handled the load.
 
Wow. Interesting reading for us ZFS fans. And NetApp and EMC costed $1.5 million, and the Nexenta server costed $0.32? Wow.

But the netapp couldnt handle the load? That sounds a bit strange. NetApp has more high end servers than ZFS servers?

Please, link when the final white paper arrives. :)
 
NetApp claims to have all kinds of high end shit. They tend to fall down under load though WAFL does not scale to high IOPs well at all.

the reason nexenta did so well is all the ARC cache. 144GB (18x8gb in each) of cache does great things for latency. the zeusRAM ZIL drives are also awesome.
 
I'm not sure there are (m)any ZFS storage arrays which could really be classed as high end - not yet anyway!

There's more to a "high end" storage array than just being fast, though I suppose it depends on how you define "high end".
ZFS itself, while being a huge advance on traditional filesystems/volume managers can't make storage high end on it's own - there are other hardware and feature requirements.

As a value proposition, ZFS can be hard to ignore though - especially in the current economic climate!
 
I'm not sure there are (m)any ZFS storage arrays which could really be classed as high end - not yet anyway!
a lot more than you might think. they have like 4500 customers now, granted most of those are going to be small but they have a lot of very large installations. very large, very available, and very fast.

there are other hardware and feature requirements.
like active/active HA controllers dual homed to multiple JBODs? that is fairly trivial to accomplish.

you can also do scale out namespace similar to what emc does with isilon and many other vendors do.

stretch HA clusters, can be done but distance (latency) is critical.

replication ... part of zfs.

nexenta doesn't lack much of anything really compared to the other enterprise storage vendors.
 
like active/active HA controllers dual homed to multiple JBODs? that is fairly trivial to accomplish.


Yes it is, but it's only the start!
For instance, under failover what happens to the ARC contents of the failed head?
What happens to the ARC on the surviving head?


Don't get me wrong, I'm not knocking Nexenta or ZFS - I'm just saying that IMO it's really up to midrange storage at the moment - it will improve/evolve, especially as it pushes upwards, but it's not quite there yet.
There are no black and white lines in determining what is high end and what isn't though - and especially in this economic climate, ZFS storage can make a lot of sense in many situations.
 
Yes it is, but it's only the start!
For instance, under failover what happens to the ARC contents of the failed head?
What happens to the ARC on the surviving head?
Is this a problem? The ARC will populate again, but it will take some time.


Don't get me wrong, I'm not knocking Nexenta or ZFS - I'm just saying that IMO it's really up to midrange storage at the moment
This is also my understanding. Only up to midrange as of today. But in the future, highend. I would like to see companies selling Lustre + ZFS storage solutions, that must be classed as very high end?
 
Is this a problem? The ARC will populate again, but it will take some time.

Depends - the pool(s) moving over to the surviving head will not be cached at all at first so you could have performance issues on those pools until the ARC gets warmed - there's also the issue that ARC available to the pool(s) already on the surviving head will have to shrink in order to accomodate ARC space for the pool(s) coming over.

To a certain extent this can also be true of high end arrays too, but they often have more granularity of control over their caches than ZFS currently offers. This may change in the future of course......but we've been waiting for "block pointer rewrite" for years now.... so who knows!!!!
 
Yes it is, but it's only the start!
it is a fairly big deal actually. netapp and emc don't dual home to each shelf by default. having a dedicated path to each shelf allows entire shelves to fail and you're still fine. you lose a shelf in a daisy chained config anything below it is gone.
For instance, under failover what happens to the ARC contents of the failed head?
who cares? it's just read cache. your first and far and away most important concern is what happens to the data. the answer, nothing, its on disk. further, since all your l2arc drives are below in the shared JBODs whatever is in them is accessible
What happens to the ARC on the surviving head?
nothing, it continues to do its job presuming you were active/active. in active/passive it will begin to populate.
Don't get me wrong, I'm not knocking Nexenta or ZFS - I'm just saying that IMO it's really up to midrange storage at the moment - it will improve/evolve, especially as it pushes upwards, but it's not quite there yet.
There are no black and white lines in determining what is high end and what isn't though - and especially in this economic climate, ZFS storage can make a lot of sense in many situations.
you're under estimating what zfs is capable of right now.
 
nothing terribly fancy about this. just a 1U with lots of ram and SSDs. they're probably using pNFS or some other type of scale out to reach the 50 million IOPs claim.
Wouldnt a couple of these count as high-end? I mean, a ZFS server with 50 million IOPS? Isnt that high end?

What IS high end?
 
SSDs blur that line a great deal. it isn't clear if they're using SATA or SAS SSDs which is or isn't important depending on the workload, load, and your required HA needs.

from an IOPs perspective 6 cutting edge SATA SSDs appear to be really high end on their own. however throw 100k sustained random writes at them and then try to read from them and things will begin to slow down. however with a large ARC and ZFS can mask a great deal of that.

still, that scenario is going to be a shit load faster than traditional spinning disks, this side of a few hundred of them anyway.

the interesting thing is the scale out and how they're accomplishing it. scale out is nice for performance however IMO there is a lot of wasted space due to replication.
 
it is a fairly big deal actually. netapp and emc don't dual home to each shelf by default. having a dedicated path to each shelf allows entire shelves to fail and you're still fine. you lose a shelf in a daisy chained config anything below it is gone.

Well you said it's trivial, and now you are saying it's a big deal :confused: :)
How would you support thousands of drives?


who cares? it's just read cache. your first and far and away most important concern is what happens to the data. the answer, nothing, its on disk.

Nobody said anything happens to the data - but if decent performance relies on data being in the ARC, then you are out of luck on the incoming pool(s) until the ARC gets warmed again.


further, since all your l2arc drives are below in the shared JBODs whatever is in them is accessible

Is it?
How will the surviving node know what's in them, AIUI you lost the L2ARC cache directory tables when you lost the ARC on the failed head.

nothing, it continues to do its job presuming you were active/active. in active/passive it will begin to populate.

Apart from the fact that it now has to cache data for double the amount of storage space.
Some existing ARC data will have to be evicted to make way for new ARC data from the incoming pool(s) - some may make it to L2ARC, but it depends how fast it's all happening!

If the failover head was passive, then it's ARC is fully cold from the get go


you're under estimating what zfs is capable of right now.

Perhaps - or perhaps there's some over-estimation happening too :)
 
Well you said it's trivial, and now you are saying it's a big deal :confused: :)
How would you support thousands of drives?
supporting thousands of drives is great in theory. in practice you run into cable length limitations. SAS cables only go so far and you can only cram so much in a rack.

however, cable the heads to two SAS switches. cable the controllers in the jbods one path to each switch.
Nobody said anything happens to the data - but if decent performance relies on data being in the ARC, then you are out of luck on the incoming pool(s) until the ARC gets warmed again.
performance will fail back to whatever your spindles can handle, worst case. in practice prefetching et al happens very quickly. a cold ARC gets hot in a few cycles.
Is it?
How will the surviving node know what's in them, AIUI you lost the L2ARC cache directory tables when you lost the ARC on the failed head.
it will learn what blocks are where quickly.
Apart from the fact that it now has to cache data for double the amount of storage space.
Some existing ARC will have to be evicted to make way for new ARC data from the incoming pool(s) - some may make it to L2ARC, but it depends how fast it's all happening!
so size your ARC accordingly. RAM is dirt cheap there is no reason to have less than 256gb if you're building a high end system.

further, if you drop a vn7500 node whatever it was caching is gone, or a netapp node etc etc. in a failed state optimal performance is NOT your primary concern. integrity first, availability (in whatever state) second.

performance suffers in any failure scenario. depending on the environment this may be very little (like a single node in a 20 node scaled out cluster) or a lot (one node of active/active where total load is 80% or something). this is true of any vendor in the enterprise space, not really sure why you expect different from zfs.
If the failover head was passive, then it's ARC is fully cold from the get go
uh huh ... and? cold to hot for DRAM happens really quickly.

Perhaps - or perhaps there's some over-estimation happening too :)
i've done the pepsi challenge. there is very little zfs is NOT capable of doing across all of the storage tiers.
 
supporting thousands of drives is great in theory. in practice you run into cable length limitations. SAS cables only go so far and you can only cram so much in a rack.
Uh, SAS has a cable length limit of 10m. If you have so many enclosures or racks of drives, you will likely have a switch or more which will give you another 10m. As to "learning blocks quickly" it will build a new cache based upon usage. That can take quite a while depending on the application.
And as to RAM being dirt cheap, yeah basic 4GB DDR3 DIMMS for your desktop box are historically cheap. That goes away when you want to stick 256GB RAM into something like a Proliant. Take a DL360 G7 for example. 18 DIMM Slots (comes with 4GB out of the box). 18 x 1x16GB Registered DIMMS. Lets start with third party stuff. http://www.ramcity.net/product/KIN1-HPTDL3-48G-04.htm $1150 for 48GB (You'll need 6 of these kits) for just under $7,000. Go with the Real HP certified/supported stuff, then you are talking 18 x 1x16GB kits http://www.provantage.com/hewlett-packard-hp-nl674aa~7HEWG0NC.htm which will run you a cool 30 Grand.
 
Last edited:
Uh, SAS has a cable length limit of 10m. If you have so many enclosures or racks of drives, you will likely have a switch or more which will give you another 10m.
and?

have you tried running a max cable length SAS cable to say a 60 drive JBOD which already has internal signaling issues you need to take into account?

there is also the problem with the LSI switches and non LSI jbod controllers randomly rebooting the switch which still is unresolved.

lastly, at some point you need to ask yourself how much data you're willing to risk in a failure scenario.

what if the power in a rack dies?

point being, at some point it makes more sense to just have another HA cluster or stand up single racks with 1 server/bunch of jbods and use namespace clustering to replicate out individual data sets or the entire data structure or w/e you want.

in the HPC space, ok, thousands of drives off a single head or cluster makes sense (presuming you can deliver the bandwidth required) but we're well outside the use case of enterprise here. the HPC guys have much different tolerances for failure scenarios.
 
And as to RAM being dirt cheap, yeah 4GB DDR3 DIMMS are historically cheap. That goes away when you want to stick 256GB RAM into something like a Proliant. Take a DL360 G7 for example. 18 DIMM Slots (comes with 4GB out of the box). 18 x 16GB Registered DIMMS. Lets start with third party stuff. http://www.ramcity.net/product/KIN1-HPTDL3-48G-04.htm $1150 for 48GB (You'll need 6 of these kits) for just under $7,000. Go with the Real HP certified/supported stuff, then you are talking 18 x 16GB kits http://www.provantage.com/hewlett-packard-hp-nl674aa~7HEWG0NC.htm which will run you a cool 30 Grand.

7 grand is pennies in this space. EMC charges 330 thousand dollars for 25 100GB solid state drives. 65K for a shelf full of 900GB 10k drives.

when building storage even off the shelf your most significant cost will ALWAYS be the drives. well ... idk .. 32gb dimms are pretty silly still and likely won't come down for awhile. however if you're budgeting for high end enterprise storage and 7 grand makes you blink ... you're in for some sticker shock.

btw you should know this. i would be surprised if the quote you got back from DDN was under 7 figures.
 
7 grand is pennies in this space. EMC charges 330 thousand dollars for 25 100GB solid state drives. 65K for a shelf full of 900GB 10k drives.

when building storage even off the shelf your most significant cost will ALWAYS be the drives. well ... idk .. 32gb dimms are pretty silly still and likely won't come down for awhile. however if you're budgeting for high end enterprise storage and 7 grand makes you blink ... you're in for some sticker shock.

btw you should know this. i would be surprised if the quote you got back from DDN was under 7 figures.

Actually, for a few VM and storage controllers we just built they were outfitted with 384GB of RAM and there are 6 head end boxes. So yeah, 200K+ on just ram is more thank just a blink.
 
Last edited:
have you tried running a max cable length SAS cable to say a 60 drive JBOD which already has internal signaling issues you need to take into account?
Yes, anyone with a brain doing cable calculations takes into account total end to end lengths. And what signaling issues are you referring to?.

there is also the problem with the LSI switches and non LSI jbod controllers randomly rebooting the switch which still is unresolved.
Sorry to hear you are having those problems, we are not.


lastly, at some point you need to ask yourself how much data you're willing to risk in a failure scenario.
We are not willing to lose any. That is why our HA, backup and DR plans cover us end to end with primary and secondary plans.


what if the power in a rack dies?
That is why each rack has primary and secondary redundant power
 
you do things your way, i'll do things my way. in a few years we'll see where storage has gone.
 
And what signaling issues are you referring to?[/I][/B].
the 4U 60 jbod has some issues with certain drives. the signal in some of the bays isn't ideal.
Sorry to hear you are having those problems, we are not.
ok, i'll tell LSI they can forget about the problem then.
We are not willing to lose any. That is why our backup and DR plans cover us end to end with primary and secondary plans.
which is great, doesn't help you keep portions of your data online though if all your eggs are in one basket.
That is why each rack has primary and secondary redundant power
PDUs fail. UPSs fail. Hell data centers fail.
 
the 4U 60 jbod has some issues with certain drives. the signal in some of the bays isn't ideal.
Then that is something they should/will remedy.


ok, i'll tell LSI they can forget about the problem then.
I never said there was or wasn't a problem, just that we haven't experienced it (Again, I don't know which particular enclosure you are referring to).


which is great, doesn't help you keep portions of your data online though if all your eggs are in one basket.
We don't keep all out eggs in one basket. I have local and worldwide failover clusters.

PDUs fail. UPSs fail. Hell data centers fail.
Yes they do. All racks have primary and secondary UPSs at the bottom, primary and secondary PDU's up the sides connected separately to PSU1/PSU2. And see above for failover on mission-critical applications. Everything that can be done for HA is being done given monetary realities. I would love to have primary, secondary and tertiary live failover systems for every single application down to KVM, but you balance out needs, wants, priorities and available capital dollars.
 
Last edited:
and?

have you tried running a max cable length SAS cable to say a 60 drive JBOD which already has internal signaling issues you need to take into account?

One other benefit (Well, one among a great many) that should be shipping to the general public later this year is SAS 3.0 (12Gb/s) supporting optical cables, single and multimode with hundreds of meters between ports and supporting up to 90 degree bends with little or no signal degradation. We are currently beta testing an LSI3008 HBA with a 3236 expander (Not under NDA anymore). With the active transceivers the cable run is just a wound 50m right now.
 
Last edited:
from what im hearing from LSI they aren't expected to ship any 12gbps till 2nd quarter next year. they're testing the cards now, i haven't heard anything about the switches though nor have i ehard anything about FRUs/cables.

supposedly the first pci-e 3 cards will be out in a few months but they're still 6gbps.
 
from what im hearing from LSI they aren't expected to ship any 12gbps till 2nd quarter next year. they're testing the cards now, i haven't heard anything about the switches though nor have i ehard anything about FRUs/cables.

supposedly the first pci-e 3 cards will be out in a few months but they're still 6gbps.

We just have 2 cards and 2 expanders now, they haven't even given us any specs on any upcoming switches yet. Their SAS3 ROC is a beast though. Whether for HW RAID or just a pump for SW, it is remarkable. Ss to availability, they told us to expect December availability, I don't know if that is just because we are a beta site or for everyone. Areca is also working on some SAS3 stuff using the LSI ROC. One pair is using a hybrid electrical/optical transceiver based connection (50m), the other is a 20m electrical/electrical SFF-8644 cable.
 
Last edited:
Not that I want to get into a pissing contest but...

(in reference to L2ARC on a head-failure)
it will learn what blocks are where quickly.

You lose your L2ARC. The system is going to assume the SSD is empty and start repopulating it, as if the system was rebooted.

That said, I'm fairly deep in the ZFS storage game at this point and I'm rather happy with it. It way outperforms our SAN, is more stable (even the single nodes vs our shitty "N+1 SAN") so far, and is massively cheaper for the same performance/quality SAN storage space, but it isn't without flaws.
 
supporting thousands of drives is great in theory.

It's not theory in many enterprise datacentres ;)


i've done the pepsi challenge. there is very little zfs is NOT capable of doing across all of the storage tiers.


Hmm - perhaps all the enterprise storage vendors should give up then and just license ZFS from Oracle, or follow Nexenta's path :D

Seriously though - even Oracle don't currently consider ZFS the final word in storage technologies - hence they themselves marketing alternatives in the Axiom range and the Exadata storage server.
I'm not knocking Nexenta either BTW - just not sure they'd even consider themselves as direct competitors in the high end storage market, at least not quite yet anyway (there's more than enough low-midrange market for them to go after) - still, maybe I'm doing them a disservice!

As I've said, ZFS is a huge advance in filesystem/volume management, and it will no doubt evolve. Also not all features of high end storage platforms should be necessarily be built into the filesystem - some should be seperate functions, in this case probably of the OS itself (like multipathing isn't a ZFS function)

At the end of the day, we obviously have differing opinions here.
If I understand correctly, your approach is to build a pair of powerfulish x86 server heads stuffed full of memory and connect those via SAS HBAs in said server heads to a number of SAS JBODs (poss via switches or directly connected - depends on numbers I suppose) and install Nexenta ZFS HA cluster on them.

I'm not saying that you cannot achieve good/great performance and good availability etc etc....and you'll certainly score very high on the bang/buck scale - what I'm saying is that's not enough to make it a "high end" solution, at least not for most customers in the market for such.
 
Very interesting discussion guys, I learn a lot listening to you pros! :)

If I understand correctly, your approach is to build a pair of powerfulish x86 server heads stuffed full of memory and connect those via SAS HBAs in said server heads to a number of SAS JBODs (poss via switches or directly connected - depends on numbers I suppose) and install Nexenta ZFS HA cluster on them.
But, as I heard, NetApp and many other Enterprise Storage vendors do the same. The use simple COTS hardware and put their proprietary software on top. So, is there any difference to what Nexenta users are doing, in practice?

EMC says everybody (NetApp, Nexenta, etc) use cheap x86 stuff. He flames Nexenta. Quite fun to read:
http://www.theregister.co.uk/2011/09/22/vmworld_hol_sakac/

OpenStorage vendors (Nexenta, etc) talk about RAIS (Array of Inexpensive Servers). If one server crashes, then another server takes over. They dont talk about two very expensive servers anymore. It is easier and cheaper to use many cheap servers. Just as Facebook does, Google, Amazon cloud, etc. More companies go to many cheap servers and build the software so it can manage a server crash seamlessly. Because servers will crash, even expensive high end ones. Better to scale out, than scale up. Better a cluster of cheap servers, than an expensive Mainframe. If the Mainframe crashes (which happens every few year) then everything is down. If a server in the cluster crashes, nothing happens if you have built the system correct.
 
Last edited:
Hmm - perhaps all the enterprise storage vendors should give up then and just license ZFS from Oracle, or follow Nexenta's path :D
no license required. however you can't just turn the ship around. emc, netapp, etc have billions invested in their platforms. further it would be impossible for them to justify the insane premium using an open source base.

If I understand correctly, your approach is to build a pair of powerfulish x86 server heads stuffed full of memory and connect those via SAS HBAs in said server heads to a number of SAS JBODs (poss via switches or directly connected - depends on numbers I suppose) and install Nexenta ZFS HA cluster on them.
this is the same thing emc/netapp do only they charge an insane premium.
I'm not saying that you cannot achieve good/great performance and good availability etc etc....and you'll certainly score very high on the bang/buck scale - what I'm saying is that's not enough to make it a "high end" solution, at least not for most customers in the market for such.
so performance and availability do not qualify as a high end solution .... unless it says emc or netapp and you spend over a million dollars for it?
 
Back
Top