ZFS and SAS expanders with SATA drives a toxic combo?

mwroobel · Jun 8, 2012

With all due respect, I think you need to do some more reading. You are confusing filesystems and their underlying hardware (ZFS is a filesystem and volume manager, not a connection technology). A single SAS domain can handle a theoretical limit of 65,535 devices with the proper controller hardware and infrastructure that is. As to whether expanders are or are not continuing to be a problem, I think I made my point in my above posts. I do manage Sun boxen at work. I also manage NetApp and EMC. They are all good and installed based on their benefits for a particular application or use. No one is better or worse than another in general, they each have their benefits and pitfalls. As to Oracle ZFS boxes, one of the boxes I administer is a Sun 7320 (With ZFS as the filesystem, because we had a particular need for audited checksums). Want to know how they support the 96 drives in the shelves? EXPANDERS integrated into the controller logic. Lastly, your 300 device claim. With expanders, even basic LSI SAS RAID cards support up to 512 drives with expanders. I know what you are going to say.... "I didn't say with expanders, I said to a single card"... Well, I invite you to show me a standard PCIe card with 300 drive support by itself

Not enough room for 75 SFF-8087 ports is there? This is the reason expanders exist.

brutalizer · Jun 8, 2012

I am talking about a single hw-raid controlling 300 disks, with or without expanders. That will not work because the bandwidth will not suffice. Typically, when you use expanders, all disks share the band width, which is a problem when you have more than, say 20 disks.

ZFS is in charge of every disk so ZFS does not have problems controlling 300 disks, nor problems with bandwidth.

What I am trying to say, is that hw-raid does not scale. After adding, say 20 disks, the performance drops off. And you also have a single point of failure, what happens if that hw-raid card dies?

Software raids such as ZFS are superior to hw-raid: they scale better (many petabytes, up to 1TB/sec, thousands of disks), safer (data corruption), etc.

The Sun x4500 Thumper server has 48 SATA disks in 4U. In the server there are 6 HBAs. Each HBA has 8 disks connected to it. When you build up a raidz2, you select one disk from each HBA. If one HBA dies, it does not affect the zpool. The zpool consists of several raidz2. One raidz2 is spread over all HBAs. There is no single point of failure in a properly built ZFS server.

You mentioned that Oracle 7320 has expanders? I did not know that. Do you have information on that?

cantalup · Jun 8, 2012

brutalizer said:
............

What I am trying to say, is that hw-raid does not scale. After adding, say 20 disks, the performance drops off. And you also have a single point of failure, what happens if that hw-raid card dies?
...

as I know with Adaptec hardware raid and LSI(92XX H/W raid):
if H/W raid dies... you just replace with the same H/W raid or better H/W raid with the same vendor,
by replacing, a new H/W raid will detect all the drive and configuration

as S/W raid does, H/W raid saved the configuration on each Drive as a backup

on each drive. we would say "metadata" or something.

are you sure the performance drop after 20 disk?..
this is true on the H/W raid.. the performance is degrading when not many left space...
and this is true too with any S/W raid as I understand.

mwroobel · Jun 8, 2012

I am sure LSI, Areca and many other manufacturers are going to just throw in the towel tomorrow since as you say "HW RAID can't scale well to greater than 20 disks". As to how to choose drives when building an array, thanks. With very few exceptions (in non clustered systems), there is always a single point of failure, the motherboard.
In our 7320, we have 4 shelves (24 drives per shelf, internal expanders)
Lets talk more about scalability.....Lets say you are right, and that STR was the most important thing in life (it isn't). a PCIe 2.0 x8 controller has the theoretical capability of moving 5GT/s (500MB/s) per lane yielding you 4000GB/s. If you took the average drive which we'll say does 100MB/s you could have 40 drives pushing full throttle (more than 20). Use a PCIe 3.0 x8 card and you double that to 80. Now lets say that you are using large arrays with random 4K QD32 requests (more represented in a server, but lower sequential rates) ... You could multiply the number of drives by 5 or more before you run out of bandwidth. All of this is before you start striping data across multiple HBAs (R60) or doubling down creating redundant arrays of redundant arrays.

The Sun "thumper" was announced in 2006 and eol'd 8 months ago with no drop-in replacement and is old tech. As to the expanders on the 7320, here is a link to the service manual, check P18 http://docs.oracle.com/cd/E22471_01/pdf/821-1792.pdf you will see the expander in the initial box and each shelf has an integrated expander.

In the end, all RAIDs are software. Whether that is software that exists in firmware and has an integrated parity processor or software that run in system memory and use the host processor for parity computation is irrelevant. Hardware RAID has its place. Software RAID has its place. ZFS has its place. Which one to choose is all about your application and its particular needs.

As to scalability, and you mention 1TB/s (Damn I would love a system that can offer that, btw outside of government and academia they don't exist and most/all of the systems in that class are linux based clusters NOT running ZFS are the primary filesystem and have no need for that kind of throughput from a single system), When you are talking about hundreds or thousands of spindles, sequential transfer speed is generally irrelevant, you are talking many users, extremely deep queue depths and random access. Response time and IOP/s are what are most important. You are evangelizing ZFS regardless of the case which is incorrect and you are slamming hardware raid, and with no disrespect intended it seems you have very little real world experience with any of this (as you yourself say you have never run ZFS in production).

When it comes to ZFS in a production environment, if you want a single vendor who will supply the hardware, software (and in this case they are the authors and owners of ZFS) you will definitely pay for it. There is an old saying that no one ever got fired for buying IBM. If you have your own software people on staff that can fix issues, or you pay through the nose for the Oracle hardware, software and service guarantees then ZFS (in the right circumstance for the application) can be a great thing. If you are selling this to a client (or even your own boss) as a solution and are responsible for problems, they are not going to like to hear that a fatal bug popped up and will be fixed when someone that is not you gets to it. Many hardware vendors offer hardware RAID solutions. Not all that many that aren't oracle offer ZFS. HP, Dell, EMC, NetApp etc...

ZFS is also just a filesystem, and there are others that are coming down the pike (or here already) with data integrity calculated in the filesystem, snapshots, versioning etc such as btrfs, nilfs and gpfs. Microsofts own NTFS replacement will likely contain these features as well, opening up the benefits to a much wider audience. OSX was supposed to have ZFS as their next filesystem, but went from openly talking about it to removing any reference to it because of patent issues with Sun.

brutalizer · Jun 8, 2012

cantalup said:
as I know with Adaptec hardware raid and LSI(92XX H/W raid):
if H/W raid dies... you just replace with the same H/W raid or better H/W raid with the same vendor,
by replacing, a new H/W raid will detect all the drive and configuration

Yes, but when the HW-raid card dies, the server dies. When a disk controller in Sun x4500 dies, nothing happens. It continues to run. Therefore the HWraid card is a single point of failure. (And of course, the mother board and the cpu and ... is also a single point of failure). But the x4500 has less SPOF.

mwroobel said:
I am sure LSI, Areca and many other manufacturers are going to just throw in the towel tomorrow since as you say "HW RAID can't scale well to greater than 20 disks". As to how to choose drives when building an array, thanks. With very few exceptions (in non clustered systems), there is always a single point of failure, the motherboard.
In our 7320, we have 4 shelves (24 drives per shelf, internal expanders)
Lets talk more about scalability.....Lets say you are right, and that STR was the most important thing in life (it isn't). a PCIe 2.0 x8 controller has the theoretical capability of moving 5GT/s (500MB/s) yielding you 4000GB/s. If you took the average drive which we'll say does 100MB/s you could have 40 drives pushing full throttle (more than 20). Use a PCIe 3.0 x8 card and you double that to 80. Now lets say that you are using large arrays with random 4K QD32 requests (more represented in a server, but lower sequential rates) ... You could multiply the number of drives by 5 or more before you run out of bandwidth. All of this is before you start striping data across multiple HBAs (R60) or doubling down creating redundant arrays of redundant arrays.

The Sun "thumper" was announced in 2006 and eol'd 8 months ago with no drop-in replacement and is old tech. As to the expanders on the 7320, here is a link to the service manual, check P18 http://docs.oracle.com/cd/E22471_01/pdf/821-1792.pdf you will see the expander in the initial box and each shelf has an integrated expander.

In the end, all RAIDs are software. Whether that is software that exists in firmware and has an integrated parity processor or software that run in system memory and use the host processor for parity computation is irrelevant. Hardware RAID has its place. Software RAID has its place. ZFS has its place. Which one to choose is all about your application and its particular needs.

As to scalability, and you mention 1TB/s (Damn I would love a system that can offer that, btw outside of government and academia they don't exist and most/all of the systems in that class are linux based clusters NOT running ZFS are the primary filesystem and have no need for that kind of throughput from a single system), When you are talking about hundreds or thousands of spindles, sequential transfer speed is generally irrelevant, you are talking many users, extremely deep queue depths and random access. Response time and IOP/s are what are most important. You are evangelizing ZFS regardless of the case which is incorrect and you are slamming hardware raid, and with no disrespect intended it seems you have very little real world experience with any of this (as you yourself say you have never run ZFS in production).

ZFS is also just a filesystem, and there are others that are coming down the pike (or here already) with data integrity calculated in the filesystem, snapshots, versioning etc such as btrfs, nilfs and gpfs. Microsofts own NTFS replacement will likely contain these features as well, opening up the benefits to a much wider audience. OSX was supposed to have ZFS as their next filesystem, but went from openly talking about it to removing any reference to it because of patent issues with Sun.

Sure, you can go beyond 20 disks in a hw-raid, maybe 40 - but I doubt no one will do that. It is not sane to do. Better to use several hw-raid cards to minimize Single Point Of Failure. And the difference between 20 and 40 is only a factor 2. Hardly any difference.

If we talk about Oracle 7420, it has ~600 disks and that is a quite reasonable number of disks when using ZFS. If we talk about Lustre + ZFS, then we talk about thousands of disks and 1TB/sec bandwidth. That is much more than a factor 2. That is true scalability.

Hw-raid does not scale. Sure it scales up to 50 disks or so, but I doubt no one does it because it becomes a SPOF. ZFS scales way beyond that, and it is practically and feasible to handle several hundreds of disks with ZFS, and people do it all the time. ZFS servers compete with NetApp storage Enterprise solutions, I dont see a single hw-raid compete with a whole NetApp server? Hw-raid does not scale. ZFS does. 50 disks is not scalability.

And yes, I slam hw-raid. They are legacy and dieing out. Just like dinosaurs. They are inferior to sw raid in almost every aspect: scalability, data corruption, price, performance, etc. I dont know of any hw-raid card that gives 1TB/sec or handles 55 Petabytes of disk. You need a software system to do that, not a single card.

Regarding the SAS expanders in Oracle servers, of course they exist. I just forgot, quite dumb of me to forget that. The thing is, I was so focused on SAS expanders + SATA disks discussion, which is a major no no in the ZFS world, and that is the reason I got confused by your statement that 7320 uses SAS expanders. But of course the ZFS servers do, they use SAS expanders. What they dont do, is to mix SATA disks with SAS expanders - because that is a no no in the ZFS world. So everything is well.

And order is restored.

mwroobel · Jun 8, 2012

"Yes, but when the HW-raid card dies, the server dies" Well, not if you have mirrored arrays on separate cards or say a raid6 of raid6's on multiple cards. If you ran a 32 drive ZFS volume off a single 32 port HBA, that system would be down too. You are saying that one system is better because it has multiple HBA's when you spec the other system as having just 1 HBA. Again, you are confusing volume management with disk management.

I guess I should run out and sell my HP, NetApp and EMC shares because they don't resell ZFS systems and they are selling "Legacy, dinosaur and dying out servers and storage systems (as per you)." You can believe what you like.... Regarding the 50PB / 1TB/s I expect you are just parroting what you read in the Lawrence Livermore presentation and saying "HA, look here" . As I also mentioned above, systems like this are not found outside of government or academia. They are the exception rather than the rule. Like I said, ZFS has its place just as software RAID and hardware RAID depending on the application. Also, as to SATA drives on SAS controllers being a "no-no in the ZFS World" that is just wrong, and you are pointing to a few posts from a few people quite some time ago. I think you will find that the vast majority of people here on HOCP that are setting up home/office ZFS boxes are using SATA disks, some of them using SAS controllers and expanders and some not. The people running these systems are not having these problems currently, or you would find post after post about it and specific warnings about it everywhere. You blurt out absolutes and you self-admittedly have no practical experience working with what you speak of. I am done trying to express any more to you about this.

madrebel · Jun 8, 2012

mwroobel said:
As to scalability, and you mention 1TB/s (Damn I would love a system that can offer that, btw outside of government and academia they don't exist and most/all of the systems in that class are linux based clusters NOT running ZFS are the primary filesystem and have no need for that kind of throughput from a single system)

with quad socket R (2011) motherboards just now coming out and presuming there are 4 total QPI links then you're still only at 320GB/s max theoretical throughput from the motherboard not counting any for network.

that is an extremely high number though.

However in the not too distant future LSI will ship 12gbps dual port SAS controllers (q1,q2 2013) and possibly a quad port since pci-e 3 has more than enough bandwidth.

im told not to expect 12gbps jbod controllers though so you'll be aggregating a bunch of down stream 6gbps to fewer up stream 12gbps.

mwroobel · Jun 8, 2012

There are a lot of very high performance HW SAS controllers (both add-on and embedded) coming down the pike. He doesn't seem to be interested as he has labeled it as "They are legacy and dieing out. Just like dinosaurs. They are inferior to sw raid in almost every aspect" and believes that ZFS is the end all be all of life. It's a free country, he can think whatever he likes

madrebel · Jun 8, 2012

well "software raid" is superior. the only place it is not superior is if for some reason you're running a heavy compute load on the same box.

of the software raids out there, zfs is in many ways superior and is the leader in 'modern' all encompassing filesystems. you're seeing everyone moving to the COTS 'pooled' hybrid approach. even emc is doing it with fast pools they're just marking up the COTS hardware by 2000%.

HW RAID has its place but it is diminishing. everyone is shifting to COTS and or object based scale out storage. the last hold out is enterprise which for some reason loves spending obscene amounts of money to emc/netapp for old style technology but w/e.

cantalup · Jun 9, 2012

madrebel said:
well "software raid" is superior. the only place it is not superior is if for some reason you're running a heavy compute load on the same box.

of the software raids out there, zfs is in many ways superior and is the leader in 'modern' all encompassing filesystems. you're seeing everyone moving to the COTS 'pooled' hybrid approach. even emc is doing it with fast pools they're just marking up the COTS hardware by 2000%.

HW RAID has its place but it is diminishing. everyone is shifting to COTS and or object based scale out storage. the last hold out is enterprise which for some reason loves spending obscene amounts of money to emc/netapp for old style technology but w/e.

please looks on two side where H/W and S/W raid provides to us.

H/W is dying? well this is debatable
S/W is rising? well this is debatable
H/W and S/W raid have been existed and coexisted together.
this is all up to us to pick-up, which is the best solution

as already said, H/W or S/W has pro and cons.

"Hybrid" is the current marketing, but they do not abandon H/W

imagine that you are running a company, would you prefer H/W(hybrid or no) or S/W?
this is all about marketing and you decision.

I do using SW and H/W on my system at home and in the office.

H/W is not mostly talking about wastin money... you would know when you are facing in the real word.
most of us in here that I know, are coming from home or SOHO user, where the environment is totally different with "enterprise".

my thought:
1) H/W is not dying, just going adaptable with current technology, for example hybrid(S/W or H/W)
2) S/W is not dying, this is always running along with H/W,
3) H/W is rising, I am not a marketing guy

4) S/W is rising, read #3.

XXX is superior or not? this is back to your preferences and your objective/subjective

this is your decision not mine ..

emc/netapp/solaris...... they have their marketing department, which they push to sell their product... this is up to us to decide whether to believe or strike the deal.

brutalizer · Jun 9, 2012

mwroobel said:
"Yes, but when the HW-raid card dies, the server dies" Well, not if you have mirrored arrays on separate cards or say a raid6 of raid6's on multiple cards.

Ok, this is news to me. You are saying that a hw-raid can consists of several other hw-raid cards, in a hierarchy? So you can use multiple hw-raid cards?

One hw-raid controls 10 other hw-raid cards, which controls 10 other hw-raid cards, etc - which means hw-raid card can scale to 1000s of disks? I did not know this. I thought a hw-raid card only controlled some disks, without knowing what other hw-raid cards are doing. But if this is true, then hw-raid cards does scale up to 1000s of disks - just as you claim. Do you have links on this? And how good is the synchronization between all the 100s of hw-raid cards? Is anybody doing this in practice, or is it just marketing and nobody does it? Interesting info, thanx for this. I will stop saying that hw-raid does not scale - if you are correct.

I guess I should run out and sell my HP, NetApp and EMC shares because they don't resell ZFS systems and they are selling "Legacy, dinosaur and dying out servers and storage systems (as per you)." You can believe what you like....

Yes, that sounds like a good idea. NetApp sued Sun ZFS because NetApp was afraid of ZFS which does essentially the same thing, but for free. Coincidentally, NetApp shares has lost 50% of their value, whereas Oracle is increasing ZFS servers with 42%. So selling your shares sounds good.
http://www.theregister.co.uk/2012/06/04/netapp_vulnerable/

Nexenta grows over 400% each year:
http://www.channelregister.co.uk/2012/01/18/nexenta_c_round/

It is probably the fastest growing storage company ever:
http://www.theregister.co.uk/2011/03/04/nexenta_fastest_growing_storage_start/

Here we see a another ZFS company rewrote dedupe and selling ZFS storage server cheap:
http://www.theregister.co.uk/2012/06/01/tegile_zebi/

We see that NetApp is declining rapidly, and ZFS companies increasing heavily. So, if I were you, I would short NetApp stock, and go long ZFS company stock - just as you suggested.

Regarding the 50PB / 1TB/s I expect you are just parroting what you read in the Lawrence Livermore presentation and saying "HA, look here" . As I also mentioned above, systems like this are not found outside of government or academia. They are the exception rather than the rule. Like I said, ZFS has its place just as software RAID and hardware RAID depending on the application.

Well, I am not necessarily talking about ZFS + Lustre. For instance, IBM GPFS is also a sw raid system, able to scale to many disks and high bandwidth. What I am trying to say is that sw-raid systems such as ZFS, can scale to 1000s of disks and high bandwidth - whereas hw-raid don't.

But know you are saying I am wrong, that hw-raid cards also scales well. So, yes, if you show me links, I admit that I am wrong. Please show me at least one installation where they are storing PetaBytes and high bandwidth with hw-raid. Companies routinely sell ZFS PetaByte servers.

I dont count NetApp as hw-raid - they are using sw-raid called Ontap, very similar to ZFS. In fact, ZFS was heavily copying functions from NetApp - as I have understood it. But improved it, for instance, NetApp only allows 255 snapshots, whereas ZFS has unlimited snapshots. ZFS allows triple parity, whereas NetApp does not. etc.

Also, as to SATA drives on SAS controllers being a "no-no in the ZFS World" that is just wrong, and you are pointing to a few posts from a few people quite some time ago. I think you will find that the vast majority of people here on HOCP that are setting up home/office ZFS boxes are using SATA disks, some of them using SAS controllers and expanders and some not. The people running these systems are not having these problems currently, or you would find post after post about it and specific warnings about it everywhere. You blurt out absolutes and you self-admittedly have no practical experience working with what you speak of. I am done trying to express any more to you about this.

When credible Solaris kernel developers say there are problems mixing SAS expanders with SATA disks, then I listen to them. If credible Solaris kernel devs says the problems are solved, then I will consider the problem as solved. If somebody here says the problems are solved - that does not cut it. He needs to be a Solaris kernel dev or so. He need to be credible. Will you trust lot of data, because of hearsay? Someone says "I heard the SAS + SATA problem is solved" - is that enough for you to trust your data? Not for me.

cantalup said:
please looks on two side where H/W and S/W raid provides to us.

H/W is dying? well this is debatable
S/W is rising? well this is debatable

If we extrapolate and look into history, then H/W is dying. First there is always new functionality in Hardware, because the computers dont cut it. As the tech matures, it is possible to run it in software. For instance, modems. Long ago everybody bought modems, but now it is possible to emulate a modem via software.

Cryptography is best run on hardware, because software is too slow. But when cpus are strong enough, then there is no need for crypto hardware accelerators anymore. Software will be fine in the future.

Graphic are best run in hardware today. But Intel was tinkering with Larrabee cpu which would allow cpu to run graphics. Also, Euclidon has got lot of attention with their new graphic engine which allows a cpu to create brutal graphics, 100.000 faster than todays graphic engines. If Euclidon ever makes it, then we dont need any graphics card anymore, and all graphics can be run in software.
http://www.youtube.com/watch?v=JWujsO2V2IA&feature=related

Regarding raid. In the beginning you needed a hardware card, to offload the XOR parity calculations, because the PC was not strong enough. An hw raid card is essentially a PC with RAM, CPU, I/O, etc - that is why they are expensive. But today the cpu are very strong, and ZFS checksumming requires something like 5% of one core, in a quad core cpu. Thus, no need for extra hardware doing the calculations, a cpu is fine. Also, NetApp is using software raid. I suspect all larger storage servers are using software. I doubt any larger storage server is managed by a hw-raid card exclusively - I believe they all have software raid managing large servers. But I might be wrong, as mrwroobel says hw-raid can scale and handle many disks.

Even earlier, the very old Atari game console with Asteroids, King Kong, etc - had the game burned in hardware. When you wanted to change game, you swapped hardware card. Each game was a different piece of hardware. Today you just load another game, so the cpu can do all these different games for you. No need for special hardware.

Regarding music creation. Earlier you needed much hardware, a tape recorder, mixer, effect pedals for guitar, piano, etc - now you can run everything in software. There are software pianos that are better than most normal pianos (PianotTeq), you can create drum patterns in the PC, without any drums. You can virtually do everyting in software today. You dont need extra hardware anymore. Many musicians have stopped buying hardware synths, they are only buying a software - which you can upgrade easily. You only need a cheap digital piano, or you can even use the computer keyboard.

We see that more and more functionality is moving into software. A cpu strong enough can be programmed to do virtually anything, to emulate even another computer. You need less and less hardware. In the future, a strong enough PC with many cores, can emulate everything you need, so you dont need graphics card, no raid card, no modem, no intel NIC, etc. Only a single strong cpu.

Do you understand why I am saying that hardware stuff is dieing? It applies not only to raid cards, but to everything. When the cpu is strong enough, the hardware is not necessary anymore. Do you agree? Who buys modems today?

Billy_nnn · Jun 9, 2012

brutalizer said:
When credible Solaris kernel developers say there are problems mixing SAS expanders with SATA disks, then I listen to them. If credible Solaris kernel devs says the problems are solved, then I will consider the problem as solved. If somebody here says the problems are solved - that does not cut it. He needs to be a Solaris kernel dev or so. He need to be credible. Will you trust lot of data, because of hearsay? Someone says "I heard the SAS + SATA problem is solved" - is that enough for you to trust your data? Not for me.

Sun/Oracle marketed ZFS appliances based on switched SAS disk trays (J4400, J4500) which only ever had SATA drives installed.(ST7210/7310/7410)
The J4200/4400 SAS trays themselves supported a mix of SATA and SAS drives, while the J4500 only ever supported SATA drives (officially at least).

The latest models are SAS2 trays with SAS2 drives (no SATA) though whether this is for technical, performance or marketing reasons I don't know (multipathing doesn't work on SATA drives, so maybe that's a factor!)

As for the rest, offerings based on things like ZFS may be eating into the traditional RAID array market (certainly at the lower end), but even then they still need hardware to run on!
Even so, in the enterprise, large hardware RAID arrays will continue to have a presence for quite some time yet, at least IMO - BTW Sun's own ST9990V (OK it's really a Hitachi

) scaled to over 1000 drives internally, and can scale to many times that using external storage.

At the end of the day though, if you analyze what an Oracle ZFS appliance is, it's really not all that different from a proprietary hardware RAID array - it has some processors, memory, flash and storage and is self contained - the difference really is that it's made up from commodity parts (and to an extent software, even though the actual OS running on them is proprietary, it's based on Solaris) rather than proprietary units, and (the big bonus for us) it can be pretty much duplicated on a smaller scale in the home.

However, at least for the moment, even Oracle recognise that their ZFS appliance range isn't the end-game just yet - they also market FC based block access hardware RAID arrays in the form of the Axiom 600.
The Exadata Storage Server isn't running ZFS either (at least not yet anyway)!

mwroobel · Jun 9, 2012

I promised myself I wasn't going to continue in this thread, but what the hell, here goes: You are taking this as if I am attacking ZFS. That is completely wrong. I have said whether you choose hardware parity or software parity depends completly on your application. If you are running Windows Server for example, and need DAS then ZFS isn't really an option. There are a lot of people running Windows servers. Also, Solaris isn't ideal for all (many) applications (It is nicknamed Slowlaris for a reason) Linux, not perfect yet http://zfsonlinux.org/docs/LUG12_ZFS_Lustre_for_Sequoia.pdf (Read Stability section) If you are building a huge storage infrastructure, ZFS is just one option among many, whether you do hardware or software solutions. You make it sound like I am advocating HW RAID to the exclusion of all else in every circumstance and wiping the floor with ZFS. This is simply not the case. You however seem to blindly extoll the benefits of software raid regarless of the application or circumstance.

Quote:
Originally Posted by mwroobel View Post
"Yes, but when the HW-raid card dies, the server dies" Well, not if you have mirrored arrays on separate cards or say a raid6 of raid6's on multiple cards.
Ok, this is news to me. You are saying that a hw-raid can consists of several other hw-raid cards, in a hierarchy? So you can use multiple hw-raid cards?

Actually, broadcom has a chipset that supports this. And the HW RAID vendors aren't sitting idly by, they are coming up with some excellent product. But that isn't what I said. I said that if you had mirrored arrays on seperate HBAs or nested raid level arrays that could survive a loss that you would stay up.

One hw-raid controls 10 other hw-raid cards, which controls 10 other hw-raid cards, etc - which means hw-raid card can scale to 1000s of disks? I did not know this. I thought a hw-raid card only controlled some disks, without knowing what other hw-raid cards are doing. But if this is true, then hw-raid cards does scale up to 1000s of disks - just as you claim. Do you have links on this? And how good is the synchronization between all the 100s of hw-raid cards? Is anybody doing this in practice, or is it just marketing and nobody does it? Interesting info, thanx for this. I will stop saying that hw-raid does not scale - if you are correct.

Scalability has nothing to do with putting 1000 drives on a single controller, no one in their right mind is doing that for a whole slew of reasons. Just as generally no one is putting 1000 drives in a single pool (Want to guess how long a scrub on that would take?) Many large installs do pools of 90-120 drives (Our Sun box has 96 SAS drives in one pool). Most of our HW R6 arrays are 24-96 drives per HBA/Expander sets, and then mirrored or nested filesystems on top of that. In any case, large ZFS installs aren't putting 1000 drives onto one controller (HW or software) They are attaching shelves of drives to storage controllers and then meshing them in one manner or another. You can do the same thing with multiple HW arrays. Again, I come back to no one product, topology, filesystem or controller is perfect for every instance.

Quote:
I guess I should run out and sell my HP, NetApp and EMC shares because they don't resell ZFS systems and they are selling "Legacy, dinosaur and dying out servers and storage systems (as per you)." You can believe what you like....
Yes, that sounds like a good idea. NetApp sued Sun ZFS because NetApp was afraid of ZFS which does essentially the same thing, but for free. Coincidentally, NetApp shares has lost 50% of their value, whereas Oracle is increasing ZFS servers with 42%. So selling your shares sounds good.
Well, that isn't exactly the case. Sun went after NetApp first https://communities.netapp.com/comm...5/netapp-sues-sun-for-zfs-patent-infringement
They tried to use their patent portfolio as a battering RAM to make some quick dollars. The netapp lawsuit is a counterclaim asserting some of their own patents. Not because NetApp is "Scared" of Sun and/or ZFS

Nexenta grows over 400% each year:
Nexanta is doing well. They are a pre-ipo company and have nowhere to go but up. They are also a closely held private company so you don't know the reality of their finances. While I would expect them to do very well, neither I nor you know how they are doing.

Here we see a another ZFS company rewrote dedupe and selling ZFS storage server cheap:
Selling something cheap doesn't make it better.

We see that NetApp is declining rapidly, and ZFS companies increasing heavily. So, if I were you, I would short NetApp stock, and go long ZFS company stock - just as you suggested.
Thanks, I'll inform my broker.

Quote:
Also, as to SATA drives on SAS controllers being a "no-no in the ZFS World" that is just wrong, and you are pointing to a few posts from a few people quite some time ago. I think you will find that the vast majority of people here on HOCP that are setting up home/office ZFS boxes are using SATA disks, some of them using SAS controllers and expanders and some not. The people running these systems are not having these problems currently, or you would find post after post about it and specific warnings about it everywhere. You blurt out absolutes and you self-admittedly have no practical experience working with what you speak of. I am done trying to express any more to you about this.
When credible Solaris kernel developers say there are problems mixing SAS expanders with SATA disks, then I listen to them. If credible Solaris kernel devs says the problems are solved, then I will consider the problem as solved. If somebody here says the problems are solved - that does not cut it. He needs to be a Solaris kernel dev or so. He need to be credible. Will you trust lot of data, because of hearsay? Someone says "I heard the SAS + SATA problem is solved" - is that enough for you to trust your data? Not for me.
Then don't do it. Remember my whole risk/reward description from a previous post?

Cryptography is best run on hardware, because software is too slow. But when cpus are strong enough, then there is no need for crypto hardware accelerators anymore. Software will be fine in the future.

You don't seem to grasp the fact that software runs on hardware. Whether you use a CPU or GPU based system, your software is constrained by the limits of the hardware. As to running it on the mega-general CPU that you extoll, FPGA based crackers wipe the floor with general x86 CPUs for cryptography

Graphic are best run in hardware today. But Intel was tinkering with Larrabee cpu which would allow cpu to run graphics. Also, Euclidon has got lot of attention with their new graphic engine which allows a cpu to create brutal graphics, 100.000 faster than todays graphic engines. If Euclidon ever makes it, then we dont need any graphics card anymore, and all graphics can be run in software.
http://www.youtube.com/watch?v=JWujs...eature=related

Yes, and what a failure the Larabee platform was - it was envisioned as a Nvidia Killer for the gamer http://semiaccurate.com/2009/12/04/intel-kills-consumer-larrabee-focuses-future-variants/

Regarding raid. In the beginning you needed a hardware card, to offload the XOR parity calculations, because the PC was not strong enough. An hw raid card is essentially a PC with RAM, CPU, I/O, etc - that is why they are expensive. But today the cpu are very strong, and ZFS checksumming requires something like 5% of one core, in a quad core cpu. Thus, no need for extra hardware doing the calculations, a cpu is fine. Also, NetApp is using software raid. I suspect all larger storage servers are using software. I doubt any larger storage server is managed by a hw-raid card exclusively - I believe they all have software raid managing large servers. But I might be wrong, as mrwroobel says hw-raid can scale and handle many disks.

You are morphing between a single machine and a large storage system. As to whether it is beneficial to run SW RAID on the processor or have a hardware adjunct, a lot dpends on what else is running on said box.

Even earlier, the very old Atari game console with Asteroids, King Kong, etc - had the game burned in hardware. When you wanted to change game, you swapped hardware card. Each game was a different piece of hardware. Today you just load another game, so the cpu can do all these different games for you. No need for special hardware.

They had the game burned in hardware because it was cheaper to offload the cost of a 1K PROM on the software purchases than to include the up-front costs of adding a removable media device like a floppy (which wasn't generally available in anything but a consumer-unfriendly 8" in 1977 when the 2600 shipped. The ATARI also only had 128k of RAM so it was easier to have a almost instantly available path to the software, and bank switching in later cartridges offered even more available storage. Were there cheap, reliable external storage available I am sure they would have used it. Again, progress.

Regarding music creation. Earlier you needed much hardware, a tape recorder, mixer, effect pedals for guitar, piano, etc - now you can run everything in software. There are software pianos that are better than most normal pianos (PianotTeq), you can create drum patterns in the PC, without any drums. You can virtually do everyting in software today. You dont need extra hardware anymore. Many musicians have stopped buying hardware synths, they are only buying a software - which you can upgrade easily. You only need a cheap digital piano, or you can even use the computer keyboard.

Yes, this is called progress. Though I doubt anyone serious about their music is using their computer keyboard to input their piano notes.

We see that more and more functionality is moving into software. A cpu strong enough can be programmed to do virtually anything, to emulate even another computer. You need less and less hardware. In the future, a strong enough PC with many cores, can emulate everything you need, so you dont need graphics card, no raid card, no modem, no intel NIC, etc. Only a single strong cpu.
Well, what CPU are you talking about? Some CPUs do some kinds of math very well, and suck at others.

Do you understand why I am saying that hardware stuff is dieing? It applies not only to raid cards, but to everything. When the cpu is strong enough, the hardware is not necessary anymore. Do you agree? Who buys modems today?

Well, if you are referring to an analog telephone modem used for a ppp connection, that's correct. Just about every electronic communication device you have has a modem of some sort or the other. Something to modulate and demodulate analog and digital signals. Your cellphone, your cable modem, etc. As to analog modems, just about every fax machine has one and they sell a shitload of them still!

brutalizer · Jun 9, 2012

Billy_nnn said:
Sun/Oracle marketed ZFS appliances based on switched SAS disk trays (J4400, J4500) which only ever had SATA drives installed.(ST7210/7310/7410)
The J4200/4400 SAS trays themselves supported a mix of SATA and SAS drives, while the J4500 only ever supported SATA drives (officially at least).

The latest models are SAS2 trays with SAS2 drives (no SATA) though whether this is for technical, performance or marketing reasons I don't know (multipathing doesn't work on SATA drives, so maybe that's a factor!)

You claim that earlier Sun J4200 models mixed SATA and SAS disks, that might be true. I dont know. But, did they use SAS expanders?

As for the rest, offerings based on things like ZFS may be eating into the traditional RAID array market (certainly at the lower end), but even then they still need hardware to run on!
Even so, in the enterprise, large hardware RAID arrays will continue to have a presence for quite some time yet, at least IMO - BTW Sun's own ST9990V (OK it's really a Hitachi ) scaled to over 1000 drives internally, and can scale to many times that using external storage.

At the end of the day though, if you analyze what an Oracle ZFS appliance is, it's really not all that different from a proprietary hardware RAID array - it has some processors, memory, flash and storage and is self contained - the difference really is that it's made up from commodity parts (and to an extent software, even though the actual OS running on them is proprietary, it's based on Solaris) rather than proprietary units, and (the big bonus for us) it can be pretty much duplicated on a smaller scale in the home.

However, at least for the moment, even Oracle recognise that their ZFS appliance range isn't the end-game just yet - they also market FC based block access hardware RAID arrays in the form of the Axiom 600.
The Exadata Storage Server isn't running ZFS either (at least not yet anyway)!

My suspicion is that all larger storage servers, are using hw-raid as a building block. They use software to manage all the separate hw-raid cards. Just like ZFS does to manage all the separate HBA cards. Thus, software raid systems are used to manage hw-raid cards. That is my suspicion. And now some people claim that a single hw-raid card can control other hw-raid cards so the hw-raid cards can control and manage 1000s of disks - all by themselves. I doubt that.

I believe that software raid systems such as ZFS, GPFS, NetApp, etc - does not require hw-raid cards. They can run fine with JBOD. I dont see a future for hw-raid cards.

Regarding the ZFS servers, sure, they are just a normal server with RAM, CPU etc, and all the software runs on the OS. Similar to hw-raid cards, which is in effect, a tiny computer. Why run a tiny computer (hw-raid) on a server? You dont need the extra hardware anymore. If XOR parity calculations consumed 50% of a cpu, there would be a need for a tiny computer to offload the server. But ZFS consumes maybe 5% of one core. There is no need to offload such miniscule work load. If you redesigned ZFS to take advantage of a tiny computer to offload that workload - your server would run 5% faster. For the cost of a tiny computer. Who wants to spend lot of money to get 5% higher performance?

mwroobel · Jun 9, 2012

brutalizer said:
You claim that earlier Sun J4200 models mixed SATA and SAS disks, that might be true. I dont know. But, did they use SAS expanders?

Yep. See http://docs.oracle.com/cd/E19930-01/821-0817-10/821-0817-10.pdf page 3 and 14

My suspicion is that all larger storage servers, are using hw-raid as a building block. They use software to manage all the separate hw-raid cards. Just like ZFS does to manage all the separate HBA cards. Thus, software raid systems are used to manage hw-raid cards. That is my suspicion. And now some people claim that a single hw-raid card can control other hw-raid cards so the hw-raid cards can control and manage 1000s of disks - all by themselves. I doubt that.

I believe that software raid systems such as ZFS, GPFS, NetApp, etc - does not require hw-raid cards. They can run fine with JBOD. I dont see a future for hw-raid cards.

Regarding the ZFS servers, sure, they are just a normal server with RAM, CPU etc, and all the software runs on the OS. Similar to hw-raid cards, which is in effect, a tiny computer. Why run a tiny computer (hw-raid) on a server? You dont need the extra hardware anymore. If XOR parity calculations consumed 50% of a cpu, there would be a need for a tiny computer to offload the server. But ZFS consumes maybe 5% of one core. There is no need to offload such miniscule work load. If you redesigned ZFS to take advantage of a tiny computer to offload that workload - your server would run 5% faster. For the cost of a tiny computer. Who wants to spend lot of money to get 5% higher performance?

In any case, you don't seem to understand that hardware needs software, and software needs hardware. Whether you are running HW R6 or SW ZFS, you have a hardware infrastructure also running software. Whether that is software running on an embedded CPU at a low level, or running on the main CPU at a higher level doesn't matter. Will ZFS rule the roost someday, who knows? Will some other filesystem/volume manager? Who knows? That doesn't make HW RAID a dinosaur nor does it make ZFS champion. There is room for all to coexist, it depends on your needs, application, SLA, etc! There, now I am done!

Billy_nnn · Jun 9, 2012

brutalizer said:
You claim that earlier Sun J4200 models mixed SATA and SAS disks, that might be true. I dont know. But, did they use SAS expanders?

It's not a claim, it's a fact - and "switched SAS" is another term for an array using expanders (I guess they thought that "expanded SAS" gave the wrong impression

)

madrebel · Jun 9, 2012

mwroobel said:
Whether that is software running on an embedded CPU at a low level, or running on the main CPU at a higher level doesn't matter.

sure it does, cost matters.

when inexpensive quad core xeons are more than capable of doing everything why pay for dedicated risc cpus that do little more than manage parity?

when i can inexpensively (relatively) put 256GB-512GB of ram on the motherboard that manages everything in the storage system why would i limit myself to 4GBs on a card?

'H/W' cards were born of necessity. CPUs weren't that fast back then. Ram was crazy expensive.

you mentioned crypto too, another case where CPUs (with aes-ni) absolutely dominate now. There is no need for expensive dedicated crypto accell CPUs anymore (unless your requirements aren't accelerated via aes-ni) .

moving all the compute to the cpu makes more sense than dedicating money to buying dedicated cards that manage one single function.

virtualization was the last tool needed to fully turn compute into commodity. the storage vendors are insisting that storage cannot be a commodity but nexenta et al are proving storage should be a commodity.

when you can build enterprise class storage for 1/4 the cost of emc/netapp/etc, why wouldn't you? whos business is more important, yours, or theirs?

mwroobel · Jun 9, 2012

madrebel said:
sure it does, cost matters.

I agree. Lets look apples to apples since brutal is proposing "ZFS" (as if it were a single item" vs EMC/Netapp etc in an enterprise situation. Now, I consider enterprise (Since that is where I work) as well supported commercial hardware with available exceptional onsite service available nationally (or internationally). Something with a guaranteed availability of specific parts based upon the manufacturers roadmap (generally for 5 years, but that is not an absolute). So lets pick your vendor... HP, Dell, Even Supermicro. Lets say we were just building a small SAN box. Server, DAS chassis, 24 SAS drives, whatever processor memory etc you wish to add. We're going to build 2 boxes... The only difference is one will run ZFS and the other will run whatever... In the grand scheme of things, whether you choose 1 (or 2 or 3) SAS HW RAID cards vs 3 8 drive basic SAS HBAs (Again, fully supported by the manufacturer), the difference in cost of the two options when you take the tot cost of hardware, software, service etc amortized over time is so small as to be almost irrelevant. I am not talking about some guy taking a supermicro motherboard and sticking it with some M1015's he got off some guy (or three) from ebay into a Norco with some drives and calling that an enterprise server.

when inexpensive quad core xeons are more than capable of doing everything why pay for dedicated risc cpus that do little more than manage parity?

Well, the RISC CPU's don't add all that much cost into the equation in the grand scheme of things. One reason a lot of people use hardware RAID cards is even if you don't use the RAID functionality (Just in JBOD mode) because they are very fast and efficient cards and are supported or offered by the server manufacturer..

when i can inexpensively (relatively) put 256GB-512GB of ram on the motherboard that manages everything in the storage system why would i limit myself to 4GBs on a card?
Well, if it is thoroughput limitations you are worried about.... Let me cite an example. And it is a very large ZFS system, which they expect will scale to over 10PB http://www.lsi.com/downloads/Public/Direct Assets/LSI_CS_PittsburghSupercomp_041111.pdf each rack consists of two servers. Each of the two servers has a single LSI-9201-16e JBOD HBA (x8) connected to the SAS switch which are then connected to 495 3TB SATA drives So that is 1 x8 handling 247 odd drives. Screaming 4GB/s sequential speed is not the rule in installs like these, it is high IOPs and lots of high queue depth random access requests. Oh yeah, it uses SATA drives with those evil SAS HBAs and Expanders

you mentioned crypto too, another case where CPUs (with aes-ni) absolutely dominate now. There is no need for expensive dedicated crypto accell CPUs anymore (unless your requirements aren't accelerated via aes-ni) .

It all depends on what kind of math you are using in your codebreaking attempts. I am not talking about general filesystem encryption, I am talking codebreaking.

moving all the compute to the cpu makes more sense than dedicating money to buying dedicated cards that manage one single function.

when you can build enterprise class storage for 1/4 the cost of emc/netapp/etc, why wouldn't you? whos business is more important, yours, or theirs?

Sure, even using HP equipment I can build a storage infrastructure myself for 1/3 of the cost. But the software we run has requirements (oracle mostly) and recommended hardware. Our business continuity insurance has requirements as to what we run. Some of our clients have specific requirements. A lot of it is politics and covering your ass. I also don't want the responsibility with anything possibly an issue as I have enough to do as it is. I love ZFS. I love hardware RAID in windows. I love mdadm software R6 in linux. I love variations of all of them based on the particular need or task. Each has its place, and will continue to..

madrebel · Jun 9, 2012

mwroobel said:
Now, I consider enterprise (Since that is where I work) as well supported commercial hardware with available exceptional onsite service available nationally (or internationally).

this is normal terrible wasteful enterprise thinking. "if i pay for support then if something breaks it isn't my fault". so, you're going to pay an insane premium for a branded solution and then pay a large amount of money on top of that for premier support, and then you're going to pay FTEs to become experts on the product and likely pay to have those FTEs take tests for pieces of paper that say "i are expert".

enterprise wastes so much money on this crap all because people are terrified of something breaking. build it fast, cheap, available, and back it up. if you lose data still having EMC on the phone or on site won't matter as likely your city got nuked and the whole world has bigger problems.

But the software we run has requirements (oracle mostly) and recommended hardware.

so youre suggesting oracle DBs can't run on oracle ZFS storage? really?

Our business continuity insurance has requirements as to what we run.

this is bullshit then. how you run, obviously. what you run? tell them to buy it for you or find better insurance.

Some of our clients have specific requirements.

give me examples. apart from vblock (which can be replicated with different stacks)

A lot of it is politics and covering your ass.

coward enterprise mindset, i know, i used to work in enterprise too. now i consolidate in house IT to managed IaaS and people with these mindsets are losing their jobs because i can provide the same stuff at half the cost.

I also don't want the responsibility with anything possibly an issue as I have enough to do as it is.

nexenta and their partners have done all the grunt work.

I love mdadm software R6 in linux.

mdadm sucks balls lol. mdadm + lvm is worse, in comparison to zfs of course. mdadm works, its just such a pain to manage the whole solution.

Each has its place, and will continue to..

everything 'cloud' is moving away from h/w raid and legacy storage vendors. all you have to do is ask EMC how much of their revenue last year was storage and how much was VM. 90% was VM .... when a company who sells VM which relies heavily on storage only has 10% of its revenue generated by its own storage that should tell you something.

i get it, nobody ever lost their job buying emc or cisco ... but if it all dies and you dont have backups or you can't get the system back online you will lose your job regardless of what you bought. i'd rather save millions of dollars, hire more intelligent people, and build in redundancy.

my bottom line is more important than emc's or netapp's or ...

mwroobel · Jun 9, 2012

Now, I consider enterprise (Since that is where I work) as well supported commercial hardware with available exceptional onsite service available nationally (or internationally).
this is normal terrible wasteful enterprise thinking. "if i pay for support then if something breaks it isn't my fault". so, you're going to pay an insane premium for a branded solution and then pay a large amount of money on top of that for premier support, and then you're going to pay FTEs to become experts on the product and likely pay to have those FTEs take tests for pieces of paper that say "i are expert".

That is not what I said. I have equipment at 30 sites around the world including satellite offices. 16 of them I have HP hardware service contracts on the servers so if anything hardware related goes wrong I don't have to get on a plane at 3am nor maintain IT staff onsite fulltime where 99% of the time they would be twiddling their thumbs. They are mostly clustered systems which will maintain the uptime of the office while the failed hardware is replaced.

enterprise wastes so much money on this crap all because people are terrified of something breaking. build it fast, cheap, available, and back it up. if you lose data still having EMC on the phone or on site won't matter as likely your city got nuked and the whole world has bigger problems.

No, I inherited my EMC SAN and NetApp equipment in acquisitions my company made of other companies, they are running a specific task those companies purchased. I didn't purchase or install it. That said, if it makes sense to migrate away from a platform now or when it is time to replace I will evaluate what best serves our needs.

But the software we run has requirements (oracle mostly) and recommended hardware.
so youre suggesting oracle DBs can't run on oracle ZFS storage? really?

I never said that. In fact, If you read further up you will see we have a number of Sun boxes running Oracle, Including M3 blades tasked to a 7420 storage system. I also have HP and Supermicro boxes running Oracle instances as well.

Our business continuity insurance has requirements as to what we run.
this is bullshit then. how you run, obviously. what you run? tell them to buy it for you or find better insurance.
Again, there are requirements on availability and replacement equipment. That is a legal and insurance decision, I give my input and we go back and forth but I'm not the final say. There are also requirements about our clients data that we maintain for them, such as outsourcing storage to third parties.. Such as encryption mandates on our backups taken offsite, etc. They don't tell us what particular devices we can and can't buy, but many rates are set based upon how we do what we do. I don't have a 3 letter ancronym on my card

everything 'cloud' is moving away from h/w raid and legacy storage vendors. all you have to do is ask EMC how much of their revenue last year was storage and how much was VM. 90% was VM .... when a company who sells VM which relies heavily on storage only has 10% of its revenue generated by its own storage that should tell you something.

Honestly, I don't know what their breakouts are.... Everything runs in cycles, sometimes things go one way and sometimes the other. Remember when thin clients were going to destroy the PC business and move everything to the head end? Outside of large centrally planned VM headends and specific security needs, this isn't th way things shaked out. Server consolidation has taken off but client consolidation for local use hasn't. Even where it has it is most often run on real PCs and not thin clients. As to how fast things migrate to "the cloud", we'll see and again make our decisions based on needs, and not just jump on "the latest" new thing immediately.

i get it, nobody ever lost their job buying emc or cisco ... but if it all dies and you dont have backups or you can't get the system back online you will lose your job regardless of what you bought. i'd rather save millions of dollars, hire more intelligent people, and build in redundancy.

But we do. We have built a very high availability infrastructure, and have the in house talent to manage it whatever the needs. We have the backup and DR infrastructure in place to stay up. I don't get why you keep implying that I am suggesting EMC or NetApp vs ZFS, I am not. What I do suggest is buying good hardware, designing your infrastructure as to stay highly available and planning for what "might" happen to you infrastructure where you can.

my bottom line is more important than emc's or netapp's or ...

I agree completely, I never inferred anything differently.

In any case, this thread has gone wildly off the rails, ending with someone who self-admittedly has never setup a single ZFS install, let alone a large storage system pointing to threads from some time ago and stating that something which is currently being sold and supported by a number of companies (Sata disks with SAS expanders) is toxic unless you prove it to "his" standards.

No one platform or filesystem is right for every situation. It is a case-by-case basis, depending on many variables.

cantalup · Jun 9, 2012

brutalizer said:
...........

If we extrapolate and look into history, then H/W is dying. First there is always new functionality in Hardware, because the computers dont cut it. As the tech matures, it is possible to run it in software. For instance, modems. Long ago everybody bought modems, but now it is possible to emulate a modem via software.

Cryptography is best run on hardware, because software is too slow. But when cpus are strong enough, then there is no need for crypto hardware accelerators anymore. Software will be fine in the future.

Graphic are best run in hardware today. But Intel was tinkering with Larrabee cpu which would allow cpu to run graphics. Also, Euclidon has got lot of attention with their new graphic engine which allows a cpu to create brutal graphics, 100.000 faster than todays graphic engines. If Euclidon ever makes it, then we dont need any graphics card anymore, and all graphics can be run in software.
http://www.youtube.com/watch?v=JWujsO2V2IA&feature=related

Regarding raid. In the beginning you needed a hardware card, to offload the XOR parity calculations, because the PC was not strong enough. An hw raid card is essentially a PC with RAM, CPU, I/O, etc - that is why they are expensive. But today the cpu are very strong, and ZFS checksumming requires something like 5% of one core, in a quad core cpu. Thus, no need for extra hardware doing the calculations, a cpu is fine. Also, NetApp is using software raid. I suspect all larger storage servers are using software. I doubt any larger storage server is managed by a hw-raid card exclusively - I believe they all have software raid managing large servers. But I might be wrong, as mrwroobel says hw-raid can scale and handle many disks.

Even earlier, the very old Atari game console with Asteroids, King Kong, etc - had the game burned in hardware. When you wanted to change game, you swapped hardware card. Each game was a different piece of hardware. Today you just load another game, so the cpu can do all these different games for you. No need for special hardware.

Regarding music creation. Earlier you needed much hardware, a tape recorder, mixer, effect pedals for guitar, piano, etc - now you can run everything in software. There are software pianos that are better than most normal pianos (PianotTeq), you can create drum patterns in the PC, without any drums. You can virtually do everyting in software today. You dont need extra hardware anymore. Many musicians have stopped buying hardware synths, they are only buying a software - which you can upgrade easily. You only need a cheap digital piano, or you can even use the computer keyboard.

We see that more and more functionality is moving into software. A cpu strong enough can be programmed to do virtually anything, to emulate even another computer. You need less and less hardware. In the future, a strong enough PC with many cores, can emulate everything you need, so you dont need graphics card, no raid card, no modem, no intel NIC, etc. Only a single strong cpu.

Do you understand why I am saying that hardware stuff is dieing? It applies not only to raid cards, but to everything. When the cpu is strong enough, the hardware is not necessary anymore. Do you agree? Who buys modems today?

I am not a believer on H/W is dying

you need H/W and S/W in our reality.
as I learn many things started with PC-AT 386 with Turbo and fast forwards with current technology that H/W and S/W are working together.

even.. you have fast CPU, you can not emulate everything

..

for example: graphics cards are different animal. the chips is specialized for graphical workload.

H/W raid is not PC alike... .
H/W raid is a specialized chipset with embedded RISC/MIPS processor .
this is is the reason, they call I/O engine/chips/processor in their technical white papers.

as I said, S/W and H/W are getting along, they rely on each other. I do S/W and H/W raid at home and works. everything is based on what I or Customer Need(s).

brutalizer · Jun 10, 2012

cantalup said:
even.. you have fast CPU, you can not emulate everything

If the emulation is too slow to be usable, then you need hardware to get up to speed. But if the emulation is fast enough, then I don't see the point of using hardware instead? I mean, if the emulation is fast enough, then why not using the cpu?

Regarding hw-raid, the emulation is fast enough. A few % of cpu power to do XOR parity calculations, why do you want to offload that to external cpus?

You dont believe that in the future, all that can be emulated is going to be emulated? Why keep old computers, when you can emulate all of them in one VMware/VirtualBox server? I dont get it. I mean Virtualization is a hot topic today, and that is because you can emulate hardware in software. If emulation did not give huge gains, then virtualization would not be interesting, then everybody would sit with 100s of servers, instead of consolidating several servers into one. You dont believe that virtualization is going to spread even more? Hardware is the way to go?

Software raid systems such as ZFS, GPFS, BTRFS, Windows ReFS,... will kill off hardware raid systems. Everybody is doing ZFS similar filesystems today. They are being more and more common.

brutalizer · Jun 10, 2012

mwroobel said:
I promised myself I wasn't going to continue in this thread, but what the hell, here goes: You are taking this as if I am attacking ZFS. That is completely wrong. I have said whether you choose hardware parity or software parity depends completly on your application. If you are running Windows Server for example, and need DAS then ZFS isn't really an option. There are a lot of people running Windows servers. Also, Solaris isn't ideal for all (many) applications (It is nicknamed Slowlaris for a reason)

If you talk about Windows, ReFS is similar to ZFS and ReFS will make hw-raid obsolete. Sell your hw-raid card while there is a market for it.

Regarding Slowlaris, the nick name was from the old TCP/IP stack which was slow. But it has been rewritten in Solaris 11 and is now very fast. Called FireEngine. Early benches showed it to be 30% faster than Linux TCP/IP stack. I dont know of new TCP/IP benches.

In any case, large ZFS installs aren't putting 1000 drives onto one controller (HW or software) They are attaching shelves of drives to storage controllers and then meshing them in one manner or another. You can do the same thing with multiple HW arrays.

Yes, but those multiple HW arrays need someone that controls them, right? They can not control each other? Because you need software that controls the HW arrays, you might as well as go the entire way, and use software exclusively.

Well, that isn't exactly the case. Sun went after NetApp first https://communities.netapp.com/comm...5/netapp-sues-sun-for-zfs-patent-infringement
They tried to use their patent portfolio as a battering RAM to make some quick dollars. The netapp lawsuit is a counterclaim asserting some of their own patents. Not because NetApp is "Scared" of Sun and/or ZFS

If you read the comments, it seems that NetApp first tried to buy some storage patents from Sun. Sun looked at the patents, and realised that NetApp did break those Sun patents, which is why NetApp wanted to buy them in the first place. Because NetApp infringed the patents, it is fair if NetApp pays for a license, like everybody does, yes? I would hardly call it Patent Trolling

I mean, if my company realize we infringe patents from HP, and then we try to buy the patents from HP but HP rejects our bid offer, is HP patent trolling if they demands a license from us, then? Or is it us that tries to escape from license fee?

Some would say it is quite ugly of NetApp to first try to buy patents, and then reject paying a license fee - and then telling everybody that Sun is patent trolling?? Sun never had a bad reputation for patent trolling, like IBM have:
http://www.forbes.com/asap/2002/0624/044.html

You don't seem to grasp the fact that software runs on hardware. Whether you use a CPU or GPU based system, your software is constrained by the limits of the hardware. As to running it on the mega-general CPU that you extoll, FPGA based crackers wipe the floor with general x86 CPUs for cryptography
...
Yes, and what a failure the Larabee platform was - it was envisioned as a Nvidia Killer for the gamer http://semiaccurate.com/2009/12/04/intel-kills-consumer-larrabee-focuses-future-variants/

You forget my premise: if the emulation is fast enough - then the hardware is not necessary. If the emulation is slow, then you need hardware. But you need hardware only until the emulation is fast enough. Regarding Larrabee I know it is killed, but my point is that we are now seeing the first attempts to get rid of discrete graphics card, and run everything on the cpu. In the past it was not possible, but in the future it will be possible to run everything on the cpu. It is coming. We see the first attempts. Then we see the next attempt, and third, and somewhere there it will happen. And in the future noone will use separate graphics card.

Do you remember the 3dfx graphics card? You connected a separate 3D card, to your 2D card. You used separate hardware. Then you merged everything into one card. In the future everything will be run only on the cpu. You dont agree on this?

They had the game burned in hardware because it was cheaper to offload the cost of a 1K PROM on the software purchases than to include the up-front costs of adding a removable media device like a floppy (which wasn't generally available in anything but a consumer-unfriendly 8" in 1977 when the 2600 shipped. The ATARI also only had 128k of RAM so it was easier to have a almost instantly available path to the software, and bank switching in later cartridges offered even more available storage. Were there cheap, reliable external storage available I am sure they would have used it. Again, progress.

I think the Atari only had 128 bytes of RAM. Not 128k. But, yes, it is progress. In the future they will say "back in the 2010, people had separate raid cards. Today we use ReFS and the like. This is called progress".

As to analog modems, just about every fax machine has one and they sell a shitload of them still!

When it is possible to emulate analog modems in a cheap cpu, there will not be any need for analog modems anymore. Progress.

You had an interesting link about a large ZFS 10 petabyte installation. They talk about LSI switches. How does such switches work? They connect a disk array into the switch? It is simliar to an expander? What is the difference between an expander and a switch?

Stanza33 · Jun 10, 2012

LSI SAS Switch

http://www.youtube.com/watch?v=tLO_cJNqAKc

http://www.youtube.com/watch?feature=player_embedded&list=PL407B5D616ABDAF01&v=3OUcEmce96Q

.

mwroobel · Jun 10, 2012

If you talk about Windows, ReFS is similar to ZFS and ReFS will make hw-raid obsolete. Sell your hw-raid card while there is a market for it.

I don't understand your hatred for hardware RAID (or as I am finding ANYTHING that isn't emulated in a CPU), but sure I'll play. ReFS and ZFS have some similarities, but lets talk about the end-to-end data integrity and self healing you extoll to most important to all. If you have a corrupt file (silent or otherwise) in ZFS it can for the most part fix it internally. ReFS (which by the way is not finished yet in anything but a beta form, and anyone with a brain doesn't just jump into anything Microsoft puts out that is version 1.0) doesn't, and if the file(s) is/are internally corrupted you need to go to your backups and restore. From Wikipedia (since you love corroborating posts) "If nevertheless file data or metadata becomes corrupt, the file can be deleted without taking down the whole volume offline for maintenance, then restored from the backup. As a result of built-in resiliency, administrators do not need to periodically run error-checking tools such as CHKDSK when using ReFS." http://en.wikipedia.org/wiki/Windows_Server_2012

Regarding Slowlaris, the nick name was from the old TCP/IP stack which was slow. But it has been rewritten in Solaris 11 and is now very fast. Called FireEngine. Early benches showed it to be 30% faster than Linux TCP/IP stack. I dont know of new TCP/IP benches.

No, actually the reference came about when the newly released Solaris performed in general slower than the existing SunOS it replaced.

Yes, but those multiple HW arrays need someone that controls them, right? They can not control each other? Because you need software that controls the HW arrays, you might as well as go the entire way, and use software exclusively.

As I pointed to in a previous post on this thread, the cost in general of a high-performance name-brand RAID card and an equivalent JBOD card is inconsequential in the grand scheme of things. In any event you are still using HARDWARE to control the drives, regardless of the filesystem..

If you read the comments, it seems that NetApp first tried to buy some storage patents from Sun. Sun looked at the patents, and realised that NetApp did break those Sun patents, which is why NetApp wanted to buy them in the first place. Because NetApp infringed the patents, it is fair if NetApp pays for a license, like everybody does, yes? I would hardly call it Patent Trolling

I mean, if my company realize we infringe patents from HP, and then we try to buy the patents from HP but HP rejects our bid offer, is HP patent trolling if they demands a license from us, then? Or is it us that tries to escape from license fee?

Some would say it is quite ugly of NetApp to first try to buy patents, and then reject paying a license fee - and then telling everybody that Sun is patent trolling?? Sun never had a bad reputation for patent trolling, like IBM have:
http://www.forbes.com/asap/2002/0624/044.html

Well, the lawyers will thrash that out, that is beyond my expertise and in the end the lawyers are usually the only ones getting rich.

You forget my premise: if the emulation is fast enough - then the hardware is not necessary. If the emulation is slow, then you need hardware. But you need hardware only until the emulation is fast enough. Regarding Larrabee I know it is killed, but my point is that we are now seeing the first attempts to get rid of discrete graphics card, and run everything on the cpu. In the past it was not possible, but in the future it will be possible to run everything on the cpu. It is coming. We see the first attempts. Then we see the next attempt, and third, and somewhere there it will happen. And in the future noone will use separate graphics card.

Do you remember the 3dfx graphics card? You connected a separate 3D card, to your 2D card. You used separate hardware. Then you merged everything into one card. In the future everything will be run only on the cpu. You dont agree on this?

Regardless of your premise, there are support chips and there are CPUs. Maybe one day everything will be run on a cpu, but that day isn't today (or in the near future). Regarding why people bought a second 3d card, it was because it was something new and an addon that only some people wanted/needed. Obviously technological increases brought that to a single card... But people are still using graphics cards. And even the CPUs that have graphics integrated, they are still using either a seperate die or a special area of a monolithic CPU with ASIC for the graphics, they aren't putting the graphics processing through the main cores. And with the tech available today, and the power requirements and heat generation of a high-end ATI or Nvidia graphics card do you really want to stuff that into a CPU??? For some boxes, onboard integrated graphics works. For most it doesn't. Also, the hardware controller vendors are not sitting still, you talk about the future, maybe in the future there will be hardware controlled storage arrays that are much more efficient than CPU run software. I don't know, and neither do you. In 10 or 50 years, sure there might be chips capable, but not now. But this is a discussion of now and near term, not a thought exercise of what might be.

I think the Atari only had 128 bytes of RAM. Not 128k. But, yes, it is progress. In the future they will say "back in the 2010, people had separate raid cards. Today we use ReFS and the like. This is called progress".

Sorry, mistype. Since you obviously know the future, can you give me some future sports winners (MLB, NFL, Horse Racing, you can choose)!

When it is possible to emulate analog modems in a cheap cpu, there will not be any need for analog modems anymore. Progress.

It is better to keep your analog and digital separate, and to save on your CPU load and design costs to add a separate chip that might cost .25

You had an interesting link about a large ZFS 10 petabyte installation. They talk about LSI switches. How does such switches work? They connect a disk array into the switch? It is simliar to an expander? What is the difference between an expander and a switch?[/QUOTE]
You are free to investigate that at length yourself, but the easiest way to visualize it is that an expander can be thought of as a network switch with some basic intelligence for smart switching between a single host and its client drives/enclosures. A SAS Switch is a much more intelligent (generally with a much more powerful HARDWARE SAS processor(s)) which allows multiple hosts to talk to multiple enclosures and handles the physical and logical routing. BTW, if you don't understand even how a SAS switch works and their uses in any large storage infrastructure, and have never even setup ANY ZFS infrastructure at all (Let alone a large install) other than by just regurgitating marketing releases by Nexanta how do you know how it scales and how "GREAT" it is and how everything else sucks when compared to other options. Again, this comes down to no one ANYTHING is right for EVERYTHING.

Billy_nnn · Jun 10, 2012

brutalizer said:
Regarding Slowlaris, the nick name was from the old TCP/IP stack which was slow. But it has been rewritten in Solaris 11 and is now very fast. Called FireEngine. Early benches showed it to be 30% faster than Linux TCP/IP stack. I dont know of new TCP/IP benches.

The term was originally coined when Solaris first came out (or rather was first branded as Solaris) as it was slower than the previous SunOS on the same hardware.

FireEngine was years ago now!
The "new rewrite" of TCP/IP for Solaris11 was codenamed "Crossbow" IIRC, and is really more aimed at virtualisation of the physical NICs.

You forget my premise: if the emulation is fast enough - then the hardware is not necessary. If the emulation is slow, then you need hardware. But you need hardware only until the emulation is fast enough.

You will always need hardware - without it there is nothing for the software to run on.
The nature of the hardware may change, as will that of the software, but it will always have to be there - even in the case of the "cloud"..... the hardware is still there, it's just somewhere else!

brutalizer · Jun 10, 2012

Billy_nnn said:
The term was originally coined when Solaris first came out (or rather was first branded as Solaris) as it was slower than the previous SunOS on the same hardware.

This is funny, but I read somewhere that slowlaris stemmed from the slow tcp/ip stack. Googling further, I can not find my link anymore. It seems that this explanation you present, is correct.

You will always need hardware - without it there is nothing for the software to run on.
The nature of the hardware may change, as will that of the software, but it will always have to be there - even in the case of the "cloud"..... the hardware is still there, it's just somewhere else!

Yes, but we go from specialized hardware to general hardware that can mimic and emulate other hardware. My claim is that in the future, specialized hardware will be less needed. General hardware will always be slower than specialized hardware but when the general hardware is fast enough I dont see the need for specialized hardware.

In the future will only have a many core cpu and the rest can be emulated. Some cores handles graphics, some cores handles sound, etc.

mwroobel said:
I don't understand your hatred for hardware RAID (or as I am finding ANYTHING that isn't emulated in a CPU),

I dont have a hatred, I am only saying that specialized hardware will be obsolete when it is possible to emulate it in fast enough in software. I hope you agree on this projection?

No, actually the reference came about when the newly released Solaris performed in general slower than the existing SunOS it replaced.

Yes, it seems correct.

As I pointed to in a previous post on this thread, the cost in general of a high-performance name-brand RAID card and an equivalent JBOD card is inconsequential in the grand scheme of things. In any event you are still using HARDWARE to control the drives, regardless of the filesystem

Yes, I have to still use hardware, but I dont need to have specialized hardware any more. The general purpose cpu suffices.

Well, the lawyers will thrash that out, that is beyond my expertise and in the end the lawyers are usually the only ones getting rich.

When Oracle bought Sun, NetApp dropped everything. NetApp does not want to mess with Oracle. This ZFS case is settled.

Maybe one day everything will be run on a cpu, but that day isn't today (or in the near future).

So you agree on my projection? When it is possible to emulate hardware in software, the hardware will be obsolete?

And with the tech available today, and the power requirements and heat generation of a high-end ATI or Nvidia graphics card do you really want to stuff that into a CPU???

No, the emulation of such graphics technique is not good enough yet. But if Euclideon succeeds, then they use a new approach to graphics which means that the cpu can run everything. There is no need for graphics card at all. In that case, we dont need those 400 watt graphics card any more. We can ditch them, because we can run heavy FPS games on cpu only. If the game industry transitions to Euclidoen technology, I will be the first to ditch my graphics card, but I assume you will still buy expensive graphics cards even though they are not used?

A SAS Switch is a much more intelligent (generally with a much more powerful HARDWARE SAS processor(s))

Yes it uses hardware, but if it is possible to emulate all that functionality in general purpose hardware, then they will be obsolete.

BTW, if you don't understand even how a SAS switch works and their uses in any large storage infrastructure,

I know what the SAS switch is used for, but I dont know the details between them and an expander. But is it correct to say that the switch is basically an expander, but more intelligent?

and have never even setup ANY ZFS infrastructure at all (Let alone a large install) other than by just regurgitating marketing releases by Nexanta how do you know how it scales and how "GREAT" it is and how everything else sucks when compared to other options. Again, this comes down to no one ANYTHING is right for EVERYTHING.

Of course I have setup and used ZFS for many years, I have followed ZFS from the very beginning. But I have never setup or used it in production as you have. But to say I have never setup any ZFS infrastructure at all is wrong. But I dont have production experience.

Regarding scaling of ZFS, well it is obvious it scales well. It handles many 100s of disks without problems or issues. I have asked you to show me a hwraid card that handles many 100s of disks in production - but you have not. Does it mean that hw-raid does not scale, or does it mean you just dont want to show such links?

I dont understand why you are arguing against, software raid scaling better than a hardware raid card? Common sense. Of course it is true. I have never seen a single hw-raid card handle several PetaBytes. But software raid systems does, because it can span over several HBAs.

But I agree there are many other solutions, and you should choose solution depending on workload.

mwroobel · Jun 10, 2012

brutalizer said:
I dont understand why you are arguing against, software raid scaling better than a hardware raid card? Common sense. Of course it is true. I have never seen a single hw-raid card handle several PetaBytes. But software raid systems does, because it can span over several HBAs.

But I agree there are many other solutions, and you should choose solution depending on workload.

I have not said that it does or doesn't nor have I extolled the benefits of one over the other in absolutes, I have said from the beginning that depending on your particular application you choose what is best for that application. As to the tremendous savings you suggest by using JBOD HBAs vs Hardware RAID cards (Whether or not you utilize the HW functionality) I have explained that when designing a system, the differential in costs is small in the grand scheme of things amortized over the purchase and time and as a percentage of overall costs. As to a single HW RAID card handling multiple petabytes, I never said you would use a single card. Nor would you use a single HBA in a ZFS system of that size. In a large infrastructure such as that, regardless of your filesystem or connection topology you will have many servers connected with many HBAs, each managing their allocated segments of disks with or without some kind of switch and/or expander. This is done for reasons of fault tolerance, bandwidth contention, physical separation for legal or other reasons and many more. This thread is not for nor do I wish to get into a pissing match about what is best for any particular application in absolutes. You said HW is dead and ZFS is king. I have said that they all have their place based on your needs and applications. I currently run systems with HW RAID and with SW RAID. I have systems with NTFS, ZFS, HFS, Ext4 and even LTFS (HSM and backup tape arrays). I even have one sorry box somewhere running HPFS in OS/2 (long story, they never approved the $$$ to replace some god-awful financial app and decided to flog it till it died). This is not a diatribe on what the future may bring nor do I wish to be drawn into a discussion of what may or may not come to pass in said future. As things become available and stable we shall evaluate it and make our decisions based on those evaluations and their benefit to whatever application we have. This thread has gone way off on a tangent. You continue to say that SAS+EXPANDER+SATA is toxic and to be avoided at all costs until you are satisfied that it is ok. I and others have pointed you to production systems being sold and supported that use exactly that combination. If you don't like it and don't trust it, don't use it! It is truly as simple as that That's it.

brutalizer · Jun 10, 2012

mwroobel said:
As to a single HW RAID card handling multiple petabytes, I never said you would use a single card. Nor would you use a single HBA in a ZFS system of that size. In a large infrastructure such as that, regardless of your filesystem or connection topology you will have many servers connected with many HBAs, each managing their allocated segments of disks with or without some kind of switch and/or expander.

Yes, but all these hw-raid cards in a large infrastructure, must be governed by some software. The infrastructure must be run by some kind of software, the hw-raid cards dont run by themselves. Therefore, such a software scales as it handles many HBA/raid cards. A single hw-raid card does not scale, you need many of them - and even then, they dont communicate with each other well enough to handle Petabytes. But software raid communicates and manages each HBA. Thus, software raid (which spans several cpus and several HBAs and Terabyte of RAM) scales better than a single hw-raid card. It also scales better than multiple hw-raid cards because they can not reliable communicate well enough to handle petabytes and give extreme bandwidth. You need software raid to scale.

You continue to say that SAS+EXPANDER+SATA is toxic and to be avoided at all costs until you are satisfied that it is ok. I and others have pointed you to production systems being sold and supported that use exactly that combination.

No, I dont continue to say that it is toxic. I have seen links here, about production systems mixing SAS expanders and SATA disks which makes me confused. I dont really know what to believe anymore. I will in the future say
"some say that SAS expanders and SATA disks are toxic, but apparently there are servers sold, in production, with such a mix without problems. I dont know anymore. You decide if you want to mix or not."
Satisfied?

mwroobel · Jun 10, 2012

I will leave you with this as this thread has now lost whatever interest it had for me. You keep telling me I said a single RAID card scales to Petabytes, I never said that. Then somewhere madrebel said software raid6 (mdadm I think) sucked balls. As a parting note, here are some links to RAID6 systems that scale to 1680 drives in a single controller domain. It uses a combination of custom hardware based cards, software based cards and a very intelligent fabric controller. Multiple installations scale to 10s of PB and 200+ GB/s throughput. Their own filesystems are self healing. The actual controller is what they call their "Hardware State Machine" and most of the controlling is done in a custom hardware ASIC. Again, I am not going to get into a pissing match about the difference of what is hardware and what is software, the difference of software running on the firmware of a particular card/cpu vs software running on the main CPU(s). This is a hardware based controller... They quote expandability to over 1TB/s with the right infrastructure.
http://www.ddn.com/pdfs/SFA12K-20_Family_Datasheet_V1.pdf
Here is a sales presentation with some more info and a comparison vs Lustre with ZFS.
http://www.hpcadvisorycouncil.com/events/2012/Switzerland-Workshop/Presentations/Day_2/11_DDN.pdf
Here is a more detailed layout about their functionality.
http://gridka-school.scc.kit.edu/2011/downloads/Storage_Architectures_060911_Toine_Beckers.pdf

Full disclosure: I do not work for DDN, I do not resell DDN .e are currently evaluating the SFA-10K system with 600TB available storage to replace our EMC boxes.

Billy_nnn · Jun 10, 2012

The difference is that once a hardware raid array needs to scale beyond what can be managed on a few host based cards, the whole lot is usually removed from the host and onto a dedicated storage platform, running whatever OS the manufacturer develops. Hence the hardware raid array - these are available from all the main players in one form or another, ranging from small 8-12 disk systems, right up to those which can scale up to 000s of disks.

Host based software raid (even the more advanced offerings, such as ZFS) still has potential availability issues (unless perhaps you have a fast and reliable cluster offering), as it relies on the standard commodity system it's running on, and they aren't fault tolerant - many hardware raid arrays are, certainly once you get to enterprise class products. In the enterprise, data availability is almost as important as data integrity.

At the end of the day, if you are arguing about a home user running a relatively small multi-disk PC based NAS, and whether he should run a ZFS pool on standard HBAs or some other filesystem on RAID HBAs, then I'd probably plump down on the ZFS side - due to cost more than anything else TBH. However, there's nothing to stop you combining the two as well!

If you asked the same question on behalf of a customer who required say a 200 disk storage system supporting hundreds of users in a busy commercial environment, then the answer may be different!

brutalizer · Jun 10, 2012

mwroobel said:
This is a hardware based controller... They quote expandability to over 1TB/s with the right infrastructure.
http://www.ddn.com/pdfs/SFA12K-20_Family_Datasheet_V1.pdf

Cool server, but it looks like a software system manages several hardware raid cards. It could as well, skip the hw and go software all the way: manage several HBAs instead. But I wonder how they do data integrity when they use hw-raid that handles the last step: to save on disk? The reason ZFS can guarantee data integrity is because ZFS handles evertying: end-to-end. It seems this is not end-to-end as the hw-raid handles the last step....

But maybe we are actually saying the same thing, but in different words. I am trying to say hw-raid does not scale, to do scaling on hw-raid you need software that manages all cards. Then you can as well as use software the entire way, including HBAs and skip the raid. All larger enterprise servers have some kind of software managing all the hw-raid (or HBA) cards. NetApp does, Oracle does, and this link does.

Anyway, I have stated my belief (that specialized hw is obsolete when we can emulate it on software on general purpose cpus) and you have stated your belief (that hw is sometimes better, even though it is possible to emulate). And we dont get further than this. So let us stop here. You know what I believe, and I know what you believe and we have debated some, but not convinced the other to change view point.

madrebel · Jun 11, 2012

mwroobel said:
I will leave you with this as this thread has now lost whatever interest it had for me. You keep telling me I said a single RAID card scales to Petabytes, I never said that. Then somewhere madrebel said software raid6 (mdadm I think) sucked balls. As a parting note, here are some links to RAID6 systems that scale to 1680 drives in a single controller domain. It uses a combination of custom hardware based cards, software based cards and a very intelligent fabric controller. Multiple installations scale to 10s of PB and 200+ GB/s throughput. Their own filesystems are self healing. The actual controller is what they call their "Hardware State Machine" and most of the controlling is done in a custom hardware ASIC. Again, I am not going to get into a pissing match about the difference of what is hardware and what is software, the difference of software running on the firmware of a particular card/cpu vs software running on the main CPU(s). This is a hardware based controller... They quote expandability to over 1TB/s with the right infrastructure.
http://www.ddn.com/pdfs/SFA12K-20_Family_Datasheet_V1.pdf
Here is a sales presentation with some more info and a comparison vs Lustre with ZFS.
http://www.hpcadvisorycouncil.com/events/2012/Switzerland-Workshop/Presentations/Day_2/11_DDN.pdf
Here is a more detailed layout about their functionality.
http://gridka-school.scc.kit.edu/2011/downloads/Storage_Architectures_060911_Toine_Beckers.pdf

Full disclosure: I do not work for DDN, I do not resell DDN .e are currently evaluating the SFA-10K system with 600TB available storage to replace our EMC boxes.

DDN is the only 'hardware raid' company worth considering IMO. They have some really crazy hardware. I recently evaluated them too but they're as expensive as EMC and frankly a lot of people run ZFS on top of their hardware anyway because their head units have some flexibility issues.

However, you can do the same thing they do more or less. they use a 4u 60 drive jbod that they originally designed but that you can now buy from a company called dataon storage.

with dual controllers you can then setup a nexenta based system with 6-8 9205 controllers and dual home into each JBOD. that gives you 288-384 gbps (36GB/s-48GB/s) of disk bandwidth. Still a FAR cry from what they can do but unless you're doing HPC with petabyte size data sets you never need that insane bandwidth.

Also worth noting is in a few months you could then upgrade those controllers to pcie-3 controllers with 12gbps ports and double the bandwidth 576-768 gbps or 72-96 GB/s. For a fraction of the cost you can reach the same insane bandwidth via drop in replacement HBAs.

I like DDN's hardware and they have some wicked smart people working there but why would i spend the money when i have no need for the insane bandwidth? 99% of workloads are IOPs bound. You tackle IOPs limitations two ways, tons and tons of spinning disk or lots of SSDs. DDN, EMC, Netapp, etc they charge an insane premium for SSDs. EMC wanted over $330k for a dae-25 filled with 100TB sas ssds. I can get 400GB STEC IOPs drives for ~$2500ish. 25*2500= $62500 ... how does EMC justify that highway robbery?

mwroobel · Jun 11, 2012

There has never been any question that rolling your own is MUCH cheaper in the end. We built a few backblaze pods to play with and tested them with mdadm and a slightly different version SAS without the SATA port multipliers and obviously different HBAs with zfs. We continue to evaluate new and novel entrants but will not just jump into something/someone new just because it is cheaper initially. We have to look at costs (for server equipment) over the depreciation schedule we have for servers (3-5 years depending on a number of variables and particular corporate entities). As far as cost, the equivalent DDN quote has come in quite a bit less than EMC (We were looking at DAE-60 based systems as well as quotes from NetApp and DDN and it looks like DDN is the route we will be going for this). The particular application we have for this particular backend has the need for the bandwidth afforded, not every one will. As to Nexenta in particular, they are a good company and make do some good code, though I am personally not a huge a supporter because of problems we had with them regarding two boxes that were on their QVL list and had endless problems that wasted a lot of my time and ended up with them blaming the hardware suppliers (who blamed Nexenta back) that they themselves qualified. They ended up refunding the value of the replacement hardware against our contract with them. The problems were with NexentaStor 3.0, but after what we went through (and more importantly how poorly they handled the situation) I was not happy. That said, I will not cut off my nose to spite my face, and will continue to evaluate all relevant products based on their particular strengths.

madrebel · Jun 11, 2012

so, you have an enterprise app that needs 768gbits per second of throughput? You're dedicating 20 40gbit infiniband ports for this, 77 10gbit ports?

also you need to beat up your EMC rep. show them the DDN quote, tell them to beat it. they will.

mwroobel · Jun 11, 2012

madrebel said:
so, you have an enterprise app that needs 768gbits per second of throughput? You're dedicating 20 40gbit infiniband ports for this, 77 10gbit ports?

also you need to beat up your EMC rep. show them the DDN quote, tell them to beat it. they will.

Actually we are dedicating 18 FDR ports to this particular need. it will handle very high speed links for tasks locally as well as Ka and Ku downlinks and uplinks. The connected boxes will be serviced by a Mellanox SX6018 that we have been evaluating in beta for a few months (Expensive, but a DAMN good box). We are replacing some older Voltaire (Now Mellanox anyway) IB infrastructure. We may upgrade to a larger Mellanox box so we will have more future expansion options and have the necessary ports for port mirroring for IDS and network monitoring needs. As to the final costs, I haven't started to beat up the reps yet nor to play them against each other (and others). That will come once we have determined what any particular metric is worth dollar-wise to us, what we are willing to spend for those metrics and what work out best in the end.

madrebel · Jun 11, 2012

lawl, that is a boat load of bandwidth. wtf are you running that needs that much? medical imaging?

mwroobel · Jun 11, 2012

In this particular application total throughput was as (or more) important than IOPs based on the number of clients (hardware or biological

) and processes. Without going into more detail to the actual use than I am comfortable, it is a static and dynamic image focused project.

brutalizer · Jun 11, 2012

That DDN server is quite wicked. And it has brutal bandwidth. The funny thing is that if you are running one of the the largest stock exchanges in the world, you dont need a fraction of that bandwidth. So that DDN server is truly powerful, and could probably serve all the Stock Exchanges in the world at the same time, without breaking a sweat. That DDN server is high-end, yes. ZFS servers today are low-end / middle-end. There are no ZFS high-end servers. Maybe there will be in the future. When Lustre and ZFS are merged, then maybe.

(Regarding the TCP/IP traffic that the largest Stock Exchanges needs: all Traders that are connected to the Exchange (including High Frequency Traders / Algotraders that generate thousands of orders/sec), generate far less traffic than a 1GBit NIC. )

mwroobel · Jun 11, 2012

The DDN box we are most likely going with has an obscene amount of bandwidth, which is exactly what we need for this application. But bandwidth is just one part of the equation. Throughput, IOPs and especially latency all come into account, depending on the application. Especially with internationally connected boxes, latency is the true killer for some applications and bandwidth (or lack thereof) is the true killer for others. This is the same for internally connected devices as for world connected ones. Some companies and traders are spending billions to bring lower latency connections from Japan and Europe and the US because 50-100ms can mean the difference between making money and losing it. A recent link talking about some basics http://www.extremetech.com/extreme/...-cost-of-cutting-london-toyko-latency-by-60ms But the exchange as an example has differing needs based on system. Their trading systems have High-IOP/Low-Latency needs. Their global backup, DR and general load needs have higher bandwidth needs. Like I said, no one anything is perfect for everything.

ZFS and SAS expanders with SATA drives a toxic combo?

Supreme [H]ardness

[H]ard|Gawd

Gawd

Supreme [H]ardness

[H]ard|Gawd

Supreme [H]ardness

Gawd

Supreme [H]ardness

Gawd

Gawd

[H]ard|Gawd

Limp Gawd

Supreme [H]ardness

[H]ard|Gawd

Supreme [H]ardness

Limp Gawd

Gawd

Supreme [H]ardness

Gawd

Supreme [H]ardness

Gawd

[H]ard|Gawd

[H]ard|Gawd

Gawd

Supreme [H]ardness

Limp Gawd

[H]ard|Gawd

Supreme [H]ardness

[H]ard|Gawd

Supreme [H]ardness

Limp Gawd

[H]ard|Gawd

Gawd

Supreme [H]ardness

Gawd

Supreme [H]ardness

Gawd

Supreme [H]ardness

[H]ard|Gawd

Supreme [H]ardness