ZFS and SAS expanders with SATA drives a toxic combo?

I have just posted this into the HP SAS Expander owners thread but I though it would be relevant here:
I have a Highpoint 4320(Firmware 1.2.26.5) and an HP SAS Expander Firmware(2.06) in a Supermicro Motherboard C2SBA. RAID 5 (4 x WD 2002FYPS) RAID 6 (7x WD 2002FYPS)
Running Windows 2008 R2 Standad

The setup has been solid up until recently. I was copying data from the RAID 5 to the RAID 6 Volume and half way through the system crashed. As the system stayed responsive to pings it seemed to be just the storage system that had stopped responding.
I'm wondering if anyone using a Highpoint card has had any similar problems?
 
Under Linux the Marvell 88SE6480 chip used in SuperMicro AOC-SASLP-MV8 works pretty well; but at least in BSD this controller sucks with continuous timeouts and other problems.

This card is a turd in anything but windows AFAIK. I've got two of them rotting in a box somewhere here.
 
Do not run SATA disks in your expanders if reliability is a concern.

I worked for a vendor that tried this on numerous occasions with Seagate ES.2 SATA (enterprise grade) and had nothing but problems past the second shelf in the daisy chain. The drives had timeout's, disconnects, etc. This will be the death of any ZFS configuration.

SAS was solid in all of our configurations and for good reason. They are built for high IOPS enterprise environments.
 
+1

I didn't took the advices and went for SATA drives on a SAS expander.

I've had 600 GB raptors (WD6000BLHX) on LSI2008 HBA and SAS2 x36 LSI based expander. Kept getting the messages below in the system log and drives dropping out the zpools.

scsi: [ID 243001 kern.warning] WARNING: /pci@0,0/pci8086,340a@3/pci1000,3020@0 (mpt_sas0):
mptsas_handle_event_sync: IOCStatus=0x8000, IOCLogInfo=0x31120303
scsi: [ID 243001 kern.warning] WARNING: /pci@0,0/pci8086,340a@3/pci1000,3020@0 (mpt_sas0):
mptsas_handle_event: IOCStatus=0x8000, IOCLogInfo=0x31120303
scsi: [ID 365881 kern.info] /pci@0,0/pci8086,340a@3/pci1000,3020@0 (mpt_sas0):
Log info 0x31120303 received for target 14.
scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc



(if someone knows what its all about ?)

Swapped them out with SAS drives and so far so good no issue. I'll post an update if I have any issue with the SAS drives.
 
Indeed, I would be interested as well...

I am planning on taking an Intel 2612UR server and turning it into a ZFS box. The midplane on this 12 disk server has 2 SAS ports. In the document, it does say it uses an expander on the midplane. The funny thing is I've been using an Areca 1680i and now an 1882i with ZERO issues with 7 drives plugged in to ONE port on the controller. (1 port on controller --> 1 port on midplane --> 7 disks on backplane). It's running ESXi 5 at the moment.

I'm moving the VM's to a SuperMicro 6026-URF4+ chassis and using this Intel for backup. The Intel will get an LSI card of some kind (probably the 9211-8i). I'd be very interested to hear if there's any real problems here.

-J
 
Sun sells large numbers of JBOD chassis that are based on SAS exapnders and explicitly support SATA disks in them.
This I doubt very much. As far as I know, all Sun/Oracle storage servers that use ZFS are using HBA only. No one use expanders. Unless you can show links that confirms your statement, I will not accept your statement. I dont think your statement is correct, unfortunately.

For instance, the ZFS storage server Sun X4500 Thumper with 48 disks in 4U, used 6 HBAs. Each HBA connected 8 disks like this:
HBA1: disk0 d1 d2 d3 d4 d5 d6 d7
HBA2: disk8 d9 d10 d11 d12 d13 d14
...
HBA6: disk40 d41 d42 d43 d44 d45 d46 d47

Then you created a zpool with several vdevs. For instance a raidz2 vdev with
disk0
disk8
...
disk40

And another vdev with d1, d9, ..., d41. And another vdev with d2, d10, ..., d42. etc.

Then you collected all vdevs into one zpool. If one HBA broke, for instance HBA1, it doesnt matter because every vdev just lost one disk each and the zpool could still function.



Regarding the first post with the link about SAS being toxic from Garret Damore. This Garret is a former Sun kernel engineer and the man behind Illumos. He has lots of credibility. If he says something, it is a confirmed fact. He works now at Nexenta, the storage company.


Basically, the link says that SAS use a different protocol than SATA. In the expander there will be a conversion from SAS to SATA, and you loose information in the conversion. In worst case, there might be problems. Thus, if you really want to be sure, use SAS disks with SAS expanders so there are no loss of data because of conversion.

Also, because ZFS will detect all problems immediately, ZFS will expose problems with expanders that other filesystems does not notice. ZFS having problems with SAS expanders is not a sign of fragility, but a sign of ZFS superior error detection. With other fileystems the errors are still there, but you will not notice.

I believe (need to confirm this) that if you use ZFS with SAS expanders and if you get problems, then ZFS will detect all errors just as normal, but ZFS might not be able to repair the errors. The same goes with hardware raid + ZFS: ZFS will detect errors but can not repair the errors. Thus you will get an error report, but ZFS can not repair all errors.

I am trying to go for many HBAs instead. Much safer. Expanders introduce another source of problems. KISS - keep it simple
 
Makes sense. I may have to rethink my Intel plans. Intel does actually state that their SAS expander to SATA disks is a supported config but I am not sure what value that gives. Supported may just mean 'if your array blow up, we'll happily let you call us and tell us about it'.

I wonder if there is some sort of code in the midplane that helps negotiate the SAS to SATA conversion that some other PCIe card expanders don't have? Not sure...

-J
 
Originally Posted by mikesm
Sun sells large numbers of JBOD chassis that are based on SAS exapnders and explicitly support SATA disks in them.

This I doubt very much.
...SNIP...

Since you were replying to a post from page one, should have done some checking first as this was posted only a post below mikesm: http://www.nexenta.org/issues/214 [ mpt related driver issues with LSI SAS HBAs and expanders ]

I'll quote the relevant passages here; feel free to check that thread:

post #4
Updated by Jay Worthington 3 months ago

Richard Elling wrote:

SATA devices directly connected to SAS expanders is toxic.

In case someone cares, Solaris 11, released yesterday, fixes this "hardware"-problem :-D.

Now all we need is for Larry to fork over that promissed sourcecode so it can be fixed here as well ;)

Regards,

Jay

post #5
Updated by Phillip Steinbachs 3 months ago

Jay Worthington wrote:

Richard Elling wrote:

SATA devices directly connected to SAS expanders is toxic.

In case someone cares, Solaris 11, released yesterday, fixes this "hardware"-problem :-D.

Now all we need is for Larry to fork over that promissed sourcecode so it can be fixed here as well ;)

Thanks Jay for the followup. I'm glad to hear that Oracle devoted resources to resolving the problem and I hope that given enough time, a fix can be incorporated into Illumos and ultimately NexentaStor. It certainly presents a compelling reason in the interim to consider Solaris 11, even if it means giving more chump change to Larry.

-phillip

Jay Worthington's post #3 in that thread has more info regarding the tests he has done using several different config, also the reason why he came to the conclusion that it was not a hardware issue; aka SAS->SATA tunneling protocol.
 
Last edited:
Nope, still broken. Getting "The requested URL was not found on this server. The link on the referring page seems to be wrong or outdated. Please inform the author of that page about the error. "
 
The link is definitely broken. Perhaps you have some login relationship with the site that permits access to the issues?

In any case, I am very interested in this. I claimed from the first day the Nexenta folks started posting on this that they were wrong - that it was not a fundamental fault of using SATA disks with SAS expanders - but that it indicated clearly some software fault and fragility of the ZFS or IO driver implementation. Of course, since they are part of the priesthood of ZFS they could not accept the idea that any fault could possibly be due to the ZFS implementation or the drivers. It would be sacrilege to even suggest that...even though no other use case for using SAS expanders with SATA drivers suffered from this fault.

At the time, my suggestion that this uncovered an instability in ZFS and/or the OS drivers was answered with the thought that ZFS was "just so good" that it uncovered faults with the hardware unseen by "inferior" applications like hardware RAIDs. They've gone on to re-post their "its toxic" thing over and over again here on [H] as well as on other forums.

I've got no real love of Oracle or what they've done to Sun since the purchase - but I would LOVE for them to put Garrett, et al, into a position where they need to eat them words.
 
Be nice if someone posted what the actual bug is. I'm not a member of the priesthood, but I'd bet it's a bug in the mptsas driver or some such...
 
I claimed from the first day the Nexenta folks started posting on this that they were wrong - that it was not a fundamental fault of using SATA disks with SAS expanders - but that it indicated clearly some software fault and fragility of the ZFS or IO driver implementation. Of course, since they are part of the priesthood of ZFS they could not accept the idea that any fault could possibly be due to the ZFS implementation or the drivers. It would be sacrilege to even suggest that...even though no other use case for using SAS expanders with SATA drivers suffered from this fault.

At the time, my suggestion that this uncovered an instability in ZFS and/or the OS drivers was answered with the thought that ZFS was "just so good" that it uncovered faults with the hardware unseen by "inferior" applications like hardware RAIDs. They've gone on to re-post their "its toxic" thing over and over again here on [H] as well as on other forums.

Couldn't have said it better myself. I LMFAO'd when all this noise first started up, but it just wasn't worth the time questioning and disspelling it, especially when I read how thin to nonexistent their "testing" was that led to their conclusion. When people run around 'the sky is falling sometimes you just let them get it out of their system. It was same with the TLER debacle.

Now, maybe a case could be made that some of the earlier SAS1 HBA's were shoddy in terms of cheapo controller chip and poor drivers, and that they interacted poorly with SAS2 based expanders (which really had more to do with lack of standardization and interop among manufacturers in their SAS-1 based products) but that's really an independent issue.
 
Last edited:
Woo. Strong words in here, and though I note they lack any data to back them up, I am ill-equipped to rebut on this topic, as I was not involved with any of the testing and analysis that led to the conclusion that SATA over SAS is bad. I trumpet it as far and wide as I can because I have seen the symptoms and outcome of the problem, and they are not something you want to experience in a production environment.

I further am OK with it because I believe that in any case where a customer can procure SAS equivalents to enterprise SATA disks at anything near a reasonable cost similarity, they would be well-served to do so. SAS is simply a far better protocol than SATA, and I believe a full-SAS implementation is superior to one that involves SATA components.

In addition, my own research has led me to articles claiming issues with STP in a variety of environments, and boils down to a handful of potential deadlock and other issues in its use. In addition to that (or perhaps in direct relation), as you mention, many vendors involved who don't necessarily follow the spec, especially in the older days (and there's still quite a few of these floating around).

To me it boils down to:
- Nobody will argue SATA is better than SAS, except when trying to defend pricing-based purchasing decisions (and I sympathize).
- Nobody I've met will argue that if they can get a SAS disk equivalent to their SATA disk for $5 more, they would still buy the SATA.
- At present, most environments that end up mixing SAS and SATA do so in ways that end up making the SAS components step down due to the introduction of SATA to their fabric (3G SATA on 6G SAS, etc), and just in general can complicate things.

For all of those reasons plus the number of very serious issues I've seen that had a root cause resolved by replacing SATA disks with SAS ones, I will for now continue to strongly urge and even enforce the use of SAS in enterprise, production environments when utilizing ZFS.

All of that said, and bearing in mind I have not had the time nor resources to go dig into the problem on my own, I tend to agree with you on the culprit (not ZFS, 99.999% sure on that one -- y'all's other culprit). However until that can be rectified, the status quo is often simply best served by eliminating SATA from the equation, or going out of the way to remove all expanding.

If anyone has ever seen a problem like I described in the other thread (one disk ultimately leading to problems showing up across the bus, timeouts, disconnects, reset storms, etc) on an Areca or an Adaptec card, I'd love to hear from you, in public or private.
 
At the time, my suggestion that this uncovered an instability in ZFS and/or the OS drivers was answered with the thought that ZFS was "just so good" that it uncovered faults with the hardware unseen by "inferior" applications like hardware RAIDs. They've gone on to re-post their "its toxic" thing over and over again here on [H] as well as on other forums.
It is a fact that zfs catches errors that no other filesystem/hw-raid does. It is a fact that zfs is much more sensitive and notices errors very quickly. It is a fact that people have used faulty hardware getting no errors reports, until they switched to zfs. These are facts.

Now, if Solaris has errors in the driver, they could be fixed, then it is ok to use SAS expanders with SATA disks. But considering the success stories of zfs detecting errors that no one else detects, it is difficult to tell the difference. Are there bugs in the Solaris driver or are SAS expanders fragile? What does other OSes say, when using SAS expanders with SATA? Do they notice problems, or are the problems confined to only Solaris?
 
Woo. Strong words in here, and though I note they lack any data to back them up, I am ill-equipped to rebut on this topic, as I was not involved with any of the testing and analysis that led to the conclusion that SATA over SAS is bad. I trumpet it as far and wide as I can because I have seen the symptoms and outcome of the problem, and they are not something you want to experience in a production environment.

You freely admit that you are not equipped with any data of your own, yet you trumpet the findings that you admit you don't actually understand because you have seen "symptoms" of "the problem". Even though the "symptoms" are equally well explained by a fault in MPIO (as has now been proven by Orcacle) and the fact that these symptoms do not afflict any other known use of SATA drives with SAS expanders - even in applications that are every bit ZFS's equal in total performance and stress on the drives.

Cute.
 
Last edited:
It is a fact that zfs is much more sensitive and notices errors very quickly. It is a fact that people have used faulty hardware getting no errors reports, until they switched to zfs. These are facts.

These "facts" may be true. But that is not what is in play here. We are discussing a hardware configuration that works perfectly in all reported applications outside of ZFS that can cause a complete meltdown of a ZFS filesystem. No rational commercial user would support the use of SATA disks with SAS expanders if what is suggested were true. They demand systems that are robust and stable. ZFS is, in fact, very robust and stable in the face of a wide variety of fault conditions. It is shocking to me that any experienced, rational thinking person would ever suggest that ZFS is "more sensitive" in a way that would cause a complete meltdown and somehow be proud of ZFS for doing this. The entire intent of a platform like ZFS is to survive and be robust - which is does admirably - except in this case. And because the wonderful folks at Nexenta (and @Nex7 above) can't accept that there might be a bug they blame the hardware. Completely silly.

My dispute is not with ZFS. I think its wonderful. My complaint is with the people like the last two posters who will post FUD as if it is fact - even when they don't even have the faintest understanding of what they are posting.
 
Last edited:
You freely admit that you are not equipped with any data of your own, yet you trumpet the findings that you admit you don't actually understand because you have seen "symptoms" of "the problem". Even though the "symptoms" are equally well explained by a fault in MPIO (as has now been proven by Orcacle) and the fact that these symptoms do not afflict any other known use of SATA drives with SAS expanders - even in applications that are every bit ZFS's equal in total performance and stress on the drives.

Cute.
He's trumpeting a solution that works to a problem he's seen. Geez.

You can rail all you want against what you see as the problem, but if we don't have your solution in hand, you think its a smart idea for companies or anyone that wants a production-level set up to risk using a setup that people that have the experience knows may melt down. If it took Oracle this long to fix it (and Sun before them)... I'm sure the community and/or Nexenta can whip out the solution real quick.

As Nex7 said, there's no disputing SAS is a superior protocol/standard than SATA (from transport to physical), and just like he said, the only reason to use SATA over SAS is price. Hell, he agreed with most of your points.

The future of ZFS and the various distros in the hands of we lowly home users seems shaky as-is (thanks Oracle), maybe its just me, but I'd rather we not flame valuable resources on this board, lest they dry up.

PS The only thing I'd like Nex7 to answer is his contacts for SAS drive pricing. It seems as though the large capacity drives are drying up, I need to expand my storage solution and always preferred SAS drives, its just too expensive.
 
He's trumpeting a solution that works to a problem he's seen. Geez.

If that's all he was doing I'd back off. But its not. He's making a broad, sweeping claim about something he (admits in his own posts) he does not fully understand. And then he has been out repeating this claim about how evil expanders are far and wide in forums and in topics unrelated to his specific issue.

If the post said "Nexenta does not support or suggest using SAS expanders with our software" that's one thing. But that's not what's been posted.

BTW, ZFS for "lowly home users" is still quite secure - and yes, thanks to Oracle. They still permit non-commercial use of their software (without support). They still freely post it. And while the fine points of their license might make some "home use" fit their definition of "commercial", they have pursued no enforcement action against any home users - nor are they likely to. And unlike Nexenta, the current version from Oracle has a working version of MPIO.

Also, I've been doing a bit more digging here. The bug they have fixed in MPIO was not specific to the use of expanders. It is a race condition that can occur after other drive signalling errors as well. Yes, the combination of SAS expanders and SATA disks did make it more likely and tickle the bug - but until this fix is propagated then Nexenta and other non-S11 ZFS implementations could still fault. With or without expanders. And as long as the folks at Nexenta keep their head in the sand over this they are putting their customer's data at risk.
 
Last edited:
Even though the "symptoms" are equally well explained by a fault in MPIO (as has now been proven by Orcacle)

Proven where? I've heard a few reports that this 'problem' is gone in the latest Solaris bits, but have yet to ever be linked to evidence of it, nor heard of a single story of it occurring somewhere and being resolved by moving to the latest Oracle Solaris build.

And because the wonderful folks at Nexenta (and @Nex7 above) can't accept that there might be a bug they blame the hardware.

Actually, if you'd read my post entirely, you'd see I said I suspected as much. I further reinforced it by asking if anyone had seen the problem on different model HBA's, since at least for myself, I've only ever seen the problem on one brand (utilizing one driver in particular). I've never heard of it involving MPIO code, before, though.

Still, the existence of a bug or poor driver implementation does not excuse SATA over SAS entirely. While I suspect that one or more bugs may be responsible for the symptoms that are so severe they warrant not using SATA (until they are resolved), I am quite sure that SATA over SAS is not a spectacular solution even when it functions "perfectly", and is still inferior to an all-SAS deployment.

If you are arguing that, I'm afraid you're arguing a losing battle. The only reason SATA over SAS is even supported is for inter-operability, likely driven by legacy inventory demands and price sensitivities (and I suspect a desire to allow wider adoption of SAS). It provides support for SATA due to industry demand, you might say. The story ends there, however. You may rest assured, it is not because SATA is better than SAS, or even because SATA is an equivalent to SAS, or even that SATA use within the SAS environment is identical to an all-SAS environment. None of that is true, I hope you'd agree.

SATA over SAS introduces complexities.. and I'm a big fan of KISS. Even were all issues with SATA over SAS gone today, I'd still press for SAS use first and foremost, even demand it in large production builds (especially ones after IOPS over capacity). I would, however, stop refusing solutions altogether that incorporated SATA (especially capacity plays at places that can't get SAS drives at near price-parity).

My complaint is with the people like the last two posters who will post FUD as if it is fact - even when they don't even have the faintest understanding of what they are posting.

My only suggestion would be perhaps you should read the entirety of the posts prior to responding, and if you have a higher understanding of what they are than other posters, please, enlighten them/me/us, instead of making sweeping generalizations about our supposedly sweeping generalizations.

Speaking only for myself (since I have no control over others), my 'FUD' is anything but - I am intimately familiar with the pain this issue can cause, and am aware of no solution (and you have not provided one nor even provided sufficient data to find one), so to suggest that my comments not to use SATA over SAS are 'FUD' is not fair, if you cannot provide evidence to support such a claim.

You say your proof is in Solaris 11 and their fix -- prove it? Can you cite anything or link anywhere that lends credence to this claim? That's not a challenge to uphold my 'FUD' -- that's a plea because I'd personally love for it to be fixed. As a home user and knowing many others, acquiring SAS disks is not cheap at retail prices, and building a fully SATA system grows more difficult by the day.

I believe the purpose of this thread and this forum is to spread knowledge, best practices, etc. I am quite open to any insight anyone can provide to this issue, so long as it is actually useful (as opposed to borderline insults and insertion of words into other's mouths). My original post said "though I note they lack any data to back them up", and that still appears to remain the case.

In the absence of said facts, or evidence to chase down the SATA over SAS problem further and pinpoint what it is and say 'it is safe to use if you do not use this piece of code', and given the current status quo for most older Solaris and illumos-based distributions is that this bug will bite you square on the ass, the only safe and logical suggestion I can make is not to use SATA over SAS. That's not FUD, that's being logical, and providing the safest best practice I can with the tools I have at my disposal.

Also, I've been doing a bit more digging here. The bug they have fixed in MPIO was not specific to the use of expanders. It is a race condition that can occur after other drive signalling errors as well. Yes, the combination of SAS expanders and SATA disks did make it more likely and tickle the bug - but until this fix is propagated then Nexenta and other non-S11 ZFS implementations could still fault. With or without expanders. And as long as the folks at Nexenta keep their head in the sand over this they are putting their customer's data at risk.

Link? Bug number? Email thread? Citation needed. If you've done research into this that has borne fruit, please, share. :) I'm more than happy to dig into such things and poke resources within Nexenta and the illumos development community, if I'm given any specifics. Acting on this, I poked around, and I'm afraid my Google fu must be weak, because the closest I come is right back to this thread.

If you believe I find the SATA over SAS issues a minor problem, or one not worth fixing, or one that can't be fixed, I apologize for any misconceptions -- that is patently false.
 
These "facts" may be true. But that is not what is in play here. We are discussing a hardware configuration that works perfectly in all reported applications outside of ZFS that can cause a complete meltdown of a ZFS filesystem. No rational commercial user would support the use of SATA disks with SAS expanders if what is suggested were true. They demand systems that are robust and stable. ZFS is, in fact, very robust and stable in the face of a wide variety of fault conditions. It is shocking to me that any experienced, rational thinking person would ever suggest that ZFS is "more sensitive" in a way that would cause a complete meltdown and somehow be proud of ZFS for doing this. The entire intent of a platform like ZFS is to survive and be robust - which is does admirably - except in this case. And because the wonderful folks at Nexenta (and @Nex7 above) can't accept that there might be a bug they blame the hardware. Completely silly.

My dispute is not with ZFS. I think its wonderful. My complaint is with the people like the last two posters who will post FUD as if it is fact - even when they don't even have the faintest understanding of what they are posting.
Well said. It is good that you show some balance in this matter. If Oracle has fixed this problem, the better. However, I think we need more information on this, and confirmation, for instance by Garret Damore who blogged about this problem. Also, if SAS expanders + SATA disks are problematic combo right now, it should best be avoided until we have confirmed that it is an ok combo?
 
The only thing I'd like Nex7 to answer is his contacts for SAS drive pricing. It seems as though the large capacity drives are drying up, I need to expand my storage solution and always preferred SAS drives, its just too expensive.

I wish I could! I do not personally work in Sales or Sales Engineering so I don't get a lot of first-hand contacts in this area. I know that a few years ago at my last employer, we got 7200 RPM enterprise SAS high-capacity disks at a price difference from 7200 RPM enterprise SATA high-capacity disks that could only be described as 'completely negligible'. It would thus stand to reason that today, one could get them even closer to price-parity. However, the natural disasters curbing supply so heavily have broken a lot of that. I have heard that even some larger resellers have also had issues sourcing SAS disks at the moment. The supply just isn't there. :(
 
Also, I've been doing a bit more digging here. The bug they have fixed in MPIO was not specific to the use of expanders. It is a race condition that can occur after other drive signalling errors as well. Yes, the combination of SAS expanders and SATA disks did make it more likely and tickle the bug - but until this fix is propagated then Nexenta and other non-S11 ZFS implementations could still fault. With or without expanders. And as long as the folks at Nexenta keep their head in the sand over this they are putting their customer's data at risk.

I also want to respond to this on its own merits, because if there's a single case of FUD in this topic, it's this paragraph.

As I have mentioned previously, I possess no unique and total knowledge about what causes what I coin the 'SATA over SAS' problem. I have admitted to that, and admitted to my most likely suspects. That said, my year and a half of time spent in the Support department at Nexenta, probably one of the few or only companies in the entire world offering a storage appliance based on non-S11 ZFS that you can run on your own hardware (including SATA over SAS), makes me part of what is I believe a very small group of people who have real-world experience on a multitude of systems that have seen this problem. I have seen it often enough to recognize it almost immediately. I am intimately familiar with the SYMPTOMS and the potential fixes for this issue, it is only the root cause that I am not clear on (and let's be clear here -- I AM sure it involves SATA over SAS involving expanders, because that is always where it appears and so far changing that is always how it goes away).

Let me be perfectly clear. No equivocation. I have never seen, nor heard, of a single incident involving the symptoms (or even most of the symptoms) I equate with the SATA over SAS bug in an all-SAS environment. Not on Nexenta, nor on OpenIndiana, SmartOS, or any other illumos-derived distribution. Not one time. Not one.

I have further witnessed multiple examples where the issue was occurring and the removal of SATA component OR the removal of all expanders caused an immediate partial or total relief. I would never suggest to potential customers to go all-SAS as opposed to SATA if I'd seen the problem in both environments, and to suggest otherwise is, to be honest, a little insulting. :)

Based on the fact that in the 1000's of systems I've been in the position to become aware of should they have ever seen this, and the multitude of systems incorporating SATA over SAS with expanders where I have seen or heard of the problem, I would suggest that either it is so exceedingly unlikely for this to occur on an all-SAS solution that it should be ranked up there with 'the system will just spontaneously turn into a shrubbery' (and suggesting otherwise is, in fact, the definition of 'spreading FUD') OR the bug you're mentioning (but not actually providing links to or the ID of) is not, in fact, the cause of the SATA over SAS issues.
 
I really hope, Illumos can fix this problem.

I also have a test machine with a Supermicro SAS 2008 controller and an Intel RES2SV240 expander
on one port and a Supermicro 24 x 2,5 " enclosure with expander based on the same LSI SAS2 chip on the other.

It worked for several weeks without problem. Someday the system freezes, reactiveness becomes worse and
a pool from one mirrored WD Raptors was not accessable at all (pool was definitly lost).

A test with a WD tool reports and fixes a problem with one of the WD disks.
But after this experience and the reading of this thread, i would also say: at current time avoid expanders +sata.
I will move my test machine to Solaris 11 now and do more tests. Maybee they fixed the problem (Hard to verify).
But i will not use this configuration now in production with Sata at all.
 
Last edited:
I wish I could! I do not personally work in Sales or Sales Engineering so I don't get a lot of first-hand contacts in this area. I know that a few years ago at my last employer, we got 7200 RPM enterprise SAS high-capacity disks at a price difference from 7200 RPM enterprise SATA high-capacity disks that could only be described as 'completely negligible'. It would thus stand to reason that today, one could get them even closer to price-parity. However, the natural disasters curbing supply so heavily have broken a lot of that. I have heard that even some larger resellers have also had issues sourcing SAS disks at the moment. The supply just isn't there. :(
:D No prob, it was a long shot, I just had to ask.

Yeah, I'm trying to hold off as long as I can and see if someone will put out some 3TB SAS. I might just add 3-way mirrors to my existing zpool (currently using raidz2). That will take some of the sting out when I add a vdev.
 
I also want to respond to this on its own merits, because if there's a single case of FUD in this topic, it's this paragraph.

As I have mentioned previously, I possess no unique and total knowledge about what causes what I coin the 'SATA over SAS' problem. I have admitted to that, and admitted to my most likely suspects. That said, my year and a half of time spent in the Support department at Nexenta, probably one of the few or only companies in the entire world offering a storage appliance based on non-S11 ZFS that you can run on your own hardware (including SATA over SAS), makes me part of what is I believe a very small group of people who have real-world experience on a multitude of systems that have seen this problem. I have seen it often enough to recognize it almost immediately. I am intimately familiar with the SYMPTOMS and the potential fixes for this issue, it is only the root cause that I am not clear on (and let's be clear here -- I AM sure it involves SATA over SAS involving expanders, because that is always where it appears and so far changing that is always how it goes away).

Let me be perfectly clear. No equivocation. I have never seen, nor heard, of a single incident involving the symptoms (or even most of the symptoms) I equate with the SATA over SAS bug in an all-SAS environment. Not on Nexenta, nor on OpenIndiana, SmartOS, or any other illumos-derived distribution. Not one time. Not one.

I have further witnessed multiple examples where the issue was occurring and the removal of SATA component OR the removal of all expanders caused an immediate partial or total relief. I would never suggest to potential customers to go all-SAS as opposed to SATA if I'd seen the problem in both environments, and to suggest otherwise is, to be honest, a little insulting. :)

Based on the fact that in the 1000's of systems I've been in the position to become aware of should they have ever seen this, and the multitude of systems incorporating SATA over SAS with expanders where I have seen or heard of the problem, I would suggest that either it is so exceedingly unlikely for this to occur on an all-SAS solution that it should be ranked up there with 'the system will just spontaneously turn into a shrubbery' (and suggesting otherwise is, in fact, the definition of 'spreading FUD') OR the bug you're mentioning (but not actually providing links to or the ID of) is not, in fact, the cause of the SATA over SAS issues.

So your position is that even though the Oracle folks have uncovered and fixed the real problem, even though it explains the symptoms you attribute to the SAS/SATA interaction (in fact, explains it better than the rather contrived version you are standing upon- including an explanation of why the SAS/SATA interaction triggers the race), and just because you haven't seen it, its not there? Interesting. I sure do hope you approach here is not typical of Nexenta's support team. Because if is typical then your customers have every right to be frightened out of their mind right now by this arrogant, head-in-the-sand attitude.
 
No, I think he's saying the driver bug just root-caused is possibly not the only issue out there. You keep acting like all along you've been saying the truth and no-one was listening. I never saw a shred of evidence that you posted aside from negative inference ("surely if it was a fundamental issue, someone else would have see it by now" (paraphrasing you)). I've worked on SW long enough to see issues such as race conditions that never got hit until a customer upgraded to a fast enough CPU, despite the bug being Day-1 (more than 20 years old.) Frankly, I have to wonder why you are reacting so nastily to this whole thing - it's not like anyone was attacking you personally.
 
So your position is that even though the Oracle folks have uncovered and fixed the real problem (...)

From a point of view of someone not involved in the discussion it appears to me that the issue/position is more fundamental than that -

* Has oracle uncovered and fixed "the problem"
** This really requires someone who could reproduce the problem previously to now run Oracle 11 and see if the problem is resolved
* The also brings up the question - "Is this the only problem" - it's entirely possible that something got fixed, but I would hesitate to call the situation "fixed" until the positive A/B test mentioned above is achieved.
 
im using sas expanders i guess. sas card with sata drives. norco setup. works fine no errors
 
I've got no real love of Oracle or what they've done to Sun since the purchase - but I would LOVE for them to put Garrett, et al, into a position where they need to eat them words.

I'm not sure if Oracle is in a position to make anyone eat their words regarding Solaris.

There is now more Solaris and ZFS talent outside of Oracle (Sun) than inside. Joyent scooped up the lion's share of Solaris talent and much of the rest is at places like Nexenta and Delphix.

I would love for Oracle to share whatever talent they have left, but their actions towards the community have been lackluster at best and downright hostile at worst.
 
So your position is that even though the Oracle folks have uncovered and fixed the real problem, even though it explains the symptoms you attribute to the SAS/SATA interaction (in fact, explains it better than the rather contrived version you are standing upon- including an explanation of why the SAS/SATA interaction triggers the race), and just because you haven't seen it, its not there? Interesting. I sure do hope you approach here is not typical of Nexenta's support team.

You once again claim Oracle has fixed "the real problem", yet provide not one link or bug ID or citation to support this claim, which I repeatedly have requested. If there exists a fix, I'd love to see it, believe me! Have you seen the issue, and then seen it fixed by Oracle Solaris 11? Can you provide a bug ID or any sort of reference number for this 'fix'? An email thread? A link? Blog post? Anything at all? Have you ever even seen the issue at all, ever?

Because if is typical then your customers have every right to be frightened out of their mind right now by this arrogant, head-in-the-sand attitude.

What arrogant, head-in-the-sand attitude? I would think my presence and comments in this thread are proof enough I am aware of and agree there is a problem. I further have agreed it could have a software component, likely a driver. That invalidates 'head-in-the-sand' comment, as far as I can tell.

And arrogant? My original post agreed with you, at least to the point of there possibly being bug(s), had you bothered to read it. I have further made multiple comments and requests for any data to support the idea the issues are resolved on Oracle Solaris 11 as you claim, but you've replied multiple times now to fan the flames and provide nothing to substantiate what you are claiming. You've also never, I note, suggested that you've ever had any first-hand experience with this issue.

And "frightened out of their mind right now"? Spreading FUD, plain and simple. I have, again, not once seen nor even heard of this issue with all-SAS or all-SATA builds, nor on any system utilizing SATA over SAS but without expanding. That knowledge comes from a year of being the one called when there's a problem, as well as hanging out on zfs-discuss, various IRC channels, etc. There have been no reported cases of this except via environments utilizing SATA disks behind SAS expanders that I have seen, and now you claim it can affect other environments but again provide no link, no ID, no reference of any kind to a source for this claim and then made a comment like that.

I'm still more than happy to dig in to any evidence you can provide, because I would love to see a resolution to this, but based on this thread and on my whimsy of clicking your profile and looking at your last few posts, you appear to simply enjoy fanning flames and trolling and generally being accusatory and, well, rude. I can deal with that if you can add anything of any remote substance, but I'm not seeing any so far. A few insults, some wild accusations, some insertion of words in other's mouths and a bit of FUD (while simultaneously accusing others of the same when they are not), but no substance. Care to prove me wrong? Be as rude about it as you like, just provide me with any actionable intelligence I can take back to the engineering team to take a look at.
 
Pointing this @ nex7 and _Gea

Since I am the one who dragged this discussion back into the wild's of Hard discussion.

@ nex7
Can you remember if the systems showing the "fault" .... Which controllers and or how we're they labeling using disk labels ?

ie on a fully SAS system we have controllers talking WWN address, thru a SAS expander that like a network switch holds a table of WWN addresses on to the WWN address of the SAS drive.

No tunneling of SATA no encapsulation within the SAS protocol then de encapsulation at the destination.

But we also have 2 types controllers, that is some label ie older LSI 1068 type cards that in ZFS the drives get C0T0D1 type labels..... Then newer cards that label the drives with WWN type labels ( even tho the drives are SATA ) even tho the physical SATA drive knows nothing about such a naming convention.

Can you remember which labeling was in use in the systems that showed the symptoms? Ie was/were they only of a specific type ? Always WWN labeled ones? Or always C0T0D1 types setups?

@ _Gea what naming does your system that faulted use?

What I am getting at wondering is... Is it in any way limited to only those controllers who under ZFS name the drives as WWN names?

.
 
@ _Gea what naming does your system that faulted use?

What I am getting at wondering is... Is it in any way limited to only those controllers who under ZFS name the drives as WWN names?

.

LSI 2008 + IT mode = WWN
 
Building on what stanza was asking, i wonder if this issue has been seen to crop up with 2.5" SATA drives which do appear to have WWN's.
 
Back
Top