I wanted to post this to see if anyone else has run into the same issues I have experienced before I go out and buy a new case.
Background
Norco RPC-4224 Case
Supermicro MBD-X9SCM-F-O
2 x LSI 9211-8i IT Mode Flashed and VT-d passed into Openindiana VM on ESXi 5
16 x 1TB WD RE4 (WD1003FBYX)
Two raidz3 ZFS storage pools, each of them 6 disks each and 4 hot spares
Each shelf on the RPC-4224 is connected to a separate SFF-8087 port on the LSI cards and disks which are members of each of the pools is split across cards/connectors.
Issue
I noticed one day that ZFS marked one of the disks in pool1 as bad and a spare had taken over so I decided to remove the bad drive and replace it with a new drive. Resilvering started and I walked away expecting to come back later to a newly resilvered disk only to discover that 4 disks suddenly were marked as either FAULTED or DEGRADED "TOO MANY ERRORS" etc... My heart sank and I immediately started to curse Western Digital, until I RMAed numerous drives only to keep having drives suddenly start throwing errors and being removed from the pool again. As I looked into things further I noticed that drives which were failing were connected to two different LSI cards and two different SFF-8087 cables and other drives which were connected to the same controller were seemingly working fine without errors, even after a scrub. I noticed that all of the drives which kept failing were all installed in the top 3 shelf's of the Norco case and the drives which were installed in the very bottom two shelf's of the case never experienced any issues at all over countless re-silvers and scrubs.
I started to put two and two together when all of the drives in the the very top shelf all went from ONLINE to NOT CONNECTED and it made me think there is some sort of physical connectivity issue on the back plane, especially since the drives which are in the top shelfs and kept failing are connected to completely different SAS cards/cables, so I simply can't imagine this is related to the card/cable.
I tried a re-seating the drives, re-seating the SAS back-plane, re-seating the SFF-8087 cables, tried a new cable yet the issues remained. The most prominent errors are transfer errors to the drive which is what is causing ZFS to keep marking the drives as bad and in some cases they would be so bad that it wouldn't even let me execute a zpool status on the pool.
I have read about the SAS back planes on the Norco's not working at all or burning up, but I haven't come across people experiencing random transfer errors, which is why I am posting this. Has anyone else experienced similar issues with the Norco SFF-8087 back plane? I have replaced everything possible with the exception of the motherboard/case. I have another RPC-4224 case which is setup in a similar arrangement and I am not seeing any issues, however the case revisions are likely different.
Thanks for any input you may have
Jerrad
Background
Norco RPC-4224 Case
Supermicro MBD-X9SCM-F-O
2 x LSI 9211-8i IT Mode Flashed and VT-d passed into Openindiana VM on ESXi 5
16 x 1TB WD RE4 (WD1003FBYX)
Two raidz3 ZFS storage pools, each of them 6 disks each and 4 hot spares
Each shelf on the RPC-4224 is connected to a separate SFF-8087 port on the LSI cards and disks which are members of each of the pools is split across cards/connectors.
Issue
I noticed one day that ZFS marked one of the disks in pool1 as bad and a spare had taken over so I decided to remove the bad drive and replace it with a new drive. Resilvering started and I walked away expecting to come back later to a newly resilvered disk only to discover that 4 disks suddenly were marked as either FAULTED or DEGRADED "TOO MANY ERRORS" etc... My heart sank and I immediately started to curse Western Digital, until I RMAed numerous drives only to keep having drives suddenly start throwing errors and being removed from the pool again. As I looked into things further I noticed that drives which were failing were connected to two different LSI cards and two different SFF-8087 cables and other drives which were connected to the same controller were seemingly working fine without errors, even after a scrub. I noticed that all of the drives which kept failing were all installed in the top 3 shelf's of the Norco case and the drives which were installed in the very bottom two shelf's of the case never experienced any issues at all over countless re-silvers and scrubs.
I started to put two and two together when all of the drives in the the very top shelf all went from ONLINE to NOT CONNECTED and it made me think there is some sort of physical connectivity issue on the back plane, especially since the drives which are in the top shelfs and kept failing are connected to completely different SAS cards/cables, so I simply can't imagine this is related to the card/cable.
I tried a re-seating the drives, re-seating the SAS back-plane, re-seating the SFF-8087 cables, tried a new cable yet the issues remained. The most prominent errors are transfer errors to the drive which is what is causing ZFS to keep marking the drives as bad and in some cases they would be so bad that it wouldn't even let me execute a zpool status on the pool.
I have read about the SAS back planes on the Norco's not working at all or burning up, but I haven't come across people experiencing random transfer errors, which is why I am posting this. Has anyone else experienced similar issues with the Norco SFF-8087 back plane? I have replaced everything possible with the exception of the motherboard/case. I have another RPC-4224 case which is setup in a similar arrangement and I am not seeing any issues, however the case revisions are likely different.
Thanks for any input you may have
Jerrad