Norco 4224 + ZFS Transfer Errors

Jerrad · Sep 16, 2013

I wanted to post this to see if anyone else has run into the same issues I have experienced before I go out and buy a new case.

Background
Norco RPC-4224 Case
Supermicro MBD-X9SCM-F-O
2 x LSI 9211-8i IT Mode Flashed and VT-d passed into Openindiana VM on ESXi 5
16 x 1TB WD RE4 (WD1003FBYX)
Two raidz3 ZFS storage pools, each of them 6 disks each and 4 hot spares

Each shelf on the RPC-4224 is connected to a separate SFF-8087 port on the LSI cards and disks which are members of each of the pools is split across cards/connectors.

Issue
I noticed one day that ZFS marked one of the disks in pool1 as bad and a spare had taken over so I decided to remove the bad drive and replace it with a new drive. Resilvering started and I walked away expecting to come back later to a newly resilvered disk only to discover that 4 disks suddenly were marked as either FAULTED or DEGRADED "TOO MANY ERRORS" etc... My heart sank and I immediately started to curse Western Digital, until I RMAed numerous drives only to keep having drives suddenly start throwing errors and being removed from the pool again. As I looked into things further I noticed that drives which were failing were connected to two different LSI cards and two different SFF-8087 cables and other drives which were connected to the same controller were seemingly working fine without errors, even after a scrub. I noticed that all of the drives which kept failing were all installed in the top 3 shelf's of the Norco case and the drives which were installed in the very bottom two shelf's of the case never experienced any issues at all over countless re-silvers and scrubs.

I started to put two and two together when all of the drives in the the very top shelf all went from ONLINE to NOT CONNECTED and it made me think there is some sort of physical connectivity issue on the back plane, especially since the drives which are in the top shelfs and kept failing are connected to completely different SAS cards/cables, so I simply can't imagine this is related to the card/cable.

I tried a re-seating the drives, re-seating the SAS back-plane, re-seating the SFF-8087 cables, tried a new cable yet the issues remained. The most prominent errors are transfer errors to the drive which is what is causing ZFS to keep marking the drives as bad and in some cases they would be so bad that it wouldn't even let me execute a zpool status on the pool.

I have read about the SAS back planes on the Norco's not working at all or burning up, but I haven't come across people experiencing random transfer errors, which is why I am posting this. Has anyone else experienced similar issues with the Norco SFF-8087 back plane? I have replaced everything possible with the exception of the motherboard/case. I have another RPC-4224 case which is setup in a similar arrangement and I am not seeing any issues, however the case revisions are likely different.

Thanks for any input you may have

Jerrad

Firebug24k · Sep 16, 2013

I think you've answered your own question, the backplane is jacked up. If you've got some forward breakout cables, pull the drives out and plug them into the LSI cards directly and see if that helps (if you want to troubleshoot further before getting a new case). Monoprice has the cables for ten bucks.

staticlag · Sep 16, 2013

Man if you are wasting 5 disks per pool for reduncancy for 3 disks of data I would have gone triple mirrors and had awesome IOPS while I was at it.

But yeah backplanes have gone out in the past. Sometimes its just random bad luck.

odditory · Sep 16, 2013

Jerrad said:
I have read about the SAS back planes on the Norco's not working at all or burning up

Where did you read about that? The only one I'm aware of involved someone cheaping out on a crappy, underpowered PSU and then blaming the Norco backplane for some of their drives dying, and proceeding to write a long blog rant about how Norco was to blame, only to later admit it was the PSU that caused the problem.

You have said nothing about your PSU. What make/model? That's always the first thing I ask people when they're having these erratic kinds of problems where nothing else has (supposedly) changed in the config. It can be cabling as well, and it certainly can be backplane related, but troubleshooting should always start with power in cases like these.

danswartz · Sep 16, 2013

Yeah 4 hot spares is pretty overkill, IMO...

Jerrad · Sep 16, 2013

I would like to clarify a few things, the raidz3 saved my ass in this exact scenario and has in the past before as well due it's high tolerance for multiple disk failures. Now perhaps everyone scrubs their data weekly and has fast turn around on their RMAs and overly good luck, but I have personally experienced a resilver causing two drives to become DEGRADED and drop my entire pool in the past and vowed never to have this happen again. Everyone has different use cases and I would rather not get into the pro/con discussion around the choice of redundancy methods.

The reason I have 4 hot spares is because when I first started to experience this issue I actually had 8 drives in raidz3 and since 3 drives had already failed I zfs sent all of my file systems off to a backup location and rebuilt the pool from scratch using 6 disks. At the time I still hadn't isolated the pattern of what was going wrong, all I knew is that I wanted to avoid finding out the hard way by completely losing all data so I elected to throw in 4 spares as a precaution and once again this paid off as all 4 of my spares are currently in use right now due to nearly all of the drives in two top most shelfs throwing random IO errors.

I agree with the PSU comment and that is something I will look into as an option as I have a spare power supply for this exact type of scenario. I suppose a few rails could have gone bad that are powering the upper most shelfs resulting in this erratic behaviour so I will certainly try replacing the power supply before I fully commit to a back plane issue. The only reason I mentioned the Norco issue on the back planes was that in my googling to find answers to my symptoms I ended up coming across threads where people had drives simply not work due to bad SAS back planes or back planes which actually popped resistors. http://lime-technology.com/forum/index.php?topic=15589.0

I will replace the PSU and see if that rectifies the issue, thanks for the feedback.

brutalizer · Sep 17, 2013

Wow, you have a very safe storage solution indeed, and that ZFS saved your data fine. I like the ZFS ability to tell even the slightest problem that no hardware raid would detect. I have not really heard of anyone getting problems with two disks at the same time though. But, that is the reason I too, want to go for raidz3. From your tale, it seems that raidz3 can be needed.

BTW, you also do know that you can create a mirror with any number of disks you want? You could create a mirror consisting of three disks, or consisting of 11 disks if you wanted.

m1abram · Sep 17, 2013

3-way mirror will give you better performance and similar protection. One important factor with mirrors over RAIDZ3 is resilver times are faster and less load on the pool. With your number of drives you can setup a 3-way mirror with similar usable space and much better performance without reducing your risk.

Also 4 hotspares, if you get to the 2nd hotspare and it fails you better shut that system down and figure out WTF is going on. Cause at that point it most likely a bigger issue than any system can solve on its own.

madrebel · Sep 17, 2013

post the output of iostat -en

Jerrad · Sep 17, 2013

m1abram said:
3-way mirror will give you better performance and similar protection. One important factor with mirrors over RAIDZ3 is resilver times are faster and less load on the pool. With your number of drives you can setup a 3-way mirror with similar usable space and much better performance without reducing your risk.

Also 4 hotspares, if you get to the 2nd hotspare and it fails you better shut that system down and figure out WTF is going on. Cause at that point it most likely a bigger issue than any system can solve on its own.

I am going to look further into the 3+ mirror example you provided as the data itself doesn't require tons of storage and this may actually better suite my use case as the primary intent of this pool is such that I can go on vacation and not have to worry about a bad drive or drives causing a restore from backup headache. This data isn't large in nature, but does consist of 400,000+ small files I would simply rather not have to restore if I can help it.

I agree something is very wrong if drives are failing like this and the 4 hot spares was done as an intermediary measure while I attempted to perform some triage on the server. The challenge for me is this server is remotely located and I have to drive across the city to perform hardware troubleshooting and when I do each of these changes I have to wait for a re-silver/scrub to complete to observe the results to see if it helped or not. Added to this is trying to keep this server uptime somewhat reasonable as it does host a website as well on another VM. Taking this server for the weekend and doing nothing else but swapping out hardware would be ideal, but isn't currently an option so it's a rather slow process to go through all of the swap this and try again process, but it's really my only option at this point until I isolate the actual culprit.

My next steps are going to be swapping out the PSU and if that doesn't help try the reverse breakout cables to bypass the back plane in the top shelf. In the mean time I have zfs sent the file systems on the problematic pool to another pool which has been solid and unmounted the old ones. This should buy me some time to try the PSU/break out cables this weekend and hopefully isolate the actual cause of this, after which I will look into rebuilding the damaged pool as a multi disk mirror after some testing on my own setup.

Also the PSU is a Seasonic X850 Gold IIRC

Unfortunately iostat has been reset since the last recent reboot, however when I did have output there were between 50 and 500 transfer errors on the various problematic disks.

extended device statistics ---- errors ---
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b s/w h/w trn tot device
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 fd0
32.0 21.0 1288.5 460.9 0.2 0.1 3.2 2.1 0 0 0 0 0 0 gales
0.2 0.3 3.8 3.1 0.0 0.0 6.5 1.1 0 0 0 0 0 0 rpool
0.4 0.4 3.9 3.1 0.0 0.0 0.0 0.9 0 0 0 0 0 0 c4t0d0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 66 0 66 c3t0d0
5.6 0.7 215.0 2.6 0.0 0.0 0.0 2.0 0 0 0 0 0 0 c6t50014EE25C7A1287d0
0.3 0.7 0.2 2.6 0.0 0.0 0.0 3.3 0 0 0 0 0 0 c6t50014EE2B1CFEBBAd0
2.7 1.8 145.6 135.0 0.0 0.0 0.0 4.6 0 0 0 0 0 0 c6t50014EE25C79EACBd0
1.8 1.8 74.5 135.0 0.0 0.0 0.0 4.3 0 0 0 0 0 0 c6t50014EE20724F515d0
1.7 1.8 71.5 135.0 0.0 0.0 0.0 4.3 0 0 0 0 0 0 c6t50014EE2072520ABd0
2.7 1.8 138.6 135.0 0.0 0.0 0.0 4.5 0 0 0 0 0 0 c6t50014EE2B1CFC2A9d0
1.7 1.8 67.6 135.0 0.0 0.0 0.0 4.4 0 0 0 0 0 0 c6t50014EE25C7A37A1d0
2.7 1.8 141.7 135.0 0.0 0.0 0.0 4.7 0 0 0 0 0 0 c6t50014EE25C7A45F7d0
5.6 0.7 215.1 2.6 0.0 0.0 0.0 2.3 0 0 0 0 0 0 c6t50014EE20724DE80d0
5.5 0.7 215.1 2.6 0.0 0.0 0.0 2.4 0 0 0 0 12 12 c6t50014EE2B1CF649Dd0
5.6 0.7 214.9 2.6 0.0 0.0 0.0 2.4 0 0 0 1 0 1 c6t50014EE25C79AC11d0
0.3 6.2 0.2 216.7 0.0 0.0 0.0 3.5 0 0 0 0 0 0 c6t50014EE2B1D004F8d0
5.6 0.7 214.9 2.6 0.0 0.0 0.0 2.1 0 0 0 0 0 0 c6t50014EE2B1D01277d0
0.3 0.0 0.2 0.0 0.0 0.0 0.0 0.1 0 0 0 0 0 0 c6t50014EE2B2CEA74Bd0
5.7 0.7 214.8 2.6 0.0 0.0 0.0 2.0 0 0 0 0 0 0 c6t50014EE2B39B9E5Bd0
0.2 0.0 0.1 0.0 0.0 0.0 0.0 0.1 0 0 0 3 0 3 c6t50014EE2B29B4B71d0

Thanks again for the feedback guys I really appreciate it

Firebug24k · Sep 17, 2013

Just a quick note - you want forward breakout cables, not reverse breakout cables.

Jerrad · Sep 17, 2013

Firebug24k said:
Just a quick note - you want forward breakout cables, not reverse breakout cables.

My mistake I actually have the correct forward breakout cables on hand already from a previous build, thanks for the clarification.

Lost-Benji · Sep 18, 2013

odditory said:
Where did you read about that? The only one I'm aware of involved someone cheaping out on a crappy, underpowered PSU and then blaming the Norco backplane for some of their drives dying, and proceeding to write a long blog rant about how Norco was to blame, only to later admit it was the PSU that caused the problem.

The Norco's are well known for shitty backplanes and the blaming of PSU's came for one dickhead who made a public idiot of himself.

Lets settle a few facts on the Norco's. The green boards are the issue prone version. These have been replaced with the yellows and have way less issues. The greens suffer from bad soldering and missing parts or parts soldered to wrong terminals.
How do I know this, because I have repaired plenty of them.

PSU sizing, if running only drives and an expander, a solid 500W is ample, if running a small-to-medium system in rear then a solid 600+W will do fine. The ideas of needing KW supplies is for the wankers with WOFTAM ideas.

OP, besides whats already been said about your choices/ideas of the RAID (not mine by far), take note of the backplanes, if green, pull them all and take to a repair shop to re-solder the lot with real solder (has lead in it, lead-free is the death of electronics) and ensure all parts are same positions on all the boards.
If the boards are yellow, then pull them and inspect them all for soldering/part placement issues.
Google is your friend:
https://www.google.com.au/search?q=...&sqi=2&ved=0CAcQ_AUoAQ&biw=1920&bih=972&dpr=1

http://lime-technology.com/forum/index.php?topic=15589.0

Rectal Prolapse · Sep 18, 2013

Lost-Benji: damn, now you got me all paranoid about my Norco 4224! I purchased mine in November of 2010. I am probably not safe.

danswartz · Sep 18, 2013

3-way mirrors are not only better than 4 hot spares (in terms of not wasting hardware), your random IOPs (particularly reads) will rock.

madrebel · Sep 18, 2013

do note however that your writes balloon by a factor of three which can saturate your SAS bus.

Lost-Benji · Sep 18, 2013

Rectal Prolapse said:
Lost-Benji: damn, now you got me all paranoid about my Norco 4224! I purchased mine in November of 2010. I am probably not safe.

No need to be paranoid, just be wise with purchases. The Greens were phased out years ago but there may still be some NOS floating around. There is also some different name/badged versions of the Norco's also doing the rounds that may still be covered under the same precautions.

Once repaired, they are good bits of kit. I have a couple and they are filled to brim with drives and systems on 650W PSU's, no issues for the past nearly two years.

Rectal Prolapse · Sep 18, 2013

Ok Benji - I can always practice my soldering skills if I ever run into issues.

haileris · Sep 19, 2013

Probably a little off topic but I think this was the blog on the "exploding" backplanes.

http://wsyntax.com/cs/killer-norco-case/

It doesn't appear to contain any comments putting the blame on the power supply. Is there another blog that I'm not aware of?

brutalizer · Sep 20, 2013

odditory said:
Where did you read about [bad norco backplanes]? The only one I'm aware of involved someone cheaping out on a crappy, underpowered PSU and then blaming the Norco backplane for some of their drives dying, and proceeding to write a long blog rant about how Norco was to blame, only to later admit it was the PSU that caused the problem.

Oh boy.

danswartz · Sep 20, 2013

Content-free reply?

Norco 4224 + ZFS Transfer Errors

Weaksauce

Weaksauce

[H]ard|Gawd

Supreme [H]ardness

2[H]4U

Weaksauce

[H]ard|Gawd

2[H]4U

Gawd

Weaksauce

Weaksauce

Weaksauce

Limp Gawd

Gawd

2[H]4U

Gawd

Limp Gawd

Gawd

Limp Gawd

[H]ard|Gawd

2[H]4U