ARECA Array Degraded...

PhoenixC5 · Oct 26, 2011

So I'm on my way upstairs to bed and I hear an loud annoying beep from my server closet. I log in and find a degraded RAID array. One drive was listed as failed. I rebooted the server and now that drive is listed as free. I checked the log and this is what was there:

2011-10-25 22:50:09 Enc#2 Slot#6 Device Failed
2011-10-25 22:50:09 Raid Set # 000 RaidSet Degraded
2011-10-25 22:50:09 ARC-1880-VOL#000 Volume Degraded
2011-10-25 22:49:48 Enc#2 Slot#6 Time Out Error

Now my main array is sitting here degraded. So first a few questions...I don't have a spare drive for the array, so I can order one real quick off Amazon, but in the meantime, I'm guessing the drive is fine and this is a fluke timeout. If that is the case, can I just re-build the array with that drive again? Is there a quick and safe way to do this. Honestly, I've not had one go bad on this controller, so I'm even sure as to the steps. This is an 1880.

Help!!

xxan0xx · Oct 26, 2011

The safe way is replace the drive, and the drive should automatically go back into the array. While you're purchasing a drive, I'd suggest picking up a drive for a hot spare so you can have a spare drive take over immediately.

Trepidati0n · Oct 26, 2011

xxan0xx said:
The safe way is replace the drive, and the drive should automatically go back into the array. While you're purchasing a drive, I'd suggest picking up a drive for a hot spare so you can have a spare drive take over immediately.

^^^^

Anything less that tihs is just rolling the dice. Secondarily, does the OP have a backup?

PhoenixC5 · Oct 26, 2011

Ugh...things I don't have time and money for right now. Always happens at the worst possible time doesn't it?

I have the important stuff backed up on my secondary array, but it will probably be Monday before a new drive could get here. So can I rebuild the array temporarily with the other drive? Or at least run a test on it somehow?

nitrobass24 · Oct 26, 2011

Put the *bad* drive in a different computer and test it.

What drives are you using?

PhoenixC5 · Oct 26, 2011

Hitachi HDS723020BLA642 - The 2TB 7200RPM 6Gbs drives. First time I've had an issue since I installed the array in January.

nitrobass24 · Oct 26, 2011

Just a guess, but I am assuming you live in Phoenix, AZ?

There is a Frys, there you could just go pick up a new drive.

PhoenixC5 · Oct 26, 2011

I'm actually in Dallas. I did find one locally and I'm going to run this afternoon to pick one (or two, or three) up.

So once I get these drives, I put them in....then what?

nitrobass24 · Oct 26, 2011

Well just replace the bad drive.
Rebuild
Then if you want to add more capactiy you can add the additional drives, expand the raidset, expand the volume set, expand the partition.

OldSchool · Oct 26, 2011

I fear the day that one of my drives dies, I am not sure how to identify them in the Norco RPC-4220...

drescherjm · Oct 26, 2011

I am not sure how to identify them in the Norco RPC-4220...

Doesn't that have activity leds with each drive?

PhoenixC5 · Oct 26, 2011

I believe you can have the Areca card give you a solid light and use that to find it. I document the serial number and location as I throw them in. So...I'm about to add in a pair of drives to replace the dead one. Now the question is...hot spare or RAID 6?

Thoughts? Also, I'm sure I'll have more questions as I go through the repair process...wish me luck!

PhoenixC5 · Oct 26, 2011

Drive added and the rebuild went on its own. Hooray! Now...I'll let that go and finish. Then, I will pull the other drive and most likely RMA it. In the meantime, I now have another drive, so...go hot spare or RAID 6?

odditory · Oct 26, 2011

Don't RMA before running at least a surface scan with HDTune or Hard Disk Sentinel if not Hitachi DFT (or whatever their latest tool that supports 3TB drives). And look at the SMART stats (hook it up to a plain motherboard SATA port). I'd lay money that the timeout was a false positive. I get them on occasion, never anything wrong with the drive. These Hitachi drives don't really fail. In the last 3 years I've had to RMA exactly one out of over 150 x 1TB and 2TB Hitachi's. My hunch is that the sensitivity is too high on the way Areca's code determines a timeout and kicking a drive out of the array, I've been bugging them about this forever. No, this is not a TLER/CCTL issue, its something else.

Also, if you're running a separate SAS expander card like the HP SAS Expander, if the expander chip gets too hot, one of the byproducts is drive timeouts in the areca event log. I was able to reproduce it in testing and its why I put a 40x40x10mm fan on all my expander heatsinks.

As to hotspare versus RAID6 its no contest, RAID6. With drives as cheap as they are, running RAID5 these days is a crime. And if you travel a lot, RAID6+Hotspare.

PhoenixC5 · Oct 27, 2011

The rebuild completed in 7 hours and 20 minutes. The state is back to normal. I guess I will go head and pull the drive sometime soon and test it in my desktop. If it all looks good, then I will consider throwing it back in and converting the array to RAID 6 with the other new drive.

So...RAID 6 conversion. Is that pretty painless? Any gotcha's I should look for? I'd like to do it live once I've backed everything up.

BENN0 · Oct 27, 2011

I believe RAID6 will be preferable over RAID5 + hotspare. In case something goes wrong during a RAID5 rebuild to the hotspare (unrecoverable read error), your complete array will still drop.
With RAID6 you will have the extra parity that can save you from a single unrecoverable read error occurring during a drive rebuild.

nitrobass24 · Oct 27, 2011

PhoenixC5 said:
The rebuild completed in 7 hours and 20 minutes. The state is back to normal. I guess I will go head and pull the drive sometime soon and test it in my desktop. If it all looks good, then I will consider throwing it back in and converting the array to RAID 6 with the other new drive.

So...RAID 6 conversion. Is that pretty painless? Any gotcha's I should look for? I'd like to do it live once I've backed everything up.

Yea its easy, gonna take some time, its basically like doing a rebuild.
Gotchas??? Dont turn the PC off, pray a drive doesnt die/kick out during.

PhoenixC5 · Oct 27, 2011

I am running the HD Tune (free version) error scan on the drive right now. Here are the SMART stats at the moment:

HD Tune: Hitachi HDS723020BLA Health

ID Current Worst ThresholdData Status
(01) Raw Read Error Rate 97 97 16 393216 Ok
(02) Throughput Performance 136 136 54 80 Ok
(03) Spin Up Time 100 100 24 357 Ok
(04) Start/Stop Count 100 100 0 6 Ok
(05) Reallocated Sector Count 100 100 5 0 Ok
(07) Seek Error Rate 100 100 67 0 Ok
(08) Seek Time Performance 133 133 20 27 Ok
(09) Power On Hours Count 100 100 0 1461 Ok
(0A) Spin Retry Count 100 100 60 0 Ok
(0C) Power Cycle Count 100 100 0 6 Ok
(C0) Power Off Retract Count 100 100 0 9 Ok
(C1) Load Cycle Count 100 100 0 9 Ok
(C2) Temperature 136 136 0 1638444 Ok
(C4) Reallocated Event Count 100 100 0 0 Ok
(C5) Current Pending Sector 100 100 0 2 Ok
(C6) Offline Uncorrectable 100 100 0 2 Ok
(C7) Ultra DMA CRC Error Count 200 200 0 0 Ok

Power On Time : 1461
Health Status : Ok

I'm not entirely certain what I should be looking for as far as SMART goes...

odditory said:
Don't RMA before running at least a surface scan with HDTune or Hard Disk Sentinel if not Hitachi DFT (or whatever their latest tool that supports 3TB drives). And look at the SMART stats (hook it up to a plain motherboard SATA port). I'd lay money that the timeout was a false positive. I get them on occasion, never anything wrong with the drive. These Hitachi drives don't really fail. In the last 3 years I've had to RMA exactly one out of over 150 x 1TB and 2TB Hitachi's. My hunch is that the sensitivity is too high on the way Areca's code determines a timeout and kicking a drive out of the array, I've been bugging them about this forever. No, this is not a TLER/CCTL issue, its something else.

Also, if you're running a separate SAS expander card like the HP SAS Expander, if the expander chip gets too hot, one of the byproducts is drive timeouts in the areca event log. I was able to reproduce it in testing and its why I put a 40x40x10mm fan on all my expander heatsinks.

As to hotspare versus RAID6 its no contest, RAID6. With drives as cheap as they are, running RAID5 these days is a crime. And if you travel a lot, RAID6+Hotspare.

mwroobel · Oct 28, 2011

PhoenixC5 said:
Drive added and the rebuild went on its own. Hooray! Now...I'll let that go and finish. Then, I will pull the other drive and most likely RMA it. In the meantime, I now have another drive, so...go hot spare or RAID 6?

Without question, RAID6.

drescherjm · Oct 28, 2011

(C5) Current Pending Sector 100 100 0 2 Ok
(C6) Offline Uncorrectable 100 100 0 2 Ok

These are 2 sectors that the drive wrote but can not read any more. Google search for ure or bit-rot. For the drive this could be a sign of a mechanical failure or it could just be a small defective spot in the media or a cosmic ray hit the drive. 2 is not a large number.

I would take the drive out and use some program to write to every sector of the disk multiple times. Then look at the SMART.

PhoenixC5 · Oct 28, 2011

This is what I got back from HD Tune (free). Obviously there is a damaged block, but I'm not sure what the impact of that is. Does that mean I should just RMA the drive? Here is the SMART after that scan:
HD Tune: Hitachi HDS723020BLA Health

ID Current Worst ThresholdData Status
(01) Raw Read Error Rate 95 95 16 393226 Ok
(02) Throughput Performance 136 136 54 80 Ok
(03) Spin Up Time 100 100 24 357 Ok
(04) Start/Stop Count 100 100 0 6 Ok
(05) Reallocated Sector Count 100 100 5 0 Ok
(07) Seek Error Rate 100 100 67 0 Ok
(08) Seek Time Performance 133 133 20 27 Ok
(09) Power On Hours Count 100 100 0 1484 Ok
(0A) Spin Retry Count 100 100 60 0 Ok
(0C) Power Cycle Count 100 100 0 6 Ok
(C0) Power Off Retract Count 100 100 0 9 Ok
(C1) Load Cycle Count 100 100 0 9 Ok
(C2) Temperature 166 166 0 1638436 Ok
(C4) Reallocated Event Count 100 100 0 0 Ok
(C5) Current Pending Sector 100 100 0 1 Ok
(C6) Offline Uncorrectable 100 100 0 0 Ok
(C7) Ultra DMA CRC Error Count 200 200 0 0 Ok

Power On Time : 1484
Health Status : Ok

ARECA Array Degraded...

Limp Gawd

Limp Gawd

[H]F Junkie

Limp Gawd

[H]ard|DCer of the Month - December 2009

Limp Gawd

[H]ard|DCer of the Month - December 2009

Limp Gawd

[H]ard|DCer of the Month - December 2009

Limp Gawd

[H]F Junkie

Limp Gawd

Limp Gawd

Supreme [H]ardness

Limp Gawd

Weaksauce

[H]ard|DCer of the Month - December 2009

Limp Gawd

Supreme [H]ardness

[H]F Junkie

Limp Gawd