ARC-1880ix-24 Support Question - Raid5 WorstCase Recovery

Current User · Jul 22, 2013

hey forum
found you here because i own an areca controller and google pointed me here.
so as the title says i am pretty desperated.

i post my story here and hope that not only readers having fun,
but also someone has a nice advice. here we go:
the one controller:

Code:

Controller Name         ARC-1880IX-24
Firmware Version        V1.51 2012-07-04
BOOT ROM Version        V1.51 2012-07-04
PL Firmware Version     13.0.59.0

unfortunately a worst case happend to one arc-1880ix-24 system.
a volume was removed from one of our servers. i investigated
the error an found that the raidset 'seagate' was in failed state:

Code:

2013-07-22 07:01:19     Enc#2 PHY#0     Device Failed            
2013-07-22 07:01:19     Enc#2 PHY#1     Device Failed            
2013-07-22 07:01:18     seagate         RaidSet Degraded                 
2013-07-22 07:01:13     seagate         Volume Failed            
2013-07-22 07:00:29     seagate         RaidSet Degraded                 
2013-07-22 07:00:29     seagate         Volume Degraded                  
2013-07-22 00:58:29     Enc#2 PHY#1     Time Out Error

as a first step i activated the failed disks.
PHY#0 became 'normal'
PHY#1 became 'free'
the volume came back online but was degraded.

the next step was to power down the server.
disconnected all other raidsets/drives.
pulled out both drives (PHY#1 and PHY#0) to check them
in another workstation with 'gsmartcontrol'.
the results:
PHY#1 had no errors.
PHY#0 had 6 similar erros:

Code:

Error 6 occurred at disk power-on lifetime: 7118 hours (296 days + 14 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 00 ff ff ff 4f 00   6d+14:08:10.700  READ DMA EXT
  25 00 00 ff ff ff 4f 00   6d+14:08:07.045  READ DMA EXT
  25 00 40 ff ff ff 4f 00   6d+14:08:07.031  READ DMA EXT
  b0 da 00 00 4f c2 00 00   6d+14:08:06.996  SMART RETURN STATUS
  25 00 00 ff ff ff 4f 00   6d+14:08:06.981  READ DMA EXT

next i swaped the PHY#0 for a new drive and powered on the server.
i made the new drive a hotspare and the rebuild started.
then after 24minutes the desaster happened:

Code:

2013-07-22 10:30:37     Enc#2 PHY#0     Device Failed            
2013-07-22 10:30:37     Enc#2 PHY#0     Device Removed           
2013-07-22 10:30:27     seagate         RaidSet Degraded                 
2013-07-22 10:30:27     seagate         Volume Failed            
2013-07-22 10:30:22     Enc#2 PHY#0     Time Out Error           
2013-07-22 10:27:05     seagate         Failed Volume Revived            
2013-07-22 10:27:05     Enc#2 PHY#0     Device Inserted                  
2013-07-22 10:26:56     Enc#2 PHY#1     Device Removed           
2013-07-22 10:26:56     Enc#2 PHY#0     Device Removed           
2013-07-22 10:26:56     seagate         RaidSet Degraded                 
2013-07-22 10:26:40     seagate         Volume Failed            
2013-07-22 10:26:35     Enc#2 PHY#1     Time Out Error           
2013-07-22 10:25:25     seagate         Stop Rebuilding         000:24:55        
2013-07-22 10:25:25     seagate         RaidSet Degraded                 
2013-07-22 10:25:25     seagate         Volume Failed            
2013-07-22 10:25:20     Enc#2 PHY#0     Time Out Error           
2013-07-22 10:24:13     Enc#2 PHY#0     Time Out Error           
2013-07-22 10:00:30     seagate         Start Rebuilding                 
2013-07-22 10:00:28     seagate         Rebuild RaidSet

i was quite surprised and activated the failed disks again.

Code:

2013-07-22 10:43:53     192.168.222.001         HTTP Log In              
2013-07-22 10:43:30     seagate         Failed Volume Revived            
2013-07-22 10:43:30     H/W Monitor     Raid Powered On

unfortunately the PHY#0 became 'free' and PHY#1 became 'raidsetmember'
the situation is bad now.
the drive PHY#0(A) was kicked out during rebuild and is 'free' now.

Code:

PHY#0(A)        Free            2000.4GB        ST32000542AS
PHY#1(12)       seagate         2000.4GB        WDC WD2001FASS-00U0B0
PHY#2   N.A.    N.A.    N.A.
PHY#3   N.A.    N.A.    N.A.
PHY#4(B)        seagate         2000.4GB        ST32000542AS
PHY#5(D)        seagate         2000.4GB        ST32000542AS
PHY#6   N.A.    N.A.    N.A.
PHY#7   N.A.    N.A.    N.A.
PHY#8(E)        seagate         2000.4GB        ST32000542AS
PHY#9(F)        seagate         2000.4GB        ST32000542AS
PHY#10  N.A.    N.A.    N.A.
PHY#11  N.A.    N.A.    N.A.
PHY#12(10)      seagate         2000.4GB        ST32000542AS
PHY#13(11)      seagate         2000.4GB        ST32000542AS

as far as i can say, i now have a corrupt raid5 volume that was interrupted
while rebuilding because drive PHY#0 failed.

is there a chance to get my volume back?
what steps should i do?

edit: i also send this report to arecas kevin wang. i'll report back any response.

staticlag · Jul 22, 2013

What exactly does your system look like?

What chassis/backplane? Are the drives internal or external?

Are all 8 of those drives in one raid5 config?

Current User · Jul 22, 2013

it is a custom built server with 4x MB455SPF-B backplanes
so all drives internal.
all 8 drives are in raid-set 'seagate' and the whole capacity of the raid-set 'seagate' is assigned to one raid5 volume-set 'seagate'

staticlag · Jul 22, 2013

I'm guessing you have no backups?

Ideally you should now buy at least 8 drives no smaller than 2 TB each and make a 1:1 copy of each of the drives in your raidset.

Next I would try putting the "failed" original phy0 back in and seeing if that will make your array come back online. I would then backup the data.
-------
If that didn't work I would try maybe playing around with R-studio and the old disk to see if it would work, I would maybe spend a day on this.

Next if the data was really important to me I would contact a professional recovery service.

Finally a few things to remember:

1) raid5 is a terrible choice for large arrays, maybe 10 years ago it was nice, but with today's drive sizes of 2 and 3 TB you should use nothing but RAID6 or 7 .

2) You should always backup your data.

Current User · Jul 22, 2013

indeed, backup is quite old ...
good idea to buy another set of drives and clone the source-drives.
any tips what tool to prefer?

and thanks for the heads up, i-ll do my best.

saiyan · Jul 22, 2013

So you have two hard drives failed and you were able to re-activate one of them back as a raid set member?

The log you provided says your hard drive PHY#1 had time-out errors. I don't know what cause your time-out errors but I think it was most likely sector re-allocations because when sector re-allocation occurs your hard drive would not respond as quickly as usual until the hard drive has successfully read the bad sector and write data to a new allocated sector. During this period, RAID controller would often report time-out errors and drop the hard drive from the raid set.

If you were able to add a failed hard drive back to the same raid set, I would not have attempted the RAID rebuild because that same hard drive will likely fail again soon. If I don't have a recent backup, I would immediately backup the existing data first.

Did you have the chance to look the SMART attributes or temperature readings of your hard drives on your ARC-1880 controller's Archttp web page?

I personally have rebuilt my RAID-5 array (8 x 3TB using ARC-1880i) twice in the last two years without any problems. Before each attempt rebuilding the array, I looked at the SMART attributes on Archttp web page to make sure no other hard drives may be on their last legs.

But I don't want to worry about hard drive failures while rebuilding a RAID array so I eventually re-created a new RAID-6 array instead. Of course, having an up-to-date backup is also important.

Current User · Jul 25, 2013

after 1.5days i managed it to clone the PHY#0 and PHY#1.
i used sysresccd with GNU ddrescue. it was a good choice i think.
even learned some linux

however, after i cloned both 'failed drives' i've thrown them
into the slots an fired up the machine.
all disks free. so i did a RESCUE (btw. what is the ReScUe and LeVeL2ReScUe thing?)
reboot -> controller fw timeout..
reboot -> all disks failed/free

then i did some research and tweaked the settings and advanced settigs of the controller
power down power up: raidset present, array present, rebuilding started on PHY#1

Code:

System Events Information
Time     Device     Event Type     Elapse Time     Errors
2013-07-25 06:22:00     seagate     Complete Rebuild     006:51:58      
2013-07-24 23:30:01     seagate     Start Rebuilding

i also checked some data and it seems fine so far.
now i am preparing to move the data.

ARC-1880ix-24 Support Question - Raid5 WorstCase Recovery

Current User

n00b

staticlag

[H]ard|Gawd

Current User

n00b

staticlag

[H]ard|Gawd

Current User

n00b

saiyan

n00b

Current User

n00b