Did I just lose all of my data?

ekological · Dec 8, 2009

Hey guys,

I built a NAS running Ubuntu with a 3ware 16-port controller:

http://www.hardforums.com/showthread.php?t=1449908

I followed advice and created the filesystem with:

Code:

mkfs.xfs /dev/sdb

I have since migrated the 6TB RAID5 array (using four 2TB disks) to a 10TB RAID6 array and had no problems.

Today, I tried to move the OS drive to a USB key to further reduce power consumption. So I pulled the OS drive and installed the USB drive. I also created a bootable Ubuntu 9.10 USB key from which to install the OS. I was a bit scared to leave the 10TB storage array connected for fear of accidentally wiping out the data so I just pulled the PCIe card and installed the OS. When everything was up and running, I reinstalled the PCIe 3ware controller and fired up the NAS. I was busy getting the drivers and other software running and forgot how to set up samba, so I was unconcerned that the 10TB filesystem wasn't mounted. Since I installed the latest version of the 3ware drivers, I also saw that a new firmware was available, so I flashed the firmware of the 3ware card. Now I noticed that the 10TB /dev/sda (it was different with the USB drive installed and the original OS drive pulled) showed up as unallocated space.

What to do?? I'm in the process of running testdisk, but I wasn't sure what partition type to use.

TIA,
Chester

longblock454 · Dec 8, 2009

You are fine. The reason it shows up as unallocated was because you formatted it without a partition. Does mount spit errors? If you haven't yet tried to mount give it a shot and report back.

ekological · Dec 8, 2009

I tried the following:

Code:

sudo mount -t xfs /dev/sdb /media/sdb

and got this:

Code:

mount: wrong fs type, bad option, bad superblock on /dev/sdb,
       missing codepage or helper program, or other error
       In some cases useful info is found in syslog - try
       dmesg | tail  or so

Oh, I plugged back the original drive and removed the USB drive just to see if I could get the data back to make sure I didn't wipe it out.

longblock454 · Dec 8, 2009

Dump the output of:

Code:

sudo fdisk -l

It might have changed from sdb.

ekological · Dec 8, 2009

here it is:

Code:

Disk /dev/sda: 160.0 GB, 160041885696 bytes
255 heads, 63 sectors/track, 19457 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x2eb80a51

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1   *           1       18662   149902483+  83  Linux
/dev/sda2           18663       19457     6385837+   5  Extended
/dev/sda5           18663       19457     6385806   82  Linux swap / Solaris

WARNING: The size of this disk is 10.0 TB (9999944253440 bytes).
DOS partition table format can not be used on drives for volumes
larger than (2199023255040 bytes) for 512-byte sectors. Use parted(1) and GUID 
partition table format (GPT).


Disk /dev/sdb: 9999.9 GB, 9999944253440 bytes
255 heads, 63 sectors/track, 1215757 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x00000000

   Device Boot      Start         End      Blocks   Id  System

longblock454 · Dec 8, 2009

Try with the same mount syntax then dump:

Code:

sudo dmesg|tail && tail -20 /var/log/messages

ekological · Dec 8, 2009

Code:

[  152.880833] XFS: bad magic number
[  152.880838] XFS: SB validate failed
[ 5363.048549] XFS: bad magic number
[ 5363.048554] XFS: SB validate failed
[ 5414.578833] XFS: bad magic number
[ 5414.578837] XFS: SB validate failed
[ 6029.616834] XFS: bad magic number
[ 6029.616838] XFS: SB validate failed
[ 7073.388789] XFS: bad magic number
[ 7073.388793] XFS: SB validate failed
Dec  8 17:36:48 thevault kernel: [   25.616552] 3w-9xxx: scsi2: ERROR: (0x03:0x0101): Invalid command opcode:opcode=0x85.
Dec  8 17:36:48 thevault kernel: [   25.617872] 3w-9xxx: scsi2: ERROR: (0x03:0x0101): Invalid command opcode:opcode=0x85.
Dec  8 17:36:49 thevault 3dm2: ENCL: Monitoring service started.
Dec  8 17:36:49 thevault 3dm2: ENCL: Enclosure Monitoring service is enabled.
Dec  8 17:36:59 thevault kernel: [   36.408117] XFS: bad magic number
Dec  8 17:36:59 thevault kernel: [   36.408121] XFS: SB validate failed
Dec  8 17:38:53 thevault kernel: [  150.245540] XFS: bad magic number
Dec  8 17:38:53 thevault kernel: [  150.245545] XFS: SB validate failed
Dec  8 17:38:54 thevault kernel: [  151.669042] XFS: bad magic number
Dec  8 17:38:54 thevault kernel: [  151.669047] XFS: SB validate failed
Dec  8 17:38:55 thevault kernel: [  152.880833] XFS: bad magic number
Dec  8 17:38:55 thevault kernel: [  152.880838] XFS: SB validate failed
Dec  8 19:05:45 thevault kernel: [ 5363.048549] XFS: bad magic number
Dec  8 19:05:45 thevault kernel: [ 5363.048554] XFS: SB validate failed
Dec  8 19:06:37 thevault kernel: [ 5414.578833] XFS: bad magic number
Dec  8 19:06:37 thevault kernel: [ 5414.578837] XFS: SB validate failed
Dec  8 19:16:52 thevault kernel: [ 6029.616834] XFS: bad magic number
Dec  8 19:16:52 thevault kernel: [ 6029.616838] XFS: SB validate failed
Dec  8 19:34:16 thevault kernel: [ 7073.388789] XFS: bad magic number
Dec  8 19:34:16 thevault kernel: [ 7073.388793] XFS: SB validate failed

longblock454 · Dec 8, 2009

If you plug back in your original drive all is good? Did you by change switch from x32 or x64 during the switch to USB?

ekological · Dec 8, 2009

Nope. This is back on the original drive. =/ Both versions were x64.

longblock454 · Dec 8, 2009

OK, so just that I understand. You attempted the original boot drive with the only difference being the updated firmware?

ekological · Dec 8, 2009

Yup. I'm about to revert back to the old firmware.

longblock454 · Dec 8, 2009

ekological said:
Yup. I'm about to revert back to the old firmware.

Quite unusual, but from the data above it would be my next move..

ekological · Dec 8, 2009

reverted back to the old firmware and tried mounting, got the same error and run the same command to check the logs:

Code:

[  354.184308] XFS: bad magic number
[  354.184313] XFS: SB validate failed
[  380.090199] XFS: bad magic number
[  380.090203] XFS: SB validate failed
[  380.908267] XFS: bad magic number
[  380.908271] XFS: SB validate failed
[  381.463919] XFS: bad magic number
[  381.463923] XFS: SB validate failed
[  403.104370] XFS: bad magic number
[  403.104375] XFS: SB validate failed
Dec  8 20:20:15 thevault kernel: [   22.913940] usplash:364 freeing invalid memtype ffffffffe8000000-ffffffffe9000000
Dec  8 20:20:17 thevault kernel: [   24.635651] ppdev: user-space parallel port driver
Dec  8 20:20:17 thevault kernel: [   25.045371] e1000e: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
Dec  8 20:20:17 thevault kernel: [   25.045938] ADDRCONF(NETDEV_CHANGE): eth1: link becomes ready
Dec  8 20:20:18 thevault kernel: [   25.369671] XFS: bad magic number
Dec  8 20:20:18 thevault kernel: [   25.369676] XFS: SB validate failed
Dec  8 20:20:18 thevault kernel: [   25.655999] 3w-9xxx: scsi2: ERROR: (0x03:0x0101): Invalid command opcode:opcode=0x85.
Dec  8 20:20:18 thevault kernel: [   25.657243] 3w-9xxx: scsi2: ERROR: (0x03:0x0101): Invalid command opcode:opcode=0x85.
Dec  8 20:20:19 thevault 3dm2: ENCL: Monitoring service started.
Dec  8 20:20:19 thevault 3dm2: ENCL: Enclosure Monitoring service is enabled.
Dec  8 20:25:47 thevault kernel: [  354.184308] XFS: bad magic number
Dec  8 20:25:47 thevault kernel: [  354.184313] XFS: SB validate failed
Dec  8 20:26:13 thevault kernel: [  380.090199] XFS: bad magic number
Dec  8 20:26:13 thevault kernel: [  380.090203] XFS: SB validate failed
Dec  8 20:26:14 thevault kernel: [  380.908267] XFS: bad magic number
Dec  8 20:26:14 thevault kernel: [  380.908271] XFS: SB validate failed
Dec  8 20:26:15 thevault kernel: [  381.463919] XFS: bad magic number
Dec  8 20:26:15 thevault kernel: [  381.463923] XFS: SB validate failed
Dec  8 20:26:36 thevault kernel: [  403.104370] XFS: bad magic number
Dec  8 20:26:36 thevault kernel: [  403.104375] XFS: SB validate failed

ekological · Dec 9, 2009

I ran TestDisk for most of the day. I first tried to use the "None" option here:

Code:

Please select the partition table type, press Enter when done.
[Intel  ]  Intel/PC partition
[EFI GPT]  EFI GPT partition map (Mac i386, some x86_64...)
[Mac    ]  Apple partition map
[None   ]  Non partitioned media
[Sun    ]  Sun Solaris partition
[XBox   ]  XBox partition
[Return ]  Return to disk selection

But what it started discovering was complete gibberish. I then picked the "Intel" option and saw this right before the analyze was complete:

Code:

  No partition         39620   2  1 769074  98 46 11718684604
  No partition         52546   0  1 487631  46 61 6989643484

Of course, as soon as it finished, it stated that no partitions were found, which is what I would expect. Is there any way to use the information above (I'm not exactly sure how to interpret it) to fix something? I'm thinking this data is valid considering the xfs_growfs command I performed a while ago.

Thanks,
Chester

longblock454 · Dec 9, 2009

These errors mean something:

Code:

[  354.184308] XFS: bad magic number
[  354.184313] XFS: SB validate failed

Nothing caught my eye during a quick google search. You'd better hit the XFS mailing list or a more technical Linux forum before doing/going any further.

keenan · Dec 9, 2009

ekological said:
aI'm thinking this data is valid considering the xfs_growfs command I performed a while ago.

You haven't mentioned running xfs_growfs...

I think if the filesystem can't find its metadata you're probably pretty hooped. No idea how that would happen, but it sounds to me like something in the firmware upgrade may have corrupted the data.

Try xfs_check and xfs_repair -n against it, if they don't find anything you're probably going to have to resort to recovering what you can with photorec or more in-depth techniques, assuming the firmware hasn't made bit soup out of your data.

Edit:

Code:

[  354.184308] XFS: bad magic number
[  354.184313] XFS: SB validate failed

SB is the superblock. Basically a small metadata area at a specific location on the disk that the filesystem uses to store information about the filesystem, normally it's one of the first few blocks on the device, and a few copies are usually made around the device for redundancy as well. Bad magic number means that the 'magic', some specific number that XFS uses to identify XFS filesystems, isn't present in the primary superblock where it's supposed to be, so the superblock is rendered invalid. This is to protect from the disaster that would happen if you tried to mount an e.g. ext2 filesystem using XFS, and usually indicates that either the block device can't be read properly or that the data is corrupt. xfs_repair can, if this is the case, validate the secondary superblocks and restore them to the primary location.

Personally I'm wondering what these are about:

Code:

Dec  8 17:36:48 thevault kernel: [   25.616552] 3w-9xxx: scsi2: ERROR: (0x03:0x0101): Invalid command opcode:opcode=0x85.
Dec  8 17:36:48 thevault kernel: [   25.617872] 3w-9xxx: scsi2: ERROR: (0x03:0x0101): Invalid command opcode:opcode=0x85

ekological · Dec 10, 2009

Thanks for the help guys. I did do xfs_growfs when I went from four disks (in raid5) to seven disks (in raid6). Whether or not it did anything, I cannot say as the larger volume was not available until after I rebooted.

Running xfs_check yields this:

Code:

xfs_check: /dev/sdb is not a valid XFS filesystem (unexpected SB magic number 0xeb639042)
xfs_check: WARNING - filesystem uses v1 dirs,limited functionality provided.
cache_node_purge: refcount was 1, not zero (node=0x1f37eb0)
xfs_check: cannot read root inode (117)
cache_node_purge: refcount was 1, not zero (node=0x1f59a10)
xfs_check: cannot read realtime bitmap inode (117)
xfs_check: WARNING - filesystem uses v1 dirs,limited functionality provided.
bad superblock magic number eb639042, giving up

So I think it's like you said--the meta data got hosed. I'm running xfs_repair -n and it said the primary superblock was bad and now it's atempting to find the secondary superblock.

Chester

ekological · Dec 10, 2009

Good news! I ran xfs_repair -n like I had just said. After about 2 hours of scanning, it said that it found a secondary candidate, so I reran the command without the -n switch and this is what I got:

Code:

found candidate secondary superblock...
verified secondary superblock...
writing modified primary superblock
Phase 2 - using internal log
        - zero log...
        - scan filesystem freespace and inode maps...
        - found root inode chunk
Phase 3 - for each AG...
        - scan and clear agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
        - agno = 6
        - agno = 7
        - agno = 8
        - agno = 9
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
        - agno = 6
        - agno = 7
        - agno = 8
        - agno = 9
Phase 5 - rebuild AG headers and trees...
        - reset superblock...
Phase 6 - check inode connectivity...
        - resetting contents of realtime bitmap and summary inodes
        - traversing filesystem ...
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify and correct link counts...
Note - stripe unit (0) and width (0) fields have been reset.
Please set with mount -o sunit=<value>,swidth=<value>
done

IT WORKS! Thanks guys for all your help!

Did I just lose all of my data?

ekological

n00b

longblock454

2[H]4U

ekological

n00b

longblock454

2[H]4U

ekological

n00b

longblock454

2[H]4U

ekological

n00b

longblock454

2[H]4U

ekological

n00b

longblock454

2[H]4U

ekological

n00b

longblock454

2[H]4U

ekological

n00b

ekological

n00b

longblock454

2[H]4U

keenan

2[H]4U

ekological

n00b

ekological

n00b