VMFS Resignature Issues

DarkDiamond · Nov 10, 2012

Hi!

Hopefully someone will be kind enough to lend a hand. I had an issue with a raid array last night that forced me to move my iSCSI luns to another raid array and reconnect them to vSphere.

When I rescanned the hba's and added the storage, it asked me if I wanted to mount using the existing signature or to resignature the datastore. I chose resignature the data store. Unfortunately when I did this for each of my luns, it looks like I've lost many of the VMDK's in those luns. I've also lost the ability to register a vmx on the ESXi box through the vSphere Client. The option is grayed out when I right click the vmx file. When I look at some of the folders where my VMs were in the datastore browser, they look empty. When I try to cd into the datastore from SSH on my ESXi console, I get an error message "Invalid Argument". Lastly, there's one specific VMDK that I can see in the datastore browser, but I'm unable to add it manually to a VM.

Did I totally ruin the contents of my datastores? I do have backups (some of them not as recent as I'd like). Is there anything I can do to get access to my VMDK's (including the one that I see in the datastore browser but cannot add to a VM?

Thanks!
Dark Diamond

lopoetve · Nov 10, 2012

That SO should not have happened. Resignaturing just matches the VMFS LVM signature to the array LUN signature.

Run:
vmkfstools -V

Then
ESXi 5/5.1
cat /var/log/vmkernel.log

ESXi 4.X
cat /var/log/messages | grep -i vmkernel | grep -iv aam

Post the most recent 10 lines, and then cd into it for the "invalid argument" error and do the same cat again.

DarkDiamond · Nov 10, 2012

This is the last few lines in the kernel file. Looks ominous

5210f2-46f2-001b216b5406 ("Servers 2") might be damaged on the disk. Resource cluster metadata corruption has been detected.
[type 1] Inconsistency between bitmap (193) and freeResources (61) (cluster 264).2012-11-11T01:06:13.607Z cpu4:5095)WARNING: Res3: 3249: Volume 509eb5e4-3ac894ea-808e-001b216b5406 ("Servers 1") might be damaged on the disk. Resource cluster metadata corruption has been detected.
[type 1] Invalid freeResources 202 (cluster 271).[type 1] Inconsistency between bitmap (192) and freeResources (202) (cluster 271).2012-11-11T01:06:13.607Z cpu4:5095)WARNING: Res3: 3249: Volume 509eb5e4-3ac894ea-808e-001b216b5406 ("Servers 1") might be damaged on the disk. Resource cluster metadata corruption has been detected.
[type 1] Invalid nextFreeIdx 211 (cluster 276).2012-11-11T01:06:13.607Z cpu4:5095)WARNING: Res3: 3249: Volume 509eb5e4-3ac894ea-808e-001b216b5406 ("Servers 1") might be damaged on the disk. Resource cluster metadata corruption has been detected.
[type 1] Invalid freeResources 205 (cluster 285).[type 1] Inconsistency between bitmap (186) and freeResources (205) (cluster 285).2012-11-11T01:06:13.607Z cpu4:5095)WARNING: Res3: 3249: Volume 509eb5e4-3ac894ea-808e-001b216b5406 ("Servers 1") might be damaged on the disk. Resource cluster metadata corruption has been detected.
[type 1] Inconsistency between bitmap (199) and freeResources (55) (cluster 296).2012-11-11T01:06:13.607Z cpu4:5095)WARNING: Res3: 3249: Volume 509eb5e4-3ac894ea-808e-001b216b5406 ("Servers 1") might be damaged on the disk. Resource cluster metadata corruption has been detected.
[type 1] Inconsistency between bitmap (184) and freeResources (75) (cluster 313).2012-11-11T01:06:13.607Z cpu4:5095)WARNING: Res3: 3249: Volume 509eb5e4-3ac894ea-808e-001b216b5406 ("Servers 1") might be damaged on the disk. Resource cluster metadata corruption has been detected.
[type 1] Invalid freeResources 245 (cluster 328).[type 1] Inconsistency between bitmap (7) and freeResources (245) (cluster 328).2012-11-11T01:06:13.607Z cpu4:5095)WARNING: Res3: 3249: Volume 509eb5e4-3ac894ea-808e-001b216b5406 ("Servers 1") might be damaged on the disk. Resource cluster metadata corruption has been detected.
[type 1] Inconsistency between bitmap (8) and freeResources (2) (cluster 335).2012-11-11T01:06:13.607Z cpu4:5095)WARNING: Res3: 3249: Volume 509eb5e4-3ac894ea-808e-001b216b5406 ("Servers 1") might be damaged on the disk. Resource cluster metadata corruption has been detected.
[type 1] Invalid nextFreeIdx 211 (cluster 340).2012-11-11T01:06:13.607Z cpu4:5095)WARNING: Res3: 3249: Volume 509eb5e4-3ac894ea-808e-001b216b5406 ("Servers 1") might be damaged on the disk. Resource cluster metadata corruption has been detected.
[type 1] Invalid nextFreeIdx 218 (cluster 349).[type 1] Inconsistency between bitmap (13) and freeResources (6) (cluster 349).2012-11-11T01:06:13.607Z cpu4:5095)WARNING: Res3: 3249: Volume 509eb5e4-3ac894ea-808e-001b216b5406 ("Servers 1") might be damaged on the disk. Resource cluster metadata corruption has been detected.
[type 1] Inconsistency between bitmap (136) and freeResources (118) (cluster 360).2012-11-11T01:06:13.607Z cpu4:5095)WARNING: Res3: 3249: Volume 509eb5e4-3ac894ea-808e-001b216b5406 ("Servers 1") might be damaged on the disk. Resource cluster metadata corruption has been detected.
[type 1] Inconsistency between bitmap (193) and freeResources (60) (cluster 377).2012-11-11T01:06:13.607Z cpu4:5095)WARNING: Res3: 3249: Volume 509eb5e4-3ac894ea-808e-001b216b5406 ("Servers 1") might be damaged on the disk. Resource cluster metadata corruption has been detected.
[type 1] Inconsistency between bitmap (1) and freeResources (142) (cluster 490).2012-11-11T01:06:13.607Z cpu4:5095)WARNING: Res3: 3249: Volume 509eb5e4-3ac894ea-808e-001b216b5406 ("Servers 1") might be damaged on the disk. Resource cluster metadata corruption has been detected.
[type 1] Inconsistency between bitmap (193) and freeResources (61) (cluster 264).2012-11-11T01:06:13.614Z cpu6:5095)WARNING: Res3: 3249: Volume 509eb5e4-3ac894ea-808e-001b216b5406 ("Servers 1") might be damaged on the disk. Resource cluster metadata corruption has been detected.
[type 1] Invalid freeResources 202 (cluster 271).[type 1] Inconsistency between bitmap (192) and freeResources (202) (cluster 271).2012-11-11T01:06:13.614Z cpu6:5095)WARNING: Res3: 3249: Volume 509eb5e4-3ac894ea-808e-001b216b5406 ("Servers 1") might be damaged on the disk. Resource cluster metadata corruption has been detected.
[type 1] Invalid nextFreeIdx 211 (cluster 276).2012-11-11T01:06:13.614Z cpu6:5095)WARNING: Res3: 3249: Volume 509eb5e4-3ac894ea-808e-001b216b5406 ("Servers 1") might be damaged on the disk. Resource cluster metadata corruption has been detected.
[type 1] Invalid freeResources 205 (cluster 285).[type 1] Inconsistency between bitmap (186) and freeResources (205) (cluster 285).2012-11-11T01:06:13.614Z cpu6:5095)WARNING: Res3: 3249: Volume 509eb5e4-3ac894ea-808e-001b216b5406 ("Servers 1") might be damaged on the disk. Resource cluster metadata corruption has been detected.
[type 1] Inconsistency between bitmap (199) and freeResources (55) (cluster 296).2012-11-11T01:06:13.614Z cpu6:5095)WARNING: Res3: 3249: Volume 509eb5e4-3ac894ea-808e-001b216b5406 ("Servers 1") might be damaged on the disk. Resource cluster metadata corruption has been detected.
[type 1] Inconsistency between bitmap (184) and freeResources (75) (cluster 313).2012-11-11T01:06:13.614Z cpu6:5095)WARNING: Res3: 3249: Volume 509eb5e4-3ac894ea-808e-001b216b5406 ("Servers 1") might be damaged on the disk. Resource cluster metadata corruption has been detected.
[type 1] Invalid freeResources 245 (cluster 328).[type 1] Inconsistency between bitmap (7) and freeResources (245) (cluster 328).2012-11-11T01:06:13.614Z cpu6:5095)WARNING: Res3: 3249: Volume 509eb5e4-3ac894ea-808e-001b216b5406 ("Servers 1") might be damaged on the disk. Resource cluster metadata corruption has been detected.
[type 1] Inconsistency between bitmap (8) and freeResources (2) (cluster 335).2012-11-11T01:06:13.614Z cpu6:5095)WARNING: Res3: 3249: Volume 509eb5e4-3ac894ea-808e-001b216b5406 ("Servers 1") might be damaged on the disk. Resource cluster metadata corruption has been detected.
[type 1] Invalid nextFreeIdx 211 (cluster 340).2012-11-11T01:06:13.614Z cpu6:5095)WARNING: Res3: 3249: Volume 509eb5e4-3ac894ea-808e-001b216b5406 ("Servers 1") might be damaged on the disk. Resource cluster metadata corruption has been detected.
[type 1] Invalid nextFreeIdx 218 (cluster 349).[type 1] Inconsistency between bitmap (13) and freeResources (6) (cluster 349).2012-11-11T01:06:13.614Z cpu6:5095)WARNING: Res3: 3249: Volume 509eb5e4-3ac894ea-808e-001b216b5406 ("Servers 1") might be damaged on the disk. Resource cluster metadata corruption has been detected.
[type 1] Inconsistency between bitmap (136) and freeResources (118) (cluster 360).2012-11-11T01:06:13.614Z cpu6:5095)WARNING: Res3: 3249: Volume 509eb5e4-3ac894ea-808e-001b216b5406 ("Servers 1") might be damaged on the disk. Resource cluster metadata corruption has been detected.
[type 1] Inconsistency between bitmap (193) and freeResources (60) (cluster 377).2012-11-11T01:06:13.614Z cpu6:5095)WARNING: Res3: 3249: Volume 509eb5e4-3ac894ea-808e-001b216b5406 ("Servers 1") might be damaged on the disk. Resource cluster metadata corruption has been detected.
[type 1] Inconsistency between bitmap (1) and freeResources (142) (cluster 490).2012-11-11T01:06:13.614Z cpu6:5095)WARNING: Res3: 3249: Volume 509eb5e4-3ac894ea-808e-001b216b5406 ("Servers 1") might be damaged on the disk. Resource cluster metadata corruption has been detected.
~ #

DarkDiamond · Nov 10, 2012

This is what appears when I try to cd into one of the VM folders... I noticed something else odd: there's timeouts to an ip address that doesn't exist on my network.

2012-11-11T01:12:26.539Z cpu4:625079)Vol3: 2359: Failed to get object 28 type 3 uuid 509eb57f-2b5210f2-46f2-001b216b5406 FD 3408ac4 gen 15 :Not found
2012-11-11T01:12:26.539Z cpu4:625079)Vol3: 2359: Failed to get object 28 type 3 uuid 509eb57f-2b5210f2-46f2-001b216b5406 FD 3808ac4 gen 16 :Not found
2012-11-11T01:12:26.540Z cpu4:625079)Vol3: 2359: Failed to get object 28 type 3 uuid 509eb57f-2b5210f2-46f2-001b216b5406 FD 3808ac4 gen 16 :Not found
2012-11-11T01:13:17.894Z cpu6:4102)WARNING: Hbr: 534: Connection failed to 192.168.5.113 (groupID=GID-eaa0eda4-3af0-4ff8-8361-87bbd3282358): Timeout
2012-11-11T01:13:17.895Z cpu6:4102)WARNING: Hbr: 4322: Failed to establish connection to [192.168.5.113]:44046(groupID=GID-eaa0eda4-3af0-4ff8-8361-87bbd3282358): Timeout
2012-11-11T01:13:19.908Z cpu0:235997)Vol3: 2359: Failed to get object 28 type 2 uuid 509eb57f-2b5210f2-46f2-001b216b5406 FD 6008ac4 gen 37 :Not found

lopoetve · Nov 12, 2012

Oh fuckballs.

What's the filer?

If you're on 5.1, run:
esxcfg-scsidevs -m

This will give you the datastore to naa mapping (naa.ABJAEHADFASDFA:1)
then run:
voma -m vmfs -f check -d /vmfs/devices/disks/naa.BLAHFROMABOVE:1
(don't forget the :1)
And save the output to a file or something, and paste it here.

NetJunkie · Nov 12, 2012

When lopoetve says "oh fuckballs" that's not good...

lopoetve · Nov 13, 2012

NetJunkie said:
When lopoetve says "oh fuckballs" that's not good...

No, no it's not

DarkDiamond · Nov 13, 2012

I'm running Starwind for an iSCSI target. Prior to all of this, my raid adapter started dropping drives out of my raid set. Eventually I lost two drives on the raid 6. I copied everything from that onto a different array. Even though I had the hosts offline and the LUNs unmounted during the copy, I wonder if the failing array scrambled the data somehow.

Thanks for replying though

I unmounted the luns after I posted those log entries, so I'll run this command later on today once I remount them and I'll post the results.

Olga-SAN · Nov 13, 2012

it's a good idea to use 3-way mirror with starwind and no hardware or software raid on individual nodes or use raid0

in this case you just need to throw in a spare drive and re-initiate starwind sync process

no raid rebuild required --> much faster

DarkDiamond said:
I'm running Starwind for an iSCSI target. Prior to all of this, my raid adapter started dropping drives out of my raid set. Eventually I lost two drives on the raid 6. I copied everything from that onto a different array. Even though I had the hosts offline and the LUNs unmounted during the copy, I wonder if the failing array scrambled the data somehow.

Thanks for replying though I unmounted the luns after I posted those log entries, so I'll run this command later on today once I remount them and I'll post the results.

DarkDiamond · Nov 13, 2012

I ran the commands you asked me to after I remounted one of the potentially corrupted datastores (choosing Keep Existing Signature this time). This was the result:

voma -m vmfs -f check -d /vmfs/devices/disks/eui.a8dd3a0ad3343162:1

Checking if device is actively used by other hosts
Found 1 actively heartbeating hosts on device '/vmfs/devices/disks/eui.a8dd3a0ad3343162:1'
1): MAC address 00:1b:21:6b:54:06
/vmfs/volumes/509eb251-055cd3b7-1e17-001b216b5406/Media # ls -ltr
-rw------- 1 root root 2147483648000 Oct 2 07:50 Media_2-flat.vmdk
-rw------- 1 root root 606 Nov 6 01:07 Media_2.vmdk
-rw------- 1 root root 8192512 Nov 10 00:54 Media_2-ctk.vmdk
/vmfs/volumes/509eb251-055cd3b7-1e17-001b216b5406/Media #

I'm able to see three files (the ones in the snippet above). However, I cannot add this vmdk file to another VM. When I try to browse to it in the "Browse Datastores" dialog box, it doesn't appear in the list to select. Given that I can see it via the command line, is it possible this isn't totally fubar? Any way I can force mount this disk to another VM to copy the data off of it?

lopoetve said:
Oh fuckballs.

What's the filer?

If you're on 5.1, run:
esxcfg-scsidevs -m

This will give you the datastore to naa mapping (naa.ABJAEHADFASDFA:1)
then run:
voma -m vmfs -f check -d /vmfs/devices/disks/naa.BLAHFROMABOVE:1
(don't forget the :1)
And save the output to a file or something, and paste it here.

lopoetve · Nov 14, 2012

I keep forgetting that - I never run it live...

so, you can shut down every vm and all but one host and run it, or you can do this:

dd if=/vmfs/devices/disks/eui.a8dd3a0ad3343162:1 of=/vmfs/volumes/yourlocalstoragepleasehavesome/eui.bin bs=1M count=1200

and then run voma on the resulting eui.bin file. Dump it to local storage though, not the array.

DarkDiamond · Nov 14, 2012

How big will this bin file be? Just want to make sure I have enough of storage available...

lopoetve said:
I keep forgetting that - I never run it live...

so, you can shut down every vm and all but one host and run it, or you can do this:

dd if=/vmfs/devices/disks/eui.a8dd3a0ad3343162:1 of=/vmfs/volumes/yourlocalstoragepleasehavesome/eui.bin bs=1M count=1200

and then run voma on the resulting eui.bin file. Dump it to local storage though, not the array.

lopoetve · Nov 15, 2012

DarkDiamond said:
How big will this bin file be? Just want to make sure I have enough of storage available...

1.2G.

DarkDiamond · Nov 21, 2012

Sorry my response took a few days. Got slammed at work getting a bunch of software projects wrapped up before the holiday...

I ram VOMA against the volume. For some reason I wasn't able to run it against the eui.bin file you mentioned before. Here's a sample of what I saw:

Phase 1: Checking VMFS header and resource files
Detected file system (labeled:'Media 1') with UUID:509eb251-055cd3b7-1e17-001b216b5406, Version 5:54
ON-DISK ERROR: Cluster 3904 free count 243 should be 2
ON-DISK ERROR: Cluster 10183 free count 39 should be 53
ON-DISK ERROR: Cluster 10184 free count 100 should be 125
ON-DISK ERROR: Cluster 10185 free count 100 should be 125
Found stale lock [type 10c00002 offset 16787456 v 0, hb offset 4136960
gen 25, mode 0, owner 00000000-00000000-0000-000000000000 mtime 9338380
num 0 gblnum 0 gblgen 0 gblbrk 0]
ON-DISK ERROR: Cluster 10186 free count 100 should be 125
ON-DISK ERROR: Cluster 10187 free count 100 should be 125
Found stale lock [type 10c00002 offset 16789504 v 0, hb offset 4136960
gen 25, mode 0, owner 00000000-00000000-0000-000000000000 mtime 9338376
num 0 gblnum 0 gblgen 0 gblbrk 0]
ON-DISK ERROR: Cluster 10188 free count 100 should be 125
Found stale lock [type 10c00002 offset 16790528 v 0, hb offset 4136960
gen 25, mode 0, owner 00000000-00000000-0000-000000000000 mtime 9338382
num 0 gblnum 0 gblgen 0 gblbrk 0]
ON-DISK ERROR: Cluster 10189 free count 100 should be 125
ON-DISK ERROR: Cluster 10190 free count 100 should be 125
--More--

along with lots of these:
ON-DISK ERROR: <FD c62 r11> : Invalid linkCount 245
ON-DISK ERROR: <FD c62 r11>: invalid address PB2 cow 1 cnum 475473 rnum 15
ON-DISK ERROR: <FD c62 r11>: invalid address PB2 cow 1 cnum 475783 rnum 15

and literally thousands of lines of lines similar to these:

Phase 5: Checking resource reference counts.
ON-DISK ERROR: PB inconsistency found: (775,1) allocated in bitmap, but never used
ON-DISK ERROR: PB inconsistency found: (775,2) allocated in bitmap, but never used
ON-DISK ERROR: PB inconsistency found: (775,3) allocated in bitmap, but never used

I'm guessing FUBAR is in order?

Dark Diamond

Diakiaoshinsama · Nov 24, 2012

I agree, when he says "Oh fuckballs" that is really bad. I was in a similar situation to this a while back. I HAD to re-signature some of my files due to a SAN upgrade. And one of them had a Snapshot in place......

1 full day and a lot of sweat later I finally rebuilt the VM. This may not help you, but what I had to do was manually pull the files off and straighten out the signatures. As it was the VMX and VMDK files were pointing to the previous iSCSI/LUN signatures. Manually editing them in Notepad then re-uploading them I was able to fix them. Even then I still had some filesystem corruption that a chkdsk had to fix.

Lesson learned, re-signature at your own risk. Or as an excellent form of payback to someone you don't like.

Best of luck to ya!

DarkDiamond · Nov 24, 2012

Do you think that may solve my problem? Crappy thing is the Lun I can't access is 2 tb in size. That will take a while to download, not to mention I'll need to buy a hard drive.

Diakiaoshinsama said:
I agree, when he says "Oh fuckballs" that is really bad. I was in a similar situation to this a while back. I HAD to re-signature some of my files due to a SAN upgrade. And one of them had a Snapshot in place......

1 full day and a lot of sweat later I finally rebuilt the VM. This may not help you, but what I had to do was manually pull the files off and straighten out the signatures. As it was the VMX and VMDK files were pointing to the previous iSCSI/LUN signatures. Manually editing them in Notepad then re-uploading them I was able to fix them. Even then I still had some filesystem corruption that a chkdsk had to fix.

Lesson learned, re-signature at your own risk. Or as an excellent form of payback to someone you don't like.

Best of luck to ya!

lopoetve · Nov 27, 2012

Nope, his was a VMX issue (resignaturing doesn't update the links in the files).

Yours is "your volume looks like spaghetti".

Restore from backup, and see if there are any errors on the transfer - something went wrong. That's unfixable.

danswartz · Nov 27, 2012

BOHICA

VMFS Resignature Issues

DarkDiamond

n00b

lopoetve

Extremely [H]

DarkDiamond

n00b

DarkDiamond

n00b

lopoetve

Extremely [H]

NetJunkie

[H]F Junkie

lopoetve

Extremely [H]

DarkDiamond

n00b

Olga-SAN

Limp Gawd

DarkDiamond

n00b

lopoetve

Extremely [H]

DarkDiamond

n00b

lopoetve

Extremely [H]

DarkDiamond

n00b

Diakiaoshinsama

Limp Gawd

DarkDiamond

n00b

lopoetve

Extremely [H]

danswartz

2[H]4U