VMware vSAN - Single Disk Noncompliant

gimp

[H]F Junkie
Joined
Jul 25, 2008
Messages
10,572
Just for funsies, and to see what this vSAN thing is all about, decided to build a 2-node VMware vSAN cluster.
Hosts are running on 8th Gen NUC i7's, 32GB RAM, 1x 250gb nvme, 1x 2tb SSD, booting off old USB thumbsticks.

Took a bit to figure out how to get it all working, but I managed.

Only one snag, which I don't think is directly related to the 2-node set up (with witness appliance).

The VMware vSAN witness appliance has 3 disks.
Currently, 1 of those 3 disks is out of compliance with the storage policy.
It reports that one disk as RAID0, when everything else is RAID1; and policy states RAID1.
All of the disks in the VCSA are properly reported as RAID1.

I've tried creating a new storage policy and applying it, but that doesn't change anything.
I also tried doing the "Repair Objects Immediately" option from the vSAN Health menu, under the vSAN cluster/monitor.

Nothing seems to want to correct it.
I've done a whole lot of searching, but everything I find is related to all or multiple disks in a noncompliant state. In those cases, it was usually creating and applying a new policy that got things working.

any tips? what might I be doing wrong? bug in something else?
I have already updated everything with the latest patches (2 physical hosts, witness host, and VCSA)

gs012865.png


gs012866.png


gs012867.png
 
Well never mind.
vcenter crashed, and now all VMs are "unknown" status and "name" is just a single digit number.
browsing the vsandatacenter store isn't showing the VCSA or vsan witness appliance vmdks on either host.
neat...
time to rebuild.
 
This is why I avoid VSAN like the plague.

I've heard a fair amount of good.
And really, just using the 60-day trial to test it out because I was curious. I also haven't gotten any experience with vSphere 6.7 yet, as we are still on 6.0 due to other infrastructure compatibility pieces that we need to upgrade first.

since "hyperconverged" is the latest fad, I wanted to check it out.
and I don't have anything capable of shared storage atm; or, well, current setup/physical placement of devices makes it not very capable.
 
gotta say, maybe it's because I'm not 10gbe yet, running on NUCs, and it's only a 2-node stretched cluster with a nested witness host, but... none of that should be causing these issues.
Occurred this morning as well, and it only seems to occur after storage vmotioning a single disk from an NFS share back to the vSAN datastore. And it doesn't happen until after the storage vmotion actually completes.
the 2 nodes are connected to the same switch and I doubt I'm saturating 1gb.
Both nodes show similar issues, while there are no issues connecting to the nodes themselves.

gs012868.png


not terribly impressed.
 
Why is SCSI controller showing yellow?

I assume you are running this on 6.7 based on screenshots. Those are nicer then 6.0 and 6.5
 
Why is SCSI controller showing yellow?

I assume you are running this on 6.7 based on screenshots. Those are nicer then 6.0 and 6.5

controller isn't on HCL, or at least not certified
 
so today... think I destroyed vsan again, by powering both hosts off.
per VMware, everything should get re-established.
https://blogs.vmware.com/virtualblocks/2018/07/13/understanding-ve-booting-w-vc-unavailable/
sadly, it did not. Or, has not.
Even after sitting for over 30 minutes.
each host did eventually report vsan datastore size and usage, but data reported was what the individual host was providing; not the total.
VMs showing back up with names as just numbers and status of invalid.

Powered my hosts off so I could validate up-to-date firmware on the Corsair m.2 and SanDisk SSD.
Had to use the hotswap bays on my media/fileserver for the SanDisk.
Ran the "long" test with the SanDisk SSD software. Both came back clean.
Also up to date firmware.
Firmware on Corsair m.2 up to date as well.

Yet on one host...
event.WARNING - VSAN device t10.ATA___SanDisk_SDSSDH32000G_______________183795801332_____ is degrading. Consider replacing it..fullFormat (WARNING - VSAN device t10.ATA___SanDisk_SDSSDH32000G_______________183795801332_____ is degrading. Consider replacing it.)

So now trying to run a test on the SSD while in the NUC; bit problematic figuring out what/how to do that, as I would prefer to use a live boot disk so I don't have to blow it all away again.

currently running badblock read-only test off a gparted live.
 
When do you call it quits, here? :)

well smartctl long test passed on the ssd.
running badblocks non-destructive read-write test.
if that passes then I'd have to say the issue is compatibility-related, and I'd be calling it quits with vSAN attempts for now.
 
well I just fucked up.
I forgot I had disabled the vmw_ahci module one host 1, but never rebooted it.

oops.

vsan is back up.
still not sure what's goin gon with host 2.
even badblocks non-destructive read-write test came back with 0 errors, but host is still barking about the drive "degrading"
 
Dude, VSAN is steaming garbage. Just quit banging your head against a wall and buy a cheapo used Synology box or build a FreeNAS box. You'll have far less sleepless nights. Trust me, you've only just started to find all the problems and idiosyncrasies VSAN has to offer.
 
Dude, VSAN is steaming garbage. Just quit banging your head against a wall and buy a cheapo used Synology box or build a FreeNAS box. You'll have far less sleepless nights. Trust me, you've only just started to find all the problems and idiosyncrasies VSAN has to offer.

considering the host itself is reporting issues with my disk but no testing reports any issues, me thinks it's not vsan related.
and, as I want to get some experience with it just because, well... your solution is not a solution.
 
Dude, VSAN is steaming garbage. Just quit banging your head against a wall and buy a cheapo used Synology box or build a FreeNAS box. You'll have far less sleepless nights. Trust me, you've only just started to find all the problems and idiosyncrasies VSAN has to offer.

You misspelled 'idiotsyncrasies' :)
 
Suggesting vSAN is garbage because it doesn't run properly on hardware it was never intended to run on does not make any sense.....I have 100's of servers running vSAN and it works out quite well. There are caveats to its use like with any solution. This particular usecase is not the one vSAN was intended for.
 
Dude, VSAN is steaming garbage. Just quit banging your head against a wall and buy a cheapo used Synology box or build a FreeNAS box. You'll have far less sleepless nights. Trust me, you've only just started to find all the problems and idiosyncrasies VSAN has to offer.
We have a few dozen PBs of vSAN in production, and a few hundred more TB in labs without any real issues.
 
We have a few dozen PBs of vSAN in production, and a few hundred more TB in labs without any real issues.

If you don't mind me asking, what is the maximum number of nodes per a cluster you use with vSAN?
 
Related to my topic... :p

since I destroyed the vSAN cluster, node2 no longer complains of "degrading" health on the SSD.

ended up piecing together some old drives and mobo and loaded up FreeNAS until I can build myself a proper NAS.
Was hoping to hack together my old MediaSmart Server, but looks like I'll need a proper breakout cable for video out since the thing completely fails to boot (possibly POST?) with a bootable thumbdrive.

Gotta lab up a project I have coming up, and our prod environment is our test environment at work. So breaking down the 6.7 stuff.
I'd rather go through trials and tribulations on my garbage first.
 
how long does a 32 Node vsan take to upgrade?
Define upgrade? I'm actually not even sure I know really as our automation just rolls through when we tell it to. Next time we upgrade if I rememeber to I can see how long a single node takes and extrapolate that out, but, how long a thing takes isn't a metric I particuarly care to concern myself with :D
 
Define upgrade? I'm actually not even sure I know really as our automation just rolls through when we tell it to. Next time we upgrade if I rememeber to I can see how long a single node takes and extrapolate that out, but, how long a thing takes isn't a metric I particuarly care to concern myself with :D

Ahhh, Automation explains it.

Like tell VUM to go and upgrade entire cluster from 6.0 to 6.5/6.7U1
 
We also do not use VUM, All host updates are automated and take between 30-50 minutes per host. We have ~10 or so hosts upgrading at any given time.

A 32 node cluster would take an entire day to upgrade? Is this with moving data during each host going into maintenance mode? All flash or hybrid? What FTT? Any data services enabled? What's baseline latency and how is it affected with hosts going offline?
 
A 32 node cluster would take an entire day to upgrade? Is this with moving data during each host going into maintenance mode? All flash or hybrid? What FTT? Any data services enabled? What's baseline latency and how is it affected with hosts going offline?
Like I said, that was a gut guess as I'm not really sure on the timing. We don't sit and watch things. If it takes an hour or a day to upgrade a cluster, it doesn't matter to us.
When we place a host into maint mode for upgrade, we do a full data evac, not ensure accessibility.
FTT is 2. All flash. No data services.
Latency on what? DIsk I/O? If so, zero impact.
What do you use? PowerCLI?
Most of our automation is in powercli, but we also use a bit of python in places for this.
 
Related to my topic... :p

since I destroyed the vSAN cluster, node2 no longer complains of "degrading" health on the SSD.

ended up piecing together some old drives and mobo and loaded up FreeNAS until I can build myself a proper NAS.
Was hoping to hack together my old MediaSmart Server, but looks like I'll need a proper breakout cable for video out since the thing completely fails to boot (possibly POST?) with a bootable thumbdrive.

Gotta lab up a project I have coming up, and our prod environment is our test environment at work. So breaking down the 6.7 stuff.
I'd rather go through trials and tribulations on my garbage first.


Stupid question cuz im like that..hah

Ive only done esxi 6.7 with local storage and no other experience with esxi but... Now the dumb q..
Is the freenas iscsi that your connecting too? I gotta readup on vSan...
 
Stupid question cuz im like that..hah

Ive only done esxi 6.7 with local storage and no other experience with esxi but... Now the dumb q..
Is the freenas iscsi that your connecting too? I gotta readup on vSan...

no, just using it for NFS storage.
One of these days I want to do iSCSI boot for ESXi hosts. But... I need a proper NAS. Not the thing I hobbled together that has garbage performance.
 
Not the thing I hobbled together that has garbage performance.


define hobbled? I'm going to do a freenas box and curious what you consider so... I just posted a thread in virtualized computing on my hardware lookin for thoughts on what to use what as...
 
define hobbled? I'm going to do a freenas box and curious what you consider so... I just posted a thread in virtualized computing on my hardware lookin for thoughts on what to use what as...

spare parts I had lying around.
4 hdds, 3 different sizes (2tb, 3tb, 4tb)
celeron
8gb RAM
 
got another NUC, so 3.
Recently'ish picked up a Ubiquiti ES-16-XG 10Gb switch.
Picked up 3x QNAP QNA-T130G1S (TB3 to SFP+) as Aquantia does have ESXi drivers that support the AQC-100 chip.
Rebuilt my homelab, using vSAN.
No issues.

Figured out how to silence the compliance pieces. There's a Ruby vSphere Console where you can silence the health checks (along with a lot of other things.)
 
Back
Top