Veeam/VMware VSAN backup issue

Discussion in 'Virtualized Computing' started by Kelvarr, Mar 13, 2019.

  1. Kelvarr

    Kelvarr 2[H]4U

    Messages:
    4,091
    Joined:
    Jul 19, 2001
    Daily, I keep getting the following alert/warning when Veeam tries to back up a (powered off) VM located on our Horizon VDI VSAN cluster:

    "all stuck VM snapshot consolidation attempts have failed"

    I have checked the VM multiple times, and there aren't any snapshots on the VM. There was the very first time I checked, but I deleted all the snapshots (none were in use), and I continue to get the message.

    Does anybody have any ideas?
     
  2. Grimlaking

    Grimlaking 2[H]4U

    Messages:
    2,673
    Joined:
    May 9, 2006
    I've not messed with veem only similar error I had before was because of an actual snapshot.

    Do you have more than one storage solution connected to the vm cluster? If so do a storage migration. That will clean up the file structure of the folder for you. Even migrate it right back once you're done and test again.
     
    Kelvarr likes this.
  3. Spartacus09

    Spartacus09 Gawd

    Messages:
    849
    Joined:
    Apr 21, 2018
    You may need to manually consolidate some of the VMs to see why the auto consolidation is failing.
    It could be orphaned snapshots, lack of space to consolidate, or locked by some external process.
     
  4. ND40oz

    ND40oz [H]ardForum Junkie

    Messages:
    11,234
    Joined:
    Jul 31, 2005
    Yeah, have you looked in the datastore browser to see if there are any left over delta files?
     
  5. Grimlaking

    Grimlaking 2[H]4U

    Messages:
    2,673
    Joined:
    May 9, 2006
    Again a storage migration of the VM will clean up the old files if you have multiple datastores available.
     
  6. k1pp3r

    k1pp3r [H]ardness Supreme

    Messages:
    7,803
    Joined:
    Jun 16, 2004
    My first question would be: Did you open a support case with Veeam and ask them?
     
  7. Kelvarr

    Kelvarr 2[H]4U

    Messages:
    4,091
    Joined:
    Jul 19, 2001
    Unfortunately no. On this VSAN cluster, all space is consumed in a single datastore.

    I have. There are all the *.vmdk files left over from the snapshots, but not the snapshot files.

    No, not yet, but I may have to.

    The other option someone mentioned is cloning the VM, which should also get rid of the snapshots.
     
  8. ND40oz

    ND40oz [H]ardForum Junkie

    Messages:
    11,234
    Joined:
    Jul 31, 2005
    Have you taken another snapshot and then done a delete all snapshots yet?
     
  9. OFaceSIG

    OFaceSIG [H]ard|Gawd

    Messages:
    1,887
    Joined:
    Aug 31, 2009
    I doubt it's VSAN's fault but I'm staying away from VSAN lulz... That shit nuked Walmart's infrastructure. They are moving 100% to Nutanix for hypercoverged/software defined storage.
     
  10. Spartacus09

    Spartacus09 Gawd

    Messages:
    849
    Joined:
    Apr 21, 2018
    Nutanix/Simplivity is basically the same thing as vSAN just has a smaller HCL (though it does have some specialized hardware/software with secret sauce), but you pay out the ass for it about 2x the cost of vSAN.
    I feel like all the hyper-converged stuff is way overpriced and limited in its use case, I'm sticking with traditional shared storage for now and enjoying my expandability separation of storage and compute.
     
    Sulphademus likes this.
  11. OFaceSIG

    OFaceSIG [H]ard|Gawd

    Messages:
    1,887
    Joined:
    Aug 31, 2009
    For sure, if you can afford it, SAN or networked iSCSI storage is vastly more flexible. For sure.

    But saying Nutanix and vSAN are the same thing is an incorrect oversimplification. They are both SDS but that's where the similarity ends. vSAN is kernel level and Nutanix is a controller VM that resides on the host. Very different approaches. Both have their advantages and disadvantages.

    However, I don't know if you work for Netflix or someplace that's flush with cash but most places are moving away from SAN storage. Costs are simply too high. Software + local drives are cheaper. More complicated, but cheaper.
     
  12. Kelvarr

    Kelvarr 2[H]4U

    Messages:
    4,091
    Joined:
    Jul 19, 2001
    Multiple times. The deltas go away, but the actual vmdk sticks around.
     
  13. Spartacus09

    Spartacus09 Gawd

    Messages:
    849
    Joined:
    Apr 21, 2018
    Nope small'ish 200 person software company, I don't see the costs of traditional SAN being higher even though each HCI node is less cost, but I'm also internal IT not hosting a product/site, our stuff is primarily demo/ps/dev.
    We just bought an 85TB nimble array, with dedupe and compression its about 150TB usable and yielding about 125k iops for our r/w usage.
    Two 48 port SFP+ switches and the array was under $100k (plus tax/ship/install), the comparable space amount for Nutanix was 330k in a 5 node cluster, AND we wouldn't be able to use our existing 5 racks of compute equipment easily (transitioning from local storage to shared for better efficiency and HA).
    Maybe you're thinking of all flash costs, but hybrid SAN is underpriced substantially compared to HCI from what I looked at and still offers 4-9s uptime capability if you need it (or even 6-9s with redundant clusters).
     
    Last edited: Mar 14, 2019
  14. Kelvarr

    Kelvarr 2[H]4U

    Messages:
    4,091
    Joined:
    Jul 19, 2001
    We will be facing/exploring this issue in a couple of years.

    Simplivity is one that we have explored, as is VSAN, and yes, we are finding VSAN cheaper.

    Currently, we run a 3Par 7200 (no SSD shelf though :( ) SAN. It is insanely expensive to go to the next 3Par piece. We find that it isn't necessarily the hardware that makes it so expensive, but the mandatory service/maintenance contract. I'm not saying that I would want the contract to be anything less, but it certainly isn't cheap.
     
  15. OFaceSIG

    OFaceSIG [H]ard|Gawd

    Messages:
    1,887
    Joined:
    Aug 31, 2009
    Yes, my firm is all flash and we use dedicated fiber channel infrastructure. It was a premium play for our previous generation infrastructure.
     
  16. Grimlaking

    Grimlaking 2[H]4U

    Messages:
    2,673
    Joined:
    May 9, 2006

    Yea it's stupid what we pay for support contracts. It damn near equals the cost of the storage (disk) we are loading our device with.

    We are putting in some all flash unity arrays currently and I'd much prefer to run diskless host nodes connected to a large San. Mixed VNX 2's are our flavor of choice but are getting long in the tooth. in a couple years we will be looking to replace those. At a quarter million a pop that won't be fun.
     
  17. ND40oz

    ND40oz [H]ardForum Junkie

    Messages:
    11,234
    Joined:
    Jul 31, 2005
    And it's already powered off, so you can't try shutting it down and attempting it. Do you have enough room to just clone it and delete the current one?

    As for VSAN costs, we can get NetApps and Violins cheaper, I don't understand the point unless you're going for rack density.
     
  18. Grimlaking

    Grimlaking 2[H]4U

    Messages:
    2,673
    Joined:
    May 9, 2006

    It's more than rack density. (I don't like VSAN's either personally) It's power consumption as well. A dense rack of SSD's will take less overall power in a lot of cases than a dense rack of spindle. Of course the cost of ownership outside of power is much higher.

    But that's not what this thread is about! ;)
     
  19. Spartacus09

    Spartacus09 Gawd

    Messages:
    849
    Joined:
    Apr 21, 2018
    Ahhh, thats likely why yeah we went the hybrid option and iSCSI with SFP rather than FC 125k iops was plenty 400+ would have been overkill.

    Something to note, Nutanix has a recurring cost to keep the device functional (not support/maint, thats separate) but an actual opex for the server to work you have to pay yearly (or 2-3y at a time) ~ think Meraki's platform you own the hardware but it doesnt work without paying them.

    HPE's support has been suprisingly reasonable on their switches, storage they're a little more proud of, but alot more reasonable than netapp's.
     
  20. Grimlaking

    Grimlaking 2[H]4U

    Messages:
    2,673
    Joined:
    May 9, 2006

    We use broadcom 6505's because all of our dedicated FC networks are within racks or spanning 4 racks at most. Don't need bigger or more complicated. Operating on a KISS method works well for us.
     
  21. ND40oz

    ND40oz [H]ardForum Junkie

    Messages:
    11,234
    Joined:
    Jul 31, 2005
    Yes, off topic, but I'm using NetApp AFF and Violin for VMware storage, there's no spindles involved. Once you make the jump to all flash the only thing you'll want to use spindles for is file shares and backups. VSAN licensing and paying per CPU only seems to make it even sort of cost effective in very small deployments.
     
  22. Kelvarr

    Kelvarr 2[H]4U

    Messages:
    4,091
    Joined:
    Jul 19, 2001
    All you guys that don't like VSAN's, please explain why.

    As I said, or perhaps didn't say clear enough...we actually have both a traditional SAN (for regular fileserver storage, and for our vSphere infrastructure) and VSAN (for our Horizon VDI). I have no preference as of yet. However, like I said, we will be looking to replace this in a year or two. $$$ may drive all decisions here, but at least I have them listening to my reasoning for a more expensive solution (sometimes). Plus, I really would like to know because our preferred vendor pushes VSAN hard over traditional SAN. They'll do either, but push VSAN.
     
  23. Spartacus09

    Spartacus09 Gawd

    Messages:
    849
    Joined:
    Apr 21, 2018
    VDI definitely makes more sense to do a vsan/HCI model since you can calculate your storage to compute scaling easier, make it match to the node size you purchase, and add shared GPU capability much easier than a traditional SAN.
    Personally, I would recommend getting your vendor to give you the breakdown of the differences between the vsan route and the varying HCI (nutanix/simiplivity) and which one is best for your use case/environment.
     
  24. ND40oz

    ND40oz [H]ardForum Junkie

    Messages:
    11,234
    Joined:
    Jul 31, 2005
    Look at your line item costs for VSAN licensing and all the drives you're buying to support it, not to mention the overhead it uses on your VM hosts. Then determine what you can buy in traditional storage for that amount.

    Did you end up cloning the VM and deleting the original?
     
  25. Kelvarr

    Kelvarr 2[H]4U

    Messages:
    4,091
    Joined:
    Jul 19, 2001
    Not quite yet. I cloned the VM (which has gotten rid of the snapshots), but haven't deleted the original yet. A) There is a software update that is due out tomorrow for one of our important suites, so I have to upgrade and push the VDI image back out anyway. B) I was notified 19 minutes before a meeting that I had to be in a 3 hour meeting. C) Our VDI decided to take a shit and not provision any images, so our remote users were dead in the water....so I had to fix that.
     
  26. Child of Wonder

    Child of Wonder 2[H]4U

    Messages:
    3,269
    Joined:
    May 22, 2006
    I work for an array vendor so my input is biased, but HCI and VSAN in particular just isn't the panacea it claims to be. Long winded post coming up....

    1. Simplicity - I used to manage HP EVAs, EMC VNXs, and other storage products back in the day. They were a management headache and it required tons of planning, balancing, and re-balancing to get a reasonable mix of availability, capacity, and performance out of a system. But inevitably you worked yourself into a corner where the only option was to blow the whole thing up, buy new, and start over again. HCI has these same problems. You start with a nice, pristine environment of nodes and you either go all in with performance by choosing RF2/FTT1, no data services (perhaps compression and encryption if it doesn't affect performance), but you risk availability because while you can support one node offline unexpectedly, you can't deal with two without issues. You also have to evac data from the node if you put it into maintenance mode to preserve RF2/FTT1 which means maintenance takes more time and floods the network with data traffic. Choosing RF2/FTT1 also means every data object takes up double the space or, if you decide RF2/FTT1 isn't for you and choose RF3/FTT2, they take up triple the space. Do you turn on Erasure Coding or Deduplication to save space? What's the performance impact? Maybe you have one cluster that's almost no data services with RF3/FTT2 for high performance and availability but you sacrifice capacity (maybe a little performance), another cluster with RF2/FTT1 and some data services but you sacrifice availability, and another cluster with all data services enabled plus Erasure Coding and Deduplication to save on space, but you sacrifice performance.

    Sounds to me like when I used to create RAID 10 Storage Pools for high performance workloads, RAID 5 for OK performance workloads but a little space savings, and RAID 5/6 pools with more data services for low performance and maximum space savings. Then when one Storage Pool would run out of space or performance I'd have to expand, re-balance, etc. but with HCI you're adding compute, application socket licensing, and networking as factors to consider when making storage decisions PLUS when you expand you need to take into account that as existing nodes approach 5 years old they're going to go away and be replaced so make sure the nodes you're expanding with will play nice with whatever you plan to buy to replace your original footprint of nodes when it's time to replace them.

    2. Performance - As I mentioned above, a lot of decision making has to go into maximizing performance, availability, and capacity and you'll have to sacrifice on one to improve the other two but even when performance is a focus, what is the impact of using the same compute your VMs use for storage? How is CPU Ready affected when the kernel or CVM is chewing up CPU cycles to deliver on storage performance? How is performance impacted during maintenance or a node/component outage? Even if the platform self heals by relocating data segments as fast as possible what impact does that have when you have a performance SLA to the business that requires a lot of horsepower itself? If your platform uses data locality, can you base SLAs off that and if so wouldn't that mean something as simple as a VMotion should be done during a maintenance window because it will have a storage performance impact?

    3. Cost - Now you have to begin taking application core licensing into account with HCI because you're going to license cores used for storage. Depending on how you architect your HCI, you may need more network ports. There's obviously HCI licensing and support, too. To top that off, you're also going to have to continually replace nodes every 3-5 years throwing away compute and storage at the same time. Do the original HCI software licenses transfer to the new nodes? Even if your HCI platform can migrate the data around to make retiring and adding nodes easy to do, you're still stuck in a perpetual merry go round of node replacement, every generation potentially having a new CPU family, different networking, and different storage components.

    I'm not saying HCI is a bad infrastructure decision but the marketing hype doesn't live up to the reality from the customers I've spoken to and there's much, much more thought that has to go into properly architecting an HCI solution, not to mention managing it, and it rarely proves to be cost effective. Be careful choosing to go down the HCI path and make sure it's because it truly suits your business and technical requirements better than the alternative and not just because you think HCI is cool.
     
  27. schizrade

    schizrade [H]ardness Supreme

    Messages:
    4,668
    Joined:
    Feb 15, 2003
    OP, does vsan report any "other" objects in your capacity usage? If so, those objects need to be manually deleted. I had the same issue where veeam got flipped around somehow and left a few TB of disassociated garbage in the datastore. Something to look at. Cleaning them out isnt too challenging and is a pretty good leaning exercise.

    vsan usage.PNG
     
  28. Easius

    Easius Limp Gawd

    Messages:
    345
    Joined:
    Jan 1, 2009
    I had a weird snapshot cleanup issue with Veeam and after working with support oddly enough removing the attached virtual CD/DVD drive resolved the issue on the VMs - I just ran a final consolidation and has been good since. It was a template I was seeing the same failures with across VMs.