Thoughts on VMware Virtual SAN

peanuthead · Feb 8, 2014

Got it. Thanks for clarifying.

Thuleman · Feb 8, 2014

KapsZ28 said:
There are multiple 1GbE connections and storage, production, and management are all separate. But as far as storage, each ESXi host is only utilizing a 1GbE back to the NetApp. So as far as read/write and IOPS, we are a bit lacking due to that bottleneck.

Let's just disregard the obvious lack of network redundancy to your storage array, but are you actually saturating that GbE link? I assume that your vMotion traffic does have its own GbE as well or is that running on the same link as your iSCSI traffic?

KapsZ28 · Feb 8, 2014

Thuleman said:
Let's just disregard the obvious lack of network redundancy to your storage array, but are you actually saturating that GbE link? I assume that your vMotion traffic does have its own GbE as well or is that running on the same link as your iSCSI traffic?

There is redundancy. What I mean is there is only one VIF on the NetApp for storage and we are using NFS. So if you look at the traffic from the ESXi host, almost all storage traffic goes through 1 NIC and not the other. vMotion is on the same network as storage, but DRS is only partially automated and we don't move many VM's around during the day. No iSCSI on this setup.

I am not a network guy, but looking at the graphs I don't think there is any saturation.

This is a 1 month graph a throughput from a single 1GbE NIC on one of the ESXi hosts.

No more than 250Mbps of utilization.

And this is a 1 month graph of the VIF on the NetApp.

Last month it seemed to average about 300Mbps.

I don't know, maybe it is a NetApp issue. We have two FAS3240 heads setup in HA. There are three shelves. One shelf is 22 - 15k drives setup in a single aggregate with dual parity. The second shelf is 22 - 7,200 RPM drives setup in a single aggregate with dual parity. And the third shelf is also 22 - 7,200 RPM drives setup in a single aggregate with dual parity. With two spares per shelf of course.

There are 15 ESXi hosts and about 220 VM's. From what I was told, this NetApp should handle 500 VM's without a problem. But hey, I am not a storage expert.

I do find it interesting that when we do vMotion to another datastore it is somewhat slow especially if you stay within the same aggregate. Which I kind of wonder, if you are vMotioning a VM to another volume on the same shelf, does it still go through the network, or done internally on the NetApp?

I also find it interesting how long it takes to install an OS like Windows Server 2008 R2. I can install an OS faster using a USB 2.0 thumb drive on a physical server than on one of our hosts. My personal ESXi server uses local storage which is so much fast than the NetApp performance we are seeing now.

So would you think we have a networking issue, NetApp issue, or just lots of issues? Would really suck to dump all this money into 10GbE and not see any performance gains.

Vader · Feb 9, 2014

Typically we find that bandwidth is rarely the problem. It's usually a mis config somewhere or simply put your disk can't keep up with the workload demand.

Whoever told you about how many VMs you can run on your current storage without know workload requirements should find a new occupation. Is that harsh? Probably but think about the risk that this statement can make. How the hell can anyone say that without knowing the workload. I've seen environments that require tens of thousands of IOPs on a three tier application with just three VM,s and your storage couldn't even come close to delivering that. It all depends on the workload. This is why a thorough assessment should be done.

Let's move on to your network. For Ethernet based storage in vSphere on 1Gb you should at least have 6 NICs preferably 8 but 6 will allow you to dedicate uplinks to specific traffic. In 6 NIC setup on standard vSwitch

VSwitch 0 Mgmt and vMotion 1 NIC active 1 NIC standby for each opposite vnic

VSwitch 1 NFS we can talk more about "balancing" load but at least have two NICs to accommodate failure

vSwitch 2 VM Port groups

You need someone to come in to do an assessment in your environment with industry standard tools AND to understand your business growth patterns and requirements.

KapsZ28 · Feb 9, 2014

Vader said:
Typically we find that bandwidth is rarely the problem. It's usually a mis config somewhere or simply put your disk can't keep up with the workload demand.

Whoever told you about how many VMs you can run on your current storage without know workload requirements should find a new occupation. Is that harsh? Probably but think about the risk that this statement can make. How the hell can anyone say that without knowing the workload. I've seen environments that require tens of thousands of IOPs on a three tier application with just three VM,s and your storage couldn't even come close to delivering that. It all depends on the workload. This is why a thorough assessment should be done.

Let's move on to your network. For Ethernet based storage in vSphere on 1Gb you should at least have 6 NICs preferably 8 but 6 will allow you to dedicate uplinks to specific traffic. In 6 NIC setup on standard vSwitch

VSwitch 0 Mgmt and vMotion 1 NIC active 1 NIC standby for each opposite vnic

VSwitch 1 NFS we can talk more about "balancing" load but at least have two NICs to accommodate failure

vSwitch 2 VM Port groups

You need someone to come in to do an assessment in your environment with industry standard tools AND to understand your business growth patterns and requirements.

True. I don't think anything was ever really assessed other than the CTO picking what he thought was best and I don't believe he had any experience with NetApp prior to this. Unfortunately they are all about new technology that we are not familiar with and we usually just implement and figure it out on our own.

As far as workload, we host just about anything in our environment. I would say most VM's are Linux. We host a few big websites that have quite a few DB servers running on the NetApp.

Probably our biggest complaints about performance come from a company that has a RDS Farm hosted through us. They are basically using it as a desktop replacement, so when logged into the servers they will be running Outlook (which is hosted by Office365) web browsing, etc. Possibly even trying to watch a video through the RDS server. Besides having a dedicated connection broker and gateway server, there are 5 RDS Farm members each with 8 vCPU and 32 GB of memory. A typical load would be about 25 users per server.

How can I really tell if the issue is with the drives being over utilized? Here are a couple of monitoring samples of one of the main volumes.

Read and write latency.

IOPS.

Vader · Feb 9, 2014

Can you adjust the graphs to show just a 24 hour period, specifically around January 12th and 20th.

Latency spikes are pretty common as long as they are just short bursts. Typically when you have some latency in the 30+ ms range over a longer period of time you could experience poor performance. This can be an indicator that for that time frame, the workload is requiring more than the disk can provide.

Also..this is one Datastore you are saying?..Not really seeing the big picture...this is why you need a tool that can provide that. A like dpack from Dell for a quick a dirty..but there are better to access long periods than 7 days.

One more thing..this is going way off topic from the original post..don't want to derail from the OP's original discussion. Maybe start a new post?

KapsZ28 · Feb 9, 2014

Here is a bit more detail on those dates. Yes, this is a single datastore. There are 5 total datastores for this aggregate and this is the aggregate using the 15k RPM drive shelf. Other than the load balancing of NFS, is there any benefit to having multiple datastores on a single aggregate for 1 shelf? I mean, they are all using the same hard drives anyway. I would prefer the single datastore to benefit more with dedupe. Once we go with 10GbE, I don't think the load balancing of NFS datastores will be as important.

Jan 12

Jan 20

KapsZ28 · Feb 9, 2014

I am the OP of this thread. And it is only sort of off topic. This is one of the reason why I am asking about vSAN. If part of our performance problem is with the NetApp, I am wondering if local storage with vSAN may increase performance.

Vader · Feb 9, 2014

Oh yeah...lol..

The latency spikes appear to be bursts..but it's over a 4 hour period it seems...that would concern me. I would want to know what was causing that.

From a vSphere perspective, you can get realtime perf stats via esxtop as well as the Performance tabs in the Thick or Webclient. You should be able to drill down into a specific workload that's causing the latency when it occurs. Do you have vCOPS installed? Foundation is free..you can at least leverage for environment health.

Again...get a full assessment to get a bigger picture.

I wouldn't worry about looking at additional technologies without knowing what you have, what you can deliver, and what you need. You have invested in NetApp. At the end of the day, it may make more sense to keep that investment and add whatever is required.

Going from a Frame Based Solution to VSAN is a complete architectural change so be prepared to either re-purpose the NetApp for something else or get rid of it all together and this includes compute. I recall that you have also invested in Dell Blades. This means that you would be moving from a Semi-Converged Architecture where you scale up, meaning add to what you have like Disk Shelves, Blades..etc to Hyper-converged Architecture where you scale out by adding additional "nodes,", like compute and storage legos.

KapsZ28 · Feb 9, 2014

I will see what I can get. The Dell blades are in a different datacenter using a Dell PowerVault SAN. That is 10GbE, but nowhere's near the load. Performance is much higher at the moment.

Part of the decision to stick with NetApp or switch will also affect what new servers we purchase. So we are kind of trying to figure out what is the next step. Plans for upgrading to 10GbE are almost definite, but like I said, I would hate for us to blow all that money and not see much of a gain.

Child of Wonder · Feb 9, 2014

Looks like latency spikes around noon and midnight each day. Anything running during those times? AV scans? Backups? Replication? Batch jobs?

Also, check to make sure none of your VMs are swapping. That puts a good deal of extra load on your datastores they don't need. If you don't know how to check, download RVTools and run it against your vCenter server, go to the vRAM tab, and check the Swap Used and Limit fields. All your VMs should list 0 swap used and -1 for limit.

Thuleman · Feb 9, 2014

KapsZ28 said:
Other than the load balancing of NFS, is there any benefit to having multiple datastores on a single aggregate for 1 shelf? I mean, they are all using the same hard drives anyway.

One definite benefit is that if VMFS becomes corrupted, or if a run-away snapshot fills up your entire datastore, you only crash so many VMs. If all of your VMs run on the same datastore then you crash all of them.

Permissions could matter if you run VMs for 3rd parties that have a login to your vCenter to provision their own drives, you want to keep those parties separate from your main environment so they don't monkey with storage that doesn't "belong" to them.

Depending on what product(s) you are using to run your VMs and whether this is applicable in your environment, but you may want to look into running some of your VMs as linked clones which may eliminate the pressing need for dedup on the storage side.

KapsZ28 · Feb 9, 2014

Child of Wonder said:
Looks like latency spikes around noon and midnight each day. Anything running during those times? AV scans? Backups? Replication? Batch jobs?

Also, check to make sure none of your VMs are swapping. That puts a good deal of extra load on your datastores they don't need. If you don't know how to check, download RVTools and run it against your vCenter server, go to the vRAM tab, and check the Swap Used and Limit fields. All your VMs should list 0 swap used and -1 for limit.

We use CDP backups which usually run every 4 hours. So there are definitely backups during the day, but maybe only about 20 VM's in this environment are being backed up. We just purchased Veeam Backup and Replication and will be switching once we have more storage to back up to.

I'll definitely check out the RVTools. I've used a trial version of SolarWinds Virtualization Manager in the past and don't remember seeing too much swapping, but we have added quite a few more VM's since then.

KapsZ28 · Feb 9, 2014

Thuleman said:
One definite benefit is that if VMFS becomes corrupted, or if a run-away snapshot fills up your entire datastore, you only crash so many VMs. If all of your VMs run on the same datastore then you crash all of them.

Permissions could matter if you run VMs for 3rd parties that have a login to your vCenter to provision their own drives, you want to keep those parties separate from your main environment so they don't monkey with storage that doesn't "belong" to them.

Depending on what product(s) you are using to run your VMs and whether this is applicable in your environment, but you may want to look into running some of your VMs as linked clones which may eliminate the pressing need for dedup on the storage side.

Really? An entire VMFS can become corrupt? That would be bad for us. We are not doing any volume snapshots or SAN replication. Just backups on certain VM's.

KapsZ28 · Feb 9, 2014

Child of Wonder said:
Looks like latency spikes around noon and midnight each day. Anything running during those times? AV scans? Backups? Replication? Batch jobs?

Also, check to make sure none of your VMs are swapping. That puts a good deal of extra load on your datastores they don't need. If you don't know how to check, download RVTools and run it against your vCenter server, go to the vRAM tab, and check the Swap Used and Limit fields. All your VMs should list 0 swap used and -1 for limit.

So, are VM's like this impacting performance?

Child of Wonder · Feb 9, 2014

KapsZ28 said:
So, are VM's like this impacting performance?

lol

YES

Your hosts are running out of RAM, have tried ballooning to save memory, but ultimately are having to resort to swapping to satisfy VM memory demands. That will kick the crap out of your storage because your datastores are being used as swap space.

You either need to lower the amount of memory assigned to VMs (which may not even help since they're already ballooning quite a bit), increase the RAM in your hosts, or add more hosts.

KapsZ28 · Feb 9, 2014

http://hardforum.com/newreply.php?do=newreply&p=1040613410

Child of Wonder said:
lol

YES

Your hosts are running out of RAM, have tried ballooning to save memory, but ultimately are having to resort to swapping to satisfy VM memory demands. That will kick the crap out of your storage because your datastores are being used as swap space.

You either need to lower the amount of memory assigned to VMs (which may not even help since they're already ballooning quite a bit), increase the RAM in your hosts, or add more hosts.

Yeah, we are supposed to get more hosts. I think I will push for that before going to 10GbE. Right now there are 15 hosts in a single cluster and it is a mess. For one, not all servers are the same. Memory from 64 GB to 144 GB. Since some CPUs are different gens, the EVC is set to Penryn. Not to mention they aren't all using the same amount of NICs.

I want to create more clusters, increase the EVC to where it should be, and try to better match up clusters with the type of VM's that are running. This way the memory hogs are in a cluster with plenty of memory. Would like at least two new hosts with either 256 GB or 384 GB of RAM and two Intel Xeon E5-2650 v2 or better.

Vader · Feb 9, 2014

Yousers! Good catch CoW...btw..an assessment will find this...like the VMware Health Analyzer..among many other things...hint, hint....don't worry..I won't mention it again...lol.

KapsZ28 · Feb 9, 2014

Keep in mind, resources are much lower mid day Sunday. I will need to see what it looks like tomorrow to get a better picture.

peanuthead · Feb 9, 2014

KapsZ28 said:
http://hardforum.com/newreply.php?do=newreply&p=1040613410

Yeah, we are supposed to get more hosts. I think I will push for that before going to 10GbE. Right now there are 15 hosts in a single cluster and it is a mess. For one, not all servers are the same. Memory from 64 GB to 144 GB. Since some CPUs are different gens, the EVC is set to Penryn. Not to mention they aren't all using the same amount of NICs. I want to create more clusters, increase the EVC to where it should be, and try to better match up clusters with the type of VM's that are running. This way the memory hogs are in a cluster with plenty of memory. Would like at least two new hosts with either 256 GB or 384 GB of RAM and two Intel Xeon E5-2650 v2 or better.

I'm not at near the capacity of knowledge Vader or CoW, so my responses won't be as good.

That is some crazy swapping going on. In reading about your host hardware I was reminded about vCenter/DRS. If memory serves me correctly when you have multiple hosts with such a drastic config doesn't vCenter/DRS have a difficult time knowing when to run and not run? (Sort of an open question.) I know you said you don't run it that often during the day, however I'm just throwing that out there.

KapsZ28 · Feb 9, 2014

peanuthead said:
I'm not at near the capacity of knowledge Vader or CoW, so my responses won't be as good.

That is some crazy swapping going on. In reading about your host hardware I was reminded about vCenter/DRS. If memory serves me correctly when you have multiple hosts with such a drastic config doesn't vCenter/DRS have a difficult time knowing when to run and not run? (Sort of an open question.) I know you said you don't run it that often during the day, however I'm just throwing that out there.

I am also no expert on swapping, but I think there is only about 10 GB of memory being swapped total on a shelf with 8.95 TB of storage. Is that really a lot of swapping?

As for the DRS, I would like to get the clusters to a point where there isn't much DRS required. Especially since some VM's like our RDS Farm just can't be migrated without crashing.

Child of Wonder · Feb 9, 2014

KapsZ28 said:
I am also no expert on swapping, but I think there is only about 10 GB of memory being swapped total on a shelf with 8.95 TB of storage. Is that really a lot of swapping?

As for the DRS, I would like to get the clusters to a point where there isn't much DRS required. Especially since some VM's like our RDS Farm just can't be migrated without crashing.

It's not the amount of swap space being used, it's the unnecessary IOPs hitting your storage. Think about how many reads and writes go in and out of the RAM on your computer and how you have a dozen+ VMs all using your storage array as RAM. Spinning disks, no matter how many of them are in your Netapp, weren't made to handle IO designed for RAM.

In an earlier graph you showed your Netapp averaging 500+ IOPs, sometimes spiking over 1,000. I'd be willing to bet that number drops a good deal once you eliminate all the VM swapping and you'd also see your average latency drop a good deal as well.

peanuthead · Feb 9, 2014

When you get to swapping your in a bad area. Generally you don't want to get to that area. I believe the memory reclaim order is TPS, ballooning, host RAM swapping then memory compression. Crazy; perhaps not. Remember, the swapping is not based on the storage shelf but based on the hosts' RAM usage versus capacity. The bottom line is that storage is being unnecessarily taxed.

Child of Wonder · Feb 9, 2014

peanuthead said:
I'm not at near the capacity of knowledge Vader or CoW, so my responses won't be as good.

That is some crazy swapping going on. In reading about your host hardware I was reminded about vCenter/DRS. If memory serves me correctly when you have multiple hosts with such a drastic config doesn't vCenter/DRS have a difficult time knowing when to run and not run? (Sort of an open question.) I know you said you don't run it that often during the day, however I'm just throwing that out there.

DRS moves VMs based on a combination of load balancing the hosts and estimated benefit. If all the hosts are near 100% utilization, nothing will be benefited from moving VMs around. I don't think DRS cares whether your host cluster is homogenous or heterogenous. HA, on the other hand, does and having hosts of different CPU speeds and RAM size can result in less available resources than you'd expect when admission control is calculated.

Child of Wonder · Feb 9, 2014

peanuthead said:
When you get to swapping your in a bad area. Generally you don't want to get to that area. I believe the memory reclaim order is TPS, ballooning, host RAM swapping then memory compression. Crazy; perhaps not. Remember, the swapping is not based on the storage shelf but based on the hosts' RAM usage versus capacity. The bottom line is that storage is being unnecessarily taxed.

Compression comes before swapping, but otherwise you're right.

Yes, swapping is very, very bad.

KapsZ28 · Feb 9, 2014

Child of Wonder said:
It's not the amount of swap space being used, it's the unnecessary IOPs hitting your storage. Think about how many reads and writes go in and out of the RAM on your computer and how you have a dozen+ VMs all using your storage array as RAM. Spinning disks, no matter how many of them are in your Netapp, weren't made to handle IO designed for RAM.

In an earlier graph you showed your Netapp averaging 500+ IOPs, sometimes spiking over 1,000. I'd be willing to bet that number drops a good deal once you eliminate all the VM swapping and you'd also see your average latency drop a good deal as well.

Cool, thanks for the info. Is increasing the amount of memory a VM needs the only way to avoid swapping?

peanuthead · Feb 9, 2014

CoW, perhaps I was thinking of HA. I just vaguely remember that if the hosts are just a mix of cpu and various RAM sizes (max RAM) then ESXi has a hard time understanding when to move something and not to move something. Perhaps I am thinking of shares and the total numbers of shares being very disproportionate across the hosts.

peanuthead · Feb 9, 2014

KapsZ28 said:
Cool, thanks for the info. Is increasing the amount of memory a VM needs the only way to avoid swapping?

NO! Don't increase the RAM on the VM. You will cause a bigger issue. Either add RAM to the host(s) OR lower the RAM assigned to VMs. Since you are beyond ballooning, as said before, I doubt the lowering of RAM on the VMs will help. If you can reduce the RAM assigned to a VM (since it's not being used) then do so. This is just best practice; assign the RAM that it's using not what someone wants the value to be.

KapsZ28 · Feb 9, 2014

peanuthead said:
NO! Don't increase the RAM on the VM. You will cause a bigger issue. Either add RAM to the host(s) OR lower the RAM assigned to VMs. Since you are beyond ballooning, as said before, I doubt the lowering of RAM on the VMs will help. If you can reduce the RAM assigned to a VM (since it's not being used) then do so. This is just best practice; assign the RAM that it's using not what someone wants the value to be.

Ah, gotcha. I need to do more reading on this stuff. My proposed two new ESXi hosts will increase our overall memory capacity by 42%. If they go for it, hopefully that will eliminate these memory problems.

peanuthead · Feb 9, 2014

KapsZ28 said:
Ah, gotcha. I need to do more reading on this stuff. My proposed two new ESXi hosts will increase our overall memory capacity by 42%. If they go for it, hopefully that will eliminate these memory problems.

It will certainly help. As best practice don't over-assign resources to a VM if the VM doesn't need them. It could never hurt to go through periodically and inventory the VMs that you are running. This helps keep things nice and tighty

KapsZ28 · Feb 9, 2014

peanuthead said:
It will certainly help. As best practice don't over-assign resources to a VM if the VM doesn't need them. It could never hurt to go through periodically and inventory the VMs that you are running. This helps keep things nice and tighty

I agree. But some of these VM's are paid for by customers that wanted a specific amount of memory allocated to their VM. Being that they are paying for that much memory, not sure how to go back to them and say hey, we are going to decrease your memory allocation because you are not using it.

peanuthead · Feb 9, 2014

KapsZ28 said:
I agree. But some of these VM's are paid for by customers that wanted a specific amount of memory allocated to their VM. Being that they are paying for that much memory, not sure how to go back to them and say hey, we are going to decrease your memory allocation because you are not using it.

If you have a SLA agreement with them stating specifics then that is a totally different story.

Thuleman · Feb 9, 2014

KapsZ28 said:
I am also no expert on swapping, but I think there is only about 10 GB of memory being swapped total on a shelf with 8.95 TB of storage. Is that really a lot of swapping?

From a performance perspective any swapping is a lot of swapping.

KapsZ28 said:
Cool, thanks for the info. Is increasing the amount of memory a VM needs the only way to avoid swapping?

Like others have said already, you want to lower the amount of memory assigned to VMs or increase the amount of memory per host.

I seem to recall that you run a bunch of Linux VMs. Linux can easily use up all your RAM even through that memory isn't actually needed by the apps. See: https://atomicorp.com/company/blogs/259-why-does-linux-use-so-much-memory.html

Likewise database servers on any OS will use all of the VMs memory for caching even though that much memory isn't necessarily needed for proper operation of the OS+apps.

KapsZ28 said:
I agree. But some of these VM's are paid for by customers that wanted a specific amount of memory allocated to their VM. Being that they are paying for that much memory, not sure how to go back to them and say hey, we are going to decrease your memory allocation because you are not using it.

I find the above highly problematic. Your customers paid for memory that you aren't providing to them. If I pay my host for 1 GB of RAM then I expect that 1 GB to be available to me and not just 700 MB of physical RAM and 300 MB of RAM that resides on a swap disk. I am trying really hard to stay polite here but companies like the one you work for are basically ripping people off. It's at least unethical, and depending on your terms of service it may be in legally questionable territory.

KapsZ28 · Feb 9, 2014

Thuleman said:
I find the above highly problematic. Your customers paid for memory that you aren't providing to them. If I pay my host for 1 GB of RAM then I expect that 1 GB to be available to me and not just 700 MB of physical RAM and 300 MB of RAM that resides on a swap disk. I am trying really hard to stay polite here but companies like the one you work for are basically ripping people off. It's at least unethical, and depending on your terms of service it may be in legally questionable territory.

I am not going to argue with you there, but it is not always easy to have all resources available when customers make requests on the fly. Buying more servers is a plan for the immediate future, but to give you an idea we had a customer last week all the sudden want 30 VM's spun that are Linux DB servers with 2 GB to 4 GB of memory. So figure about 90 GB extra memory added from their order. They have always used dedicated servers, not our virtual environment.

Another client has request 4 VMs with 32 GB of memory in the past month.

And another client that has access to their own resource pool apparently gave themselves quite a few extra GB of memory they aren't even paying for. I believe they are paying for 64 GB of memory, have VM's totaling 164 GB and actively using 121 GB of memory. So in that case we are losing money.

In the past couple of weeks I have increased memory capacity to this cluster by 160 GB by moving around a couple of servers.

All I need is management to sign off on two new servers and I will have 768 GB more memory to play with.

I think part of the problem is our guys don't like to tell customers no or they need to wait until we procure more resources. They try to make everyone happy and then give me a huge headache.

QHalo · Feb 9, 2014

The hard truth: Your business is trying to act like a service provider and doesn't appear to know what it takes to operate as one.

Thuleman · Feb 9, 2014

If nothing else this illustrates that because the entry level barrier for "cloud hosting" is so low there are all kinds of issues that happen at smaller companies.

KapsZ28 said:
I am not going to argue with you there, but it is not always easy to have all resources available when customers make requests on the fly.

And another client that has access to their own resource pool apparently gave themselves quite a few extra GB of memory they aren't even paying for. I believe they are paying for 64 GB of memory, have VM's totaling 164 GB and actively using 121 GB of memory.

There are resource pools for three departments on the equipment I run and there's no way that they can increase the memory/cpu allocations of their resource pools. In addition Expandable Reservations are disabled so they are locked into the resources they have and that's that. If the client was able to allocate memory to his/her resource pool and that is not intended then the access permissions are set incorrectly.

KapsZ28 said:
I think part of the problem is our guys don't like to tell customers no or they need to wait until we procure more resources. They try to make everyone happy and then give me a huge headache.

This all sounds terrible. I'd get out of there if you have that option.
Not adding more memory to servers (overnight from newegg/amazon) when it's needed, oversubscribing resources, is all of this running on a single SAN? What happens when that SAN fails?

Why don't you guys make use of the VMware Hybrid cloud providers to add some capacity while you wait on parts.

QHalo · Feb 9, 2014

I would think bursting into the hybrid cloud would require permission from the customers as well.

peanuthead · Feb 9, 2014

I agree with Thuleman. Any customer that has access to the vmware client, etc. is a bad thing unless there is something in a contract stating the other. All VM management and new VM creation is done by a team at where I work, period. None of our clients have access to the environment other than their specific servers. Heck, they don't even have SQL server access.

Adding more RAM is going to help correct the over allocation issue but it will correct process. If a customer needs you to be that elastic and expand that quickly then using Amazon or VMware hybrid clouds would be the best option for the moment. My other thought is that if they require the VM for a momentary use then shut them down, use the hybrid option. (i.e. - a CPA office would need more resources for the first few months of the year then they are done with them.) If they are needing those resources permanently then their hosting fee should increase accordingly so the new hardware purchase(s) could be paid for. Hopefully there is something being put aside from the current customer base to handle such an up front cost.

Thuleman I think they had 3 SANs; 15K SAS, 7200rpm SATA, 7200rpm SATA. Unless there is only one SAN broken into 3 arrays. I'm not a Netapp guru. If it is one SAN then what if it goes down? That rebuild would take forever but I guess that is why they are running dual parity....

QHalo, why would they need to contact the customer on that? My understanding is that the hybrid cloud is just a temporary extension of one's DC.

KapsZ28 · Feb 9, 2014

No, a customer shouldn't have access to modify their resources, but for some reason this resource pool was set to unlimited. Probably an oversight as the permissions were correct.

But personally I think some of the comments here are uncalled for. Not all businesses have the money to pay for constant upgrades. Until the past year this company has been running in the red. The two co-founders don't even get paychecks and are very dedicated to their customers.

As of this year we have been approved for a loan and are no longer operating in red. We just barely skimmed into the black.

This is why we are looking to upgrade and fix an infrastructure that has been outgrown.

I am simply looking for help and opinions on how to increase performance and get our infrastructure where it needs to be. Not opinions about whether or not we are a good service provider.

peanuthead · Feb 9, 2014

I'm not attacking the service level in which your employer provides. I'm simply trying to help guide where I can in correcting the issue immediately and long term. I completely understand the situation and it's a difficult one to be in. On one side you don't have the funds to correct the issue by adding hardware. On the other side, if you offer slower than promised service you have people leave that are paying something. Something I just thought of that may be beneficial is to sell the lower end hosts (44GB RAM ones, etc.) and purchase a couple of used C1100s w/ 128GB RAM from Ebay. While not ideal or elegant if you can sell the old hosts for 50-75$ of a C1100 and get 3x the RAM and performance then that could be a win-win situation.

Thoughts on VMware Virtual SAN

Supreme [H]ardness

Supreme [H]ardness

2[H]4U

Supreme [H]ardness

2[H]4U

Supreme [H]ardness

2[H]4U

2[H]4U

Supreme [H]ardness

2[H]4U

2[H]4U

Supreme [H]ardness

2[H]4U

2[H]4U

2[H]4U

2[H]4U

2[H]4U

Supreme [H]ardness

2[H]4U

Supreme [H]ardness

2[H]4U

2[H]4U

Supreme [H]ardness

2[H]4U

2[H]4U

2[H]4U

Supreme [H]ardness

Supreme [H]ardness

2[H]4U

Supreme [H]ardness

2[H]4U

Supreme [H]ardness

Supreme [H]ardness

2[H]4U

2[H]4U

Supreme [H]ardness

2[H]4U

Supreme [H]ardness

2[H]4U

Supreme [H]ardness