Curious how other environments do this... DEV to Prod comparison

Grimlaking · May 9, 2018

Where I work in the specific area I work I need to be clear. We are considered very critical to my companies business.

One of the things we do to help us maintain up time is if we have a piece of hardware or software we need to run in production. We have the exact like in Development. We even did this with our VM infrastructure.

Before starting we got three small 20 core systems with 128 gig of ram each and a 12 TB dell storage setup to test on. Once it was decided we bought 3 hosts (going with 5 per VM cluster in Prod) of 38 core 512GB systems connected to a independent VNX5400 (each cluster using it's own hardware for redundancy's sake.)

I'm getting the impression I'm rather lucky we have a group that does this for our servers and infrastructure and most places are not allowed the freedom/budget to do this sort of thing.

Or am I way off base here?

cjcox · May 9, 2018

Not off base, not sure of your capacity needs though. Personally, I would have gone with 3 separate storage systems per cluster (choosing something cheaper than the expensive VNX5400). But that's me.

Grimlaking · May 9, 2018

Cjcox, out of curiosity, why would you want 3 different storage solutions per cluster? Our VNX is setup with fastcache (for our pool) some SSD, and the rest 15k RPM SAS drives.

Vengance_01 · May 10, 2018

You are right. At my old job our main hardware was ucs b200 blades with various configs due to different generations. Storage was all flash with some sas NetApp array. Our test or Dev if you could call it that we're old hp blade chassis and old mostly sata with some sas. You guys did it right but not everyone has the budget to do it right.

Spartacus09 · May 12, 2018

Depending on your business, mine as a software company our dev is all local. We're using kubertes/docker for our applications running on Linux VMs (hypervisored on esxi).
We then push all the polished apps to GCP and AWS (also containerized) for our UAT environment followed by production environments after final checks.
Our product has both on-premise and SaaS based offerings though.

dgingeri · May 14, 2018

Well, my employer is not so good about that. We're running prod on a pair of Xeon E5-2620 v2 machines, and our dev/QA VMs are running on the old prod stack, a pair of Dell 1950s with 2 dual core Xeon 5400 series chips. The policy before I joined on was that prod gets upgraded every 5 years and Dev/QA gets the old prod machines.

Well, that was before we got bought out and had a full management changeover. Now, they're looking to replace our Dev/QA machines with a single socket AMD Epyc host, but no redundancy except in storage. They don't even want to spring for dual power supplies. They've also said they want to replace our prod hosts by 2020, but most of our prod purposes will be pushed out to "cloud hosted". We'll just have file servers and AD authentication locally, with all our Prod and QA web services hosted with AWS, Azure, Packet.net, and Rackspace. Dev will stay local, but that's about it.

Grimlaking · May 15, 2018

Dgineri, that sucks, I wouldn't want to work or do business with someone who has my information on the cloud. (not saying that is what your company is doing mind you.) I work for a private security company we are VERY concerned with redundancy, every server has a mirror, every server has redundant network and connection to the VNX. And each side (each site has two sides.) has it's mirror in an opposit server on identical hardware but completely saegregated out to the network core 9600's. Knock on wood but that's why we haven't had a outage in over 30 years.

Outlaw85 · May 15, 2018

not off base and sounds like a decent process.

Our place is like yours. but on a fairly large scale. Almost completely virtualized on prem with multiple cloud residencies. Our hardware is pretty close spec to our prod with slight variances in clock speed being the main difference. We have a lot of UCS B2x0 m3/4 with some older m2 backed by a mix of storage arrays including 3par and netapp. The m2/older m3's, we are decommissioning and replacing with Nutanix. This is the plan as stuff falls off lease.

Our most recent (in progress) is the (ONE) app will have:
Nonprod- 16 NTX nodes in 3xxx series chassis (2x10core node, 512gb node, 8tb total per chassis)
Prod 1- 14 UCS blades m4 series (2x10core blade, 512gb blade, svc storage)
Prod 2- 64 NTX nodes in 3xxx series chassis (2x12core node, 512gb node, 8tb total per chassis)

Generally our non-prod and prod infra is scaled close but i want to say non-prod is actually bigger because each app team insists they need the exact same as prod for non-prod... 2 dev, 2 qa, 3 stress... etc.. WHY!?! Oh and now we have to have a thing they made up called "pre-prod". what a joke. And they all must be bloated to the same spec too.. 8+cpu 24+gb... goes on.. Being on the capacity team, it's enough to make me want to drink on the job.

dgingeri · May 16, 2018

Grimlaking said:
Dgineri, that sucks, I wouldn't want to work or do business with someone who has my information on the cloud. (not saying that is what your company is doing mind you.) I work for a private security company we are VERY concerned with redundancy, every server has a mirror, every server has redundant network and connection to the VNX. And each side (each site has two sides.) has it's mirror in an opposit server on identical hardware but completely saegregated out to the network core 9600's. Knock on wood but that's why we haven't had a outage in over 30 years.

We don't handle a whole lot of sensitive data. Our business model lends itself to cloud uses a lot better than most things, but I can't really say what it is. Our management is very sensitive to public image, so we're discouraged from associating ourselves with the company publicly. Any public association must be tempered with "being a representative of the company", which leaves out a lot of activities and just makes online life a lot less fun.

Grimlaking · May 16, 2018

Outlaw85 said:
not off base and sounds like a decent process.

Our place is like yours. but on a fairly large scale. Almost completely virtualized on prem with multiple cloud residencies. Our hardware is pretty close spec to our prod with slight variances in clock speed being the main difference. We have a lot of UCS B2x0 m3/4 with some older m2 backed by a mix of storage arrays including 3par and netapp. The m2/older m3's, we are decommissioning and replacing with Nutanix. This is the plan as stuff falls off lease.

Our most recent (in progress) is the (ONE) app will have:
Nonprod- 16 NTX nodes in 3xxx series chassis (2x10core node, 512gb node, 8tb total per chassis)
Prod 1- 14 UCS blades m4 series (2x10core blade, 512gb blade, svc storage)
Prod 2- 64 NTX nodes in 3xxx series chassis (2x12core node, 512gb node, 8tb total per chassis)

Generally our non-prod and prod infra is scaled close but i want to say non-prod is actually bigger because each app team insists they need the exact same as prod for non-prod... 2 dev, 2 qa, 3 stress... etc.. WHY!?! Oh and now we have to have a thing they made up called "pre-prod". what a joke. And they all must be bloated to the same spec too.. 8+cpu 24+gb... goes on.. Being on the capacity team, it's enough to make me want to drink on the job.

Wow yea that's overkill in and of itself.

I've heard some horror stories about the UCS solution drive storage failures being just way out of hand. We went with Dell 730's when we did ourrefresh running dual 18 core's with 512 ram and one as a fail over in each side. (we are dual site with dual sides.) This allows us to in effect have quadrouple redundancy where in a worst case scenario we can actually run our entire critical potion of our business on one of those Silo's.

What we don't do is have the exact same size in Dev. Currently our Dev is a bit over built but we have exact match to match for the versions in hardware. Only so if we do real testing on new versions and such we know for a fact how it will translate to behavior in production.

We didn't go with the UCS solution because of how we want to silo and our footprint we would have wound up with UCS chassis with one blade in them and that is a huge expense for not using it fully.

Outlaw85 · May 16, 2018

Grimlaking said:
Wow yea that's overkill in and of itself.

I've heard some horror stories about the UCS solution drive storage failures being just way out of hand. We went with Dell 730's when we did ourrefresh running dual 18 core's with 512 ram and one as a fail over in each side. (we are dual site with dual sides.) This allows us to in effect have quadrouple redundancy where in a worst case scenario we can actually run our entire critical potion of our business on one of those Silo's.

What we don't do is have the exact same size in Dev. Currently our Dev is a bit over built but we have exact match to match for the versions in hardware. Only so if we do real testing on new versions and such we know for a fact how it will translate to behavior in production.

We didn't go with the UCS solution because of how we want to silo and our footprint we would have wound up with UCS chassis with one blade in them and that is a huge expense for not using it fully.

I can't say if it's the solution or not just because I honestly don't know. My hunch is the less than desired environment our UCS C240's are in (75F+ room and somewhat dusty). We run 2x C240 with 12? scsi drives. After about 3yrs, they start dropping like flies. We also run stormagic and the support has been fine and the product works as it's advertised but the complexity as a whole makes it a little frustrating. System down? Time to webex, Cisco, VMware and Stormagic... grrr. I think the main problem with this complexity is that the drives don't want to fail alone and usually take at least one more with it causing VM corruption. Turns into a whole new beast.

Makes sense with the redundancies. We are similar with multi site but only some apps have the extra silo within the site like the app I posted above. Sometimes it's worth the extra cost for uptime. For the dev sizing, if you can actually test, and i mean stress test like a prod env would see. I'm all for rightsizing. The problem is when they say it's for that and really it's only dev/functional testing and the 'stress' test is really production but they insist they need multiple stress environments.. booo lol i wish i could just send some of their bloat to my basement for a small lab haha.

I agree with the 8 blade chassis, they are nice if you need the capacity but can be a large waste if not. Most of ours are 6-8 blades per chassis filling a 42u cabinet - top of rack switch/s. Looks cool from the front until you have to crab walk behind them and sweat your lunch off haha.

cjcox · May 17, 2018

Grimlaking said:
Cjcox, out of curiosity, why would you want 3 different storage solutions per cluster? Our VNX is setup with fastcache (for our pool) some SSD, and the rest 15k RPM SAS drives.

For flexibility I always want a +1 scenario. I don't always get what I want though. A single piece of dense storage in a SAN is just not going to be as effective as 2-3 independent storage systems in a SAN. So... flexibiliy, speed, scale, reliability.

Grimlaking · May 17, 2018

cjcox said:
For flexibility I always want a +1 scenario. I don't always get what I want though. A single piece of dense storage in a SAN is just not going to be as effective as 2-3 independent storage systems in a SAN. So... flexibiliy, speed, scale, reliability.

Ok I'd love to discuss this more with you. How do you figure a SAN with Tiered storage is going to be less in performance than three independent san's. All of your migrations then take additional time as the storage generally would need to be migrated as well if each host in the three host dev environment we are talking about is used. Also you no longer have the benifit of tiered storage and fast cache to help with performance overall. (You can turn off storage tiering for your 'slow' storage if you need to.) I'd really like to understand the benifit more of having three independent storage solutions over one highly reliable one. (Short of the HA heartbeat.)

cjcox · May 17, 2018

Grimlaking said:
Ok I'd love to discuss this more with you. How do you figure a SAN with Tiered storage is going to be less in performance than three independent san's. All of your migrations then take additional time as the storage generally would need to be migrated as well if each host in the three host dev environment we are talking about is used. Also you no longer have the benifit of tiered storage and fast cache to help with performance overall. (You can turn off storage tiering for your 'slow' storage if you need to.) I'd really like to understand the benifit more of having three independent storage solutions over one highly reliable one. (Short of the HA heartbeat.)

Normally, even in a tiered scenario, there aren't independent pathways (not talking multipath, but the concept of independent storage pathways). In fact, I'm actually very anti-tiering... well, actually, there are only tow tiers anymore: All SSD/Flash and disk. The scale different is simply too great. If you really need IOPS, flash. Otherwise, it simply doesn't matter anymore. I'm done with rust tiering... get an array of high density slow rpm drives. Much more reliable. Again, the high IOPS tier is all flash.

So, why 3.... N+1. You should be able to run on two if you have to. This allows you to easily migrate and update your storage regardless of vendor (yes, there are vendors that have vendor proprietary solutions that help, but they are vendor proprietary and usually work via "magic"). Having two allows you to create interesting higher speed scenarios that cross two units. Doing that with one storage platform usually means funneling everything through one pipe effectively (doesn't make sense). Of course, again, the whole scenario might not be interesting anymore... unless you have no flash only "tier". Why not just N and add? Because there is a benefit from a scaling point of view of the extra pathway (talking front to back from network to controller, etc.).

I do the same for hypervisors. N+1 (or better of course). Just need a way to easily upgrade. Why not just N? Same reason. Being able to run on N is not as "nice" as running on N+1, though you need to make sure the system can operate on N if it has to. And you can do better than +1 if you like. It's just my minimum.

I've worked in a lot of shops with just "N" style setups, and it always makes things very painful and often times much more expensive.

Oh... and I never said the storage units would be "unreliable", just cheaper. You can get high performance active-active controller storage arrays for a lot less (in fact you might get all 3 for the price of that one VNX).

Grimlaking · May 18, 2018

our VnX's the way they are set.. are two speed.. SSD and 15k that's it. Then the fastcache on top of that. IT's all good I understand what you are getting at now. More but smaller then some sort of jobd software solution with the LUN's as you see fit. No mixed LUN's at all. I understand your perspective now. Thank you!

Curious how other environments do this... DEV to Prod comparison

Grimlaking

2[H]4U

cjcox

2[H]4U

Grimlaking

2[H]4U

Vengance_01

Supreme [H]ardness

Spartacus09

[H]ard|Gawd

dgingeri

2[H]4U

Grimlaking

2[H]4U

Outlaw85

[H]ard|Gawd

dgingeri

2[H]4U

Grimlaking

2[H]4U

Outlaw85

[H]ard|Gawd

cjcox

2[H]4U

Grimlaking

2[H]4U

cjcox

2[H]4U

Grimlaking

2[H]4U