Geo Clustering & Risk Assesment - aka the Billion Dollar Question

aitoribarra · Nov 25, 2010

This isn't really the right forum for this question, but I know that a lot of the people hanging around here have knowledge and experience that can help me. I find a higher concentration of sensible IT answers here than anywhere else. I'm sorry if I'm reminding too many people about the less pleasant aspects of their day jobs!

I need a sensible way of calculating the relative disaster / outage risk of different distances and locations between two or more datacenters for geo clustering. The reason being that the distance between sites is a major cost factor - this will need dark fiber to enable 40GbE/100GbE and beyond. As this gets more and more expensive with greater distance, I need a way to quantify how much safer I'm going to be for that extra spend. Basically I'm looking for some sort of well tested, maybe peer reviewed methodology, ideally something I can feed numbers into until I can find an acceptable balance. I don't feel comfortable making guesses (even educated ones) and would like to use the same kind of methodology that, say, a risk assessor for an insurance company would use to come up with premium pricing.

Geo clustering: I'm talking about connecting two DCs for general networking, storage (iSCSI replication with failover/failback), failover clustering of application work loads, vms, live migration of vms - all this needs a big fat pipe.

It's not enough to just get the best I can afford; even if I can't afford the best possible (probably about 25km before latency becomes an issue?) I need to have some numbers on the probabilites.

Here's a pair of example options:

Option A: cheap but riskier
Two London Docklands Tier 3/4 datacenters that are only about 50m apart
Seperated by a street and two office buildings
Connected by diversely routed dark fiber
Both are at risk from a local grid outage (a few years ago, there was a power outage of several hours which took out pretty much the whole of London south of the Thames, but both these DCs have their own backup gensets with enough fuel to survive several days with out grid power)
At the same altitude (close to water level of river Thames, so if there was a tidal flood both DCs would flood)
A large explosion / plane crash / alien attack / earthquake would take out both DCs

Option B: expensive but safer
Docklands DC + North West London DC, approx 15 radial km apart
Connected by diversely routed dark fiber
Different altitudes, with NW DC at much lower flood risk
Only a massive disaster thermo nuclear explosion or massive earthquake could take out both DCs simultaneously
Failure of entire London power grid is still a possibility

I want to compare the cost & risk of the two options for the following scenarios:

1) Availability risk - the probability that both DCs suffer an outage at the same time
2) Disaster risk - the probability that both DCs suffer a disaster so bad that a restore from backup is necessarry (which begs the question about backup variables like distance of backup from the two DCs, frequency of backup, recovery time, lead times on replacement hardware)

My feeble mind tells me:

Availability risk = look at min, max and average of a * b , c * d and e
where
a = risk of independent failure of 1st DC
b = risk of independent failure of 2nd DC
c = risk of independent failure of first fiber
d = risk of independent failure of second fiber
e = risk of single disaster that takes out both DCs at the same time

It's how e varies with distance that is key, I think.

My Google-fu has proved wanting; so far the best I could find is this:

An Empirical Assessment of IT Disaster Risk

This is based on an empirical study of real-life disasters, but it's quite old (1981 - 2000, 429 distasters) , and doesn't really give me the information I'm after. Still, some really interesting stuff in it:

Most frequent disasters:

Disruptive act Worker strikes and other intentional human acts, such as bombs or civil unrests, designed to interrupt the normal processes of organizations.

Fire. Electrical or natural fires

IT failure Hardware, software, or network problems such as DASD failures or bad controller cards.

IT move/upgrade Data center moves and CPU upgrades undertaken by the company that cause disruption.

Natural event Earthquakes, hurricanes, severe weather (for example, heavy rains or snowfall) or other events that lack dependence on human activity.

Power outage Loss of power.

Water leakage Unintended loss of contained water (for example, pipe leaks, main breaks).

Relative risk:

Code:

[FONT="Courier New"]Natural Event   79.1
IT Failure      69.7
Disruptive Act  32.9
Power Outage    14.2
Fire            4.5
IT Move/Upgrade 2.1
Water Leakage   0.22
Miscellaneous   2.8
Environmental   2.4
Theft           0.0
Flood           0.0
IT Capacity     0.0
IT User Error   0.0[/FONT]

(Frankly I think the numbers are skewed somewhat by people lying about how many times they've screwed up - "IT User Error" should be a much higher number... Also most of the DCs must have been located in places a lot more dangerous than London!)

And here is some mathematics (used for the relative risk numbers above):

The formula used to calculate relative risk is:
[c2 * p - (c * p)2 ]
where c = claim amount; p = probability of claim

I'm sure some of you had to worry about this stuff - how did you come up with your answers?

Many thanks,

Aitor

GeorgeHR · Nov 25, 2010

A thermo nuclear bomb or EMP is going to drive your risk of failure to infinity - user computers are not going to function.

It appears that you want some type of real time risk control (some sort of bound on downtime) rather than the traditional insurance (compensation for a loss). That is a different problem.

I suspect the risk probabilities are a function of expected down time or expected response time increase.

aitoribarra · Nov 25, 2010

Thanks GeorgeHR,

I was joking about the bombs and aliens... And obviously a targeted attack as opposed to a rare random event is much harder to protect against. Luckily, I don't think anyone outside of fiction has ever used an EMP...

I think you're right - I'm after probability of simultaneous outage, and looking at how that changes with distance. Expressed most simply - I know that it costs x to have my DCs practically next to each other, and 100x to 1000x to have them far apart. Let's pretend that having them next to each other and costin x gets me to a 99.99% probability that I won't suffer simultaneous outage. If it costs me 100x to seperate them further, does that get me to 99.999% (which is pretty bad value for money) or 99.9999% (which is pretty good).

Maybe I'm looking at this the wrong way and instead of probability:distance I should just think of risk categories. Spend x and I'm at risk from any event which can take out more than one building; spend 100x and I'm at risk from anything that can take out a whole city.

Geo Clustering & Risk Assesment - aka the Billion Dollar Question

aitoribarra

Weaksauce

GeorgeHR

Gawd

aitoribarra

Weaksauce