Google Outage - Network Guru's Check out how it happened

archivalbackup · May 14, 2009

First - Google had major outages, this is still very new so I do not know how widespread it was, but I know it affected a number of states and seemed to primary affect AT&T's network and Qwest's network (unfortunately our two ISPs at work) but not McCloud.

I was fortunate to have a bunch of debugs while trying to track down the problem that I wanted to share for everyone to discuss. I believe the problem was an ISP (ntt.net) did not filter a customer BGP advertisement, who decided to advertise a best route for google which several major ISPs picked up. Below are my logs and brief analysis.

0. My gmail stops working. Switch machines, not working. Check the router:

#ping gmail.google.com rep 100

Translating "gmail.google.com"...domain server ( ) [OK]

Type escape sequence to abort.
Sending 100, 100-byte ICMP Echos to 72.14.205.101, timeout is 2 seconds:
...!.!.....
Success rate is 18 percent (2/11), round-trip min/avg/max = 420/432/444 ms

Ah crap! See which ISP it is coming from, and trace it:

sh ip bgp 72.14.205.101
BGP routing table entry for 72.14.204.0/23, version 50236222
Paths: (2 available, best #1, table Default-IP-Routing-Table)
Not advertised to any peer
7018 2914 15169, (received & used)
12.86.168.x from 12.86.168.x (12.122.83.148)
Origin IGP, localpref 100, valid, external, best
-----------------------
#traceroute 72.14.205.101

Type escape sequence to abort.
Tracing the route to www3.l.google.com (72.14.205.101)

1
2 cr2.sl9mo.ip.att.net (12.122.142.142) [AS 7018] 20 msec 20 msec 16 msec
3 cr2.cgcil.ip.att.net (12.122.2.21) [AS 7018] 56 msec 16 msec 16 msec
4 ggr6.cgcil.ip.att.net (12.122.132.185) [AS 7018] 20 msec 16 msec 20 msec
5 192.205.35.78 [AS 7018] 16 msec 28 msec 28 msec
6 ae-1.r20.chcgil09.us.bb.gin.ntt.net (129.250.4.112) [AS 2914] 20 msec 20 msec
ae-1.r21.chcgil09.us.bb.gin.ntt.net (129.250.3.8) [AS 2914] 20 msec
7 p64-7-0-3.r20.snjsca04.us.bb.gin.ntt.net (129.250.5.20) [AS 2914] 72 msec
p64-2-1-0.r20.sttlwa01.us.bb.gin.ntt.net (129.250.5.28) [AS 2914] 68 msec 64 msec
8 as-2.r20.tokyjp01.jp.bb.gin.ntt.net (129.250.2.35) [AS 2914] 188 msec
as-3.r20.tokyjp01.jp.bb.gin.ntt.net (129.250.4.190) [AS 2914] 184 msec
as-2.r20.tokyjp01.jp.bb.gin.ntt.net (129.250.2.35) [AS 2914] 192 msec
9 xe-2-0-0.a20.tokyjp01.jp.ra.gin.ntt.net (61.213.162.110) [AS 2914] 184 msec 160 msec
xe-1-0-0.a21.tokyjp01.jp.ra.gin.ntt.net (61.213.162.230) [AS 2914] 156 msec
10 xe-2-1.a17.tokyjp01.jp.ra.gin.ntt.net (61.213.169.70) [AS 2914] 180 msec 192 msec 172 msec
11 * * *
12 209.85.241.90 [AS 15169] 288 msec

Ok WTF! Tokyo?!?!?! that's fucked. Kill our AT&T connection and fail over to Qwest, still having problems. Try from home - google is fine. Try from my iPhone, google is down (Ok calmed down a bit that all of AT&T's network is jacked getting to google).

My home router is fine

2 216-43-69-9.ip.mcleodusa.net (216.43.69.9) 60 msec 64 msec 68 msec
3 CHCGILWUH53JC04-GE3-2-0-0.mcleodusa.net (64.198.100.206) 100 msec
209-252-156-2.ip.mcleodusa.net (209.252.156.2) 76 msec 76 msec
4 core1-2-2-0.ord.net.google.com (206.223.119.21) 76 msec 80 msec 72 msec
5 216.239.48.154 76 msec 76 msec
209.85.250.237 104 msec
6 209.85.250.110 108 msec 108 msec 112 msec
7 66.249.94.92 108 msec 124 msec 112 msec
8 72.14.236.142 120 msec 116 msec 116 msec
9 www3.l.google.com (72.14.205.101) 92 msec 96 msec 92 msec

Call an ISP down the street I know and they are not having any problems out via McCloud or their other ISP. Check internet health report - problems with AT&T to Internap and internap to Cogent (remember Cogent) packet losses close to 4%.

Ok, Checking looking glass ... most of which I can not reach, but I get into oregon-ix.net (an .edu) and I later I parse down to this important tidbit:

route-views.oregon-ix.net>sh ip bgp 72.14.205.101
BGP routing table entry for 72.14.204.0/23, version 2029656
Paths: (34 available, best #25, table Default-IP-Routing-Table)
Not advertised to any peer
3333 286 15169
193.0.0.56 from 193.0.0.56 (193.0.0.56)
Origin IGP, localpref 100, valid, external
Community: 286:85 286:800 286:3031 286:4001

2914 15169
129.250.0.171 from 129.250.0.171 (129.250.0.79)
Origin IGP, metric 336, localpref 100, valid, external
Community: 2914:410 2914:2401 2914:3400
Dampinfo: penalty 507, flapped 3 times in 00:24:37

2914 15169
129.250.0.11 from 129.250.0.11 (129.250.0.51)
Origin IGP, metric 261, localpref 100, valid, external
Community: 2914:410 2914:2401 2914:3400
Dampinfo: penalty 511, flapped 3 times in 00:24:47

7018 2914 15169
12.0.1.63 from 12.0.1.63 (12.0.1.63)
Origin IGP, localpref 100, valid, external
Community: 7018:5000
Dampinfo: penalty 502, flapped 3 times in 00:24:48

Remember AS 2914. AS 15169 is google. AS 7018 is AT&T where I get my BGP route from. So why the F is google getting advertised by NTT.net as a 1 hop route via Tokyo to a vast majority of the internet? I have to guess that it is the same deal as the youtube black hole problem that happened a few months back in which some ass decided to advertise youtube as a null route to block it, but their upstream ISP did not have in place basic blocking or filtering of customer advertised routes.

So my google.com traffic is getting routed through AS 2914 which is some Japanese ISP that happens to have a presence in Centennial Colorado, and I guess major peering with Qwest and AT&T.

1. Call into AT&T and after a long time on hold get a tech that 'yeah our network is [fucked] having problems' and we have a master ticket number you can have, but other ISPs are down too.

2. Chat with Xphile3 and learn this is a major problem, different state.

3. Things get fixed, now what changed:

traceroute 72.14.205.101

Type escape sequence to abort.
Tracing the route to qb-in-f101.google.com (72.14.205.101)

1 1
2 cr2.sl9mo.ip.att.net (12.122.142.142) [AS 7018] 16 msec 16 msec 16 msec
3 cr2.cgcil.ip.att.net (12.122.2.21) [AS 7018] 20 msec 16 msec 16 msec
4 ggr2.cgcil.ip.att.net (12.122.132.137) [AS 7018] 16 msec 16 msec 16 msec
5 192.205.33.210 [AS 7018] 16 msec 16 msec 16 msec
6 ae-31-51.ebr1.chicago1.level3.net (4.68.101.30) [AS 3356] 24 msec 28 msec 20 msec
7 ae-2-2.ebr2.newyork2.level3.net (4.69.132.66) [AS 3356] 36 msec 40 msec 36 msec
8 ae-6-6.ebr4.newyork1.level3.net (4.69.141.21) [AS 3356] 48 msec 44 msec 36 msec
9 ae-94-94.csw4.newyork1.level3.net (4.69.134.126) [AS 3356] 52 msec
ae-64-64.csw1.newyork1.level3.net (4.69.134.114) [AS 3356] 36 msec
ae-74-74.csw2.newyork1.level3.net (4.69.134.118) [AS 3356] 48 msec
10 ae-3-89.edge1.newyork1.level3.net (4.68.16.142) [AS 3356] 36 msec
ae-4-99.edge1.newyork1.level3.net (4.68.16.206) [AS 3356] 40 msec
ae-1-69.edge1.newyork1.level3.net (4.68.16.14) [AS 3356] 40 msec
11 google-inc.edge1.newyork1.level3.net (4.71.172.82) [AS 3356] 36 msec
google-inc.edge1.newyork1.level3.net (4.71.172.86) [AS 3356] 40 msec 40 msec
12 209.85.255.68 [AS 15169] 40 msec
72.14.238.232 [AS 15169] 40 msec
209.85.255.68 [AS 15169] 48 msec
13 72.14.233.113 [AS 15169] 52 msec 52 msec
216.239.43.146 [AS 15169] 60 msec
14 66.249.94.90 [AS 15169] 56 msec 56 msec
72.14.236.183 [AS 15169] 60 msec
15 72.14.232.62 [AS 15169] 56 msec 60 msec 68 msec
16 qb-in-f101.google.com (72.14.205.101) [AS 15169] 68 msec 60 msec 56 msec

Ok we go from AS 7018 (AT&T) to AS 3356 (Level 3). When it was broken we went:

5 192.205.35.78 [AS 7018] 16 msec 28 msec 28 msec
6 ae-1.r20.chcgil09.us.bb.gin.ntt.net (129.250.4.112) [AS 2914] 20 msec 20 msec

from 7018 (AT&T) to 2914 (NTT.NET)
When it was fixed we now go:

fixed said:
5 192.205.33.210 [AS 7018] 16 msec 16 msec 16 msec
6 ae-31-51.ebr1.chicago1.level3.net (4.68.101.30) [AS 3356] 24 msec 28 msec 20 msec

Let's take a peak at Level 3's network status page and see if there is anything interesting going on ... Holy hell look at Cognet and and global crossing! http://www.esupport24.com/netstatus/

-------------------------
edit: I am going to keep editing this to add information / analysis. I just noticed that the advertised route from ntt.net was:

my view during outage said:
sh ip bgp 72.14.205.101
BGP routing table entry for 72.14.204.0/23, version 50236222

Google's network is actually a /18 not /23 so another issue at play was longest match in BGP:

arin said:
OrgName: Google Inc.
OrgID: GOGL
Address: 1600 Amphitheatre Parkway
City: Mountain View
StateProv: CA
PostalCode: 94043
Country: US

NetRange: 72.14.192.0 - 72.14.255.255
CIDR: 72.14.192.0/18
NetName: GOOGLE
NetHandle: NET-72-14-192-0-1
Parent: NET-72-0-0-0-0
NetType: Direct Allocation
NameServer: NS1.GOOGLE.COM
NameServer: NS2.GOOGLE.COM
NameServer: NS3.GOOGLE.COM
NameServer: NS4.GOOGLE.COM
Comment:
RegDate: 2004-11-10
Updated: 2007-04-10

RTechHandle: ZG39-ARIN
RTechName: Google Inc.
RTechPhone: +1-650-318-0200
RTechEmail: [email protected]

OrgTechHandle: ZG39-ARIN
OrgTechName: Google Inc.
OrgTechPhone: +1-650-318-0200
OrgTechEmail: [email protected]

epimetheus · May 14, 2009

Ok, I'll be honest, a lot of that is way the hell over my head. Cool analysis though! My laymen's understanding is that somebody basically advertised a funky route?

xphil3 · May 14, 2009

Again, Excellent find archival. A bit different from what many of us were experiencing but I think it contributed to the over problem. One thing to consider, google might have a direct peering relationship with ntt, if you can get a looking glass from them you will be able to verify this. I also checked many looking glasses, and did not find similar results as you did during the second phase of the "Attack".

If they do infact have a peering with them, and they clearly have a peering with att then they became a transit AS for all of the customers under the att umbrella along with att peers that might have a worse path now. Either way, bad job on their part. With some regex and simple bgp experience this could have been avoided. Remember kids, ip as-path access-list 20 permit ^$

Now, on to what myself and others experienced.

Code:

C:\Documents and Settings\marc>tracert www.google.com

Tracing route to www.l.google.com [72.14.213.104]
over a maximum of 30 hops:

  1    <1 ms    <1 ms    <1 ms  172.17.3.100
  2     5 ms     4 ms     4 ms  L100.WASHDC-VFTTP-74.verizon-gni.net [72.66.118.
1]
  3     3 ms     4 ms     4 ms  G3-1-574.WASHDC-LCR-05.verizon-gni.net [130.81.1
05.252]
  4     5 ms     7 ms     4 ms  so-4-2-0-0.LCC1-RES-BB-RTR1-RE1.verizon-gni.net
[130.81.28.144]
  5     8 ms     7 ms     7 ms  0.so-1-2-0.XL3.IAD8.ALTER.NET [152.63.37.117]
  6     6 ms     7 ms     7 ms  0.ge-6-1-0.BR2.IAD8.ALTER.NET [152.63.41.153]
  7     6 ms     7 ms     7 ms  te-11-3-0.edge1.Washington4.level3.net [4.68.63.
169]
  8     8 ms    17 ms    19 ms  vlan79.csw2.Washington1.Level3.net [4.68.17.126]

  9    10 ms     9 ms     9 ms  ae-72-72.ebr2.Washington1.Level3.net [4.69.134.1
49]
 10    27 ms    27 ms    27 ms  ae-2-2.ebr2.Chicago2.Level3.net [4.69.132.69]
 11    26 ms    27 ms    27 ms  ae-1-100.ebr1.Chicago2.Level3.net [4.69.132.113]

 12    55 ms    54 ms    54 ms  ae-3.ebr2.Denver1.Level3.net [4.69.132.61]
 13   100 ms    92 ms    92 ms  ae-2.ebr2.Seattle1.Level3.net [4.69.132.53]
 14   113 ms   119 ms   114 ms  ae-21-52.car1.Seattle1.Level3.net [4.68.105.34]

 15     *        *        *     Request timed out.
 16     *      322 ms     *     209.85.249.32
 17     *        *      360 ms  216.239.46.208
 18     *      331 ms   364 ms  64.233.174.127
 19   333 ms   329 ms   344 ms  209.85.253.14
 20   363 ms     *        *     pv-in-f104.google.com [72.14.213.104]
 21     *      341 ms     *     pv-in-f104.google.com [72.14.213.104]
 22     *      365 ms     *     pv-in-f104.google.com [72.14.213.105]
 23   331 ms   357 ms     *     pv-in-f104.google.com [72.14.213.104]
 24     *        *      363 ms  pv-in-f104.google.com [72.14.213.104]

Trace complete.

A nice cap of major latency/potential routing issues WITHIN the google network. Looks like Verizon was uneffected by this route-injection from ntt.net as we are getting best paths through L3. Also, notice the last 5 hops... seems like a loop but it never exits the local router.

Here is the trace AFTER the fix:

Code:

C:\Documents and Settings\marc>tracert www.google.com

Tracing route to www.l.google.com [72.14.213.104]
over a maximum of 30 hops:

  1    <1 ms    <1 ms    <1 ms  172.17.3.100
  2     4 ms     4 ms     4 ms  L100.WASHDC-VFTTP-74.verizon-gni.net [72.66.118.
1]
  3     5 ms     4 ms     4 ms  G3-1-674.WASHDC-LCR-06.verizon-gni.net [130.81.1
07.92]
  4     5 ms     4 ms     4 ms  so-4-2-0-0.RES-BB-RTR2.verizon-gni.net [130.81.2
8.146]
  5     5 ms     4 ms     7 ms  0.so-6-1-0.XL4.IAD8.ALTER.NET [152.63.36.237]
  6     6 ms     7 ms     7 ms  0.ge-7-1-0.BR2.IAD8.ALTER.NET [152.63.41.161]
  7     6 ms     7 ms     7 ms  te-11-3-0.edge1.Washington4.level3.net [4.68.63.
169]
  8     8 ms    14 ms    17 ms  vlan69.csw1.Washington1.Level3.net [4.68.17.62]

  9     7 ms     7 ms     7 ms  ae-62-62.ebr2.Washington1.Level3.net [4.69.134.1
45]
 10    26 ms    27 ms    27 ms  ae-2-2.ebr2.Chicago2.Level3.net [4.69.132.69]
 11    28 ms    27 ms    27 ms  ae-1-100.ebr1.Chicago2.Level3.net [4.69.132.113]

 12    55 ms    67 ms    54 ms  ae-3.ebr2.Denver1.Level3.net [4.69.132.61]
 13    97 ms    92 ms    92 ms  ae-2.ebr2.Seattle1.Level3.net [4.69.132.53]
 14    91 ms    92 ms    92 ms  ae-21-52.car1.Seattle1.Level3.net [4.68.105.34]

 15    85 ms   157 ms    87 ms  GOOGLE-INC.car1.Seattle1.Level3.net [4.79.104.74
]
 16    88 ms    89 ms    87 ms  209.85.249.32
 17    98 ms    97 ms   102 ms  216.239.46.208
 18    98 ms    99 ms    99 ms  216.239.48.141
 19   112 ms   104 ms   104 ms  209.85.253.6
 20   100 ms    99 ms    99 ms  pv-in-f104.google.com [72.14.213.104]

Trace complete.

As you can see, latency is down but still in the 100ms range... a clear indication of the problem being worked on/fixed. Also notice how google's Demarc is now resolved and within a semi-decent latency range. Anyways.... anyone else got anything?

archivalbackup · May 14, 2009

xphile, awesome catch yourself ... I want to take the google router loop and make a poster of that =)

Seriously cool stuff. Twitter is going absolutely bonkers with this (since google was down). See for yourself: search for the tag #googlefail

http://twitter.com/#search?q=#googlefail

archivalbackup · May 14, 2009

I should also add that a lot of websites were down for me this morning, not just google. Google just makes an easy and high profile target to debug with.

YeOldeStonecat · May 14, 2009

archivalbackup said:
I should also add that a lot of websites were down for me this morning, not just google. Google just makes an easy and high profile target to debug with.

A lot of websites and forums run stuff from Google, ajax.googleis.com was hanging on the bottom of the browser trying to load...and didn't...preventing many forums from loading.

Vito_Corleone · May 14, 2009

Wow, there are some stupid ISPs out there...

We didn't have any issues that I noticed.

just2cool · May 14, 2009

Good findings man. Unfortunately, I was in vendor meetings for much of my morning, so I missed the routing hiccup. Anyway, here's what I've got.

Remember AS 2914. AS 15169 is google. AS 7018 is AT&T where I get my BGP route from. So why the F is google getting advertised by NTT.net as a 1 hop route via Tokyo to a vast majority of the internet? I have to guess that it is the same deal as the youtube black hole problem that happened a few months back in which some ass decided to advertise youtube as a null route to block it, but their upstream ISP did not have in place basic blocking or filtering of customer advertised routes.

It's either a router bug (an IOS bug for as-prepend affected much of the internet not too long ago), or Google peers with NTT.net judging from the AS path.
Here's more info about the bug: http://blog.ioshints.info/2009/02/root-cause-analysis-oversized-as-paths.html

edit: I am going to keep editing this to add information / analysis. I just noticed that the advertised route from ntt.net was:

Google's network is actually a /18 not /23 so another issue at play was longest match in BGP.

Yeah, judging from the NLRI, it looks like a simple way to load balance on their prefixes. On further inspection from a few of my external routers...

Code:

#sh ip bgp 72.14.204.0
BGP routing table entry for 72.14.204.0/23, version 31027303
Paths: (2 available, best #2, table Default-IP-Routing-Table)
Multipath: eBGP
  Advertised to update-groups:
     5
  701 3356 15169, (received & used)
    x.x.x.x (metric 976) from x.x.x.x (x.x.x.x)
      Origin IGP, metric 0, localpref 100, valid, internal
  7018 3356 15169, (received & used)
    x.x.x.x from x.x.x.x (x.x.x.x)
      Origin IGP, localpref 100, valid, external, best
sh ip bgp 72.14.192.0
BGP routing table entry for 72.14.192.0/18, version 28750893
Paths: (1 available, best #1, table Default-IP-Routing-Table)
Multipath: eBGP
  Advertised to update-groups:
     5
  7018 15169, (received & used)
    x.x.x.x from x.x.x.x (x.x.x.x)
      Origin IGP, localpref 100, valid, external, best

Code:

#sh ip bgp 72.14.204.0
BGP routing table entry for 72.14.204.0/23, version 105320533
Paths: (2 available, best #1, table Default-IP-Routing-Table)
Multipath: eBGP
  Advertised to update-groups:
     3
  701 3356 15169
    x.x.x.x from x.x.x.x (x.x.x.x)
      Origin IGP, localpref 100, valid, external, best
  7018 3356 15169
    x.x.x.x (metric 976) from x.x.x.x (x.x.x.x)
      Origin IGP, metric 0, localpref 100, valid, internal
#sh ip bgp 72.14.192.0
BGP routing table entry for 72.14.192.0/18, version 95732966
Paths: (2 available, best #2, table Default-IP-Routing-Table)
Multipath: eBGP
  Not advertised to any peer
  701 7018 15169
    x.x.x.x from x.x.x.x (x.x.x.x)
      Origin IGP, localpref 100, valid, external
  7018 15169
    x.x.x.x (metric 976) from x.x.x.x (x.x.x.x)
      Origin IGP, metric 0, localpref 100, valid, internal, best

They only advertise their /18 to AT&T, not the /23, which is shown above. The /23 seems to be advertised to ASN 3356. The only thing I can think of is that they might have their search engines on the /23 and want to do some simple load balancing on the prefixes.

If they do infact have a peering with them, and they clearly have a peering with att then they became a transit AS for all of the customers under the att umbrella along with att peers that might have a worse path now. Either way, bad job on their part. With some regex and simple bgp experience this could have been avoided. Remember kids, ip as-path access-list 20 permit ^$

As you can see below, they look to be a rather large ISP, so they would want to be a transit network for a bunch of their customers. Putting that local filter in place would be an issue. Now, I argue that they should prepend USA NLRI to their USA neighbors to retain redundancy but prevent traversing the world.

Code:

#sh ip bgp regex _2914_
BGP table version is 31039291, local router ID is x.x.x.x
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal,
              S Stale
Origin codes: i - IGP, e - EGP, ? - incomplete

   Network          Next Hop            Metric LocPrf Weight Path
* i3.0.0.0          x.x.x.x           0    100      0 701 2914 9304 80 i
*>                  x.x.x.x                          0 7018 2914 9304 80 i
*> 8.12.24.0/24     x.x.x.x                          0 7018 2914 22886 i
* i                 x.x.x.x           0    100      0 701 2914 22886 i
*> 8.18.144.0/24    x.x.x.x                          0 7018 2914 16509 i
* i8.18.145.0/24    x.x.x.x           0    100      0 701 2914 16509 ?
*>                  x.x.x.x                          0 7018 2914 16509 ?
*>i12.96.21.0/24    x.x.x.x           0    100      0 701 2914 35847 22960 i
* i13.10.0.0/17     x.x.x.x           0    100      0 701 2914 7385 26662 i
*> 15.133.0.0/20    x.x.x.x                          0 7018 2914 8220 2129 i
* i15.248.0.0/24    x.x.x.x           0    100      0 701 2914 7385 71 i
* i24.89.0.0/19     x.x.x.x           0    100      0 701 2914 4922 14291 i
* i24.121.0.0/23    x.x.x.x           0    100      0 701 2914 7385 25994 25994 25994 25994 i
* i24.121.242.0/24  x.x.x.x           0    100      0 701 2914 7385 25994 25994 25994 25994 i
* i24.153.112.0/20  x.x.x.x           0    100      0 701 2914 4922 14291 i
*>i32.96.86.0/24    x.x.x.x           0    100      0 701 2914 12180 18440 i
* i38.100.72.0/24   x.x.x.x           0    100      0 701 2914 46785 i
*>                  x.x.x.x                          0 7018 2914 46785 i
* i38.102.144.0/24  x.x.x.x           0    100      0 701 2914 16991 i
*>                  x.x.x.x                          0 7018 2914 16991 i
*> 38.117.11.0/24   x.x.x.x                          0 7018 2914 36383 i
* i38.118.195.0/24  x.x.x.x           0    100      0 701 2914 30212 i
*>                  x.x.x.x                          0 7018 2914 30212 i
*> 38.118.199.0/24  x.x.x.x                          0 7018 2914 30212 i
* i                 x.x.x.x           0    100      0 701 2914 30212 i
*> 40.252.48.0/24   x.x.x.x                          0 7018 2914 8220 i
* i41.152.0.0/18    x.x.x.x           0    100      0 701 2914 15412 36992 i
* i41.152.0.0/16    x.x.x.x           0    100      0 701 2914 8966 8961 36992 i
many many more...

So, let this be a message to all of you thinking about doing networking -- IT'S COOL! While a lot of this stuff we're discussing is a bit complex, it is worth trying to learn.

xphil3 · May 14, 2009

just2cool said:
As you can see below, they look to be a rather large ISP, so they would want to be a transit network for a bunch of their customers. Putting that local filter in place would be an issue. Now, I argue that they should prepend USA NLRI to their USA neighbors to retain redundancy but prevent traversing the world.

Something I noticed after posting(to see if they have a direct peering with goog), there are pages upon pages of routes they are sending to att, so yes... that as-path filter could indeed break functionality that they would like.

just2cool · May 14, 2009

As punishment, AT&T should apply a ^2914$ to them. That'll teach 'em.

Those of you who are learning networking, read this:
http://www.cisco.com/en/US/tech/tk365/technologies_tech_note09186a0080094431.shtml

Shortest AS-PATH is (basically) #1 on the decision process. It doesn't care how crappy the links are or how many routers are within any single AS on the path. For reasons like today, this is why BGP should only be run by people who know how to use it. But, from time to time, someone always manages to screw it up and it blows up the world.

Fint · May 14, 2009

Google says it wasn't a rogue ISP, but their own internal issue.

http://googleblog.blogspot.com/2009/05/this-is-your-pilot-speaking-now-about.html

archivalbackup · May 14, 2009

An official update from google: http://googleblog.blogspot.com/2009/05/this-is-your-pilot-speaking-now-about.html

google.com said:
This is your pilot speaking. Now, about that holding pattern...
5/14/2009 12:15:00 PM
Imagine if you were trying to fly from New York to San Francisco, but your plane was routed through an airport in Asia. And a bunch of other planes were sent that way too, so your flight was backed up and your journey took much longer than expected. That's basically what happened to some of our users today for about an hour, starting at 7:48 am Pacific time.

An error in one of our systems caused us to direct some of our web traffic through Asia, which created a traffic jam. As a result, about 14% of our users experienced slow services or even interruptions. We've been working hard to make our services ultrafast and "always on," so it's especially embarrassing when a glitch like this one happens. We're very sorry that it happened, and you can be sure that we'll be working even harder to make sure that a similar problem won't happen again. All planes are back on schedule now.

Posted by Urs Hoelzle, SVP, Operations

I don't buy it completely. They make it sound like it was their routing issue, I wonder if they are just afraid to give everyone on the internet the idea they could potentially disrupt google traffic world wide with a bad routing announcement and an ISP that doesn't filter properly. It could have been google, but until I see the technical details, I will hold out.

edit: I may have to hold out indefinitely ... I dont think anyone will tell the truth here:
http://blogs.zdnet.com/BTL/?p=18064

article said:
Update: In the stray email department, I was informed that it’s an AT&T routing issue. Anything that touches Google via AT&T is down. Trying to confirm that now so take it for what it’s worth.

Here’s the traceroute pointing to AT&T:

Update 6: CNet News’ Shankland has two interesting nuggets. First the Google statement:

“We’re aware some users are having trouble accessing some Google services. We’re looking into it, and we’ll update everyone soon.”

Gmail is reportedly all clear. Meanwhile, Keynote is showing packet losses at NTT and Qwest. That fact means it’s more than just AT&T at work behind the Google issues.

Update 9: AT&T says via Twitter that it’s not responsible for the Google outage.

The telecom giant also issues the following statement:
After receiving speculative reports in the media that Google experienced an outage related to the AT&T network, we looked into the matter. We have not identified any specific problems in our network that could have caused the reported outage.

Update 10: McAfee argues that the Google outage was due to a IPv6 upgrade, reports CNet News’ Tom Krazit. Google isn’t elaborating in its blog post. Google explains:

just2cool · May 14, 2009

Alright. Considering they dumbed it down to a "planes" analogy, I'd take that statement with a grain of salt. I heard that it was just more than Google being affected. Maybe that's because of the dominance of Google Ads on other websites, though?

Google obviously peers via BGP with AS 2914. Maybe they were testing failover but forgot to negotiate prepending or they were supposed to advertise a different network altogether. Since 2914 isn't in the routing table anymore for that NLRI, they're no longer advertising it there at all.

Unfortunately, I doubt we'll ever know the technical details.

xphil3 · May 14, 2009

Fint said:
Google says it wasn't a rogue ISP, but their own internal issue.

http://googleblog.blogspot.com/2009/05/this-is-your-pilot-speaking-now-about.html

Collectively myself and archival hit the nail on the head. I agree archival, I do not 100% buy that the ISP had nothing to do with this. Just2cool, you also make very good point and I agree that we wont fully know the details. Oh well, it was fun trying to figure it out. Consulting company! ;-)

archivalbackup · May 15, 2009

I decided to take a look at a public looking glass this morning, and look how NTT's BGP announcement for Google changed from yesterday morning:

2914 3356 15169
129.250.0.171 from 129.250.0.171 (129.250.0.79)
Origin IGP, metric 5, localpref 100, valid, external
Community: 2914:420 2914:2000 2914:3000 65504:3356

2914 3356 15169
129.250.0.11 from 129.250.0.11 (129.250.0.51)
Origin IGP, metric 10, localpref 100, valid, external
Community: 2914:420 2914:2000 2914:3000 65504:3356

The are still advertising the network, but importantly they are not originating it. The route has changed from 2914->15169 to 2914->3356->15169 with 3356 being Level3 which matches with my trace routes once things were fixed yesterday.

Also the route that AT&T is advertising today (that had been going to NTT when things were broken) is:
7018 3356 15169
12.0.1.63 from 12.0.1.63 (12.0.1.63)
Origin IGP, localpref 100, valid, external
Community: 7018:5000

Now to Level3 not NTT. I still think NTT has some explaining to do.

aaronearles · May 15, 2009

Now the real question, who cares? Except of course Google...

archivalbackup · May 15, 2009

aaronearles said:
Now the real question, who cares? Except of course Google...

Those of us who understand how the internet works, how fragile it really is, and wish to see it continue working are the ones that care. It is not just about Google, this underscores the entire stability of the internet as a whole. It can all go away very quickly if people choose not to care.

WesM63 · May 15, 2009

Dang, I missed the entire thing. I was on site all day mucking with old skool 56k modems and dial-backup.

archivalbackup · May 15, 2009

Ok, I found a really neat resource called BGP play that allows you to replay routing announcements over a given time range.

Here is what happened yesterday (visually). Before google went down:

While Google was down and routing to Asia:

After things were fixed:

just2cool · May 15, 2009

archivalbackup said:
I decided to take a look at a public looking glass this morning, and look how NTT's BGP announcement for Google changed from yesterday morning:
The are still advertising the network, but importantly they are not originating it. The route has changed from 2914->15169 to 2914->3356->15169 with 3356 being Level3 which matches with my trace routes once things were fixed yesterday.

If they were originating it, you would have only seen 2914 in the AS path, no? Google was always originating it, what changed is who Google was sending the update to, or perhaps peering with.

Yesterday, at one point, they sent it to NTT, hence the [2914 15169] path. Now, they're no longer sending it to NTT (or it's filtered), but they're sending it to L3. NTT peers directly with L3, hence the [2914 3356 15169] path. The obvious reason why we're not still having issues is because other US ISPs are directly peered with L3 so there's no need to go through Asia anymore. Although, I am also seeing some prefixes going from AT&T directly to Google, so it appears like they're making a bunch of changes.

Now, I heard Google was testing IPv6 and they had some issues with their new AS being identified. Maybe when their main BGP sessions go down, that's when they advertise to their backup ISP NTT.

But who knows...

(BTW, sweet topology diagrams haha)

Ockie · May 15, 2009

just2cool said:
If they were originating it, you would have only seen 2914 in the AS path, no? Google was always originating it, what changed is who Google was sending the update to, or perhaps peering with.

Yesterday, at one point, they sent it to NTT, hence the [2914 15169] path. Now, they're no longer sending it to NTT (or it's filtered), but they're sending it to L3. NTT peers directly with L3, hence the [2914 3356 15169] path. The obvious reason why we're not still having issues is because other US ISPs are directly peered with L3 so there's no need to go through Asia anymore. Although, I am also seeing some prefixes going from AT&T directly to Google, so it appears like they're making a bunch of changes.

Now, I heard Google was testing IPv6 and they had some issues with their new AS being identified. Maybe when their main BGP sessions go down, that's when they advertise to their backup ISP NTT.

But who knows...

(BTW, sweet topology diagrams haha)

The only question I would ask is why use NTT at all for a failover or a connection peer at all, there are many more global tier-1 carriers with much more proper routing and large domestic networks.

I don't think we'll find the real reason ever. Whoever did the screw up is more than likely standing in the unemployment line with some gag order.

YeOldeStonecat · May 15, 2009

aaronearles said:
Now the real question, who cares? Except of course Google...

A lot of us that support end users. My cell phone was ringing off the hook by clients who thought their internet was down....because Google.com was their home page, or they immediately went to search for something on Google. Nothing budged, they assumed internet down...so call The Catman and rack up his cell phone minutes! Hell I was behind walls 'n up ladders yesterday fixing some ethernet wiring that some monkey at their phone company dorked up by not installing it right. I didn't have time for 113 phone calls while doing that.

archivalbackup · May 15, 2009

just2cool said:
If they were originating it, you would have only seen 2914 in the AS path, no? Google was always originating it, what changed is who Google was sending the update to, or perhaps peering with.

(BTW, sweet topology diagrams haha)

Great analysis, as always, ... I was mixing verbs and not being technically correct. Yes google (we assume since they do have a valid peering with NTT) was originating the announcement (assuming it was not rogue). It would seem odd to use Asia as a backup route for the entire US block of addresses, which as we saw yesterday, didn't even have the capacity to work correctly if it was indeed a valid backup route.

Perhaps in the midst of working on IPv6 they cut their peering with Level3 and that's why NTT appeared to be the next best route; however in the BGP Play graphs, during the outage Level3 still shows up as a peer, and best path for a number of other ISPs. Still feels odd.

QHalo · May 15, 2009

aaronearles said:
Now the real question, who cares? Except of course Google...

...

Anyway, Thanks for posting this and all the detail. As soon as I get to the BGP section in BSCI I'll come back and read this again so it doesn't read like a foreign language. I get the jist of it just from your discussions but it just further reinforces the fact that I have a long way to go. Exciting actually.

aaronearles · May 15, 2009

A lot of us that support end users. My cell phone was ringing off the hook by clients who thought their internet was down....because Google.com was their home page, or they immediately went to search for something on Google. Nothing budged, they assumed internet down...so call The Catman and rack up his cell phone minutes! Hell I was behind walls 'n up ladders yesterday fixing some ethernet wiring that some monkey at their phone company dorked up by not installing it right. I didn't have time for 113 phone calls while doing that.

No I realize that, I said the same thing in another thread about this, our receptionist called saying her internet was down and I wasted 5 minutes troubleshooting the problem because, of course, I was troubleshooting by pinging google.

I just mean who cares why it was down, no reason to sit here wasting time troubleshooting a problem that isn't yours, unless of course you have nothing better to do...

QHalo · May 15, 2009

Just because the mistake wasn't yours or the problem not yours, doesn't mean you can't be intrigued and actually learn from someone else's mistake. Even if the problem is not yours, some managers still want to know why. And I'm sure most of us don't say no to people that sign the checks.

Vito_Corleone · May 15, 2009

aaronearles said:
No I realize that, I said the same thing in another thread about this, our receptionist called saying her internet was down and I wasted 5 minutes troubleshooting the problem because, of course, I was troubleshooting by pinging google.

I just mean who cares why it was down, no reason to sit here wasting time troubleshooting a problem that isn't yours, unless of course you have nothing better to do...

Guys like you seem like you're only around to get a paycheck. The other people in this thread are actually fascinated by technology and have a thirst for knowledge, even things that don't apply to them or their current positions. In my experience, people like us are great at our jobs and advance more quickly because we're not just in this to get paid, we are actually interested in it and constantly striving to learn more.

So, in other words, we're awesome and you aren't.

xphil3 · May 15, 2009

aaronearles said:
Now the real question, who cares? Except of course Google...

The people that commented here.

Dude, Very nice Archival... its been a few years since I used BGPplay. Im impressed.
The problem looks very clear to me now, let me guys know what you think.

I think its a culmination of issues here. I believe that google COULD infact have attempted an IPv6 pilot, and which region of the world has the largest footprint? Obviously Asia. Japan has the oldest 6bone right now, so this seems like a prime explanation on what happened and why Google mysteriously peered with NTT(which they didn't have a neighborship with previously).

Now, on to BGP. I didn't have the same problems that Archival had.. only because from my originating traffic flow path was through VZ(who is also peering with L3 and has them as a best path) Check out my traces on the first page. The RETURN traffic now was mostly likely asymmetric(going through japan possibly), as well as a loop inside of the google network.

What most likely happened for all of the att customers, was that NTT inadvertently became a transit AS for all of ATT for the Google prefix's. Look at those bgpplay diagrams again, from Att's persepective(and their customers) Google is only 2 hops away through NTT(which is in reality probably ~20 actual router hops and across half the world). Now, this is where things get fishy.... understand the BGP path decision process... they(both paths, through L3 and NTT) both have the same AS path length from Att's perspective.. 2 Hops to the prefix(through Ntt and into goog). Ntt could have sent out other metrics to influence ATT to come through their AS to get to google, OR google didn't set the right communities so NTT wouldn't propagate it to other AS peers. Whatever they did, its obvious that NTT becuase a transit for the google network which they SHOULD NOT HAVE, it could have been googles fault but I solely blame it on that ISP.

Fint · May 15, 2009

aaronearles said:
I just mean who cares why it was down, no reason to sit here wasting time troubleshooting a problem that isn't yours, unless of course you have nothing better to do...

There is a saying about wise people learning from the mistakes of others...

archivalbackup · May 15, 2009

xphil3 said:
I think its a culmination of issues here. I believe that google COULD infact have attempted an IPv6 pilot, and which region of the world has the largest footprint? Obviously Asia. Japan has the oldest 6bone right now

Ahh, now that makes sense. Finally I think the pieces are falling into place. I keep forgetting that IPv6 is trying to take over the world since everyone in the US has done an outstanding job of dragging their feet kicking and screaming in implementing v6. I have this notion that v6 is still more 'academic' than real life. Time to setup a dual-stack and force myself to start using it.

xphil3 · May 15, 2009

aaronearles said:
No I realize that, I said the same thing in another thread about this, our receptionist called saying her internet was down and I wasted 5 minutes troubleshooting the problem because, of course, I was troubleshooting by pinging google.

I just mean who cares why it was down, no reason to sit here wasting time troubleshooting a problem that isn't yours, unless of course you have nothing better to do...

Seriously... I woulndt expect an analyst to understand, you're out of your leauge here. Let the engineers care I guess

darkpaw · May 15, 2009

Thanks for the excellent analysis to those of you who took the time to post the indepth information. I'm one of those people that will likely never touch a major BGP interface, but what happens at those levels affects everyone that relies on the Internet to do their job. I knew just enough about what was being discussed to follow along and I appreciate the opportunity.

Thanks!

aaronearles · May 15, 2009

Excuse me? What does my job title have anything to do with it? I understood enough to follow, and I'm a techie just like you guys, I have just as much interest in technology and networking as anyone else on here - im not just in IT for the paycheck or whatever was said. I just don't see myself wasting time troubleshooting someone else's problem that they've already solved, you're not accomplishing anything.

just2cool · May 15, 2009

In terms of solving the issue, no, we're not accomplishing anything. Google is working again regardless of what was said here.

In terms of finding out what the real issue was, why things failed, what changed, better design strategies to prevent this in the future... that's where this thread helps. If you want to be ignorant, that's fine by us, but believe it or not everyone here has learned a lot from this. I have no doubt that learning from this experience will help most of us in the future.

Those who fail to learn history are doomed to repeat it. I just hope you're never in the position to make this mistake.

TechieSooner · May 15, 2009

Alright, so all that flew straight over my head... anyone mind to simplify what really happened?

I've never dealt with BGP so I've got no idea how this all works.

Fint · May 18, 2009

Quick summary; BGP is the routing protocol the core of the internet uses. It functions because every BGP-enabled router knows every single possible path to every network on the internet; the router then picks the best path to send data out (if you only have one connection to the internet, then you don't need BGP, because you only have one choice; BGP is used by ISPs, and other companies with multiple Internet connections to [usually] multiple providers)

Somebody messed up something at Google, and the "best path" route for some people appeared to now go via Japan; all the data flowing this way made accessing Google for some people slow.

just2cool · May 18, 2009

I'll add a little bit to Fint's overview. Between his post and my post, it should be enough to make you understand the thread.

Unlike traditional routing protocols that compute routes based on each "router hop", BGP computes routes (called paths in BGP) based on each "AS hop".

What's an AS? It's an Autonomous System. Essentially, BGP doesn't care how you route within the AS, it's only concerned with what networks and other ASes that you can reach. Each BGP route has something called an "AS Path" which shows the path that the prefixes/subnets/networks take. BGP sends updates to its configured neighbors using NLRIs (Network Layer Reachability Information). For the Google example in this thread, [7018 2914 15169] means that it has to go through the AT&T AS, Japan AS, then the Google AS to reach Google.

You would think BGP is smart enough to avoid high latency links like Japan, but that's the downfall of not caring much about the internal routing. It relies on the AS path way too much. That's why you get interviewed before running BGP -- you need to know what you're doing because you affect the world. Although, there's no other real way to do it as there are 288,000+ BGP routes today. Just Imagine if a few links went down within your AS and would have to propagate out to the entire Internet... For the same reason, BGP is a very slow convergence protocol and commonly dampens (temporarily filters) routes that flap frequently.

More info on the BGP decision process (there can never be a tie):
http://www.cisco.com/en/US/tech/tk365/technologies_tech_note09186a0080094431.shtml

Google Outage - Network Guru's Check out how it happened

Gawd

Gawd

[H]ard|Gawd

Gawd

Gawd

[H]F Junkie

[H]ard|Gawd

Gawd

[H]ard|Gawd

Gawd

[H]ard|Gawd

Gawd

Gawd

[H]ard|Gawd

Gawd

[H]ard|Gawd

Gawd

2[H]4U

Gawd

Gawd

*** Self Proclaimed Storage King ***

[H]F Junkie

Gawd

2[H]4U

[H]ard|Gawd

2[H]4U

[H]ard|Gawd

[H]ard|Gawd

[H]ard|Gawd

Gawd

[H]ard|Gawd

2[H]4U

[H]ard|Gawd

Gawd

Supreme [H]ardness

[H]ard|Gawd

Gawd

* Self Proclaimed Storage King *