archivalbackup
Gawd
- Joined
- Oct 12, 2007
- Messages
- 643
First - Google had major outages, this is still very new so I do not know how widespread it was, but I know it affected a number of states and seemed to primary affect AT&T's network and Qwest's network (unfortunately our two ISPs at work) but not McCloud.
I was fortunate to have a bunch of debugs while trying to track down the problem that I wanted to share for everyone to discuss. I believe the problem was an ISP (ntt.net) did not filter a customer BGP advertisement, who decided to advertise a best route for google which several major ISPs picked up. Below are my logs and brief analysis.
0. My gmail stops working. Switch machines, not working. Check the router:
My home router is fine
Call an ISP down the street I know and they are not having any problems out via McCloud or their other ISP. Check internet health report - problems with AT&T to Internap and internap to Cogent (remember Cogent) packet losses close to 4%.
Ok, Checking looking glass ... most of which I can not reach, but I get into oregon-ix.net (an .edu) and I later I parse down to this important tidbit:
So my google.com traffic is getting routed through AS 2914 which is some Japanese ISP that happens to have a presence in Centennial Colorado, and I guess major peering with Qwest and AT&T.
1. Call into AT&T and after a long time on hold get a tech that 'yeah our network is [fucked] having problems' and we have a master ticket number you can have, but other ISPs are down too.
2. Chat with Xphile3 and learn this is a major problem, different state.
3. Things get fixed, now what changed:
Ok we go from AS 7018 (AT&T) to AS 3356 (Level 3). When it was broken we went:
When it was fixed we now go:
Let's take a peak at Level 3's network status page and see if there is anything interesting going on ... Holy hell look at Cognet and and global crossing! http://www.esupport24.com/netstatus/
-------------------------
edit: I am going to keep editing this to add information / analysis. I just noticed that the advertised route from ntt.net was:
Google's network is actually a /18 not /23 so another issue at play was longest match in BGP:
I was fortunate to have a bunch of debugs while trying to track down the problem that I wanted to share for everyone to discuss. I believe the problem was an ISP (ntt.net) did not filter a customer BGP advertisement, who decided to advertise a best route for google which several major ISPs picked up. Below are my logs and brief analysis.
0. My gmail stops working. Switch machines, not working. Check the router:
Ah crap! See which ISP it is coming from, and trace it:#ping gmail.google.com rep 100
Translating "gmail.google.com"...domain server ( ) [OK]
Type escape sequence to abort.
Sending 100, 100-byte ICMP Echos to 72.14.205.101, timeout is 2 seconds:
...!.!.....
Success rate is 18 percent (2/11), round-trip min/avg/max = 420/432/444 ms
Ok WTF! Tokyo?!?!?! that's fucked. Kill our AT&T connection and fail over to Qwest, still having problems. Try from home - google is fine. Try from my iPhone, google is down (Ok calmed down a bit that all of AT&T's network is jacked getting to google).sh ip bgp 72.14.205.101
BGP routing table entry for 72.14.204.0/23, version 50236222
Paths: (2 available, best #1, table Default-IP-Routing-Table)
Not advertised to any peer
7018 2914 15169, (received & used)
12.86.168.x from 12.86.168.x (12.122.83.148)
Origin IGP, localpref 100, valid, external, best
-----------------------
#traceroute 72.14.205.101
Type escape sequence to abort.
Tracing the route to www3.l.google.com (72.14.205.101)
1
2 cr2.sl9mo.ip.att.net (12.122.142.142) [AS 7018] 20 msec 20 msec 16 msec
3 cr2.cgcil.ip.att.net (12.122.2.21) [AS 7018] 56 msec 16 msec 16 msec
4 ggr6.cgcil.ip.att.net (12.122.132.185) [AS 7018] 20 msec 16 msec 20 msec
5 192.205.35.78 [AS 7018] 16 msec 28 msec 28 msec
6 ae-1.r20.chcgil09.us.bb.gin.ntt.net (129.250.4.112) [AS 2914] 20 msec 20 msec
ae-1.r21.chcgil09.us.bb.gin.ntt.net (129.250.3.8) [AS 2914] 20 msec
7 p64-7-0-3.r20.snjsca04.us.bb.gin.ntt.net (129.250.5.20) [AS 2914] 72 msec
p64-2-1-0.r20.sttlwa01.us.bb.gin.ntt.net (129.250.5.28) [AS 2914] 68 msec 64 msec
8 as-2.r20.tokyjp01.jp.bb.gin.ntt.net (129.250.2.35) [AS 2914] 188 msec
as-3.r20.tokyjp01.jp.bb.gin.ntt.net (129.250.4.190) [AS 2914] 184 msec
as-2.r20.tokyjp01.jp.bb.gin.ntt.net (129.250.2.35) [AS 2914] 192 msec
9 xe-2-0-0.a20.tokyjp01.jp.ra.gin.ntt.net (61.213.162.110) [AS 2914] 184 msec 160 msec
xe-1-0-0.a21.tokyjp01.jp.ra.gin.ntt.net (61.213.162.230) [AS 2914] 156 msec
10 xe-2-1.a17.tokyjp01.jp.ra.gin.ntt.net (61.213.169.70) [AS 2914] 180 msec 192 msec 172 msec
11 * * *
12 209.85.241.90 [AS 15169] 288 msec
My home router is fine
2 216-43-69-9.ip.mcleodusa.net (216.43.69.9) 60 msec 64 msec 68 msec
3 CHCGILWUH53JC04-GE3-2-0-0.mcleodusa.net (64.198.100.206) 100 msec
209-252-156-2.ip.mcleodusa.net (209.252.156.2) 76 msec 76 msec
4 core1-2-2-0.ord.net.google.com (206.223.119.21) 76 msec 80 msec 72 msec
5 216.239.48.154 76 msec 76 msec
209.85.250.237 104 msec
6 209.85.250.110 108 msec 108 msec 112 msec
7 66.249.94.92 108 msec 124 msec 112 msec
8 72.14.236.142 120 msec 116 msec 116 msec
9 www3.l.google.com (72.14.205.101) 92 msec 96 msec 92 msec
Call an ISP down the street I know and they are not having any problems out via McCloud or their other ISP. Check internet health report - problems with AT&T to Internap and internap to Cogent (remember Cogent) packet losses close to 4%.
Remember AS 2914. AS 15169 is google. AS 7018 is AT&T where I get my BGP route from. So why the F is google getting advertised by NTT.net as a 1 hop route via Tokyo to a vast majority of the internet? I have to guess that it is the same deal as the youtube black hole problem that happened a few months back in which some ass decided to advertise youtube as a null route to block it, but their upstream ISP did not have in place basic blocking or filtering of customer advertised routes.route-views.oregon-ix.net>sh ip bgp 72.14.205.101
BGP routing table entry for 72.14.204.0/23, version 2029656
Paths: (34 available, best #25, table Default-IP-Routing-Table)
Not advertised to any peer
3333 286 15169
193.0.0.56 from 193.0.0.56 (193.0.0.56)
Origin IGP, localpref 100, valid, external
Community: 286:85 286:800 286:3031 286:4001
2914 15169
129.250.0.171 from 129.250.0.171 (129.250.0.79)
Origin IGP, metric 336, localpref 100, valid, external
Community: 2914:410 2914:2401 2914:3400
Dampinfo: penalty 507, flapped 3 times in 00:24:37
2914 15169
129.250.0.11 from 129.250.0.11 (129.250.0.51)
Origin IGP, metric 261, localpref 100, valid, external
Community: 2914:410 2914:2401 2914:3400
Dampinfo: penalty 511, flapped 3 times in 00:24:47
7018 2914 15169
12.0.1.63 from 12.0.1.63 (12.0.1.63)
Origin IGP, localpref 100, valid, external
Community: 7018:5000
Dampinfo: penalty 502, flapped 3 times in 00:24:48
So my google.com traffic is getting routed through AS 2914 which is some Japanese ISP that happens to have a presence in Centennial Colorado, and I guess major peering with Qwest and AT&T.
1. Call into AT&T and after a long time on hold get a tech that 'yeah our network is [fucked] having problems' and we have a master ticket number you can have, but other ISPs are down too.
2. Chat with Xphile3 and learn this is a major problem, different state.
3. Things get fixed, now what changed:
traceroute 72.14.205.101
Type escape sequence to abort.
Tracing the route to qb-in-f101.google.com (72.14.205.101)
1 1
2 cr2.sl9mo.ip.att.net (12.122.142.142) [AS 7018] 16 msec 16 msec 16 msec
3 cr2.cgcil.ip.att.net (12.122.2.21) [AS 7018] 20 msec 16 msec 16 msec
4 ggr2.cgcil.ip.att.net (12.122.132.137) [AS 7018] 16 msec 16 msec 16 msec
5 192.205.33.210 [AS 7018] 16 msec 16 msec 16 msec
6 ae-31-51.ebr1.chicago1.level3.net (4.68.101.30) [AS 3356] 24 msec 28 msec 20 msec
7 ae-2-2.ebr2.newyork2.level3.net (4.69.132.66) [AS 3356] 36 msec 40 msec 36 msec
8 ae-6-6.ebr4.newyork1.level3.net (4.69.141.21) [AS 3356] 48 msec 44 msec 36 msec
9 ae-94-94.csw4.newyork1.level3.net (4.69.134.126) [AS 3356] 52 msec
ae-64-64.csw1.newyork1.level3.net (4.69.134.114) [AS 3356] 36 msec
ae-74-74.csw2.newyork1.level3.net (4.69.134.118) [AS 3356] 48 msec
10 ae-3-89.edge1.newyork1.level3.net (4.68.16.142) [AS 3356] 36 msec
ae-4-99.edge1.newyork1.level3.net (4.68.16.206) [AS 3356] 40 msec
ae-1-69.edge1.newyork1.level3.net (4.68.16.14) [AS 3356] 40 msec
11 google-inc.edge1.newyork1.level3.net (4.71.172.82) [AS 3356] 36 msec
google-inc.edge1.newyork1.level3.net (4.71.172.86) [AS 3356] 40 msec 40 msec
12 209.85.255.68 [AS 15169] 40 msec
72.14.238.232 [AS 15169] 40 msec
209.85.255.68 [AS 15169] 48 msec
13 72.14.233.113 [AS 15169] 52 msec 52 msec
216.239.43.146 [AS 15169] 60 msec
14 66.249.94.90 [AS 15169] 56 msec 56 msec
72.14.236.183 [AS 15169] 60 msec
15 72.14.232.62 [AS 15169] 56 msec 60 msec 68 msec
16 qb-in-f101.google.com (72.14.205.101) [AS 15169] 68 msec 60 msec 56 msec
Ok we go from AS 7018 (AT&T) to AS 3356 (Level 3). When it was broken we went:
from 7018 (AT&T) to 2914 (NTT.NET)5 192.205.35.78 [AS 7018] 16 msec 28 msec 28 msec
6 ae-1.r20.chcgil09.us.bb.gin.ntt.net (129.250.4.112) [AS 2914] 20 msec 20 msec
When it was fixed we now go:
fixed said:5 192.205.33.210 [AS 7018] 16 msec 16 msec 16 msec
6 ae-31-51.ebr1.chicago1.level3.net (4.68.101.30) [AS 3356] 24 msec 28 msec 20 msec
Let's take a peak at Level 3's network status page and see if there is anything interesting going on ... Holy hell look at Cognet and and global crossing! http://www.esupport24.com/netstatus/
-------------------------
edit: I am going to keep editing this to add information / analysis. I just noticed that the advertised route from ntt.net was:
my view during outage said:sh ip bgp 72.14.205.101
BGP routing table entry for 72.14.204.0/23, version 50236222
Google's network is actually a /18 not /23 so another issue at play was longest match in BGP:
arin said:OrgName: Google Inc.
OrgID: GOGL
Address: 1600 Amphitheatre Parkway
City: Mountain View
StateProv: CA
PostalCode: 94043
Country: US
NetRange: 72.14.192.0 - 72.14.255.255
CIDR: 72.14.192.0/18
NetName: GOOGLE
NetHandle: NET-72-14-192-0-1
Parent: NET-72-0-0-0-0
NetType: Direct Allocation
NameServer: NS1.GOOGLE.COM
NameServer: NS2.GOOGLE.COM
NameServer: NS3.GOOGLE.COM
NameServer: NS4.GOOGLE.COM
Comment:
RegDate: 2004-11-10
Updated: 2007-04-10
RTechHandle: ZG39-ARIN
RTechName: Google Inc.
RTechPhone: +1-650-318-0200
RTechEmail: [email protected]
OrgTechHandle: ZG39-ARIN
OrgTechName: Google Inc.
OrgTechPhone: +1-650-318-0200
OrgTechEmail: [email protected]
Last edited: