Malfunctioning Networking Card Led to Huge CenturyLink Outage

AlphaAtlas · Jan 3, 2019

According to a recent report from GeekWire, a single faulty network management card was behind a 2 day, "widespread" outage in CenturyLink’s cloud network. On Twitter, Brian Krebs posted a screenshot of a notice that was allegedly sent out by CenturyLink, which blames a single card for "propagating invalid frame packets across devices." The incident reportedly impacted 911 services in several states, among other things. The FCC opened an investigation into the outage, while affected states are considering investigations of their own.

CenturyLink representatives did not immediately respond to a request to verify the notice. By the standards of modern cloud service providers, a two-day outage is an eternity. And it’s not clear how a single piece of equipment could cause an outage of such magnitude given the layers of redundancy that cloud providers build into their systems. An FCC investigation into the outage might turn up some answers, unless CenturyLink is willing to post a more detailed post-mortem on the outage, which is becoming a standard part of incident response.

Mega6 · Jan 3, 2019

LOL Century Link Techs. Good one. Someone needs to learn how to use WireShark.

Sniper|3d-R| · Jan 3, 2019

Single point of failure? Was the latest firmware and drivers installed? Why was this not seen in QA/testing? This sort of behavior is seen in testing (its what I currently do) and I doubt this was the actual issue. I hope to see a detailed analysis so this does not happen in the future.

Zareek · Jan 3, 2019

Not buying it, they are leaving something out, it doesn't add up.

nutzo · Jan 3, 2019

Not to defend them, but a hard failure is usually easy to trouble shoot.
It's problems like a network card corrupting or dropping random packets that can a real pain to find.

mnewxcv · Jan 3, 2019

Yeahhhhhhh. That makes sense.

Guarana [BAWLS] · Jan 3, 2019

nutzo said:
Not to defend them, but a hard failure is usually easy to trouble shoot.
It's problems like a network card corrupting or dropping random packets that can a read pain to find.

This.
I've also seen properly redundant system designs go down to something that *shouldn't* be a SPoF. It had redundant hardware / links, and still went down like it was a SPoF.

Should, Would, and Do are 3 very different things.

Dead Parrot · Jan 3, 2019

Wild Ass Speculation but almost sounds like whatever gizmo was supposed to detect and isolate bad packets and switch to the good traffic instead picked the bad packets as 'good' and dropped the good ones. Hopefully the final incident report will fill in missing details.

purple_monster · Jan 3, 2019

Mega6 said:
LOL Century Link Techs. Good one. Someone needs to learn how to use WireShark.

Using wireshark to see red packets is easy. Getting the tap in the right spot to catch bad frames is hard, especially at large providers where the engineers do not have physical access and have to work with thumbsey DC techs. I have a strong feeling they aren't sharing the whole story because a single card causing this outage doesn't add up, unless their infrastructure is a complete joke(possible).

lcpiper · Jan 3, 2019

mashie said:
Death by SNMP?

I'm intrigued what management server this faulty NIC was in. Unless they are running OpenFlow I can't see how this could break anything.

Probably wasn't in any server, it was probably a a NIC blade from a large switch like the one below. You can swap out the different cards for your needs, fiber, copper, etc.

prne10 · Jan 3, 2019

From what I understand, a lot of the ISPs use their own home grown monitoring solutions. Sounds like a major SRE team fail.

Dekoth-E- · Jan 3, 2019

Zareek said:
Not buying it, they are leaving something out, it doesn't add up.

I've dealt with enough Century Link as an outside contractor to also call bullshit on this.

That_Sound_Guy · Jan 3, 2019

nutzo said:
Not to defend them, but a hard failure is usually easy to trouble shoot.
It's problems like a network card corrupting or dropping random packets that can a real pain to find.

Very much this. I support 911 and was impacted by this event and I still sympathize for what their techs probably had to go through. Lets remember, the techs/engineers probably made this concerns like this known and leadership ignored or decided not to fund it. Techs and engineers generally want their lives safer and easier, but takes money and permission to do that.

Still funny all the people calling BS, company full of morons, poor monitoring and all the rest of it to know what its like to monitor a global network. A global network constantly being attacked by outside forces like power loss, fiber cuts, environmental damage and everything else on top of hardware and software.

Hatriot · Jan 3, 2019

For backbone level, carrier class gear. Infinera, Ciena, "some" Cisco, this will not be a mere "network card" like people are mentioning. Networking for an ISP vs an enterprise area pretty different once you reach a certain level of bandwidth.

Deleted member 134608 · Jan 3, 2019

Guarana [BAWLS] said:
This.
I've also seen properly redundant system designs go down to something that *shouldn't* be a SPoF. It had redundant hardware / links, and still went down like it was a SPoF.

Should, Would, and Do are 3 very different things.

Yup. If I had a dollar for every time our analysts said "failover didn't fail over" I wouldn't be rich, but I'd have a nice steak dinner....

Hatriot · Jan 3, 2019

prne10 said:
From what I understand, a lot of the ISPs use their own home grown monitoring solutions. Sounds like a major SRE team fail.

This is not correct. More like every day is struggle to find a single plane of glass which supports every type of layer2/3 and TDM configuration. The truth of the matter is they all of them use a multitude of NMS systems based on the vendor of the gear they use. I support at least five different vendor specific systems at the ISP I am with.

Boscoh · Jan 3, 2019

There's more containerizing, contextualizing, and virtualizing being used to create faux redundancy than people realize. I've seen networks brought down when two separate data planes are served by a single control plane and that control plane has an issue, or one physical device is virtualized into several virtual nodes and the physical device or the core OS has an issue.

I'll bet it comes out that multiple individual nodes in their network served both data/control/management planes and this wasn't a physically redundant network.

TheBuzzer · Jan 3, 2019

It was caused by that spy chip from supermicro if anyone still remembers

Master_shake_ · Jan 3, 2019

i wonder if they tried unplugging it and plugging back in?

Exavior · Jan 3, 2019

That_Sound_Guy said:
Very much this. I support 911 and was impacted by this event and I still sympathize for what their techs probably had to go through. Lets remember, the techs/engineers probably made this concerns like this known and leadership ignored or decided not to fund it. Techs and engineers generally want their lives safer and easier, but takes money and permission to do that.

Still funny all the people calling BS, company full of morons, poor monitoring and all the rest of it to know what its like to monitor a global network. A global network constantly being attacked by outside forces like power loss, fiber cuts, environmental damage and everything else on top of hardware and software.

It is more of some people might just know CenturyLink. They are lacking techs in many areas, won't fix know issues...

Yes, running a nation wide network is hard. Even more so when you refuse or don't have the man power to fix issues like you stated. Offices around me have trouble and they ignore it for days or weeks because they don't have the man power or willingness to resolve the issue. I had one of their CO techs working for me for awhile, he was impressed that we could actually get work done here vs how it was at his years at Century Link.

mashie said:
I'm pretty certain you won't find HP switches in any carrier networks.

Well the RFO stated Management NIC and there is no such thing in a carrier grade networking device unless they are referring to the console/out of band management port... The most likely scenario is that it was an explanation "lost in translation" between the NOC and the senior manager that producing the RFO. They probably talked about a Cisco Supervisor, Nokia CPM or Juniper RE. Once one of them goes wrong the most spectacular outages occur, especially if they only have a partial failure.

I don't think that they actually meant that that CenturyLink would have an HP, he was just using that as an example.

Exavior · Jan 3, 2019

Master_shake_ said:
i wonder if they tried unplugging it and plugging back in?

They tried that then had to wait the normal 48 hours before trying step two.

piscian18 · Jan 3, 2019

lcpiper said:
Probably wasn't in any server, it was probably a a NIC blade from a large switch like the one below. You can swap out the different cards for your needs, fiber, copper, etc.

*snips*

Agreed long time MSO engineer. This sounds like it came out of a faulty route processor card. Shit happens. Even in redundant systems if one doesn't do proper notifies due to some sort of fault then the network cant kill it and reestablish. Ive personally seen an ASR flip out and kill a network. Somebody literally has to get to HUB/colo and power off the system. Sometimes the network can slow so bad you cant even get to a near system and block it. Its even worse if a dumbass MSO puts their OOB MGMT network on inband routing.

cyclone3d · Jan 3, 2019

Sniper|3d-R| said:
Single point of failure? Was the latest firmware and drivers installed? Why was this not seen in QA/testing? This sort of behavior is seen in testing (its what I currently do) and I doubt this was the actual issue. I hope to see a detailed analysis so this does not happen in the future.

No, a single network card taking down a whole bunch of things because it is flooding everything with bad packets would not be considered a single point of failure.

I have personally seen the same exact type of thing happen twice.

In both instances a network card on a desktop computer malfunctioned and was flooding everything in the entire office with bad packets.

The first time it took me a couple hours to figure out what was going on. The second time it took me about 30-45 minutes.

When that exact problem happens it can be really frustrating to figure out what is going on.

mord · Jan 3, 2019

That_Sound_Guy said:
Very much this. I support 911 and was impacted by this event and I still sympathize for what their techs probably had to go through. Lets remember, the techs/engineers probably made this concerns like this known and leadership ignored or decided not to fund it. Techs and engineers generally want their lives safer and easier, but takes money and permission to do that.

Still funny all the people calling BS, company full of morons, poor monitoring and all the rest of it to know what its like to monitor a global network. A global network constantly being attacked by outside forces like power loss, fiber cuts, environmental damage and everything else on top of hardware and software.

or all the tech support is in India or somewhere like that. Some contractor support comes on site in the rare event something needs moved. Said contractor's are most likely college kids that do it part time and have had about 20 hours of training. 19.5 of which were on inclusion, sexual harassment training, and other HR mandated topics.

kdh · Jan 3, 2019

Those issues are very hard to find. Think about it like this.. When you have a platform with multiple layers of redundancy for fail over, it becomes incredibly hard to isolate where the problem is. Why? You have multiple layers of redundancy all trying to resolve a partially failing piece of gear. Things start flopping around. Ive been in IT professionally for 19 years.. One thing I learned.. Gear works great when it has no problems. Problems are easy to find when a piece of gear flat out dies. But shoot me in the face when trying to find a piece of gear flopping up and down between ok and failed. I've personally had a single gbic cause me pain across my whole environment. I only discovered it after finding a single entry in an obscure log partially mentioning it. When CL techs were trouble shooting that issue.. I guarantee no one slept, and no one stopping trying to find root cause on that issue because of what was being impacted.

GNUse_the_force · Jan 3, 2019

spanning tree issue ?

Burticus · Jan 3, 2019

Admin down and quarantine the port. Problem solved. Thank me later.

mynamehere · Jan 4, 2019

TheOne&OnlyZeke · Jan 4, 2019

Did anyone check whether it was wired 568A or B
Could have made a difference

lostin3d · Jan 4, 2019

mt2e · Jan 4, 2019

In the telecom industry cards like this are replaced everyday. Typically you fail over to a backup, take a hit, replace the bad card, then switch back to the original card that was replaced. Sometimes the failed over/backup card is also bad and thus both cards need to be replaced. There are systems that have racks of backup cards, to me it dosen't sound like the other slots were populated.

inerlogic · Jan 4, 2019

Malfunctioning Networking Card Led to Huge CenturyLink Outage

[H]ard|Gawd

2[H]4U

Supreme [H]ardness

Limp Gawd

Supreme [H]ardness

[H]F Junkie

[H]ard|Gawd

2[H]4U

Gawd

[H]F Junkie

Limp Gawd

Supreme [H]ardness

2[H]4U

Limp Gawd

Deleted member 134608

Guest

Limp Gawd

[H]ard|Gawd

HACK THE WORLD!

Fully [H]

[H]F Junkie

[H]F Junkie

[H]F Junkie

[H]F Junkie

Limp Gawd

Gawd

Gawd

Supreme [H]ardness

[H]ard|Gawd

100% Irish

[H]ard|Gawd

Limp Gawd

Limp Gawd