Malfunctioning Networking Card Led to Huge CenturyLink Outage

AlphaAtlas

[H]ard|Gawd
Staff member
Joined
Mar 3, 2018
Messages
1,713
According to a recent report from GeekWire, a single faulty network management card was behind a 2 day, "widespread" outage in CenturyLink’s cloud network. On Twitter, Brian Krebs posted a screenshot of a notice that was allegedly sent out by CenturyLink, which blames a single card for "propagating invalid frame packets across devices." The incident reportedly impacted 911 services in several states, among other things. The FCC opened an investigation into the outage, while affected states are considering investigations of their own.

CenturyLink representatives did not immediately respond to a request to verify the notice. By the standards of modern cloud service providers, a two-day outage is an eternity. And it’s not clear how a single piece of equipment could cause an outage of such magnitude given the layers of redundancy that cloud providers build into their systems. An FCC investigation into the outage might turn up some answers, unless CenturyLink is willing to post a more detailed post-mortem on the outage, which is becoming a standard part of incident response.
 
Last edited by a moderator:
Single point of failure? Was the latest firmware and drivers installed? Why was this not seen in QA/testing? This sort of behavior is seen in testing (its what I currently do) and I doubt this was the actual issue. I hope to see a detailed analysis so this does not happen in the future.
 
Not to defend them, but a hard failure is usually easy to trouble shoot.
It's problems like a network card corrupting or dropping random packets that can a real pain to find.
 
Last edited:
Not to defend them, but a hard failure is usually easy to trouble shoot.
It's problems like a network card corrupting or dropping random packets that can a read pain to find.
This.
I've also seen properly redundant system designs go down to something that *shouldn't* be a SPoF. It had redundant hardware / links, and still went down like it was a SPoF.

Should, Would, and Do are 3 very different things.
 
Wild Ass Speculation but almost sounds like whatever gizmo was supposed to detect and isolate bad packets and switch to the good traffic instead picked the bad packets as 'good' and dropped the good ones. Hopefully the final incident report will fill in missing details.
 
LOL Century Link Techs. Good one. Someone needs to learn how to use WireShark.

Using wireshark to see red packets is easy. Getting the tap in the right spot to catch bad frames is hard, especially at large providers where the engineers do not have physical access and have to work with thumbsey DC techs. I have a strong feeling they aren't sharing the whole story because a single card causing this outage doesn't add up, unless their infrastructure is a complete joke(possible).
 
Death by SNMP?

I'm intrigued what management server this faulty NIC was in. Unless they are running OpenFlow I can't see how this could break anything.

Probably wasn't in any server, it was probably a a NIC blade from a large switch like the one below. You can swap out the different cards for your needs, fiber, copper, etc.


B9A680D2-F3C2-4A87-8905-7A7661AFC30A.jpg
 
From what I understand, a lot of the ISPs use their own home grown monitoring solutions. Sounds like a major SRE team fail.
 
Not to defend them, but a hard failure is usually easy to trouble shoot.
It's problems like a network card corrupting or dropping random packets that can a real pain to find.

Very much this. I support 911 and was impacted by this event and I still sympathize for what their techs probably had to go through. Lets remember, the techs/engineers probably made this concerns like this known and leadership ignored or decided not to fund it. Techs and engineers generally want their lives safer and easier, but takes money and permission to do that.

Still funny all the people calling BS, company full of morons, poor monitoring and all the rest of it to know what its like to monitor a global network. A global network constantly being attacked by outside forces like power loss, fiber cuts, environmental damage and everything else on top of hardware and software.
 
For backbone level, carrier class gear. Infinera, Ciena, "some" Cisco, this will not be a mere "network card" like people are mentioning. Networking for an ISP vs an enterprise area pretty different once you reach a certain level of bandwidth.
 
This.
I've also seen properly redundant system designs go down to something that *shouldn't* be a SPoF. It had redundant hardware / links, and still went down like it was a SPoF.

Should, Would, and Do are 3 very different things.

Yup. If I had a dollar for every time our analysts said "failover didn't fail over" I wouldn't be rich, but I'd have a nice steak dinner....
 
From what I understand, a lot of the ISPs use their own home grown monitoring solutions. Sounds like a major SRE team fail.

This is not correct. More like every day is struggle to find a single plane of glass which supports every type of layer2/3 and TDM configuration. The truth of the matter is they all of them use a multitude of NMS systems based on the vendor of the gear they use. I support at least five different vendor specific systems at the ISP I am with.
 
There's more containerizing, contextualizing, and virtualizing being used to create faux redundancy than people realize. I've seen networks brought down when two separate data planes are served by a single control plane and that control plane has an issue, or one physical device is virtualized into several virtual nodes and the physical device or the core OS has an issue.

I'll bet it comes out that multiple individual nodes in their network served both data/control/management planes and this wasn't a physically redundant network.
 
It was caused by that spy chip from supermicro if anyone still remembers
 
Very much this. I support 911 and was impacted by this event and I still sympathize for what their techs probably had to go through. Lets remember, the techs/engineers probably made this concerns like this known and leadership ignored or decided not to fund it. Techs and engineers generally want their lives safer and easier, but takes money and permission to do that.

Still funny all the people calling BS, company full of morons, poor monitoring and all the rest of it to know what its like to monitor a global network. A global network constantly being attacked by outside forces like power loss, fiber cuts, environmental damage and everything else on top of hardware and software.

It is more of some people might just know CenturyLink. They are lacking techs in many areas, won't fix know issues...

Yes, running a nation wide network is hard. Even more so when you refuse or don't have the man power to fix issues like you stated. Offices around me have trouble and they ignore it for days or weeks because they don't have the man power or willingness to resolve the issue. I had one of their CO techs working for me for awhile, he was impressed that we could actually get work done here vs how it was at his years at Century Link.

I'm pretty certain you won't find HP switches in any carrier networks.

Well the RFO stated Management NIC and there is no such thing in a carrier grade networking device unless they are referring to the console/out of band management port... The most likely scenario is that it was an explanation "lost in translation" between the NOC and the senior manager that producing the RFO. They probably talked about a Cisco Supervisor, Nokia CPM or Juniper RE. Once one of them goes wrong the most spectacular outages occur, especially if they only have a partial failure.

I don't think that they actually meant that that CenturyLink would have an HP, he was just using that as an example.
 
Probably wasn't in any server, it was probably a a NIC blade from a large switch like the one below. You can swap out the different cards for your needs, fiber, copper, etc.


*snips*

Agreed long time MSO engineer. This sounds like it came out of a faulty route processor card. Shit happens. Even in redundant systems if one doesn't do proper notifies due to some sort of fault then the network cant kill it and reestablish. Ive personally seen an ASR flip out and kill a network. Somebody literally has to get to HUB/colo and power off the system. Sometimes the network can slow so bad you cant even get to a near system and block it. Its even worse if a dumbass MSO puts their OOB MGMT network on inband routing.
 
Last edited:
Single point of failure? Was the latest firmware and drivers installed? Why was this not seen in QA/testing? This sort of behavior is seen in testing (its what I currently do) and I doubt this was the actual issue. I hope to see a detailed analysis so this does not happen in the future.

No, a single network card taking down a whole bunch of things because it is flooding everything with bad packets would not be considered a single point of failure.

I have personally seen the same exact type of thing happen twice.

In both instances a network card on a desktop computer malfunctioned and was flooding everything in the entire office with bad packets.

The first time it took me a couple hours to figure out what was going on. The second time it took me about 30-45 minutes.

When that exact problem happens it can be really frustrating to figure out what is going on.
 
Very much this. I support 911 and was impacted by this event and I still sympathize for what their techs probably had to go through. Lets remember, the techs/engineers probably made this concerns like this known and leadership ignored or decided not to fund it. Techs and engineers generally want their lives safer and easier, but takes money and permission to do that.

Still funny all the people calling BS, company full of morons, poor monitoring and all the rest of it to know what its like to monitor a global network. A global network constantly being attacked by outside forces like power loss, fiber cuts, environmental damage and everything else on top of hardware and software.

or all the tech support is in India or somewhere like that. Some contractor support comes on site in the rare event something needs moved. Said contractor's are most likely college kids that do it part time and have had about 20 hours of training. 19.5 of which were on inclusion, sexual harassment training, and other HR mandated topics.
 
Those issues are very hard to find. Think about it like this.. When you have a platform with multiple layers of redundancy for fail over, it becomes incredibly hard to isolate where the problem is. Why? You have multiple layers of redundancy all trying to resolve a partially failing piece of gear. Things start flopping around. Ive been in IT professionally for 19 years.. One thing I learned.. Gear works great when it has no problems. Problems are easy to find when a piece of gear flat out dies. But shoot me in the face when trying to find a piece of gear flopping up and down between ok and failed. I've personally had a single gbic cause me pain across my whole environment. I only discovered it after finding a single entry in an obscure log partially mentioning it. When CL techs were trouble shooting that issue.. I guarantee no one slept, and no one stopping trying to find root cause on that issue because of what was being impacted.
 
  • Like
Reactions: Sufu
like this
Admin down and quarantine the port. Problem solved. Thank me later.
 
In the telecom industry cards like this are replaced everyday. Typically you fail over to a backup, take a hit, replace the bad card, then switch back to the original card that was replaced. Sometimes the failed over/backup card is also bad and thus both cards need to be replaced. There are systems that have racks of backup cards, to me it dosen't sound like the other slots were populated.
 
Back
Top