Amazon Blames a Typo for Tuesday’s Outage

Megalith

24-bit/48kHz
Staff member
Joined
Aug 20, 2006
Messages
13,000
The massive outage that occurred earlier this week (Tuesday) is being blamed on a sloppy employee who entered an incorrect command (fatty fingers?) as he was trying to remove a small number of servers. As everyone knows, a lot more than that ended up going offline. I take it that the AWS department has a new opening? Not sure I would be able to handle looking Bezos in his creepy eyes, though...

"The Amazon Simple Storage Service (S3) team was debugging an issue causing the S3 billing system to progress more slowly than expected," the company noted. "At [12:37 p.m. ET], an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. "Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended," the message added.
 
A man makes a mistake. Shit happens. If your system is not designed to handle human error, then the system is what needs to be fixed.
 
Somewhat ironic that a company that touts the reliability of its cloud service doesn't have a backup for its billing systems.

Wonder if they found the reason for the original running slow problem?
 
You know, I can't wait to see what he is like at 70 or 80. All of these younger CEO's that think they are all that, don't think they will get old and slow. rofl.
 
A man makes a mistake. Shit happens. If your system is not designed to handle human error, then the system is what needs to be fixed.

Yeah, this seems like a critical design flaw.

There is no such system in the world that is 100% automated that can't be brought down by accidental human intervention so there is no system fix for that.

True, but you can attempt to design some logic into it.

Or you could add to the playbook in big red letters "CAUTION: This command may have ..".

Obviously without knowing the command it's hard to tell.
 
There is no such system in the world that is 100% automated that can't be brought down by accidental human intervention so there is no system fix for that.

Yeah of course, but you can harden your system to human error and make it easier to recover. If someone can accidentally delete a single system critical file and there are no backups, then that was poor planning.

My main point is, no one needs to be fired. The person and team responsible for the error have learned a very valuable lesson, a lesson that Amazon itself will hopefully learn from (since they paid for it, afterall) if they are wise.
 
Yeah of course, but you can harden your system to human error and make it easier to recover. If someone can accidentally delete a single system critical file and there are no backups, then that was poor planning.

My main point is, no one needs to be fired. The person and team responsible for the error have learned a very valuable lesson, a lesson that Amazon itself will hopefully learn from (since they paid for it, afterall) if they are wise.
They have already designed a change to the command to get the dependencies' requirements and refuse to remove resources if it would go below those requirements. They're also looking into what other commands should have similar checks.
 
Whoa. Once command? I thought there would be some type of redundant checkup in regards to typing something by accident.
 
Reminds me of what happened at my work about a year ago. A database admin "accidentally" deleted one of our databases. It took 3 days just to bring it back up (3 days of downtime for about a dozen employees) and another month to get everything that could be, restored. People fuck up. It happens.
 
They learned from it, and are applying it to other systems and processes... Clearly not ideal, but at least it's being made into something useful...
 
If you're a network administrator or support and you haven't brought down an entire network yet, you're still a networking virgin. That said this guy may have effectively broke the ice on Olivia Wilde.
 
What? No "Are you sure you want to do this?" prompt?

The issue is that most people just click on yes and ignore that. I work for a telephone company. The company that makes our switch brought up how they get a lot of request for ways to undo removing customers from the switch because even when you get the "Are you sure you want to remove this business customer and all their lines?" prompt people will just click yes then realize after that they did the wrong account or didn't actually mean to delete that.

That said I am guilty of the same type of action. I have accidently shut down an entire server when only meaning to turn off a single customer's connection as I was a level to high in the command line when issuing the command to deactivate. Seen the prompt this action will stop all traffic and hit yes before it registered in my head that I was saying yes to a full system shutdown and not just a customer connection shutdown.
 
What? No "Are you sure you want to do this?" prompt?
Those are useless. Clicking yes on them is muscle memory. The only worthwhile confirmation prompts that require input that you need to think about.
 
I guess the quick fix is to ensure the command to take down all servers has more than a single letter difference from the command to take down one or two, i.e. killall vs kill.

EDIT: I'm obviously making assumptions here.
 
Why does he always have the look of someone with bodies buried in the back yard?

1488487976LNhJ87sUrZ_1_1.jpg
 
A man makes a mistake. Shit happens. If your system is not designed to handle human error, then the system is what needs to be fixed.
I'm surprised one typo could do so much damage. Assuming it was one character or so, that sounds like bad programming if that's all it took to do that much damage.
 
I guess the quick fix is to ensure the command to take down all servers has more than a single letter difference from the command to take down one or two, i.e. killall vs kill.

EDIT: I'm obviously making assumptions here.
You beat me to it lol
 
Obviously your one, single use case applies to the entire industry :)
That's not to say I never caused a system-wide outage, just not from a typo! :D

How many do you manage?

That's the point. It's always been in-house systems, not managing multiple systems for a diversity of clients. The point is to support my contention that it is not necessarily wise to rely on a remotely managed system that is maintained exclusively by a 3rd party (cloud services) and that by maintaining your own system, you are responsible for the safeguards in place, and can make sure they are in place. In one of the systems I administered, I was responsible for writing up a disaster recovery plan. It did not involve "call up the server provider and hope they can figure out quickly what is happening."
 
Back
Top