Amazon Blames a Typo for Tuesday’s Outage

Discussion in 'HardForum Tech News' started by Megalith, Mar 2, 2017.

  1. Megalith

    Megalith 24-bit/48kHz Staff Member

    Messages:
    13,004
    Joined:
    Aug 20, 2006
    The massive outage that occurred earlier this week (Tuesday) is being blamed on a sloppy employee who entered an incorrect command (fatty fingers?) as he was trying to remove a small number of servers. As everyone knows, a lot more than that ended up going offline. I take it that the AWS department has a new opening? Not sure I would be able to handle looking Bezos in his creepy eyes, though...

    "The Amazon Simple Storage Service (S3) team was debugging an issue causing the S3 billing system to progress more slowly than expected," the company noted. "At [12:37 p.m. ET], an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. "Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended," the message added.
     
  2. When not having that cup of coffee first actually gets you fired.
     
  3. trick_m0nkey

    trick_m0nkey Moderator Staff Member

    Messages:
    4,061
    Joined:
    Oct 11, 2005
    A man makes a mistake. Shit happens. If your system is not designed to handle human error, then the system is what needs to be fixed.
     
    serious likes this.
  4. Bowman15

    Bowman15 [H]ard|Gawd

    Messages:
    1,248
    Joined:
    Apr 7, 2015
    There is no such system in the world that is 100% automated that can't be brought down by accidental human intervention so there is no system fix for that.
     
    cybrnook, captaindiptoad and Darunion like this.
  5. Dead Parrot

    Dead Parrot 2[H]4U

    Messages:
    2,567
    Joined:
    Mar 4, 2013
    Somewhat ironic that a company that touts the reliability of its cloud service doesn't have a backup for its billing systems.

    Wonder if they found the reason for the original running slow problem?
     
  6. Nukester

    Nukester [H]ard|Gawd

    Messages:
    1,428
    Joined:
    Mar 21, 2016
    You know, I can't wait to see what he is like at 70 or 80. All of these younger CEO's that think they are all that, don't think they will get old and slow. rofl.
     
  7. Spidey329

    Spidey329 [H]ardForum Junkie

    Messages:
    8,676
    Joined:
    Dec 15, 2003
    Yeah, this seems like a critical design flaw.

    True, but you can attempt to design some logic into it.

    Or you could add to the playbook in big red letters "CAUTION: This command may have ..".

    Obviously without knowing the command it's hard to tell.
     
  8. trick_m0nkey

    trick_m0nkey Moderator Staff Member

    Messages:
    4,061
    Joined:
    Oct 11, 2005
    Yeah of course, but you can harden your system to human error and make it easier to recover. If someone can accidentally delete a single system critical file and there are no backups, then that was poor planning.

    My main point is, no one needs to be fired. The person and team responsible for the error have learned a very valuable lesson, a lesson that Amazon itself will hopefully learn from (since they paid for it, afterall) if they are wise.
     
  9. Ducman69

    Ducman69 [H]ardForum Junkie

    Messages:
    10,445
    Joined:
    Jul 12, 2007
    And a woman will keep bringing it up even years later, anytime she's losing an argument.
     
    AK0tA, Initium, alxnet7227 and 5 others like this.
  10. MavericK

    MavericK Zero Cool

    Messages:
    29,011
    Joined:
    Sep 2, 2004
    So he typed "remove server *" instead of "remove server 8"? :p
     
    sirgallium, alxnet7227 and Spartacus like this.
  11. Spartacus

    Spartacus [H]ard|Gawd

    Messages:
    1,924
    Joined:
    Apr 29, 2005
    Lol.... wildcards are both awesome and they also totally suck at the wrong times.

    .
     
  12. DocSavage

    DocSavage 2[H]4U

    Messages:
    2,409
    Joined:
    Dec 18, 2002
    They have already designed a change to the command to get the dependencies' requirements and refuse to remove resources if it would go below those requirements. They're also looking into what other commands should have similar checks.
     
  13. Whach

    Whach [H]ard|Gawd

    Messages:
    1,030
    Joined:
    Dec 22, 2011
    Whoa. Once command? I thought there would be some type of redundant checkup in regards to typing something by accident.
     
  14. iamjanco

    iamjanco Limp Gawd

    Messages:
    441
    Joined:
    Jul 8, 2016
    Do Hindi keyboards even have an asterisk key?
     
  15. [21CW]killerofall

    [21CW]killerofall Aliens...

    Messages:
    3,095
    Joined:
    Mar 16, 2006
    Reminds me of what happened at my work about a year ago. A database admin "accidentally" deleted one of our databases. It took 3 days just to bring it back up (3 days of downtime for about a dozen employees) and another month to get everything that could be, restored. People fuck up. It happens.
     
    GoldenTiger likes this.
  16. MikeTrike

    MikeTrike [H]ardForum Junkie

    Messages:
    8,202
    Joined:
    Nov 16, 2005
    They learned from it, and are applying it to other systems and processes... Clearly not ideal, but at least it's being made into something useful...
     
  17. EODetroit

    EODetroit [H]ard|Gawd

    Messages:
    1,485
    Joined:
    Oct 20, 2004
    If you're a network administrator or support and you haven't brought down an entire network yet, you're still a networking virgin. That said this guy may have effectively broke the ice on Olivia Wilde.
     
  18. amddragonpc

    amddragonpc [H]ard|Gawd

    Messages:
    1,996
    Joined:
    Sep 20, 2012
    What? No "Are you sure you want to do this?" prompt?
     
  19. Exavior

    Exavior [H]ardForum Junkie

    Messages:
    9,660
    Joined:
    Dec 13, 2005
    The issue is that most people just click on yes and ignore that. I work for a telephone company. The company that makes our switch brought up how they get a lot of request for ways to undo removing customers from the switch because even when you get the "Are you sure you want to remove this business customer and all their lines?" prompt people will just click yes then realize after that they did the wrong account or didn't actually mean to delete that.

    That said I am guilty of the same type of action. I have accidently shut down an entire server when only meaning to turn off a single customer's connection as I was a level to high in the command line when issuing the command to deactivate. Seen the prompt this action will stop all traffic and hit yes before it registered in my head that I was saying yes to a full system shutdown and not just a customer connection shutdown.
     
  20. griff30

    griff30 I Lower the Boom!

    Messages:
    5,419
    Joined:
    Jul 15, 2000
    I figured they would try to blame it on Russia.
     
  21. M76

    M76 [H]ardForum Junkie

    Messages:
    9,690
    Joined:
    Jun 12, 2012
    Those are useless. Clicking yes on them is muscle memory. The only worthwhile confirmation prompts that require input that you need to think about.
     
    drescherjm likes this.
  22. spugm1r3

    spugm1r3 [H]ard|Gawd

    Messages:
    1,153
    Joined:
    Sep 28, 2012
    I guess the quick fix is to ensure the command to take down all servers has more than a single letter difference from the command to take down one or two, i.e. killall vs kill.

    EDIT: I'm obviously making assumptions here.
     
  23. jardows

    jardows [H]ard|Gawd

    Messages:
    1,739
    Joined:
    Jun 10, 2015
    A good reason to not rely on "the cloud."
     
  24. trick_m0nkey

    trick_m0nkey Moderator Staff Member

    Messages:
    4,061
    Joined:
    Oct 11, 2005
    The cloud is here to stay. Running your own stack isn't any more reliable.
     
  25. sirgallium

    sirgallium Limp Gawd

    Messages:
    336
    Joined:
    May 30, 2006
    *hundreds of servers stream up the screen going offline*

    shit shit what the f is happening

    quick unplug it

    God damn. Do you think they're gonna notice?
     
    GoldenTiger likes this.
  26. Azphira

    Azphira [H]ard|Gawd

    Messages:
    1,821
    Joined:
    Aug 18, 2003
    Why does he always have the look of someone with bodies buried in the back yard?

    [​IMG]
     
    sirgallium and MikeTrike like this.
  27. MikeTrike

    MikeTrike [H]ardForum Junkie

    Messages:
    8,202
    Joined:
    Nov 16, 2005
    Sounds like someone who can get shit done...
     
  28. jardows

    jardows [H]ard|Gawd

    Messages:
    1,739
    Joined:
    Jun 10, 2015
    I never caused a system-wide outage lasting for hours because of a typo, let alone, causing an outage that affects thousands of clients!
     
  29. jmilcher

    jmilcher [H]ardness Supreme

    Messages:
    4,342
    Joined:
    Feb 3, 2008
    I'm surprised one typo could do so much damage. Assuming it was one character or so, that sounds like bad programming if that's all it took to do that much damage.
     
  30. jmilcher

    jmilcher [H]ardness Supreme

    Messages:
    4,342
    Joined:
    Feb 3, 2008
    You beat me to it lol
     
    spugm1r3 likes this.
  31. trick_m0nkey

    trick_m0nkey Moderator Staff Member

    Messages:
    4,061
    Joined:
    Oct 11, 2005
    Obviously your one, single use case applies to the entire industry :)
     
    MikeTrike likes this.
  32. MikeTrike

    MikeTrike [H]ardForum Junkie

    Messages:
    8,202
    Joined:
    Nov 16, 2005
    How many do you manage?

    upload_2017-3-6_12-11-28.png
     
  33. jardows

    jardows [H]ard|Gawd

    Messages:
    1,739
    Joined:
    Jun 10, 2015
    That's not to say I never caused a system-wide outage, just not from a typo! :D

    That's the point. It's always been in-house systems, not managing multiple systems for a diversity of clients. The point is to support my contention that it is not necessarily wise to rely on a remotely managed system that is maintained exclusively by a 3rd party (cloud services) and that by maintaining your own system, you are responsible for the safeguards in place, and can make sure they are in place. In one of the systems I administered, I was responsible for writing up a disaster recovery plan. It did not involve "call up the server provider and hope they can figure out quickly what is happening."