Internal Documents Show How Amazon Scrambled to Fix Prime Day Glitches

Discussion in '[H]ard|OCP Front Page News' started by Montu, Jul 20, 2018.

  1. Montu

    Montu [H]ard DCOTM x4

    Messages:
    8,005
    Joined:
    Apr 25, 2001
    Amazon's Prime Day got off to a rough start earlier this week, but they still managed to make a boat load of money. Their rough start began with the front page of their website showing a bunch of dog pictures with an error code. They got things straight after a few hours and then went on to make hundreds of millions of dollars. The behind the scene story according to internal documents is they just didn't have enough servers online to handle the load. Kind of surprising for a company that sells server time and space to others. As the glitches grew they killed international traffic and added servers manually since their auto-scaling feature appeared to fail. Just goes to show even the mighty fail once in a while. However, Amazon tends to learn from their mistakes so next year will probably be pretty smooth.

    Caesar says the root cause of the problem may have to do with a failure in Amazon’s auto-scaling feature, which automatically detects traffic fluctuations and adjusts server capacity accordingly. The fact that Amazon cut off international traffic first, rather than increase the number of servers immediately, and added server power manually instead of automatically, is an indication of a breakdown in auto-scaling, a critical component when dealing with unexpected traffic spikes, he said.
     
  2. ryan_975

    ryan_975 [H]ardForum Junkie

    Messages:
    14,628
    Joined:
    Feb 6, 2006
    I wonder if some folks got fired over that. I just had a recruiter send me an unsolicited email about a position on the team that owns the infrastructure behind amazon.com. It was casually mentioned that customers can tangibly see the impact of the work I'd be doing.

    Probably not, but funny to think about.
     
  3. Paul_Johnson

    Paul_Johnson [H] PSU Editor & Admin Staff Member

    Messages:
    15,048
    Joined:
    Aug 29, 2004
    Badly, that is how they scrambled to fix it.
     
  4. BloodyIron

    BloodyIron 2[H]4U

    Messages:
    3,119
    Joined:
    Jul 11, 2005
    It might also be that they identified a need for more human resources?

     
  5. dgingeri

    dgingeri 2[H]4U

    Messages:
    2,806
    Joined:
    Dec 5, 2004
    That's what they get for taking over the world.
     
  6. ryan_975

    ryan_975 [H]ardForum Junkie

    Messages:
    14,628
    Joined:
    Feb 6, 2006
    As I said, probably not. I just found it funny that they're looking for new software engineers to work on infrastructure scaling projects a few days after a very public failure of their infrastructure scaling solutions.
     
  7. prne10

    prne10 Limp Gawd

    Messages:
    209
    Joined:
    Oct 26, 2005
    At the very least they got screamed at. Amazon is legendary for having a very toxic culture.
     
  8. Verge

    Verge [H]ardness Supreme

    Messages:
    5,932
    Joined:
    May 27, 2001
    Ugggh it's amazon. I can assure you people were fired.
     
  9. ryan_975

    ryan_975 [H]ardForum Junkie

    Messages:
    14,628
    Joined:
    Feb 6, 2006
    That's why I politely declined. Though I've heard from people that it's getting better as long as you're not at a location other than Seattle.
     
  10. jardows

    jardows [H]ard|Gawd

    Messages:
    1,291
    Joined:
    Jun 10, 2015
    I thought it funny, I got an email from another vendor that I've signed up to receive "specials" from, and the title started this way - "Tired of looking at dog pictures? Check out our deals!"
     
  11. trentchau

    trentchau [H]ard|Gawd

    Messages:
    1,523
    Joined:
    Apr 17, 2015
    Is funny how people's perception of an issue is related to how much they love or hate the company :) .

    I dislike Amazon, but I don't we should judge based on just issues, but response to issues. Good on the team fixing it.
     
  12. harbingerofdoom

    harbingerofdoom Gawd

    Messages:
    623
    Joined:
    Apr 17, 2007
    i have several internal documents that i could leak which show how i scrambled to not by a damn thing this time on fakesaleday

    uh... i mean prime day.
     
    DocNo likes this.
  13. harbingerofdoom

    harbingerofdoom Gawd

    Messages:
    623
    Joined:
    Apr 17, 2007
    so.... only seattle?
     
  14. Ultima99

    Ultima99 [H]ardness Supreme

    Messages:
    4,899
    Joined:
    Jul 31, 2004
    Too big to fail eh?
     
  15. dyzophoria

    dyzophoria Gawd

    Messages:
    943
    Joined:
    Jan 17, 2006
    I'm thinking the other way around, I'm probably guessing the traffic was huge this time around and since there hasn't been anything like this before, them managing to compensate in resources in a few hours rather than shutdown everything is a success for the team. On a separate topic It is toxic there so a little scream here and there was probably the first reaction. lol
     
  16. Paul_Johnson

    Paul_Johnson [H] PSU Editor & Admin Staff Member

    Messages:
    15,048
    Joined:
    Aug 29, 2004
    You mean, like, last Prime Day? Or the one before?
     
  17. RealBeast

    RealBeast Gawd

    Messages:
    576
    Joined:
    Aug 4, 2010
    Perhaps they should re-evaluate using AWS and consider Azure. :rolleyes:
     
  18. TwiceOver

    TwiceOver 2[H]4U

    Messages:
    2,392
    Joined:
    Jan 14, 2003
    I personally didn't have any troubles. My only issue is my HD10 didn't ship. I assume they ran out of them at $100.
     
  19. James Costello

    James Costello n00bie

    Messages:
    5
    Joined:
    Apr 24, 2016
    I work for Amazon IT in a fulfillment center. Those even thinking of working at Amazon, don't. It's a good place to start and get your chops but put a goal of 5 years and out. That way you cash in on the benefits, which are amazing, and take the money and run. People like to hire ex-Amazonians because they know you've been through the fire. As for Prime day. first of all it's goal is to empty the warehouses of stuff that hasn't sold. I cruise the deals and I see pictures of the stuff that seems to stay on our shelves. In the weeks leading up to prime I could see the push items as they came in by the pallet load and usually we just have a smattering of them in the building. Instapot is one of them. We sell more of those than anything else at Amazon. As an aside, I have one and it's better than sliced bread! As part of IT I can view the trouble tickets. I've read the one for these problems and it's a whopper! It was a war room ticket until the fan blades started turning brown then it was get out of the way. The fact that we recovered as fast as we did is testament to the great IT people we have. I'm not going into specifics cause I don't relish getting fired. When we find the person that leaked the internal memo and a copy of the trouble ticket to CNBC, and we will, I'd hate to even be on this planet. That person needs to be looking for an island somewhere. Most everything we do is databases. Huge, globally linked database that use distributed computing to sync up around the world. When capacity fell on the order side, the warehouse side had to follow. We could not pick or pack in ours for close to 6 hours. When you scan something to pick it, you're poking a database. When you put it in a box and send it down the conveyor you're poking a data base and on and on. 99% of our computing is done on VM's. Hence the capacity deal. Yes it seems the automated system flopped but watching people add capacity was amazing. Different systems would request an addition percentage of horsepower and in between 5 and 20 minutes they had new VM's spun up and online. That side worked. We will do a deep dive and we find true root cause and we will fix it. We use very few off the shelf products. Most, and goal is for it to be all, of our software is written in house. We learn from our mistakes and will learn from this one. Some heads are gonna roll but that's the corporate way.
     
  20. wizdum

    wizdum [H]ard|Gawd

    Messages:
    1,914
    Joined:
    Sep 22, 2010
    I prefer SparkFun's old method of load testing - https://www.sparkfun.com/news/305 Just tell people that if an item is under $100, and they can make it to the "order completed" page, then the order is free.
     
  21. Draax

    Draax [H]ardness Supreme

    Messages:
    5,044
    Joined:
    Aug 11, 2005
    True
     

    Attached Files:

  22. Iratus

    Iratus [H]ard|Gawd

    Messages:
    1,179
    Joined:
    Jan 16, 2003
    Well they’ve certainly got capacity...

    Tbh I think it’s all just manufactured, just have it fall over. Look how busy we are. Get a bigger stock bump than you would if you made some extra sales.

    I’ve had that conversation with a marketing director in a previous company when people wanted to go mental with capacity for a new product launch, the alternative view is being overrun is great for building hype. Same coin as artificial shortages.