A single mistyped server command during a routine debugging of its billing system in Amazon Web Services' northern Virginia data center turned out to be cause of a huge outage February 28 that stilled an estimated 150,000 websites and/or business services for about half the day. It contains literally trillions of these items, known as "object" to programmers.
Although S3 subsystems are created to keep working "with little or no customer impact" even when significant capacity fails or is removed, Amazon said it had not "completely restarted the index subsystem or the placement subsystem in our larger regions for many years". AWS Offers Timeline of the Event AWS spelled out the timeline of the event this way: "At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was meant to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process".
According to Amazon, the trouble began Tuesday morning when an employee, who was conducting routine maintenance, mistakenly entered the wrong command while trying to take "a small number of servers" offline. That command was part of an established Amazon playbook. "This will prevent an incorrect input from triggering a similar event in the future". That error started a domino effect that took down a server subsystem and so on and so on. "Removing a significant portion of the capacity caused each of these systems to require a full restart", the company said.More news: Windows 10 Creators Update Will Let Users Block Desktop Apps from Installing
AWS says that its system is created to allow the removal of big chunks of its components "with little or no customer impact". However, AWS admitted that it hasn't actually restarted those subsystems in years, and S3 has grown considerably in the meantime. While that was happening, S3 couldn't deal with requests for objects - it was effectively turned off for the websites that depended on it. S3 also houses the data that underpins a wide array of other AWS services, including some computing processing functions. Some poor engineer, we'll call him Joe, was tasked with entering a command to shut down some storage sub-systems.
Amazon explains it much more technically, but suffice to say that error had a cascading impact on the S3 storage in the Northern Virginia datacenter.
In that time S3 had experienced "massive" growth, so restarting it, and doing all the safety checks to make sure its files hadn't gotten corrupted in the process, took longer than they expected it to.