AWS Outage Caused by Uncommon Automated Systems Interaction
We independently review everything we recommend. When you buy through our links, we may earn a commission which is paid directly to our Australia-based writers, editors, and support staff. Thank you for your support!
Quick Read
- AWS encountered a significant outage in North Virginia due to an uncommon software malfunction.
- The outage resulted from a dormant race condition within the DynamoDB DNS management framework.
- This malfunction stemmed from an improbable interaction between automated elements.
- Human intervention was required to rectify the problem.
- AWS has globally suspended specific automated systems temporarily.
- The incident impacted various AWS services reliant on DynamoDB.
AWS Outage Examination
The recent outage of Amazon Web Services (AWS) in North Virginia has been linked to a software defect within an automated DNS management framework. This defect resulted in an unexpected interaction where one automated element unintentionally eliminated the work of another.
Comprehending the Primary Cause
AWS has indicated in a report following the incident that the outage stemmed from a “latent race condition” in the DynamoDB DNS management system. This resulted in an incorrect empty DNS record for the service’s regional endpoint, which automation failed to rectify.
The Function of DNS Planner and Enactors
The DNS management system consists of two essential components: the DNS Planner and DNS Enactors. The Planner formulates new DNS strategies, while the Enactors implement these strategies at endpoints. Usually, these components function smoothly to ensure DNS states are current.
Chain Reaction of Events
During this event, one DNS Enactor experienced unusual delays, necessitating several attempts to refresh DNS endpoints. At the same time, another Enactor executed a more recent plan, instigating the race condition. The delayed Enactor overwrote the recent plan, leading to the removal of IP addresses for the endpoint.
Manual Intervention and Preventative Steps
Ultimately, human intervention was crucial to address the issue. AWS has disabled both the DNS Planner and DNS Enactors on a worldwide basis. Prior to re-enabling these systems, AWS intends to correct the race condition and introduce additional measures to avert future issues.
Impact on Dependent Services
The outage in the US-EAST-1 region affected other AWS services dependent on DynamoDB, including EC2 instances. These disruptions occurred because subsystems became unable to connect to the service, resulting in cascading effects throughout the AWS ecosystem.
Synopsis
The AWS outage in North Virginia highlights the intricacies of automated cloud management systems and their potential weaknesses. AWS is actively working to address the problems and avoid similar incidents in the future. This event serves as a reminder of the significance of robust system architecture and the necessity for human oversight in critical scenarios.