Tale of 'metadpata': the revenge of the supertools

One day in November 2022, we brought down our shop with a single character. This post recaps on the lessons we learned from this incident.

photo of Bartosz Ocytko
Bartosz Ocytko

Executive Principal Engineer

photo of Diego Roccia
Diego Roccia

Engineering Manager

Posted on Jan 23, 2024

this is fine meme

The perfect storm

In the mids of Cyber Week preparation in November 2022, I was DMd by a colleague with a request to quickly join a call. To my surprise as I was anticipating a 1:1 call, I got greeted by a message indicating that 60+ others are in the call as well. It turned out that I was just about to join an incident response call for what later got to be known internally as the "metadpata" incident.

In the call, a group of colleagues was trying to put the jigsaw pieces together analyzing why suddenly a large amount of DNS entries across our AWS accounts were removed, causing our shop to effectively go offline for our customers. Additionally, all of us except for the cloud infrastructure team were locked out of accessing AWS accounts and internal tools due to missing DNS entries, rendering the incident response difficult. In short – the classic DNS incident that you may be familiar with from other write-ups. Some helpful and lucky souls hastily started to copy their cached DNS entries before they expired. It was an all hands on deck situation with everyone focused on the single goal of restoring service for our customers ASAP. What followed in the incident call was a controlled disaster recovery with colleagues manually restoring DNS entries starting with essential tooling, followed by core infrastructure, and the services powering our on-site experiences to restore service for our customers.

How was it possible that the DNS entries across multiple accounts suddenly disappeared? The Pull Request that triggered the event was aimed at adjusting YAML configuration for our infrastructure. However, apart from changing configuration for a test account, it also contained a "p" character in one of the configuration fields called "metadata" transforming it into "metadpata". Yet, why was this single character so powerful and destructive?

Enter supertools

We coined the term supertools when working on the Post Mortem for the incident. These are applications or scripts that have the ability to execute large-scale changes across the infrastructure. Initially well intentioned as daemons automating creation of resources and implementing various stages of their lifecycle, they also perform cleanup operations that result in removal of resources. The latter operation, typically used for cleanup of resources that are to be decommissioned is easy to become subject to cost optimization. As part of cost-saving measures, the pacing of executing deletion operations was sped up.

The tool processing the configuration with the unfortunate typo is responsible for setting up AWS accounts. It is a background job that parses the configuration and computes the operations that are to be executed on each affected account. It uses the metadata object to calculate the accounts to work on. The typo resulted the configuration to be interpreted as "no accounts" which in turn was interpreted to be equal to the situation where all accounts are to be decommissioned. The deletion process was triggered and it managed to delete hosted zones containing DNS entries, which triggered the incident. Luckily, the deletion process ran into an error when performing the deletion operations, reducing the scope of the incident and the disaster recovery required.

Incident response

While our incident response culture is well established, this incident tested it to its full extent. In an all hands on deck situation, the cloud infrastructure team was focused on disaster recovery, organized via an incident call. Through an incident chat room, our colleagues were reporting the impact they still observed and reported on the progress of recovery in their clusters. The Incident Commanders focused on determining the approach and priority of the recovery efforts as well as on facilitating the communication between the chatroom and the incident call. Throughout the incident response we switched the Incident Commanders according to their areas of expertise which kept the incident response focused and efficient.

Post Mortem

Through great collaboration across teams to recover the needed DNS entries and restore service for our customers, we were back online in a few hours. As the first incident of its kind and with a large scale impact for our customers, it got high attention across the organization. Predictably, this resulted in an overload of Google Docs that limits the concurrent editors for the document who were working on the Post Mortem. To reduce the likelihood of this happening again, we've changed all links to Post Mortem documents shared with big audiences use the /preview URL by default.

Being close to the start of Cyber Week the focus for the team was to complete the Post Mortem analysis work and decide upon immediate actions to prevent a similar incident from happening. This included pausing changes to the configuration, a review of all supertools in place, and temporary deactivation of the relevant deletion processes. We also wrote a 1-pager summary of the incident and shared it proactively with the whole organization to keep everyone informed about the types of action items scheduled short- and mid-term as agreed during an Incident Review.

Infrastructure changes

An important and often vigorously discussed part of Post Mortems are the action items aimed at preventing recurrence of the incident. In our case, we analyzed how infrastructure changes are reviewed and rolled out a number of improvements with the aim of improving the validation and reducing the blast radius of infrastructure changes that go wrong. We will focus on the most impactful changes that were implemented.

Account lifecycle management changes

We have introduced a new step in the account decommissioning process that simulates deletion using Network ACLs. We also remove the delegation for the DNS zone assigned to the account to ensure that related CNAMEs will not resolve anymore. The account is left in this state for one week before proceeding further with the real decommissioning. This acts as a final "scream test" to make sure there are no more dependencies on this account.

Having assessed the trade-offs and risks for deletion of resources, we have additionally decided to be more careful with deletion of resources that have low cost savings potential compared to the impact a wrong deletion could have. These changes are now done manually and take a longer time to complete, an acceptable trade-off we're willing to take to reduce the risk. To mitigate the potential cost increase, we are monitoring the account costs for the previous 7 days. In case it is over a certain threshold, we look at deleting the resources manually.

Change validation

We've introduced a series of validation steps, for example stringent checks for the presence of mandatory keys and the preview of all stack templates using AWS CloudFormation Linter before they get deployed.

Also, we have set up jsonschema validation for all our configuration files. All these checks run both locally (thanks to pre-commit hooks) and in the CI/CD pipelines. We also did some small quality of life improvements to enable autocompletion and schema validation in our local IDEs, which mitigates the possibility of typos and errors and is simple to set up:

# yaml-language-server: $schema=schema/config_schema.json
(your config)

Additionally, for creation/decommissioning of critical resources, we have introduced several automated quality checks which ensure that all the change corresponds to the user request and the pull request description. These checks also introduce additional approval from the respective account or cost center owners and validation from respective managers. The checks are implemented as a GitHub bot that comments on the Pull Request and blocks the merge until all the checks are validated.

Change previews

We have implemented automated previews in the Pull Request comments. This feature leverages the AWS CloudFormation "ChangeSet" feature. When an updated CF stack template is provided to the CloudFormation "CreateChangeSet" endpoint, CloudFormation generates a json preview of the changes, which then can be executed or rejected. We read this ChangeSet from each account in our AWS Organization and merge them to create a human readable preview of changes in a PR comment. After the preview is created, the ChangeSet is dropped.

Preview of changes in Pull Requests

Preview of changes in Pull Requests


Phased rollout

Our Kubernetes cluster rollout already included a phased rollout to different groups of clusters. This idea was extended to our AWS infrastructure. The rollout process adopted by our tooling now includes gradual rollout to different release channels, each associated with a few AWS account categories (e.g. playground, test, infra). All changes must go through all release channels before getting to production. This approach allows us to gradually deploy changes to different accounts, ensuring a more controlled propagation that catches errors early on with a limited blast radius. The trade-off here is of course that the rollout takes a longer time.

Summary

Supertools never sleep (unless you program them otherwise!). They're powerful yet often misjudged in review processes as they're expected to only trigger action in the scope of expected changes. As our story shows, this is highly dependent on the implementation and it's highly important to implement additional safety nets in the processes and tooling. We hope that the examples of changes we've implemented in our infrastructure will help you reflect and improve mechanisms in your own context.


We're hiring! Do you like working in an ever evolving organization such as Zalando? Consider joining our teams as a Backend Engineer!



Related posts