OpenTelemetry for JavaScript Observability at Zalando
How Zalando improved observability for Node.js and web applications using OpenTelemetry Read more...
2024
Node.js and the tale of worker threads
Join me on a Friday night on-call investigation into a rogue Node.js service. Read more...
2024
End-to-end test probes with Playwright
Learn how we set up reliable automated end-to-end test probes for our Zalando website using Playwright Read more...
2024
Failing to Auto Scale Elasticsearch in Kubernetes
A story of operational failure in large scale Elastisearch installation including the root cause analysis and... Read more...
2024
12 Golden Signals To Discover Anomalies And Performance Issues on Your AWS RDS Fleet
Automate anomaly detection for AWS RDS at scale. Read more...
2024
Tale of 'metadpata': the revenge of the supertools
One day in November 2022, we brought down our shop with a single character. This post recaps on the lessons we... Read more...
2024
All you need to know about timeouts
How to set a reasonable timeout for your microservices to achieve maximum performance and resilience. Read more...
2023
How we manage our 1200 incident playbooks
We consolidated our incident playbooks in September 2019. 1200 playbooks later... Read more...
2023
Operation-Based SLOs
Zalando developed a new type of SLOs to monitor the critical aspects of its business which is based on Operations.... Read more...
2022
Tracing SREβs journey in Zalando - Part III
Follow Zalando's journey to adopt SRE in its tech organization. Read more...
2021