Scaling Beyond Limits: Harnessing Route Server for a Stable Cluster
A proxy server contributing to a stable Kubernetes cluster and scaling ingress controller.
Introduction
At Zalando, we faced a critical challenge: our ingress controller was threatening to overload our Kubernetes cluster. We needed a solution that could handle the increasing traffic and scale efficiently. This is the story of how we implemented a Route Server to manage control plane traffic more effectively and ensure a stable cluster.
Skipper: Our Ingress Controller
We use Skipper, our HTTP reverse proxy for service composition, to implement the control plane and data plane of Kubernetes ingress and RouteGroups. A creation of an Ingress
or RouteGroup
will result in having an AWS LB 1 with TLS termination targeting Skipper via kube-ingress-aws-controller, HTTP routes at Skipper and a DNS name pointing to the LB via external-dns.
To understand the deployment context, this is the scale we operate at:
- 15,000 Ingresses and 5,000 RouteGroups.
- Traffic of up to 2,000,000 requests per second.
- 80-90% of our traffic are authenticated service to service calls with daily numbers between 500,000 and 1,000,000 rps across our service fleet in total.
- 200 Kubernetes clusters.
The Challenge
Scaling Pain Points
Skipper instances were fetching Ingresses and RouteGroups from the Kubernetes API, which worked well initially. But the rapid growth in Skipper instances, reaching approximately 180 per cluster, began to overwhelm our etcd infrastructure.
This overload cascaded into severe Kubernetes API CPU throttling issues. These performance bottlenecks led to critical control plane stability risks, manifesting in two primary ways: our clusters lost the ability to schedule new pods effectively, and existing pod management operations began to fail. This combination of issues threatened the overall stability and reliability of our Kubernetes infrastructure.
Implementing Route Server
Before Route Server, Skipper was responsible for:
- Polling Kubernetes API for Ingresses and RouteGroups.
- Parsing and processing the resources to Eskip 2 format.
- Validating generated Eskip format.
- Updating the routing table.
We introduced Route Server a custom proxy layer to handle the control plane traffic more efficiently and act as a proxy with HTTP ETag cache layer between Skipper & Kubernetes API Server.
Now Route Server handles the polling and parsing operations, reducing Skipper's computational overhead while implementing a clear separation of concerns.
Cache Layer
Route Server polls the Kubernetes API at 3-second intervals to fetch the latest Ingresses
and RouteGroups
. It then generates both a routing table and a corresponding ETag value. When Skipper requests updates from Route Server, it includes its current ETag. If this matches Route Server's current ETag, indicating no changes, Route Server responds with an HTTP 304 (Not Modified) status. However, if the ETags differ, Route Server sends the updated routing table to Skipper, which then updates its local configuration and stored ETag.
Route Server Not Available
While the Route Server significantly improved our system's efficiency, we also had to consider potential failure scenarios. There are 2 possibilities when Route Server is not available:
- Skipper doesn't have an initial routing table.
- Skipper has an initial routing table but Route Server is not available to update it.
In the first case, Skipper container won't start if you have -wait-first-route-load
flag enabled. In the second case, Skipper will continue to work with the last known routing table. This is a trade-off between availability and consistency.
In both cases, we get an alert and we decide to either fix the Route Server or disable it and let Skipper work without it. Currently, we don't have a automatic fallback mechanism to the old approach.
The final flow with Route Server integrated is as follows:
Roll Out Strategy
Rolling out Route Server wasn't a simple task. A single mistake could break the connection between Kubernetes API and Skipper, potentially impacting our sales and gross merchandise volume (GMV). We needed to be extremely cautious and follow a well-structured rollout strategy.
We planned to roll out Route Server in a controlled manner, starting with test clusters. Production clusters were categorized into tiers, with Route Server deployed tier by tier, each monitored before proceeding to the next.
To do this, we defined different setup modes for rolling out the Route Server:
- Mode: False - Disabled mode
- Mode: Pre - Pre-processing mode
- Mode: Exec - Execution mode
These modes are controlled via a configuration item.
Default mode is false
which means Route Server is disabled, and we use the regular Control Plane traffic.
Pre-Processing Mode
In this mode, Route Server works alongside Skipper, fetches Ingresses
and RouteGroups
resources from Kubernetes API and preprocesses them. This mode is useful for testing and debugging which was a key factor in our rollout strategy.
We were able to get the routing table for Skipper & Route Server and compare them to ensure the Route Server is working as expected. Remember, if our routing table is broken for some reason, we will have a downtime. That's why we had to be extra cautious and check any small difference in the routing table across all clusters.
# very big limit to get all routes for skipper
β curl -i http://127.0.0.1:9911/routes\?limit\=10000000000000\&nopretty > skipper_routes.eskip
# get all routes for Route Server, we decided not to use pagination to reduce number of requests and Skipper is currently the only consumer
β curl -i http://127.0.0.1:9090/routes > routesrv_routes.eskip
β git diff --no-index -- skipper_routes.eskip routesrv_routes.eskip
Execution Mode
In this mode, Route Server acts as a proxy between Skipper and the Kubernetes API. Skipper sends requests to the Route Server, which then forwards them to the Kubernetes API. The Route Server caches the responses and sends them back to Skipper. This mode is the final setup for production.
Production Rollout
After thorough (load)testing, we rolled out the Route Server to production in a controlled manner:
- Rolled out to all test clusters and monitored for 2 weeks.
- Deployed to production clusters tier by tier, monitoring each tier before proceeding.
Alternative Solutions
We considered using Kubernetes Informers to watch for changes in the Kubernetes API. However, this approach would still require Kubernetes API to send information to all Skipper instances, which may lead to the same issues we faced. Since it's a sudden increase in traffic and HPA won't be able to catch up and scale Kubernetes API and etcd.
Future Improvements
- Automatic Fallback: Implement a fallback mechanism to ensure Skipper can continue to operate if Route Server is unavailable.
Summary
- Achieved zero downtime and no gross merchandise volume (GMV) loss during rollout.
- Extended Skipper HPA to 300 pods.
- One RouteSRV deployment can handle up to 100 RPS, equivalent to ~300 Skipper pods, with no issues.
- Route Server is now a core component of our platform.
AWS LoadBalancer can be ALB or NLB depending on kube-ingress-aws-controller annotationΒ β©
Eskip implements an in-memory representation of Skipper routes and a DSL for describing Skipper route expressions, route definitions and complete routing tables.Β β©
We're hiring! Do you like working in an ever evolving organization such as Zalando? Consider joining our teams as a Software Engineer!