Platform Engineering and Third Generation Microservices in Dublin
Closing the gap between systems of record and intelligence is a team vision.
The Zalando Dublin Technology Hub formed in 2015 in part to research and develop products and services using data science. There’s ongoing collaboration with our colleagues in Germany and Finland - recently, over twenty data science researchers and engineers from Dublin attended a two-day internal conference in Berlin. They presented on topics ranging from multi-lingual analysis of fashion content, image tagging with TensorFlow, to infrastructure for model hierarchies. Dublin Zalandos are also involved in local meetups and talks. As a result we’re now best known in the Dublin community for our data science work and culture.
The part we talk about less is building more and more critical elements of the Zalando Platform. Platform is central to the company’s thinking — recently Zalando hosted Vizions, the first conference of its kind in Europe to focus on platform business models. Just as our work on machine learning focuses on systems of engagement and intelligence, our platform engineering work focuses on the systems of record.
Our platform work started small. Initially, a few engineers began looking at fashion article data models and services. Today that has blossomed into a set of teams developing for a broad range of customer needs and we’re continuing to grow our platform teams.
Platform Engineering
Building a platform involves more than engineering and scaling. Platforms are also about providing leverage to others. We want teams and customers to work at higher levels and innovate on problems. In Dublin this has led us to take a considered approach to what we call platform engineering.
Platform teams are autonomous, own their impact, purpose, and technical direction. They also establish a technology vision to support customer and product goals. Teams thus concentrate on building what matters, but also think long term, to provide a sustainable technical runway. Quality is not a surface element — what we have to build, we have to build well and be proud of. So while we iterate and learn on design details, we can’t always justify 80/20 thinking. In some cases what we need to build are almost givens — functions like products, articles, customers and categories are not speculative.
We’re still learning how best to organise the platform and data science teams. An approach that’s working so far is establishing broad product areas contributed to by many teams. In combination with a microservices architecture, it’s proving a viable way to spin up new teams to solve new problems and maintain consistent team sizes. It lets us strike a balance between exploring new technology options that give us leverage over problems, and using what we know works well — cells of teams gossip knowledge and experiences faster, which increases our learning rate. Finally, it avoids individuals needing to work on one thing, for too long. Technical depth is critical, but new challenges are important for personal development and maintaining energy.
The mix of engineering for scale and performance targets, while moving quickly to enable future growth, is rewarding and challenging work. Arguably it’s unusual — large companies that set up remote offices often work at an impressive scale, but sometimes with a narrow scope for ownership and impact.
Third Generation Microservices
Building a new suite of services is an opportunity to learn from the past and examine things from first principles. We’re fortunate to have engineers that have worked on microservices already, giving us a strong knowledge basis to draw from.
An observation we’ve made is that synchronous request-response systems require a raft of supporting subsystems. They tend to need specific incident management processes to handle aspects like fanout latency or partial failure modes. They also can be tricky to compose. We’re finding a more asynchronous, event driven, and messaging passing style, works well in many cases. Services don’t always need to call each other. In fact, they often don’t need to know about each other at all. Instead, they can work on incoming events and data handed off by other services in a functional manner. This is based on another observation — the kinds of problems we need to solve are dominated by access to and processing of data, especially data about what’s changing.
And so today, most of our teams are working with data streams and have adopted functional programming. In this approach, data streaming tools and techniques move from the edge of the system where they have become popular for data integration, and capture for systems like lakes and warehouses, to become a central part of the service fabric.
You might call this approach third generation microservices. While it’s not exactly a new approach, it’s not widespread — the industry state of the art is still centered around request-response systems, focusing on resources and entities, rather than what’s happening and changing in the system. First generation services, were coarse grained, often tied to a business unit or tier. Second generation are fine grained, arranged around request/response calls, optimising for organisational velocity and are what many would consider the “microservices” state of the art today, but we think an industry transition will soon occur.
For us, it’s early days — we have a lot to learn about doing this at the scale and speed we want. But unifying the worlds of microservices and data streams to achieve our goals is something we’re excited to make progress on.
Closing the Loop
Closing the gap between systems of record and systems of intelligence is a vision of the teams in Dublin.
For researcher engineers and data scientists, data streams allow the results of machine learning experiments and models to integrate more easily with platform datasets. Making platform data available via well defined event streams and moving away from batch/periodic data dumps also improves data accessibility for research scientists. They also enable data scientists to build their own service APIs and user interfaces on top of data streams. Ultimately, this is why we take a platform mindset — data and research scientists can work faster and establish value at higher levels if they can rely on well engineered platform primitives that give them access to the information they need.
For platform engineering, we see value in using typed functional techniques and data streams, as a complement to our microservices architecture, technology radar, and API guidelines. These approaches simplify systems operations, enables closer to real time data processing, and allows easier service composition.
Perhaps the best outcome of organising around data and event streams has been greater knowledge exchange between software engineering and data science. These disciplines often work in isolation in the industry, but they have much to learn from each other.
Conclusion
Zalando is a multi-billion dollar business with the fastest growing technology engineering group in Europe. In Dublin it’s a been a challenge and a delight to to work on larger problems and make broader contributions. We’re just getting started, and we’re excited about what’s next.
We're hiring! Do you like working in an ever evolving organization such as Zalando? Consider joining our teams as a Applied Scientist!