navigation

Account

In your account you can view the status of your application, save incomplete applications and view current news and events

enEnglish

deGerman

May 28, 2019

CDCs for otto.de

Architecture

Development

What is the article about?

For several years, we have been building otto.de in independent teams that continuously put their changes live without having to coordinate with each other. A wide range of different tests help us to deploy these changes quickly and without fear of errors. However, one class of tests has not yet gained the traction it deserves in the industry: consumer driven contracts, or CDCs. So I'd like to use our recent findings to write about it here.

For all types of data (product data, user information, purchases, discount promotions, etc.), we usually have one system that has sovereignty over it. If other systems also want to use this data, a copy of it is requested in the background, i.e. asynchronously, at a provided interface and automatically transferred to our own database. This way we can avoid long request cascades between systems. This is good for our response time to customer requests and very helpful in keeping the overall architecture resilient.

Up to now, the servers at our company have communicated almost exclusively via REST-like HTTP interfaces. So there is one system that provides data (the server) and one or more systems that are supposed to retrieve the data from there (the clients).

What is the problem to be solved?

The server provides an HTTP endpoint - for example, for current product prices - and a client can make HTTP requests against it. Because the endpoints are generally protected against unauthorized access, keystores and credentials still belong in the picture, but essentially that's it:

Asynchronous data delivery is often just glorified Http requests between client and server. Both sides usually use their own keystores and the secrets from them to authenticate the requests.

The server's interface usually has some kind of documentation or specification and test coverage deemed appropriate by the server team. However, it still happens that dependencies on implementation details arise in the client. That's where teams rely on JSON elements always showing up in a certain order, or they can't handle new data fields. Naturally, everyone involved takes it upon themselves to be extra attentive. Or they are convinced that these kinds of mistakes only happen to others. Nevertheless, it happens again and again that something that worked before suddenly doesn't work anymore: a bug. This kind of glitch is also inconspicuous enough that it can be in the life system for quite some time before anyone notices it.

So I wish we would at least notice it quickly:

Where we come from

We have had good experience with pipelines. A test that checks whether the server's response can still be translated into domain objects after deployment to a develop system is quickly written. Although another keystore is now needed near the pipeline so that this request can also be authenticated, this already fulfills my wish:

In the client pipeline, a test checks whether the server's response still meets expectations. The pipeline turns red when the interface changes and the team can react.

However, the test only runs when the client team's pipeline is triggered, for example by a code change. However, if the team is working on another service, an incompatible change could go live unnoticed by the server team.

So my wish list is actually a bit longer:

When the client team's test is run in the server team's pipeline, it's called a CDC test. The concept was popularized by an article by Martin Fowler in 2006, but then seems to have been forgotten a bit again.

This is a shame, because it effectively prevents a problematic interface change in the server from going live at all:

The client team's test is run in the server pipeline and can prevent an incompatible server change from being deployed to the live system in the first place.

In the context of otto.de, the test was initially placed in the central artifactory, e.g. as a jar file, fetched from the server pipeline and executed there. Some teams provided shell scripts that, for example, determine and download the correct version of the jar.

A disadvantage of this approach is that runtime dependencies of the test must be present in the pipeline. For example, when we introduced Java8, it bothered us that some server pipelines were still running in Java6. In turn, other teams' tests rely on a Chrome binary or X11 libraries, which must then be pipelined by the server team. In response, some teams have wrapped their jar files in Docker images. This has reduced the problem somewhat, although of course the server team has to contend with the age-old Docker-Im-Docker problem, and version changes to Docker also tend to bring incompatible changes. For the same reason, an experiment with Pact was also over fairly quickly when one team was keen to introduce a newer version that didn't work with the other team's version.

It's also annoying that the server team has to have the client test credentials available in the pipeline, because credentials are of course not allowed to be in the code. So these credentials exist in parallel to the ones that the client team and the server team have to have in their environment anyway, in order to be able to authenticate and validate requests with them.

So while some wishes were fulfilled, my wish list got longer almost as fast:

Despite the shortcomings and in complete disregard of my wish list, this remained the state of our CDC tests for almost six years. It wasn't really, really good, but it was still good enough.

Where we don't want to go at all

Since we left our dedicated data center to migrate to the cloud, almost all teams have had their systems environments separated from each other on the network side and from the area where the pipelines run. This, of course, has consequences for testing. Trying to translate the old state into the new world of the cloud has led to this picture:

So, in order for the CDC tests to continue to work, the server team also had to partially drill holes in their firewalls. In addition, due to infrastructure as-code, the network path between client and server is now subject to potentially continuous changes: a new way to accidentally break clients.

So the server team now already has to ...

... determine the version of the test that matches the client,
... download the test (a potentially huge artifact),
... provide runtime dependencies,
... store passwords in the pipeline area,
... drill holes in the firewall.

At the same time, it's not even certain that the communication will actually work, because the network route can change unexpectedly.

After a few years of silence around the topic, now the wish list suddenly grew to an unbearable extent:

That was a good time to rethink the distribution of tasks.

Where we ultimately ended up

The client team now deploys the test as a Lambda function, as an EC2 instance, or as part of their system. An Api gateway allows the server team to launch the test with an HTTP request. Since the test is delivered at the same time as the production code, it always has the correct version. The server team does not have to download anything and the only technical dependency is the ability to submit an HTTP request from the pipeline - which is not a challenge for any team thanks to Curl or WGet. The credentials for the test are in the keystore for the client anyway, and now finally all the wishes on the list have come true:

What still needs to be done

With the new possibilities to easily build communication channels that are not based on requests and responses, but use messaging systems like SQS, Kinesis or Kafka, the problem area has suddenly expanded to such an extent that we have not yet found a good, i.e. universal, answer to it. So recently I've been wishing that we could test this as well.

True, some teams have already experimented with test messaging, and Pact is also being looked at with renewed affection. Other teams are using messaging both ways anyway and can build good tests with a few tweaks to their current strategy. Some even question whether this kind of infrastructure should be tested at all.

So I'm hopeful that we won't run out of reasons to have intense technical discussions in addition to growing business requirements. In fact, I'm glad that new requests keep coming in. Because that way I never get bored :).

3 people like this.

0No comments yet.

Write a comment

Answer to: Reply directly to the topic

Written by

Tom Vollerthun

Hacker

About the author

CDCs for otto.de

What is the article about?

What is the problem to be solved?

Where we come from

Where we don't want to go at all

Where we ultimately ended up

What still needs to be done

0No comments yet.

Written by

Similar Articles

Terraform Tips: Scalable and Reliable Infrastructure-as-Code

How we used a simple trick to save USD 500,000 in data transfer costs

Your profile -
Your advantages

A people company.

Driven by technology.

We want to improve out content with your feedback.

Allow cookies?

Data uses