How we used a simple trick to save USD 500,000 in data transfer costs
The Cost Optimization Service Team (COST) was created in the scope of OTTO’s continuous development and optimization. This team aims to strengthen the culture of cost-consciousness within the company and significantly improve the efficiency of cloud usage.One measure that we identified and analyzed with great efficiency is the platform-wide migration to dual-stack IPv6. This enabled us to significantly reduce the costs incurred due to the AWS NAT gateway. This article will give a detailed account of how this efficiency measure was derived and implemented.
COST was founded at the beginning of 2023 with the aim of improving the culture of cost-consciousness within OTTO GmbH and increasing the efficiency of cloud usage. We have three members in our team, who came together to tackle the following challenges:
We are tackling these challenges, above all, by adopting a bottom-up approach and exerting a direct impact on the teams. A cost-conscious culture is enhanced and transformed into day-to-day practice by discussing it more often and in greater depth. The development teams are directly involved in these discussions and informed about possible changes to the infrastructure. The focus is on generating savings.
The teams work according to the principle “You built it, you run it.” Accordingly, they are responsible for operation, further development and security, but also for costs. The decision to opt for a certain architecture should take cost issues into account from the outset, which should be part of the decision-making process.
Systems that have been running for several years and operations in general reveal traces of the past in the form of technical debt. If the focus is solely on expanding our capabilities, there isn’t enough time to identify and eliminate technical debt. Inefficient solutions pile up, accumulating and generating costs that could otherwise be avoided. Last but not least, cloud services have been used for years, even though it has never been clearly communicated how they work, why they are or need to be used, and how they generate costs. This lack of knowledge can be remedied through education.
As COST, we analyze data-driven cloud services and bring the necessary knowledge back to the teams. In the process, we identify technical optimization options that the teams can implement in a cost-saving way and with minimum effort. The goal is to identify optimization measures and classify them in terms of savings and effort.
And as a final, yet important service, we support the development teams both in building up a knowledge base for cloud services and in identifying and eliminating technical debt. We test the technical measures beforehand in our own team environments or together with the teams in question. We create guides that we make available to the development teams in the form of step-by-step instructions on how to generate savings. We adopt a hands-on mentality here. This encourages the teams to exchange ideas and to actively work on the topic at hand.
In this way, COST plays a decisive role in supporting OTTO on its way to working even more efficiently and cost-conscious.
The teams within OTTO GmbH work autonomously and generally operate their (micro)services within a VPC. Due to several availability zones in use, the VPCs consist of multiple public and private subnets and have so far been based purely on IPv4. The services of the teams are located in private subnets, i.e. they do not receive public IPv4 addresses and must therefore communicate via the AWS NAT gateway (see Figure).
This approach has worked well over the years and has allowed us to work and scale separately and autonomously as a team. However, this has also come at a certain price: the fact that teams communicate with each other from VPC to VPC results in high costs. This is because the teams use the NAT gateways provided by AWS for the NAT functions.
Two types of costs are relevant for the use of NAT gateways: firstly, they are billed on an hourly basis. The bigger and, therefore, far more critical cost item here results from the variable traffic costs of the NAT gateways, which are charged for each GB transmitted via this service. These costs can be found in AWS Cost Explorer under the “Usage Type” NatGateway-Bytes.
Currently (as of May 2024), these costs amount to USD 0.052/GB in the Frankfurt region. In 2023 alone, the department that operates otto.de incurred costs of approx. USD 79,000/month for this cost item! How do we get these costs under control? The secret is IPv6.
First, let’s take a step back and recap the reason why we need NAT gateways in the first place. A NAT (Network Address Translation) gateway acts as an intermediary between a private network (VPC) and the internet. It allows hosts within the VPC to use the internet without each being assigned a public IP address. Each host in the VPC’s private subnets has a private IP4 address that cannot be routed on the internet.
A NAT gateway, on the other hand, has a unique public IP address. If a host from the private subnet wishes to establish a connection to the internet, it sends its request to the NAT gateway. The NAT gateway accepts the request and replaces the host’s private IP4 address with its own public IP4 address. When the internet responds to the request, it sends the response to the public IP address of the NAT gateway. The gateway receives the response and then forwards the response to the corresponding private IP4 address in the internal network.
Due to potential security concerns, the option of simply localizing all hosts in the public subnet and allocating a public IPv4 address to each of them is out of the question, as this would make every host potentially accessible from the internet. Furthermore, allocating public IPv4 addresses is also costly, especially when you consider that our systems scale horizontally, often to dozens, if not hundreds of instances – and that per team!
In addition to the AWS NAT gateway, there is also the option of hosting NAT gateways independently, which even our proxy team has configured. This results in other overall expenses, such as instance and maintenance costs, as well as significantly higher operating costs.
NAT was created to address the problem that IPv4 only allows the distribution of a limited number of IP addresses – namely 2³². Due to the rapid growth of the internet and the increasing number of connected devices, this number has become too small. Wouldn’t it be nice if there were no differentiation between public and private IP addresses?
In contrast, IPv6 uses 128-bit addresses and therefore offers a much larger address space: there are a total of 2¹²⁸ possible IPv6 addresses – that’s more than the number of stars in the galaxy! Due to this larger address space, it is no longer necessary to differentiate between public and private IP addresses. This means that NAT is no longer required.
So much for the theory. But what does an implementation look like in practice? The IPv6 migration is notorious for being a long, protracted process. Although the protocol is already over 20 years old, it has by no means become established everywhere.
Fortunately, setting it up in AWS is very simple. AWS makes it possible to set up VPCs in dual-stack mode. This means that both IPv6 and IPv4 are supported. And this works as follows:
First, a globally unique IPv6 CIDR block (::/56) is assigned to the VPC (1). As with IPv4-only, this is subdivided into n subnets (private and public). As there is no differentiation between public and private IP addresses in IPv6, aren’t all subnets ultimately public? No, because an instance can only be accessed if there is a default route between the instance and the internet gateway (3). The abstraction of the private subnets can be reconstructed if an “egress-only” internet gateway is attached to the VPC and set as the default route for the private subnets (2). This gateway guarantees that only outgoing access is possible and thus encapsulates our services as intended. As can be seen in (6), IPv6 requests go directly to the egress-only internet gateway.
Unfortunately, setting this up in only one VPC of a client is not sufficient to use IPv6, as the called servers must also support IPv6.
However, setting up dual-stack on the server turns out to be a trivial matter, as the AWS Application Load Balancer supports dual-stack natively. The VPC only needs to be assigned an IPv6 address space, the ALB switched to dual-stack, an AAAA record created and IPv6 communication is complete. This means that our inter-team traffic, which makes up the majority of our traffic volume, is IPv6-based. A positive side effect of the migration to IPv6 is that the NAT gateways are also avoided for OTTO-external IPv6-ready servers and APIs. In addition, thanks to the dual-stack approach, we can still fall back to IPv4 if a remote host does not yet support IPv6.
According to our estimates, the monetary potential for migrating to dual-stack IPv6 is at least USD 500,000 per year. In order to realize this savings, all of our cross-functional teams need to make the switch. Within eight weeks, 75% of our teams have been able to migrate completely, reducing daily costs by almost USD 1,400 per day. We will exceed the anticipated potential.
Our experience has shown that most teams were able to complete the migration with only one or two developers and within one to two days. We have also observed that a number of third-party services, such as Confluent Kafka, do not yet speak IPv6. As this is how a large part of our remaining IPv4 communication is managed, there is still potential. It is therefore necessary to make our service providers aware of this potential and prioritize communication via IPv6 on their roadmap. As soon as they incorporate these capabilities, we can expect further savings.
Despite the simple integration process, it is important that we monitor it after completion with respect to the traffic flow. One of the things that happened was that an AAAA record ultimately was not set correctly or application-specific configurations (e.g. the - Djava.net.preferIPv6Addresses=true runtime flag) were not yet applied. Another error that frequently occurred was that users forgot to switch Docker to dual-stack as well, e.g. with --network host.
The NAT gateway architecture allows a certain level of security against external attacks. When considering the switch to dual-stack IPv6, we carried out a security analysis at a very early stage. Thanks to the egress-only internet gateway, incoming traffic that is not routed by default is blocked. This provides the same security as the NAT architecture, as the public IPv6 address remains unknown.
In many cases, traffic accounts for one of the largest cost items in cloud use. It is difficult to optimize this area, as technical communication is necessary, and traffic cannot simply be switched off. The costs it incurs cannot be reduced globally by using a more powerful infrastructure.
The fact that the IPv4 address space has been exhausted also highlights the need to communicate via IPv6. It was therefore all the better that we were able to shift the majority of our communication to IPv6 within a few weeks.
Our dependency on services that cannot yet speak IPv6 will also be eliminated in the future. There is, therefore, further potential here to reduce traffic costs. Apart from a few edge cases, the implementation of the migration in AWS required little effort. Thanks to communication between the teams on the edge cases, these could also be successfully implemented in many cases.
Due to the server/client dependency, a positive effect on costs cannot be identified for each team, but definitely organization-wide and on our AWS bill. Here, we are already saving at least USD 500,000 per year. And the trend is rising. This makes it clear how effective collaborative efforts of this kind and, therefore, the standardization of technology can be.
Maybe you know one too? We’re already looking for the next one.
Want to become part of the team?
We have received your feedback.