Nearly three years ago, we published our first post on how we handle security in the cloud based on our framework. Ever since, this framework guided us through our scaling product lines and thereby increased complexity of our cloud infrastructure. We have learned what worked and what did not work for us, and in this blog post we would like to share our experience so far.
At Otto Group data.works we develop and maintain one of the largest retail user data pools in the German-speaking area. Additionally, our product teams develop machine-learning-based recommendation and personalization services. Otto Group’s many online retailers — such as otto.de and baur.de — integrate our services into their shops to enhance the customer experience.
Our interdisciplinary product teams — consisting of data engineers, cloud engineers, BI-engineers and data scientists — work autonomously and are supported by two centralized infrastructure and data management teams. One of the main reasons we’re able to develop new products so quickly is the autonomy our teams enjoy, especially when it comes to dependencies on central infrastructure teams. This means, for example, when we start developing a new product, a GCP project is bootstrapped following basic security policies and configurations and handed over to the product team within minutes. After that, team members get full access to work freely in their GCP environment, including the usage of any Google services they need to securely access our unique data pool.
This freedom comes accompanied with the responsibility for the security of the projects. To support and enable product teams with their dual — and often dueling — goals of developing quickly and staying secure, we have established a security framework based on five pillars, which we presented in detail in our last blog post on Medium. In the following sections, we dig deeper into which challenges we have faced. What we have learned to overcome these challenges embedded in our golden triangle of people, processes & technology.
Engineers are facing a lot of challenges when working in a cloud native environment. While some of them are well-known to them, others are new. Here we want to name the most prominent challenges we have faced.
Complex Environments & Ladder of Abstraction
Cloud technology can be seen as an enabler for fast time to market. At the same time this can lead to the observation that different products built at different times can vastly vary in their cloud architecture. This can either be because the cloud vendor rapidly releases new managed services or due to the ever-changing technology landscape. Though the technology or the architecture can be totally different in subsequent applications, the security requirements mostly stay the same. Thus, engineers must find a way to manage their security controls in many different environments, with inherent context switches leading to high operational workloads. This becomes even more challenging due to the “ladder of abstraction” effect described in detail in the next section.
Disparate Security Products & Alert Fatigue
Trying to manage a growing complex environment can lead to a common pitfall: the presence of many different security products each for their own unique use case. Having a variety of security products introduces another set of challenges. For starters, every security product integration has their specific cost being in technical onboarding or the education of people. Next, not every business model of a product is cloud-ready and thus the costs can either increase rapidly with the increase in cloud environment footprint or the retention of data must be reduced to reduce costs with the downside of limited observability. And finally, each product has its own alerting system and corresponding notification policies which lead to people suffering from alert fatigue due to the sheer number of notifications combined with a high proportion of false positives. In the end, too many different security products can have the exact opposite effect they were intended to have.
High Mean-Time-To-Recovery & Talent Shortage
Another challenge arises from the mismatch between the speed at which an attack happens, and the speed engineers can remediate an attack. The total time of this process can be measured with the mean-time-to-recovery (MTTR) metric. It is a combination of the time it takes to detect an attack, identify affected systems, fix the systems, and validate the remediation. A high MTTR can have several reasons. It can be due to the lack of security expertise of engineers to efficiently skim through a high volume of data to find indicators of compromise to detect an attack. It can also be due to time consuming processes where engineers must manually jump between different security products to identify affected systems. Or it can also be because of the mentioned alert fatigue. To reduce the MTTR, engineers need to strengthen their security incident response muscle by improving their security investigation skills. This process takes precious time, but there is no shortcut for it because the amount of security expertise which can be allocated in each product team by hiring security engineers is limited due to the lack of security talents on the market.
Security Awareness
Finally, it is a perpetual effort to track the value generated of SecOps actions (e.g., security awareness trainings) in your company. This makes it hard sometimes to justify necessary expenses, because in the end you can’t show what did not happen, leading to an infinite loop of educating people and generating security awareness to retain budget for cybersecurity.
Ladder of abstraction
As described so far, going down the path of a cloud journey raises manifold challenges. This is caused by the cloud paradigm shift compared to an on-premises environment. The basis for this paradigm shift is arguably the departure from a centralized IT department to autonomous DevOps teams which leads to more responsibility and task diversity for engineers. The effect of this paradigm shift on engineers can be summarized by the term “ladder of abstraction”.
In a cloud environment, the ladder of abstraction has many sprouts starting with low-level technologies like virtual machines, Kubernetes, and networking (VPC, load balancing & routing tables, firewalls, VPN) to mid-level technologies like PaaS and serverless ending with high-level technology like no-code application development and AI/ML services. In favor of the cloud vendor, each sprout in its own has a low barrier to entry. This causes that engineers’ needs to go up and down the ladder of abstraction continuously. Normally, the higher an engineer goes, the more automated the system gets, and the less level of freedom exists for individual configurations. But as mentioned in the beginning, security requirements mostly stay the same. For example, data uploaded to a recommendation service needs to comply with the same data security controls as when processed on a compute engine accessed via SSH. They need to deal with security requirements regardless of the level of abstraction.
The consequence for engineers is that they need to have a strong T-shaped technology skillset and need to continuously work on their vertical and horizontal technical breadth.
In the following section we dig deeper into solutions for these challenges.
The golden triangle holds true also for our case that the three most important aspect for an organization are People, Processes, and Technology. Hereinafter, we step into each aspect and share what we have learned during our cloud journey.
Everything begins and comes back to people and is arguably the aspect which requires the most effort. Building a digital security hygiene and competence takes constant education of people. It is not enough to share the responsibility for information security within your organization. It is important to enable people to accept their obligations and to encourage self-responsible action. Contrary to security knowledge, a security mindset can’t be learned but is based on empowering people to receive their security obligations and enablement to act mindful in their everyday work.
Contrary to security knowledge, a security mindset can’t be learned but is a cultural change which takes time
This shift in mindset also needs a shift on how to establish a profound security awareness. Instruction-led or computer-based security trainings mostly show to be inefficient on having a long-lasting impact on reducing digital lack of security concern. Contemporary security awareness trainings rely partly on gamification aspects like serious games, capture-the-flag hackathons, security dojo leaderboards, evil user story threat modeling, blue-red team challenges and more.
In essence, security needs to be treated as an engineering problem and not an assembly line problem, because it is more like a journey than a destination. Thus, security is naturally included early in a product development cycle and not as an afterthought. Thereby security will eventually become an equal element in a DevSecOps culture. With such a cultural shift the realizability of an architectural evolution towards a zero-trust architecture gets much higher. A practical example is the decommissioning of the corporate VPN tunnel, which is getting more popular due to the increase in remote and distributed workforce. Such a migration project is a good indicator of the security maturity level of a company. It forces applications which have been previously “protected” by a network security product now being accessible outside of a network fortress and thereby application security needs to resist against real-life scenarios and threats.
We have learned that it comes naturally in a cloud environment, to leave the comfort zone e.g., of a VPN environment, and think about the security of an application early on during architectural decisions (and documenting them in architectural decision records for later), which has a more essential impact on security. So, it is a common practice of our engineers to think e.g. about the IAM and network perimeter of their application, how data protection controls are accomplished and how they will monitor the security of their application. We have experienced that this thought process is one of the reasons why a cloud environment acts as a catalyst to think early about the security of the system and to take ownership of it. We have also learned that with this mindset shift, it is important to start as soon as possible in your cloud journey, because like any other cultural change it will take time to establish a new culture.
In our experience, continuously measuring the security maturity level of products and our whole organization over time is important. A mandatory prerequisite to achieve this is the detection of cloud resources to build a cloud resource inventory. Based on this configuration management database (short: CMDB) we can build a continuous maturity testing process which gives us the opportunity for data driven decision making. It can, for example, identify which teams currently need more security support, which systems have not been patched for a while, or to identify loopholes in the roles and rights concept (e.g., creative engineers exploiting service account impersonation to access resources they are not meant to access). These are example to detect security debt so that product teams can tackle them by establishing processes around them.
We use processes to set strong security guardrails wherever possible and as a result to paw a golden path for product teams. We combine this golden path together with a set of minimal security requirements which are monitored continuously. Product teams can choose to stay on the golden path or to deviate from it if the current use case urge them to do so. But regardless of their choice they can rely on processes being in-place to give them continuous feedback about their current security posture. It turns out that these processes also help new engineers to quickly gain confidence in the cloud environment without been overcautious when working in it.
Security feedback needs to be “shift-left” of a software supply chain to reduce the effort to resolve security issues before they hit the cloud
We have also learned that it is better and more efficient to give feedback at the beginning of a software supply chain. This process called “shift-left” refers to efforts to ensure application security controls are meet at the earliest stages in the development lifecycle. Building processes at this stage can have a tremendous benefit for people owning the security of their applications, because it can reduce their effort to resolve security issues before they hit their cloud environment in production. The reasoning is straightforward: It is easy to implement security into a system while it is built up, but it is much more complicated when attached afterwards because it consumes more time and effort.
Finally, there are tools. As mentioned in the beginning, having too many tools can have the exact opposite effect as they were intended to have. Tools should be introduced when a clear need becomes obvious while at the same time being mindful about the fact that it will bring an additional learning curve and additional alerts for engineers.
Tools should give people an advantage to invest less time finding relevant information and taking necessary actions
Tools should assist people to automate recurring tasks. They should give people an advantage to invest less time to find relevant information and to take important actions. One of the most important lessons we have learned is that the number of issues found is less important than the actual fidelity of a security issue. We strive to keep the signal-to-noise ratio as high as possible and try not to bother developers with contextless generic CVE noise. To be able to do this, we have learned that security monitoring tools need to have a holistic approach on the supply chain. They need to be able to observe applications from source code to artifact to runtime deployment including their endpoint and finally their access configuration. For example, when Log4Shell came up, it took us only a about a week to identify vulnerable applications and mitigate the risk. We did this by leveraging our CMDB to identify affected source code repositories, tracking them to their cloud runtime and thus give product teams a clear assignment what to do while simultaneously validating the status with our CMDB.
As with cloud-native applications, security monitoring products need to be cloud-native too. They need to scale effortlessly with the cloud environment footprint while being economically efficient. In best case they use the event nature of a cloud environment to ingest changes in real-time. This capability gives engineers confidence in the tool due to the fast feedback loop described previously. These tools should not fix problems of the past but support engineers moving to the next cloud era. In this era infrastructure automation with infrastructure as code (IaC) is seamlessly integrated with security policy enforcement with policy as code (PaC) in a continuous integration pipeline which gives engineers fast feedback on their cloud security posture before any vulnerability hits the cloud. Engineers get security suggestions on the spot which can be adjusted or applied right away to mitigate vulnerabilities. Engineers should feel encouraged and empowered by their tool belt to be ready to build future secure products.
We have come a long way in our cloud journey though it feels like we still have a road ahead of us in our cloud security journey. But there is one thing we are especially proud of: our own in-house developed cloud security monitoring solution. In this last section we give an overview what we have built into it, how it supports our golden triangle and how it helps us to overcome our challenges.
The decision to invest in a cloud-native security monitoring solution came up as our cloud working mode became clearer and our cloud footprint increase leveled off. Meanwhile we have learned a few things from our previous security monitoring tool which we wanted to have in our new solution.
The basis of our 5 pillars presented in the last post on Medium are still the core principles we rely on. But additionally to having an asset inventory in form of a CMDB, we also want to have continuous cloud compliance testing (CCCT) with cloud security posture management (CSPM) built on top of it. These capabilities enable us to have a complete representation of our cloud environment and thereby generate security issues that really matters. This has the benefit for engineers to concentrate on the most important issues and don’t get annoyed by false-positive alerts. Over time we created more than 130 security controls which build the basis for our CSPM. They act on our graph enabled CMDB which models several resource relationships as depicted below.
Equipped with the ideas we want to achieve; we get to work to build an event-based cloud architecture which harvests all our cloud resources configurations via cloud audit logs as well as batch ingestion. These cloud resources configurations are then unified into a coherent data model which can be used security issue analysis. The issues are then recorded and used to build a security posture view for product teams. The by-product of this event-based cloud architecture is that we have built a CCCT which reacts on real-time changes of our cloud.
Our security posture view currently has 13 compliance controls divided into 4 dimensions: data security, IAM security, network security and endpoint security. Each dimension consists of several issue classes combined into a representation of the cloud security posture of that dimension. Even though there might be more security issues present, product teams can be sure that if they fix all 13 compliance controls, they have a good cloud security posture. But this is not a reason to get lazy, because their posture can change at any time and is historically aggregated into a security score, so they continuously need to maintain a high level and more controls are about to be added.
This concludes our story and what we have learned so far. In the next post we want to dig deeper into the technical details and challenges we have overcome developing our cloud-native security monitoring solution.
This article was originally published on Medium and can be read there as well.
This talk might also be interesting for you: Here we present our newly developed real-time Cloud-Native security solution for GCP called Pantheon at code.talks 2022.
Read more about our cloud infrastructure.
We have received your feedback.