At OTTO, we faced several challenges to operate AWS CloudFormation StackSets at Scale. We must govern several hundred AWS accounts for our product teams, all while balancing the need for agility and control.
At this scale, operations can take a lot of time, because there are multiple operational tasks that we need to do when AWS accounts are leaving the AWS Organization or Teams are nuking the AWS account, StackSets Instances get drifted, because not all required resources for compliance can be secured (SCP Limitations), existing AWS accounts are joining the AWS Organization and all mandatory StackSets needs to be deployed, and manual steps should be reduced to a minimum. Furthermore, there is no feature from the Service itself to gain an overview of the status of drifted Instances and the general health of your StackSet health and compliance.
The cloud competence center at OTTO IT, also known as the Governance at Scale (GAS) team, developed a solution for self-healing on StackSets, that is integrated into the OTTO tooling ecosystem with Confluence and Microsoft Teams.
OTTO worked with globaldatanet to set up its Landing Zone. globaldatanet is an award-winning AWS Advanced Consulting Partner and longtime Cloud Solution Provider for OTTO, supporting the team in cloud security and GAS. Their focus on building cloud-native solutions using Serverless supported over 100 companies within 5 years to develop and innovate products and services in the cloud.
In this post, we’ll demonstrate how to implement fully automated enterprise-scaled self-healing on StackSets using AWS StepFunctions and create a Dashboard to get an overview of your StackSet health and compliance and reduce operational time.
The solution workflow includes the following steps:
Let’s see how this works.
The following architecture shows the whole solution of the Self Healing StackSets:
Figure 1: Architecture of fully-automated Self Healing Solution with integration to Confluence
Key | Value | Result | Example |
---|---|---|---|
antidependson | StackSet Name | antidependson marks stacksets which collide with each other. | MYSTACKSET |
dependson | [List of StackSet Names] | List of Stacksets that need to be rolled out before deploying this stackset (e.g. Enable Config before Activate Config Rules). NOTE : Please reduce to only one dependson-stackset for now. Form "chains" for multi-dependencies. | MY-STACKSET1:MYSTACKSET2 |
mandatory | true or false | The stackset instances must be present on all AWS accounts | true |
selfhealing | true or false | StackSet can be healed via Delete & Redeploy (exception e.g. IDP roles) - Parameter Overwrites will be cached. | true |
region | [Regions] | List of Regions in which the stackset instances are to be deployed | eu-west-1:eu-central-1:us-east-1 |
The automated generation of the Stackset-configuration via JSON inside the ParameterStore is a multi-purpose-utility:
The lambda responsible for the task is called via an event rule:
Every time a Stackset operation is completed with status "succeeded".
This is due to the fact that the tags of a Stackset are part of the Stackset, not additional items describing a Stackset, so a change to the tags will always result in a Stackset-Update-Operation.
In terms of computer science the Lambda is quite interesting, as the primary problem was to build a nonweighted tree based on the "dependson" and "antidependson" tags and then compile an ordered one-dimensional list, like in the good old "travelling salesmen"-problem.
Figure 2: StepFunction Workflow
ƛ Serverless Functions
?!Decisions
While developing the solution we faced several limitations. Here are our findings and solutions for that.
🚨 StackSets instance operations: Maximum number of stack instances, across all stack sets, that you can run operations on in each region at the same time, per administrator account is limited to 10.000 operations.
✅ We implemented a counter to count the current StackSets operations which are in progress, in addition we also caught the Exception from CloudFormation and waited few seconds to try the operation again.
🚨 Parameter Overwrites Caching: Whenever removing a drifted StackSet Instance which has Parameter Overwrite you will lose the individual parameters of the instance.
✅ Before deleting the drifted StackSet instance, we cache the parameter overwrites, and after successful deletion, we re-deploy the StackSet instance with the cached parameter overwrites.
🚨AWS Step Functions Payload size: AWS Step Functions supports payload sizes up to 256KB. For our solution, we need more payloads between states, especially if we want to pass our log or the concurrent parameter overwrites per StackSet.
✅ We store our states in an S3 bucket to pass the state. At the end of the execution, we delete the state from S3 so as not to influence the next Step Function execution with the wrong state.
After each execution of the StackSet Health StepFunction, we aim to notify our GAS team about the actions taken during the previous run. Therefore, we have implemented a Teams notification that includes a status update, a link to the generated dashboard, and a link to the log file.
The following screenshot illustrates an example of a Teams notification. It provides a summary report and directs you to the dashboard and log file for further details.
Figure 3: Status updates via Teams
Our StackSet Health Dashboard is a simple HTML file generated by a Lambda function, stored in S3, and distributed via a CloudFront. You can integrate this dashboard into your Confluence or any other internal wiki. This dashboard is secured via the CloudFormation function - in addition, you can also add a firewall to restrict access to a specific CIDR or geographic region and prevent third-party access. The screenshot below shows an example of the overall StackSet Health status information for an entire AWS organization.
Figure 4: Dashboard
In this post, we demonstrated a solution to automatically heal AWS CloudFormation StackSets at scale. By implementing this solution in our organization, we were able to reduce manual StackSet healing efforts by 4 hours per week, improve the overall reliability of our StackSets, increase compliance in our organization, and gain daily visibility into all StackSet instances using the dashboards. In summary, the self-healing CloudFormation StackSets solution combines automation, monitoring, and self-healing capabilities to provide a robust and resilient system for StackSets.
Want to be part of our team?
We have received your feedback.