Implementing defense-in-depth for your AWS network
Some facts
According to the Q4 2023 Cloudflare DDoS threat report, a 117% increase in network-layer DDoS attacks can be observed year-over-year.
A number of cloud customers have been targeted with a novel HTTP/2-based DDoS attack which peaked in August 2023. These attacks were significantly larger than any previously-reported Layer 7 attacks, with the largest attack surpassing 398 million requests per second (see CVE-2023-44487 for details).
2022 report by F5 found that over 50% of all MITM attacks involve the interception of sensitive information such as login credentials and banking data.
Introduction
Be it a grudge over an online game, a politically-motivated action under hybrid warfare, or outright nihilism, the reasons for conducting cyberattacks on your AWS workload can be numerous, as are the methods used to carry them out. Many attack vectors of today exploit the limits and weaknesses of your network (e.g. DDoS, MITM, Desync), and just as many can be mitigated using effective network-based security controls. AWS provides you with an array of tools to stand up said controls and implement defense-in-depth, but the breadth of knobs and levers the platform puts at your disposal can be daunting. In this post we will take a stab at summarizing major AWS network security mechanisms, so you as a customer can take a more informed design decision based on risk, effort and cost. In the course of this we will categorize security controls according to the below standard taxonomy:
Preventative controls – These controls are designed to prevent an event from occurring.
Detective controls – These controls are designed to detect, log, and alert after an event has occurred.
Responsive controls – These controls are designed to drive remediation of adverse events or deviations from your security baseline.
Considering and implementing all three types of controls can help you create a defense-in-depth strategy for your AWS network. Preventative controls in such a case act as the first security measure to prevent unauthorized access or unwanted changes to your resources. Detective and responsive controls in turn help you to understand what events occur in your network environment and define remediation actions to those events. Now let’s see how this could work in practice.
Meet the Unicorn Tycoon team
In order to more effectively demonstrate the approach to layering network security controls we will use a hypothetical scenario of Unicorn Tycoon, a fictitious multi-player session-based game run on AWS and available for players worldwide (selecting a use case from the gaming industry is intentional, taking into account the major surge in cyberattack activity visible in this space in recent years). The game infrastructure is running in a single region and can be accessed through AWS edge services (Amazon CloudFront, AWS Global Accelerator), as outlined in the below diagram:
The architecture can be divided into two separate segments, the control plane and the data plane:
The control plane provides an HTTP-based service for player registration and login, match-making and leader boards. It uses CloudFront to provide low latency access to both static and dynamic content. The backend application is deployed as a set of Amazon Elastic Kubernetes Service (Amazon EKS) services, running on top of Amazon Elastic Compute Cloud (Amazon EC2) instances (that require access to the internet). The services are fronted by an Application Load Balancer (ALB). All data is stored in an Amazon Aurora database.
The data plane is responsible for serving the game itself through a proprietary protocol on top of UDP. It uses EKS to spin up a new container hosting a game server for each new multi-player game session and tear it down, once the session completes. Routing players to appropriate game servers, while ensuring low network latency, is achieved through using AWS Global Accelerator Custom Routing.
Players use a dedicated game client to connect to both the control and data plane. The hosting and distribution of the game client to players is outside the focus of this blog post.
In recent months Unicorn Tycoon has gained massive popularity and a large player following. As such, it started attracting malicious actors, that have been launching various attacks on both the infrastructure as well as the application. How can the engineers make sure they are protected from well-known attack vectors and minimize the risk stemming from zero-day threats?
Preventative controls
In general, you should consider enforcing as many preventive controls as possible at the edge, mitigating malicious traffic before it reaches your regional resources. On the control plane side the Unicorn Tycoon team used to leverage CloudFront for caching static content only, but after learning about the various security benefits of CloudFront they decided to proxy all traffic through their distribution, including dynamic content. Listed below are only some of these benefits:
CloudFront allows to manage security policies governing HTTPS connections, helping avoid common attacks like BEAST or POODLE related to outdated SSL/TLS cyphers, and has build-in mitigation against multiple common attacks, including slow-reading (e.g. Slowloris) and HTTP Desync threats.
Through its network of PoPs CloudFront is able to absorb large DDoS attacks at the edge. Enjoying first-class integration with AWS Shield Standard, it offers always-on mitigation of L3/L4 DDoS attacks and protection from SYN floods via a stateless SYN proxy. IT is worth emphasising that, unike CloudFront, regional resources like ALB do not offer inline mitigation capabilities for L3/4 (there is usally a ca. 5-10 min delay between the detection and mitigation of malicious traffic).
CloudFront seamlessly integrates with AWS WAF, allowing for out-of-the-box protection against applicable risks from the OWASP Top 10 list, as well as bot management and fraud prevention at the edge. Depending on the sophistication of bots targeting your website (e.g. bots using browser-automation tools like Puppeteer Stealth or hiding behind residential network or legitimate proxies), AWS WAF allows you to tune the level of protection in exchange for a higher cost. It’s also worth noting that, since recently, AWS WAF allows for the filtering of incoming requests based on the TLS fingerprint (JA3), which has proven to be quite an effective way of fending off recurring attackers.
A CloudFront distribution can be protected by AWS Shield Advanced (which we will elaborate on, when discussing responsive controls)
Similarly to what CloudFront does for the control plane, Global Accelerator provides both improved performance and security for the data plane. The anycast network used by Global Accelerator can spread the surface area of any DDoS attack and has baked-in support for instantly mitigating SYN flood, ACK flood and UDP reflection attacks.
As we enforce security at various layers of our architecture, we should also make sure that specific layers are not easy for attackers to circumvent. As the Maginot Line proved during WW2, there is little point in building complex fortifications, if the enemy army can just bypass those with little effort. In the context of our architecture we must for instance make sure that our ALB should only be accessible from our CloudFront distribution. In order to achieve this on the network layer, we can make use of the AWS-managed prefix list for CloudFront, which can be referenced by the ALB security group. But what, if the attacker is using another CloudFront distribution? We can address this on the application level through allowing CloudFront to include a secret value in a custom HTTP header that our downstream ALB (or WAF) can validate. This can be a hard-coded value or, better yet, a value supplied by AWS Secrets Manager with automated rotation implemented. Needless to say, you also need to make sure HTTPS is enforced when CloudFront is connecting to the origin. “Origin cloaking” is even easier with Global Accelerator, as it offers built-in support for securely connecting to VPC resources deployed in a private subnet, a feature the Unicorn Tycoon team made use of.
Depending on your security requirements, you can chose to encrypt the HTTP traffic end-to-end or perform TLS offloading at the ELB in order save compute resources assigned to the application. The Unicorn Tycoon team in particular decided to encrypt control plane traffic end-to-end, so it cannot be intercepted by an adversary in case he managed to gain access to their infrastructure. For this purpose, the team leveraged AWS Private CA and cert-manager, a popular Kubernetes add-on to distribute, renew, and revoke certificates (as described here), as well as configured TLS encryption towards the database. For traffic between EC2 hosts the team also relied on the underlying hardware-based encryption provided by the Nitro system. Similar precautions around encryption have been taken on the data plane side as well, where the team adopted the Datagram Transport Layer Security (DTLS) protocol.
If a threat actor has already gained initial access to your internal network, their ability to spread their reach to more resources (lateral movement) will also depend on network visibility, therefore, to reduce the scope of impact, it is best practice to segment our VPC infrastructure in private and public subnets. Traffic between resources in these subnets can be controlled by means of Network ACLs, as well as security groups. The Unicorn Tycoon team followed these best practices. To manage traffic on pod level, the they also installed the Calico network policy engine add-on on both EKS clusters. Moreover, through using security groups for pods they were able to enforce granular control over which pods have network access to the Amazon Aurora database.
Knowing that security groups are stateful and prone to resource exhaustion, in order to improve DDoS resilience, the team also made sure the Application Load Balancer security group is configured to not use connection tracking. On the data plane, on the other hand, the team made use of Calico’s XDP capabilities in order to block DDoS traffic explicitly and in a way that is much more efficient than via iptables. It’s worth noting that this blocking is possible, due to the fact that Global Accelerator preserves the source IPs of the incoming traffic
On the DNS side, the team quickly abandoned the idea of standing up their own DNS server and has relied on Amazon Route 53 for serving DNS queries. Route 53 is the only AWS service offering a 100% availability SLA, has inline L3/L4 protection provided by AWS Shield out-of-the-box and boasts built in protections against DNS laundering attacks (cache busting). It also provides support for configuring DNSSEC, reducing the risk of domain hijacking, cache poisoning, and other malicious activities that can compromise the integrity and confidentiality of DNS queries and responses.
As the EC2 instances on the control plane require outbound internet access, The Unicorn Tycoon team needed to make sure that egress traffic originating from these instances is scanned for malicious activity and blocked, if such activity is detected. This could include scenarios like data exfiltration or a compromised instance downloading a malicious payload from C2 servers. For that purpose the team utilized both the Route 53 Resolver DNS Firewall and AWS Network Firewall. The former allows to prohibit FQDN to IP resolution for domains listed as malicious, while the latter will evaluate each outgoing network packet against a set of predefined rules. AWS Network Firewall’s intrusion prevention system (IPS) in particular provides active traffic flow inspection so you can identify and block vulnerability exploits using signature-based detection (since recently this also applies to TLS-encrypted egress traffic). The Unicorn Tycoon team has leveraged AWS managed rule groups in its firewall policy configuration, while also applying a number of custom Suricata rules based on best practices.
Detective controls
There are multiple sources of logs that can be used to detect anomalous activity withing the Unicorn Tycoon network and at its edge. This includes, but is not limited to:
CloudFront Access Logs - CloudFront logs contain information about requests made to your distribution and can be used to identify patterns of suspicious activity, as well as security misconfigurations or SSL/TLS certificate issues
WAF Logs - Logs generated by AWS WAF allow for identifying Layer 7 attacks on your application as well as tune your WAF rules to minimize false positives
VPC Flow Logs - VPC flow logs enable you you to capture information about the IP traffic going to and from network interfaces in your VPC, allowing you to look for suspicious or unusual activity between resources within the VPC, including EKS pods.
ALB Access Logs - Access logs capture detailed information about requests sent to your load balancer. Each log contains information about target server responses (including the HTTP status code), which can be used as a factor for pinpointing malicious activity
Global Accelerator Flow Logs - These logs capture metadata about the traffic flowing through Global Accelerator, which you can use to identify potential security vulnerabilities
DNS Logs - If you use AWS DNS resolvers within your VPC, DNS logs can help pinpoint DNS queries towards malicious websites originating from your resources
Network Firewall logs – For the stateful rule engine, logging can provide detailed information about scanned network packets, and any stateful rule action taken against these packets.
To detect whether EKS EC2 are compromised, the Unicorn Tycoon team leverages Amazon Guarduty, which uses its own copies of VPC Flow Logs and DNS Logs to identify security threats based on observed network activity. Moreover, all log sources mentioned above allow to be written to Amazon S3, where data can be queried by means of Amazon Athena to detect suspicious patterns in network traffic. Alternatively, you can forward logs to Amazon CloudWatch Logs and make use of its advanced features like Logs Insights or metric filters. Speaking of metrics, they can be valuable sources for detective controls as well, for instance:
CloudFront Metrics - Metrics emitted by CloudFront to Amazon CloudWatch can be used to configure alarms notifying users of potential threats (examples: 403 error rate, origin latency)
WAF and Shield Advanced Metrics - WAF metrics help you assert the performance of you WAF rules (false positives, false negatives), while Shield metrics allow you to detect ongoing DDoS attacks on your resources
ALB Metrics - Selected load balancer metrics can be leveraged from a threat detection perspective. For example: a sudden spike in the TargetResponseTime metric can be one of the indicators of the target application being overwhelmed by a DDoS attack
Unicorn Tycoon uses these metrics to configure CloudWatch alarms that notify it’s security team whenever a potential security threat is detected.
Apart from analyzing network activity, network configuration needs to be continuously inspected as well in order to reduce the attack surface made available to a malicious actor and identify possible backdoors. The Unicorn Tycoon team makes use of Amazon Inspector to ensure that EC2 instances on both the control and data plane are reachable only via designated network paths. They also leverage AWS Firewall Manager to define policies for identififying security group misconfigurations.
Responsive controls
On the control plane the Unicorn Tycoon team uses the Security Automation for AWS WAF solution, which continuously analyzes log data from WAF, CloudFront and ALB to identify HTTP flood attacks, as well as scanners and probes, and automatically adjusts WAF rules to mitigate identified network threats. The CloudFront distribution is additionally protected by AWS Shield Advanced, and automatic application layer DDoS mitigation has been turned on. In case the automatic mitigation proves insufficient, proactive support from the Shield Response Team (SRT) has been enabled on the protected resource and a calculated health check based on CloudFront and ALB metrics has been configured.
On the data plane the team has decided to employ their own automation which analyzes Global Accelerator flow logs in order to update an IP deny list maintained as a Calico GlobalNetworkSet and enforced by XDP.
For clarity, the key security controls in the networking domain adopted by Unicorn Tycoon have been outlined in the below diagram:
Additional Considerations
For the Game Server (data plane) part of the architecture, a path often taken by customers does not involve Global Accelerator, but relies on exposing EC2 instances directly to the internet by means of Elastic IPs. This kind of architecture does not make use of the security controls applied at the AWS edge, but has the upside of supporting AWS Shield Advanced protection and the ability to insert AWS Network Firewall into the ingress path, if required.
For protecting traffic across EKS pods, as well as cluster ingress and egress traffic, a holistic approach based on implementing a service mesh can also be considered. AWS offers a managed service in this space, AWS App Mesh. It’s also worth noting that since late 2023 the VPC CNI plugin has added support for the Network Policy API and can be considered as a replacement for Calico in this regard.
While stateless architecture components (ALB, EKS Hosts, Application services) can easily scale out to absorb DDoS traffic, provided appropriate IP space is there, automatically scaling database resources is much more difficult. The Unicorn Tycoon team is considering various approaches for protecting database resources from becoming exhausted, including the use of Amazon RDS Proxy for connection pooling, as well as an Auto Scaling ElastiCache for Redis cluster as a read-through cache sitting in front of the database.
A full list of best practices around DDoS resiliency can be found in the AWS Best Practice Whitepaper. Moreover, AWS has recently made available a new AWS Systems Manager runbook (AWSPremiumSupport-DDoSResiliencyAssessment). This automation runbook helps AWS Enterprise Support and Business Support customers quickly assess the resiliency of their AWS account resources against DDoS attacks and pinpoint common misconfigurations.
Conclusion
Before we leave the Unicorn Tycoon team to their work, we want to highlight that it is important for you to tailor security controls to your specific requirements. Security should not be a “copy & paste” exercise but the actual measures taken should consider the specific context that you operate in. The blog post listed a variety of possibilities to apply security controls across different layers of the AWS network, but the controls chosen should ultimately be a function of threat analysis, your budget as well as regulatory implications.
Want us to have a look at your current AWS configuration from a security perspective? Get in touch with us to find out about the next steps. More about our cloud advisory offering can be found here.