Backend topology in AWS

Overview

Each region the backend is deployed in, is structured as follows:

All incoming traffic to our servers is routed through Cloudflare
Cloudflare serves as the DNS for our domain (legionsecurity.ai)
In addition, we use the Cloudflare zero trust access policies to limit access to the session viewer ({subdomain}.legionsecurity.ai/session) only to Legion internal users.

All incoming requests from the internet to our subdomains ({subdomain}.legionsecurity.ai) have a DNS record in CloudFlare routing them to the relevant region's ALB in our VPC
Incoming traffic is limited only to traffic coming from CloudFlare (defined by the list of official Cloudflare IP ranges)
The ALB has multiple AZs (Availability zones), one in each of the public subnets in our VPC
The ALB has listener rules which define how it behaves when an incoming API request is recieved:
- If the request is HTTP (port 80) > redirect it to HTTPS (port 443)
- If the request is HTTPS (port 443) > show a certificate matching the subdomain's name and forward the request internally on port 80 to the target group in which the running instances of our service registered themselves. The target group peforms health checks to keep track of which instances of our service are running and can receive incoming calls
- All other requests are dropped

ECS Fargate is the serverless infra used to run our containers
Within ECS we have a task definition for the backend service and a task definition for the worker, which define how to run our services - which image to run, which environment variables and secrets to pass to it when the service is booting, and the autoscaling rules for the backend
The ECS cluster is connected to all our private subnets, and evenly splits the running instances of our service between them
We use 2 dedicated IAM roles for each ECS task definition - a task execution role, which contains the permissions the ECS cluster needs to be able to boot our service (e.g read secrets from secret manager, pull image from ECR), and a task role, which contains the permissions our service needs to call other AWS services when running (e.g read/write blobs to S3 bucket)

Both MongoDB Atlas and the various AWS services we use (S3, SES, SQS, etc.) are hosted on AWS which allows us to use private endpoint so communication to them from our ECS cluster will remain in the private subnet and not go through public internet

All outgoing calls from our service to the internet, including to OpenAI, go from our task (in the private subnet) to the NAT Gateway (in the public subnet) which sends those calls to the internet through a single public IP address per NAT Gateway

Some tools used by customers may be self-hosted in the customer's VNET and not accessible from the internet, in which case the worker and poller for worker investigations won't be able to reach those tools.
To overcome this we will have a dedicated EC2 instance for each such customer and have the instance be connected to the customer's VPN and act as a proxy through which the poller and worker will reach those self-hosted tools