Appearance
Infrastructure Specifications
Pending Approval
These specifications are pending final approval and may change
Architecture Components
Network Design
- Region: All services are meant to be deployed within a single AWS region
- VPC: Custom Virtual Private Cloud with the following subnet structure:
- Public Subnet: Hosts internet-facing resources
- Compute Private Subnets: Isolated environment for processing components
- Database Private Subnets: Isolated environment for data storage components
Component Breakdown
External Access Layer
- Cloudflare: Provides DNS management, DDoS protection with rate limiting capabilities, a Web Application Firewall and Reverse Proxy services
- Public Endpoint: External traffic entry point
API Layer
- Webservice Lambda + HTTPS Function Endpoint:
- Serverless webhook receivr
- Exposed only to whitelisted IP addresses from Cloudflare
- Security measures:
- Cloudflare rate limiting protection
- Domain validation within the function code
- Note: IP whitelisting not implemented due to application requirements
Message Queue
- RabbitMQ EC2 Instance:
- Self-hosted on EC2 instances
- Internal access only (not accessible from public internet)
- Acts as the central message broker between the webhook and processors
Processing Layer
- Elastic Container Service (ECS):
- Hosts the request processor applications
- Configured with autoscaling policies to handle variable loads
- Internal access only (not accessible from public internet)
- Processes messages from RabbitMQ queue
Data Storage
MySQL/MariaDB Database:
- RDS instance for structured data storage
- Internal access only (no publicly exposed endpoint)
- Located in database private subnets
Redis Instance:
- Used for caching and transaction locks
- Internal access only
- Located in database private subnets
Networking Components
- NAT Gateway:
- Managed AWS NAT Gateway solution
- Allows outbound internet connectivity for resources in private subnets
- Provides additional security by preventing direct inbound connections
Data Flow
- External requests arrive through Cloudflare to the public endpoint
- The Lambda function processes the incoming webhook request, validates it, and forwards to RabbitMQ
- Message is queued in RabbitMQ's queues
- ECS-hosted processors consume messages from the queue
- Processors interact with the database and Redis as needed
- Responses can be sent back to external systems via the NAT Gateway
Security Considerations
Network Security
- Public services are isolated to the public subnet
- Processing and database components are isolated in private subnets
- NAT Gateway manages outbound connections from private subnets
- No direct public access to databases or processing components
Application Security
- Web Service Security:
- Only accessible from Cloudflare IPs
- Cloudflare rate limiting to prevent DDoS attacks
- Cloudflare WAF to open access to endpoints only from approved / whitelisted sources
Data Security
- Database and Redis instances are not publicly accessible
- All internal communications occur within the VPC
- Data encryption in transit and at rest
Scaling Considerations
Horizontal Scaling
- Lambda functions scale automatically based on request volume
- ECS service uses autoscaling policies to adjust to processing demands
Scaling based on Queue Back Pressure
Since the Request Processor will have a single queue in RabbitMQ, that queue will be monitored for pressure (the difference between the rate of ingestion from the rate of consumption). If the queue size is consistently rising for more than 5 minutes and the rate of ingestion is consistently higher than the rate of consumption, additional instances should be added.
Important Note
Once the queue size has reached 0 for a period of 5 minutes, the additional instances should be discarded.
Scaling based on Average Resource Utilization
- If the average CPU utilization is at or above 80% for more than 15 minutes, additional instances should be added.
- If the average CPU utilization is at or above 90% for more than 5 minutes, additional instances should be added.
Important Note
Once the queue size has reached 0 for a period of 5 minutes, the additional instances should be discarded. If CPU utilization is at or above 80% with a queue size of 0, vertical scaling may be required.
Vertical Scaling
- EC2 instances for RabbitMQ can be resized as needed
- Database instances can be upgraded to larger instance types
Operations and Monitoring
Monitoring Infrastructure
Infrastructure Monitoring
RDS:
- Cloudwatch metrics exported to Grafana dashboards
ECS:
- Cloudwatch metrics exported to Grafana dashboards
- Loki for log aggregation displayed in Grafana
Redis:
- Cloudwatch metrics exported to Grafana dashboards
RabbitMQ EC2:
- Zabbix monitoring for host-level metrics
- Prometheus for RabbitMQ-specific metrics
- All metrics visualized in Grafana dashboards
Lambda:
- Loki for log aggregation
- Cloudwatch for function metrics exported to Grafana
NAT Gateway:
- Cloudwatch metrics exported to Grafana dashboards
Alerting Configuration
Alert Triggers and Routing
- RDS: Cloudwatch alarms for database performance thresholds
- ECS: Cloudwatch alarms for container and service health
- Redis: Cloudwatch alarms for cache performance metrics
- RabbitMQ: Zabbix and Grafana alarms for queue and node health
- Lambda: Cloudwatch alarms for function errors and performance
- NAT Gateway: Cloudwatch alarms for bandwidth and connection limits
All alarms are integrated with PagerDuty for incident management, on-call rotation, and escalation policies.
Breakdown of Work
The estimate for the effort needed is 17.5 blocking hours + 24 non-blocking hours, broken down as follows:
| Component | Deployment Time | Dependencies | Complexity | Blocking |
|---|---|---|---|---|
| VPC & Network Setup | 2 hours | None | Medium | Yes |
| Cloudflare Configuration | 30 minutes | Domain registration | Low | Yes |
| Lambda Function + API Gateway | 2 hours | VPC, Security Groups | Medium | Yes |
| RabbitMQ EC2 Deployment | 6 hours | VPC, Private Subnets | High | Yes |
| ECS Cluster & Services | 4 hours | VPC, Private Subnets, RabbitMQ | High | Yes |
| RDS Database | 1 hour | VPC, DB Subnets | Medium | Yes |
| Redis Deployment | 1 hour | VPC, DB Subnets | Low | Yes |
| NAT Gateway | 1 hour | VPC, Public Subnet | Low | Yes |
| Monitoring Setup | 3 days | All components deployed | High | No |
Blocking vs Non-Blocking Hours
Blocking Hours refers to work which must be completed before the product can be launched Non-Blocking Hours refers to work which can or should be completed after the product is launched