High Availability & Disaster Recovery
Checking access...
High availability (HA) and disaster recovery (DR) are critical for production workloads. AWS infrastructure spans multiple Availability Zones (AZs) and regions to support resilient designs.
Availability Zones and Regions
An Availability Zone is one or more physically isolated data centers with independent power, cooling, and networking. A Region is a geographic area with 2+ AZs connected by low-latency fiber.
| Concept | Failure Scope | AWS Mitigation |
|---|---|---|
| AZ failure | Single data center outage | Deploy across 3 AZs |
| Region failure | Entire geographic area outage | Multi-region DR strategy |
| Service failure | AWS service degradation | Use multi-service architecture |
| Human error | Misconfiguration | IaC, automated rollback, change management |
Multi-AZ Architecture
# Application Load Balancer across 3 AZsresource "aws_lb" "web_alb" { name = "web-alb" internal = false load_balancer_type = "application" subnets = var.public_subnet_ids # 3 AZs}
# Auto Scaling group across AZsresource "aws_autoscaling_group" "web_asg" { name = "web-asg" min_size = 3 max_size = 15 desired_capacity = 3 vpc_zone_identifier = var.private_subnet_ids # 3 AZs
launch_template { id = aws_launch_template.web.id version = "$Latest" }
health_check_type = "ELB" health_check_grace_period = 60}
# RDS Multi-AZresource "aws_db_instance" "main" { engine = "postgres" instance_class = "db.r6g.large" multi_az = true # Synchronous standby replica in another AZ storage_type = "io1" iops = 3000}RPO and RTO
- Recovery Point Objective (RPO) — maximum acceptable data loss measured in time. How far back in time will data be lost?
- Recovery Time Objective (RTO) — maximum acceptable downtime. How long until the system is restored?
RPO ── data loss window ──►|◄── RTO (downtime) ──►Time: [--backup--] [--FAILURE--] [--recovery--]DR Strategies
| Strategy | RTO | RPO | Cost | Complexity | Description |
|---|---|---|---|---|---|
| Backup & Restore | Hours | 24 hours | Low | Low | Restore from S3/Glacier backups |
| Pilot Light | ~10 min | ~10 min | Medium | Medium | Core services running in DR region (smallest footprint) |
| Warm Standby | ~1 min | ~1 min | High | Medium | DR region running at reduced capacity |
| Active-Active | Near-zero | Near-zero | Very High | High | Both regions serving traffic simultaneously |
Backup & Restore (RTO: hours, RPO: 24h)
The simplest and cheapest strategy. Backups are stored in S3 or Glacier and restored when needed.
# Automated EBS snapshotsaws ec2 create-snapshots \ --instance-specification InstanceId=i-1234567890abcdef0,ExcludeBootVolume=false \ --description "Daily backup $(date +%Y-%m-%d)"
# Cross-region copyaws ec2 copy-snapshot \ --source-region us-east-1 \ --source-snapshot-id snap-1234567890abcdef0 \ --destination-region us-west-2 \ --description "DR copy"Pilot Light (RTO: ~10 min, RPO: ~10 min)
Core infrastructure (data, networking) runs in the DR region. Application servers are stopped or scaled to minimum.
Primary Region DR Region┌──────────────────┐ ┌──────────────────┐│ App Servers (ON) │ │ App Servers (OFF)││ DB (Active) │─repl─│ DB (Standby) ││ Route53 (Active) │ │ Route53 (Standby)│└──────────────────┘ └──────────────────┘ │ │ └─────────── Traffic ─────────┘ (All traffic to primary) → On failover: start app servers, switch DNSWarm Standby (RTO: ~1 min, RPO: ~1 min)
DR region runs at reduced capacity (e.g., 50%). Scaling resources up during failover is faster than provisioning new ones.
resource "aws_autoscaling_group" "dr_asg" { name = "dr-app-asg" min_size = 2 # Reduced capacity max_size = 20 desired_capacity = 2
launch_template { id = aws_launch_template.dr_app.id version = "$Latest" }}Active-Active (RTO: near-zero, RPO: near-zero)
Both regions serve traffic using a global load balancer. Data is replicated bidirectionally.
┌─────────────────────┐ │ Route53 (Latency) │ │ or Global Accel. │ └──────┬──────────────┘ │ ┌────────────┴────────────┐ │ │ ┌────────┴────────┐ ┌────────┴────────┐ │ us-east-1 │ │ eu-west-1 │ │ ┌────────────┐ │ │ ┌────────────┐ │ │ │ ALB │ │ │ │ ALB │ │ │ ├────────────┤ │ │ ├────────────┤ │ │ │ App (ASG) │ │ │ │ App (ASG) │ │ │ ├────────────┤ │ │ ├────────────┤ │ │ │ DynamoDB │──┼─repl─┼─│ DynamoDB │ │ │ │ Global │ │ │ │ Global │ │ │ │ Tables │ │ │ │ Tables │ │ │ └────────────┘ │ │ └────────────┘ │ └─────────────────┘ └─────────────────┘# DynamoDB global tables for active-activeresource "aws_dynamodb_table" "global_app" { name = "AppData" billing_mode = "PAY_PER_REQUEST" hash_key = "id"
attribute { name = "id" type = "S" }
replicas { region_name = "eu-west-1" }
replicas { region_name = "ap-southeast-1" }}Danger
Active-active is complex — you must handle data conflicts, eventual consistency, and cross-region latency. Use DynamoDB Global Tables (last-writer-wins) or multi-master CockroachDB. Not all databases support this model.
Health Checks and Failover
# Route53 failover routingresource "aws_route53_record" "app" { zone_id = var.hosted_zone_id name = "app.example.com" type = "A"
failover_routing_policy { type = "PRIMARY" }
set_identifier = "primary" alias { zone_id = aws_lb.web_alb.zone_id name = aws_lb.web_alb.dns_name evaluate_target_health = true }}
resource "aws_route53_health_check" "primary" { fqdn = "app.example.com" port = 443 type = "HTTPS" resource_path = "/health" failure_threshold = 3 request_interval = 30}Key Takeaways
- AWS Regions contain 2+ AZs with independent power, cooling, and networking — always deploy across 3 AZs for production workloads
- RPO (data loss tolerance) and RTO (downtime tolerance) drive DR strategy selection — define both before choosing a pattern
- Four DR strategies: Backup & Restore (low cost, hours RTO), Pilot Light (medium cost, ~10 min RTO), Warm Standby (high cost, ~1 min RTO), Active-Active (very high cost, near-zero RTO)
- Multi-AZ for high availability: ALB + Auto Scaling across 3 AZs, RDS Multi-AZ (synchronous standby), Route53 health checks
- Multi-region requires Route53 routing (failover/latency/geolocation), cross-region replication (DynamoDB Global Tables, Aurora Global, S3 CRR), and health checks for automated failover
- Active-Active is the most complex — requires bi-directional data replication, conflict handling, and global traffic management