Skip to main content

Skillber v1.0 is here!

Learn more

High Availability & Disaster Recovery

Checking access...

High availability (HA) and disaster recovery (DR) are critical for production workloads. AWS infrastructure spans multiple Availability Zones (AZs) and regions to support resilient designs.

Availability Zones and Regions

An Availability Zone is one or more physically isolated data centers with independent power, cooling, and networking. A Region is a geographic area with 2+ AZs connected by low-latency fiber.

ConceptFailure ScopeAWS Mitigation
AZ failureSingle data center outageDeploy across 3 AZs
Region failureEntire geographic area outageMulti-region DR strategy
Service failureAWS service degradationUse multi-service architecture
Human errorMisconfigurationIaC, automated rollback, change management

Multi-AZ Architecture

# Application Load Balancer across 3 AZs
resource "aws_lb" "web_alb" {
name = "web-alb"
internal = false
load_balancer_type = "application"
subnets = var.public_subnet_ids # 3 AZs
}
# Auto Scaling group across AZs
resource "aws_autoscaling_group" "web_asg" {
name = "web-asg"
min_size = 3
max_size = 15
desired_capacity = 3
vpc_zone_identifier = var.private_subnet_ids # 3 AZs
launch_template {
id = aws_launch_template.web.id
version = "$Latest"
}
health_check_type = "ELB"
health_check_grace_period = 60
}
# RDS Multi-AZ
resource "aws_db_instance" "main" {
engine = "postgres"
instance_class = "db.r6g.large"
multi_az = true # Synchronous standby replica in another AZ
storage_type = "io1"
iops = 3000
}

RPO and RTO

  • Recovery Point Objective (RPO) — maximum acceptable data loss measured in time. How far back in time will data be lost?
  • Recovery Time Objective (RTO) — maximum acceptable downtime. How long until the system is restored?
RPO ── data loss window ──►|◄── RTO (downtime) ──►
Time: [--backup--] [--FAILURE--] [--recovery--]

DR Strategies

StrategyRTORPOCostComplexityDescription
Backup & RestoreHours24 hoursLowLowRestore from S3/Glacier backups
Pilot Light~10 min~10 minMediumMediumCore services running in DR region (smallest footprint)
Warm Standby~1 min~1 minHighMediumDR region running at reduced capacity
Active-ActiveNear-zeroNear-zeroVery HighHighBoth regions serving traffic simultaneously

Backup & Restore (RTO: hours, RPO: 24h)

The simplest and cheapest strategy. Backups are stored in S3 or Glacier and restored when needed.

Terminal window
# Automated EBS snapshots
aws ec2 create-snapshots \
--instance-specification InstanceId=i-1234567890abcdef0,ExcludeBootVolume=false \
--description "Daily backup $(date +%Y-%m-%d)"
# Cross-region copy
aws ec2 copy-snapshot \
--source-region us-east-1 \
--source-snapshot-id snap-1234567890abcdef0 \
--destination-region us-west-2 \
--description "DR copy"

Pilot Light (RTO: ~10 min, RPO: ~10 min)

Core infrastructure (data, networking) runs in the DR region. Application servers are stopped or scaled to minimum.

Primary Region DR Region
┌──────────────────┐ ┌──────────────────┐
│ App Servers (ON) │ │ App Servers (OFF)│
│ DB (Active) │─repl─│ DB (Standby) │
│ Route53 (Active) │ │ Route53 (Standby)│
└──────────────────┘ └──────────────────┘
│ │
└─────────── Traffic ─────────┘ (All traffic to primary)
→ On failover: start app servers, switch DNS

Warm Standby (RTO: ~1 min, RPO: ~1 min)

DR region runs at reduced capacity (e.g., 50%). Scaling resources up during failover is faster than provisioning new ones.

resource "aws_autoscaling_group" "dr_asg" {
name = "dr-app-asg"
min_size = 2 # Reduced capacity
max_size = 20
desired_capacity = 2
launch_template {
id = aws_launch_template.dr_app.id
version = "$Latest"
}
}

Active-Active (RTO: near-zero, RPO: near-zero)

Both regions serve traffic using a global load balancer. Data is replicated bidirectionally.

┌─────────────────────┐
│ Route53 (Latency) │
│ or Global Accel. │
└──────┬──────────────┘
┌────────────┴────────────┐
│ │
┌────────┴────────┐ ┌────────┴────────┐
│ us-east-1 │ │ eu-west-1 │
│ ┌────────────┐ │ │ ┌────────────┐ │
│ │ ALB │ │ │ │ ALB │ │
│ ├────────────┤ │ │ ├────────────┤ │
│ │ App (ASG) │ │ │ │ App (ASG) │ │
│ ├────────────┤ │ │ ├────────────┤ │
│ │ DynamoDB │──┼─repl─┼─│ DynamoDB │ │
│ │ Global │ │ │ │ Global │ │
│ │ Tables │ │ │ │ Tables │ │
│ └────────────┘ │ │ └────────────┘ │
└─────────────────┘ └─────────────────┘
# DynamoDB global tables for active-active
resource "aws_dynamodb_table" "global_app" {
name = "AppData"
billing_mode = "PAY_PER_REQUEST"
hash_key = "id"
attribute {
name = "id"
type = "S"
}
replicas {
region_name = "eu-west-1"
}
replicas {
region_name = "ap-southeast-1"
}
}

Danger

Active-active is complex — you must handle data conflicts, eventual consistency, and cross-region latency. Use DynamoDB Global Tables (last-writer-wins) or multi-master CockroachDB. Not all databases support this model.

Health Checks and Failover

# Route53 failover routing
resource "aws_route53_record" "app" {
zone_id = var.hosted_zone_id
name = "app.example.com"
type = "A"
failover_routing_policy {
type = "PRIMARY"
}
set_identifier = "primary"
alias {
zone_id = aws_lb.web_alb.zone_id
name = aws_lb.web_alb.dns_name
evaluate_target_health = true
}
}
resource "aws_route53_health_check" "primary" {
fqdn = "app.example.com"
port = 443
type = "HTTPS"
resource_path = "/health"
failure_threshold = 3
request_interval = 30
}

Key Takeaways

  • AWS Regions contain 2+ AZs with independent power, cooling, and networking — always deploy across 3 AZs for production workloads
  • RPO (data loss tolerance) and RTO (downtime tolerance) drive DR strategy selection — define both before choosing a pattern
  • Four DR strategies: Backup & Restore (low cost, hours RTO), Pilot Light (medium cost, ~10 min RTO), Warm Standby (high cost, ~1 min RTO), Active-Active (very high cost, near-zero RTO)
  • Multi-AZ for high availability: ALB + Auto Scaling across 3 AZs, RDS Multi-AZ (synchronous standby), Route53 health checks
  • Multi-region requires Route53 routing (failover/latency/geolocation), cross-region replication (DynamoDB Global Tables, Aurora Global, S3 CRR), and health checks for automated failover
  • Active-Active is the most complex — requires bi-directional data replication, conflict handling, and global traffic management