Skip to main content

Skillber v1.0 is here!

Learn more

High Availability & Disaster Recovery

Checking access...

IAM and PAM systems are critical infrastructure. When the IdP is unavailable, users cannot access any application. When the PAM platform is unavailable, administrators cannot manage servers. HA/DR design for IAM/PAM is not optional — it is a business continuity requirement.

Availability Requirements

Defining Availability Targets

SystemRecommended SLAMax Annual DowntimeRecovery Priority
Identity Provider (IdP)99.99%52 minutesCritical — all SSO apps depend on IdP
PAM platform99.99%52 minutesCritical — admin access to all systems blocked
IGA platform99.9%8.76 hoursHigh — certification and provisioning can tolerate brief outages
Directory (AD/LDAP)99.99%52 minutesCritical — authentication and authorisation dependency
MFA system99.99%52 minutesCritical — second factor required for authentication

HA Architecture Patterns

Active-Passive (Cold Standby)

┌─────────────────┐ ┌─────────────────┐
│ Primary Site │ │ DR Site │
│ │ │ │
│ ┌─────────────┐ │ │ ┌─────────────┐ │
│ │ IdP Primary │ │ │ │ IdP Standby │ │
│ │ (Active) │ │ │ │ (Inactive) │ │
│ └──────┬──────┘ │ │ └──────┬──────┘ │
│ │ │ │ │ │
│ ┌──────┴──────┐ │ │ ┌──────┴──────┐ │
│ │ DB Primary │─┼─-repl────┼─>│ DB Standby │ │
│ │ (Read/Write)│ │ │ │ (Read-only) │ │
│ └─────────────┘ │ │ └─────────────┘ │
│ │ │ │
│ Health check ────┤ │ Health check ────┤
└─────────────────┘ └─────────────────┘
│ │
└────────── DNS ───────────┘
Traffic routed to active site

RTO: 15-60 minutes RPO: 5-15 minutes (database replication) Cost: Low (standby infrastructure cost) Complexity: Low

Active-Passive (Warm Standby)

┌─────────────────┐ ┌─────────────────┐
│ Primary Site │ │ DR Site │
│ │ │ │
│ ┌─────────────┐ │ │ ┌─────────────┐ │
│ │ IdP Primary │ │ │ │ IdP Standby │ │
│ │ (Active) │ │ │ │ (Running) │ │
│ └──────┬──────┘ │ │ └──────┬──────┘ │
│ │ │ │ │ │
│ ┌──────┴──────┐ │ │ ┌──────┴──────┐ │
│ │ DB Primary │─┼-async-repl┼─>│ DB Standby │ │
│ │ (Read/Write)│ │ │ │ (Read-only) │ │
│ └─────────────┘ │ │ └─────────────┘ │
│ │ │ │
│ Session Cache ───┤ │ Session Cache ──┤
│ (Redis Primary) │ │ (Redis Replica) │
└─────────────────┘ └─────────────────┘
│ │
└────────── DNS ───────────┘
Traffic routed to active site

RTO: 1-5 minutes RPO: < 1 minute (near-synchronous DB replication) Cost: Medium (running standby + DB replication) Complexity: Medium

Active-Active (Multi-Region)

┌─────────────────┐ ┌─────────────────┐
│ Region 1 │ │ Region 2 │
│ │ │ │
│ ┌─────────────┐ │ │ ┌─────────────┐ │
│ │ IdP Node │ │ │ │ IdP Node │ │
│ │ (Active) │ │ │ │ (Active) │ │
│ └──────┬──────┘ │ │ └──────┬──────┘ │
│ │ │ │ │ │
│ ┌──────┴──────┐ │ │ ┌──────┴──────┐ │
│ │ DB (Multi- │─┼─-sync────┼─>│ DB (Multi- │ │
│ │ master) │ │ │ │ master) │ │
│ └─────────────┘ │ │ └─────────────┘ │
│ │ │ │ │ │
│ ┌──────┴──────┐ │ │ ┌──────┴──────┐ │
│ │ Cache (Rgn)│ │ │ │ Cache (Rgn)│ │
│ │ (Local) │ │ │ │ (Local) │ │
│ └─────────────┘ │ │ └─────────────┘ │
└─────────────────┘ └─────────────────┘
│ │
└─── Global Load Balancer ──┘
(GeoDNS / Traffic Manager)
Traffic routed to nearest region

RTO: Near-zero (automatic failover via global LB) RPO: Near-zero (synchronous multi-master replication) Cost: High (full infrastructure in each region) Complexity: High (data consistency, conflict resolution)

HA Design for Specific Components

IdP HA

ComponentHA StrategyImplementation
IdP serviceActive-active, statelessMultiple containers/VMs behind load balancer
Session storeDistributed cache (Redis cluster)Redis with sentinel or cluster mode for automatic failover
Token signing keysHSM cluster or key replicationAWS CloudHSM, Azure Dedicated HSM, on-prem HSM cluster
User directoryMulti-master replicationAD multi-DC, Entra ID is inherently HA
Certificate managementAutomated rotation with overlapCert-manager for internal certs, automated IdP metadata update

PAM HA

ComponentHA StrategyImplementation
PAM gateway/proxyActive-active, statelessMultiple PAM proxy nodes behind load balancer
Credential vaultActive-passive with synchronous replicationPAM vault cluster with DB replication
Session recording storageDistributed file system or object storageS3/NFS with replication, WORM-enabled for compliance
Activity databaseActive-passive SQL clusterSQL Always On, Oracle Data Guard
Connectors / agentsDual-connected (primary + secondary)Two connector nodes per target system

Disaster Recovery Planning

RTO and RPO Targets

ScenarioRTO TargetRPO TargetDR Strategy
Single system failure< 5 minutesZeroAutomatic failover within cluster
Data centre outage< 60 minutes< 15 minutesWarm standby at DR site
Regional outage< 4 hours< 30 minutesCross-region DR (cold/warm)
Catastrophic failure< 24 hours< 1 hourManual restore from backups
Ransomware / data corruption< 24 hours< 4 hoursImmutable backups, point-in-time restore

Backup Strategy

Data TypeBackup MethodFrequencyRetentionRestore Target
IdP configurationConfiguration exportAfter every change + daily90 days1 hour
Directory (AD/LDAP)System state backupDaily60 days4 hours
PAM vault databaseEncrypted database backupDaily + transaction log (15 min)90 days4 hours
PAM session recordingsImmutable object storageContinuous (write-once)Compliance-defined + 1 year2 hours
Certificate / key storeHSM backupWeekly1 year1 hour
IGA databaseDatabase backupDaily90 days4 hours
Access certification evidenceDocument store backupDaily7 years (compliance)4 hours

DR Testing

Test TypeFrequencyScopeSuccess Criteria
Component failover testMonthlySingle component failover (IdP, PAM, directory)Automatic failover in < 5 minutes
Tabletop exerciseQuarterlyWalk through DR scenarios with teamAll team members know their roles
Site failover testSemi-annualFull failover from primary to DR siteRTO/RPO met, all systems operational
End-to-end DR testAnnualFull DR with business applications testingAll critical apps functional at DR site

Key Takeaways

  • IAM/PAM systems are critical infrastructure — IdP outage means ALL SSO apps are unavailable, PAM outage blocks all admin access, and directory outage prevents all authentication; 99.99% availability is the recommended target
  • Three HA patterns exist: active-passive cold standby (RTO 15-60 min, lowest cost), active-passive warm standby (RTO 1-5 min, medium cost), and active-active multi-region (near-zero RTO, highest cost) — the appropriate pattern depends on business criticality and budget
  • IdP HA requires stateless application tier with load balancing, distributed session cache (Redis cluster), replicated token signing keys (HSM cluster), and multi-master directory replication
  • PAM HA requires active-passive vault cluster with synchronous replication, active-active proxy nodes, distributed session recording storage (object storage), and dual-connected connectors per target system
  • DR planning requires defined RTO/RPO targets for each failure scenario: component failure (5 min / 0 data loss), data centre outage (60 min / 15 min), regional outage (4 hours / 30 min), and catastrophic failure (24 hours / 1 hour)
  • Backup strategy covers configuration exports, directory backups, encrypted database backups, immutable session recordings, HSM key backups, and IGA database backups with defined frequency, retention, and restore targets
  • DR testing at four levels: monthly component failover, quarterly tabletop, semi-annual site failover, and annual end-to-end DR with business applications — untested DR plans are not reliable