IAM and PAM systems are critical infrastructure. When the IdP is unavailable, users cannot access any application. When the PAM platform is unavailable, administrators cannot manage servers. HA/DR design for IAM/PAM is not optional — it is a business continuity requirement.
Availability Requirements
Defining Availability Targets
System
Recommended SLA
Max Annual Downtime
Recovery Priority
Identity Provider (IdP)
99.99%
52 minutes
Critical — all SSO apps depend on IdP
PAM platform
99.99%
52 minutes
Critical — admin access to all systems blocked
IGA platform
99.9%
8.76 hours
High — certification and provisioning can tolerate brief outages
Directory (AD/LDAP)
99.99%
52 minutes
Critical — authentication and authorisation dependency
MFA system
99.99%
52 minutes
Critical — second factor required for authentication
RTO: 1-5 minutes RPO: < 1 minute (near-synchronous DB replication) Cost: Medium (running standby + DB replication) Complexity: Medium
Active-Active (Multi-Region)
┌─────────────────┐ ┌─────────────────┐
│ Region 1 │ │ Region 2 │
│ │ │ │
│ ┌─────────────┐ │ │ ┌─────────────┐ │
│ │ IdP Node │ │ │ │ IdP Node │ │
│ │ (Active) │ │ │ │ (Active) │ │
│ └──────┬──────┘ │ │ └──────┬──────┘ │
│ │ │ │ │ │
│ ┌──────┴──────┐ │ │ ┌──────┴──────┐ │
│ │ DB (Multi- │─┼─-sync────┼─>│ DB (Multi- │ │
│ │ master) │ │ │ │ master) │ │
│ └─────────────┘ │ │ └─────────────┘ │
│ │ │ │ │ │
│ ┌──────┴──────┐ │ │ ┌──────┴──────┐ │
│ │ Cache (Rgn)│ │ │ │ Cache (Rgn)│ │
│ │ (Local) │ │ │ │ (Local) │ │
│ └─────────────┘ │ │ └─────────────┘ │
└─────────────────┘ └─────────────────┘
│ │
└─── Global Load Balancer ──┘
(GeoDNS / Traffic Manager)
│
Traffic routed to nearest region
RTO: Near-zero (automatic failover via global LB) RPO: Near-zero (synchronous multi-master replication) Cost: High (full infrastructure in each region) Complexity: High (data consistency, conflict resolution)
HA Design for Specific Components
IdP HA
Component
HA Strategy
Implementation
IdP service
Active-active, stateless
Multiple containers/VMs behind load balancer
Session store
Distributed cache (Redis cluster)
Redis with sentinel or cluster mode for automatic failover
Cert-manager for internal certs, automated IdP metadata update
PAM HA
Component
HA Strategy
Implementation
PAM gateway/proxy
Active-active, stateless
Multiple PAM proxy nodes behind load balancer
Credential vault
Active-passive with synchronous replication
PAM vault cluster with DB replication
Session recording storage
Distributed file system or object storage
S3/NFS with replication, WORM-enabled for compliance
Activity database
Active-passive SQL cluster
SQL Always On, Oracle Data Guard
Connectors / agents
Dual-connected (primary + secondary)
Two connector nodes per target system
Disaster Recovery Planning
RTO and RPO Targets
Scenario
RTO Target
RPO Target
DR Strategy
Single system failure
< 5 minutes
Zero
Automatic failover within cluster
Data centre outage
< 60 minutes
< 15 minutes
Warm standby at DR site
Regional outage
< 4 hours
< 30 minutes
Cross-region DR (cold/warm)
Catastrophic failure
< 24 hours
< 1 hour
Manual restore from backups
Ransomware / data corruption
< 24 hours
< 4 hours
Immutable backups, point-in-time restore
Backup Strategy
Data Type
Backup Method
Frequency
Retention
Restore Target
IdP configuration
Configuration export
After every change + daily
90 days
1 hour
Directory (AD/LDAP)
System state backup
Daily
60 days
4 hours
PAM vault database
Encrypted database backup
Daily + transaction log (15 min)
90 days
4 hours
PAM session recordings
Immutable object storage
Continuous (write-once)
Compliance-defined + 1 year
2 hours
Certificate / key store
HSM backup
Weekly
1 year
1 hour
IGA database
Database backup
Daily
90 days
4 hours
Access certification evidence
Document store backup
Daily
7 years (compliance)
4 hours
DR Testing
Test Type
Frequency
Scope
Success Criteria
Component failover test
Monthly
Single component failover (IdP, PAM, directory)
Automatic failover in < 5 minutes
Tabletop exercise
Quarterly
Walk through DR scenarios with team
All team members know their roles
Site failover test
Semi-annual
Full failover from primary to DR site
RTO/RPO met, all systems operational
End-to-end DR test
Annual
Full DR with business applications testing
All critical apps functional at DR site
Key Takeaways
IAM/PAM systems are critical infrastructure — IdP outage means ALL SSO apps are unavailable, PAM outage blocks all admin access, and directory outage prevents all authentication; 99.99% availability is the recommended target
Three HA patterns exist: active-passive cold standby (RTO 15-60 min, lowest cost), active-passive warm standby (RTO 1-5 min, medium cost), and active-active multi-region (near-zero RTO, highest cost) — the appropriate pattern depends on business criticality and budget
IdP HA requires stateless application tier with load balancing, distributed session cache (Redis cluster), replicated token signing keys (HSM cluster), and multi-master directory replication
PAM HA requires active-passive vault cluster with synchronous replication, active-active proxy nodes, distributed session recording storage (object storage), and dual-connected connectors per target system
DR planning requires defined RTO/RPO targets for each failure scenario: component failure (5 min / 0 data loss), data centre outage (60 min / 15 min), regional outage (4 hours / 30 min), and catastrophic failure (24 hours / 1 hour)
Backup strategy covers configuration exports, directory backups, encrypted database backups, immutable session recordings, HSM key backups, and IGA database backups with defined frequency, retention, and restore targets
DR testing at four levels: monthly component failover, quarterly tabletop, semi-annual site failover, and annual end-to-end DR with business applications — untested DR plans are not reliable