Day-2 operations are where IAM/PAM programs succeed or fail. A well-designed architecture is worthless without competent operations — monitoring, incident response, backup/restore, upgrades, and continuous improvement. This page covers the operational practices that keep IAM/PAM systems running reliably.
| Level | Name | Characteristics | Monitoring | Incident Response | Change Management |
|---|
| 1 | Reactive | No monitoring, manual checks | None | Firefighting | Unplanned |
| 2 | Basic | Basic health checks, email alerts | CPU, memory, disk | Manual, inconsistent | Change windows |
| 3 | Proactive | Full monitoring, dashboards, SLAs | Latency, throughput, errors | Documented runbooks | CAB approval |
| 4 | Automated | Automated remediation, self-healing | Predictive analytics, trend analysis | Automated containment | CI/CD for IAM config |
| 5 | Predictive | AI-driven operations, capacity forecasting | ML-based anomaly detection | Self-healing | Continuous delivery |
| Layer | What to Monitor | Tooling | Alert Threshold |
|---|
| IdP | Authentication success/failure rate, latency, MFA delivery | IdP health dashboard, SIEM | Success rate < 99.5%, p99 latency > 2s |
| PAM | Session start/end, credential check-out rate, vault availability | PAM monitoring, SIEM | Session failure > 1%, vault offline |
| Directory | LDAP bind latency, replication latency, CPU/memory | Directory monitoring, Windows Admin Center | Bind time > 50ms, replication lag > 5 min |
| IGA | Provisioning success rate, certification completion rate | IGA platform monitoring | Provisioning success < 99%, cert completion < 95% |
| Infrastructure | CPU, memory, disk, network, TLS certificate expiry | Infrastructure monitoring (Prometheus, Datadog) | CPU > 80%, disk > 85%, cert < 30 days |
| Security | Anomalous authentication patterns, privilege misuse | SIEM, UEBA | Rule-based alerting, scoring threshold |
| Compliance | Certification overdue, SoD violations, orphaned accounts | IGA compliance dashboard | Cert overdue > 7 days, SoD violation detected |
| Metric | Good | Warning | Critical | Aggregation |
|---|
| Authentication latency (p50) | < 500ms | > 1s | > 2s | 5-minute average |
| Authentication latency (p99) | < 2s | > 3s | > 5s | 5-minute average |
| Authentication success rate | > 99.9% | < 99.5% | < 99% | 1-minute window |
| MFA push delivery time | < 2s | > 5s | > 10s | 5-minute average |
| Directory bind latency | < 10ms | > 50ms | > 200ms | 5-minute average |
| Provisioning success rate | > 99.5% | < 99% | < 95% | Per execution batch |
| Certificate expiry | > 90 days | < 30 days | < 7 days | Daily check |
| PAM session failure rate | < 0.1% | > 0.5% | > 1% | Hourly count |
| Step | Action | Owner | Timeframe |
|---|
| 1. Detect | IdP health check fails, authentication errors spike | Monitoring system | Immediate |
| 2. Acknowledge | On-call engineer acknowledges alert | On-call IAM Engineer | 5 minutes |
| 3. Assess | Determine scope (single IdP node, all nodes, data centre, region) | On-call Engineer | 10 minutes |
| 4. Communicate | Post incident to #iam-alerts channel, notify service desk | On-call Engineer | 10 minutes |
| 5. Failover | If primary IdP cluster affected, trigger DNS failover to DR IdP | On-call Engineer | 15 minutes |
| 6. Validate | Verify authentication is working through DR IdP | On-call Engineer | 15 minutes |
| 7. Investigate | Root cause analysis: check IdP logs, infrastructure, directory connectivity | IAM Engineering | Ongoing |
| 8. Remediate | Apply fix (restart service, patch, configuration change) | IAM Engineering | Per fix |
| 9. Failback | If temporary failover, schedule failback to primary during maintenance window | IAM Engineering | Per schedule |
| 10. Post-mortem | Document incident, root cause, lessons learned, preventive actions | IAM Lead | Within 5 business days |
| Step | Action | Owner | Timeframe |
|---|
| 1. Detect | PAM vault health check fails, credential check-out errors | Monitoring system | Immediate |
| 2. Acknowledge | On-call engineer acknowledges alert | On-call PAM Engineer | 5 minutes |
| 3. Impact assessment | Determine which systems are affected (all credentials vs. some) | On-call Engineer | 10 minutes |
| 4. Activate break-glass | If administrators cannot access systems, activate emergency access procedure | PAM Admin | 15 minutes |
| 5. PAM failover | If primary vault cluster, activate standby vault | PAM Engineer | 30 minutes |
| 6. Credential verification | Verify that emergency credentials are functional | PAM Admin | 15 minutes |
| 7. Root cause | Check vault service, database, network, storage | PAM Engineer | Ongoing |
| 8. Restore | Restore vault service from backup if necessary | PAM Engineer | Per scenario |
| 9. Credential rotation | Rotate any credentials exposed during outage | PAM Admin | 24 hours |
| 10. Post-mortem | Document incident, root cause, lessons learned | PAM Lead | Within 5 business days |
| Component | Backup Type | Frequency | Retention | Restore Time Target |
|---|
| IdP configuration | Configuration export (JSON/YAML) | After each change + daily | 90 days | 1 hour |
| IdP database | Encrypted SQL dump | Daily + transaction log (15 min) | 90 days | 4 hours |
| PAM vault database | Encrypted database backup | Daily + transaction log (15 min) | 90 days | 4 hours |
| PAM session recordings | Immutable object storage (S3/WORM) | Continuous (write-once) | Compliance-defined + 1 year | 2 hours |
| Directory (AD) | System state backup | Daily | 60 days | 4 hours |
| IGA database | Database backup | Daily + transaction log (hourly during campaigns) | 90 days | 4 hours |
| HSM / key store | Encrypted key backup | Weekly | 1 year | 1 hour |
| TLS certificates | Certificate backup (PKCS12) | After each renewal | Lifetime + 1 year | 1 hour |
| Scenario | Procedure | RTO | Complexity |
|---|
| Single IdP node failure | Remove node from LB, deploy new node from image, join to cluster | 30 min | Low |
| Full IdP cluster failure | Restore from database backup, re-initialise cluster nodes | 4 hours | Medium |
| PAM vault corruption | Restore vault database from backup, verify credential integrity | 4-8 hours | High |
| Directory corruption | Restore from system state backup, verify replication | 4 hours | High |
| Certificate compromise | Revoke certificate, issue new cert, update all services | 2 hours | Medium |
| HSM / key loss | Restore HSM from key backup, verify signing operations | 4 hours | High |
| Upgrade Type | Frequency | Duration | Migration Approach | Risk Level |
|---|
| Patch (security fix) | As needed | Hours | In-place or rolling | Low |
| Minor version | Quarterly | Days | Blue-green or rolling | Low-Medium |
| Major version | Annually | Weeks | Side-by-side with migration | Medium-High |
| Platform migration | Every 3-5 years | Months | Phased coexistence | High |
| Phase | Activity | Duration | Validation |
|---|
| 1. Pre-upgrade assessment | Review release notes, check compatibility, identify deprecated features | 1 week | Compatibility matrix complete |
| 2. Lab deployment | Deploy new version in test environment | 1 week | All features functional in lab |
| 3. Integration testing | Test all integrations (SIEM, ITSM, directory) | 2 weeks | Integration tests pass |
| 4. User acceptance testing | Admin team validates new version | 2 weeks | All workflows validated |
| 5. Staging deployment | Deploy to staging, run parallel with production | 1 week | Staging mirrors production |
| 6. Production rollout | Deploy to production (rolling or blue-green) | 1-2 weeks | Zero upgrade-related incidents |
| 7. Post-upgrade monitoring | Monitor for issues for 2 weeks post-upgrade | 2 weeks | Performance at or above baseline |
| 8. Rollback plan | If critical issue, rollback to previous version | Per incident | Rollback tested in lab |
| Role | Responsibilities | Typical Size (10K+ Org) |
|---|
| IAM Operations Lead | Team management, SLA oversight, vendor management | 1 |
| IdP Engineer | IdP configuration, SSO integrations, MFA management | 1-2 |
| PAM Engineer | Credential vaulting, session management, connector management | 2-3 |
| Directory Engineer | AD/LDAP operations, replication, performance tuning | 1-2 |
| IGA Analyst | Certification campaigns, provisioning operations, SoD analysis | 1-2 |
| Security Engineer (IAM) | SIEM integration, threat detection for IAM, incident response | 1 |
| Automation Engineer | Runbook automation, CI/CD for IAM config, tooling | 1 |
- IAM/PAM operational maturity progresses from reactive (Level 1) through basic, proactive, automated, to predictive (Level 5) — most organisations operate at Level 2-3 with documented monitoring and runbooks but limited automation
- Monitoring covers seven dimensions: IdP, PAM, directory, IGA, infrastructure, security, and compliance — each dimension has specific metrics, tooling, and alert thresholds with good/warning/critical boundaries
- Incident response runbooks must be documented and tested for the two most critical scenarios: IdP outage (failover to DR, validate auth, investigate root cause, remediate, failback, post-mortem) and PAM vault unavailable (activate break-glass, failover vault, verify credentials, restore, rotate exposed creds, post-mortem)
- Backup strategy requires defined backup types, frequency, retention, and restore time targets for each component — IdP configuration (daily + after change, 90-day retention, 1-hour restore), PAM vault database (daily + 15-min t-logs, 90-day retention, 4-hour restore), directory system state (daily, 60-day retention, 4-hour restore)
- Platform upgrades follow a phased approach: assessment, lab deployment, integration testing, UAT, staging, production rollout, post-upgrade monitoring, with a defined rollback plan for each phase — major version upgrades require side-by-side deployment with phased migration
- The IAM operations team for a 10K+ user organisation typically includes 8-12 people: IAM Operations Lead, IdP Engineer(s), PAM Engineer(s), Directory Engineer(s), IGA Analyst(s), Security Engineer, and Automation Engineer
- Day-2 operations are where IAM/PAM programs succeed or fail — the most mature architecture is ineffective without competent monitoring, documented runbooks, tested backup/restore procedures, and ongoing operational investment