Skip to main content

Skillber v1.0 is here!

Learn more

Operations & Runbooks

Checking access...

Day-2 operations are where IAM/PAM programs succeed or fail. A well-designed architecture is worthless without competent operations — monitoring, incident response, backup/restore, upgrades, and continuous improvement. This page covers the operational practices that keep IAM/PAM systems running reliably.

Operational Maturity Model

LevelNameCharacteristicsMonitoringIncident ResponseChange Management
1ReactiveNo monitoring, manual checksNoneFirefightingUnplanned
2BasicBasic health checks, email alertsCPU, memory, diskManual, inconsistentChange windows
3ProactiveFull monitoring, dashboards, SLAsLatency, throughput, errorsDocumented runbooksCAB approval
4AutomatedAutomated remediation, self-healingPredictive analytics, trend analysisAutomated containmentCI/CD for IAM config
5PredictiveAI-driven operations, capacity forecastingML-based anomaly detectionSelf-healingContinuous delivery

Monitoring and Alerting

IAM/PAM Monitoring Dimensions

LayerWhat to MonitorToolingAlert Threshold
IdPAuthentication success/failure rate, latency, MFA deliveryIdP health dashboard, SIEMSuccess rate < 99.5%, p99 latency > 2s
PAMSession start/end, credential check-out rate, vault availabilityPAM monitoring, SIEMSession failure > 1%, vault offline
DirectoryLDAP bind latency, replication latency, CPU/memoryDirectory monitoring, Windows Admin CenterBind time > 50ms, replication lag > 5 min
IGAProvisioning success rate, certification completion rateIGA platform monitoringProvisioning success < 99%, cert completion < 95%
InfrastructureCPU, memory, disk, network, TLS certificate expiryInfrastructure monitoring (Prometheus, Datadog)CPU > 80%, disk > 85%, cert < 30 days
SecurityAnomalous authentication patterns, privilege misuseSIEM, UEBARule-based alerting, scoring threshold
ComplianceCertification overdue, SoD violations, orphaned accountsIGA compliance dashboardCert overdue > 7 days, SoD violation detected

Key Monitoring Metrics

MetricGoodWarningCriticalAggregation
Authentication latency (p50)< 500ms> 1s> 2s5-minute average
Authentication latency (p99)< 2s> 3s> 5s5-minute average
Authentication success rate> 99.9%< 99.5%< 99%1-minute window
MFA push delivery time< 2s> 5s> 10s5-minute average
Directory bind latency< 10ms> 50ms> 200ms5-minute average
Provisioning success rate> 99.5%< 99%< 95%Per execution batch
Certificate expiry> 90 days< 30 days< 7 daysDaily check
PAM session failure rate< 0.1%> 0.5%> 1%Hourly count

Incident Response Runbooks

Runbook: IdP Outage

StepActionOwnerTimeframe
1. DetectIdP health check fails, authentication errors spikeMonitoring systemImmediate
2. AcknowledgeOn-call engineer acknowledges alertOn-call IAM Engineer5 minutes
3. AssessDetermine scope (single IdP node, all nodes, data centre, region)On-call Engineer10 minutes
4. CommunicatePost incident to #iam-alerts channel, notify service deskOn-call Engineer10 minutes
5. FailoverIf primary IdP cluster affected, trigger DNS failover to DR IdPOn-call Engineer15 minutes
6. ValidateVerify authentication is working through DR IdPOn-call Engineer15 minutes
7. InvestigateRoot cause analysis: check IdP logs, infrastructure, directory connectivityIAM EngineeringOngoing
8. RemediateApply fix (restart service, patch, configuration change)IAM EngineeringPer fix
9. FailbackIf temporary failover, schedule failback to primary during maintenance windowIAM EngineeringPer schedule
10. Post-mortemDocument incident, root cause, lessons learned, preventive actionsIAM LeadWithin 5 business days

Runbook: PAM Vault Unavailable

StepActionOwnerTimeframe
1. DetectPAM vault health check fails, credential check-out errorsMonitoring systemImmediate
2. AcknowledgeOn-call engineer acknowledges alertOn-call PAM Engineer5 minutes
3. Impact assessmentDetermine which systems are affected (all credentials vs. some)On-call Engineer10 minutes
4. Activate break-glassIf administrators cannot access systems, activate emergency access procedurePAM Admin15 minutes
5. PAM failoverIf primary vault cluster, activate standby vaultPAM Engineer30 minutes
6. Credential verificationVerify that emergency credentials are functionalPAM Admin15 minutes
7. Root causeCheck vault service, database, network, storagePAM EngineerOngoing
8. RestoreRestore vault service from backup if necessaryPAM EngineerPer scenario
9. Credential rotationRotate any credentials exposed during outagePAM Admin24 hours
10. Post-mortemDocument incident, root cause, lessons learnedPAM LeadWithin 5 business days

Backup and Restore

Backup Schedule

ComponentBackup TypeFrequencyRetentionRestore Time Target
IdP configurationConfiguration export (JSON/YAML)After each change + daily90 days1 hour
IdP databaseEncrypted SQL dumpDaily + transaction log (15 min)90 days4 hours
PAM vault databaseEncrypted database backupDaily + transaction log (15 min)90 days4 hours
PAM session recordingsImmutable object storage (S3/WORM)Continuous (write-once)Compliance-defined + 1 year2 hours
Directory (AD)System state backupDaily60 days4 hours
IGA databaseDatabase backupDaily + transaction log (hourly during campaigns)90 days4 hours
HSM / key storeEncrypted key backupWeekly1 year1 hour
TLS certificatesCertificate backup (PKCS12)After each renewalLifetime + 1 year1 hour

Restore Scenarios

ScenarioProcedureRTOComplexity
Single IdP node failureRemove node from LB, deploy new node from image, join to cluster30 minLow
Full IdP cluster failureRestore from database backup, re-initialise cluster nodes4 hoursMedium
PAM vault corruptionRestore vault database from backup, verify credential integrity4-8 hoursHigh
Directory corruptionRestore from system state backup, verify replication4 hoursHigh
Certificate compromiseRevoke certificate, issue new cert, update all services2 hoursMedium
HSM / key lossRestore HSM from key backup, verify signing operations4 hoursHigh

Platform Upgrades

Upgrade Strategy

Upgrade TypeFrequencyDurationMigration ApproachRisk Level
Patch (security fix)As neededHoursIn-place or rollingLow
Minor versionQuarterlyDaysBlue-green or rollingLow-Medium
Major versionAnnuallyWeeksSide-by-side with migrationMedium-High
Platform migrationEvery 3-5 yearsMonthsPhased coexistenceHigh

Upgrade Runbook (PAM Major Version)

PhaseActivityDurationValidation
1. Pre-upgrade assessmentReview release notes, check compatibility, identify deprecated features1 weekCompatibility matrix complete
2. Lab deploymentDeploy new version in test environment1 weekAll features functional in lab
3. Integration testingTest all integrations (SIEM, ITSM, directory)2 weeksIntegration tests pass
4. User acceptance testingAdmin team validates new version2 weeksAll workflows validated
5. Staging deploymentDeploy to staging, run parallel with production1 weekStaging mirrors production
6. Production rolloutDeploy to production (rolling or blue-green)1-2 weeksZero upgrade-related incidents
7. Post-upgrade monitoringMonitor for issues for 2 weeks post-upgrade2 weeksPerformance at or above baseline
8. Rollback planIf critical issue, rollback to previous versionPer incidentRollback tested in lab

IAM Operations Team Structure

RoleResponsibilitiesTypical Size (10K+ Org)
IAM Operations LeadTeam management, SLA oversight, vendor management1
IdP EngineerIdP configuration, SSO integrations, MFA management1-2
PAM EngineerCredential vaulting, session management, connector management2-3
Directory EngineerAD/LDAP operations, replication, performance tuning1-2
IGA AnalystCertification campaigns, provisioning operations, SoD analysis1-2
Security Engineer (IAM)SIEM integration, threat detection for IAM, incident response1
Automation EngineerRunbook automation, CI/CD for IAM config, tooling1

Key Takeaways

  • IAM/PAM operational maturity progresses from reactive (Level 1) through basic, proactive, automated, to predictive (Level 5) — most organisations operate at Level 2-3 with documented monitoring and runbooks but limited automation
  • Monitoring covers seven dimensions: IdP, PAM, directory, IGA, infrastructure, security, and compliance — each dimension has specific metrics, tooling, and alert thresholds with good/warning/critical boundaries
  • Incident response runbooks must be documented and tested for the two most critical scenarios: IdP outage (failover to DR, validate auth, investigate root cause, remediate, failback, post-mortem) and PAM vault unavailable (activate break-glass, failover vault, verify credentials, restore, rotate exposed creds, post-mortem)
  • Backup strategy requires defined backup types, frequency, retention, and restore time targets for each component — IdP configuration (daily + after change, 90-day retention, 1-hour restore), PAM vault database (daily + 15-min t-logs, 90-day retention, 4-hour restore), directory system state (daily, 60-day retention, 4-hour restore)
  • Platform upgrades follow a phased approach: assessment, lab deployment, integration testing, UAT, staging, production rollout, post-upgrade monitoring, with a defined rollback plan for each phase — major version upgrades require side-by-side deployment with phased migration
  • The IAM operations team for a 10K+ user organisation typically includes 8-12 people: IAM Operations Lead, IdP Engineer(s), PAM Engineer(s), Directory Engineer(s), IGA Analyst(s), Security Engineer, and Automation Engineer
  • Day-2 operations are where IAM/PAM programs succeed or fail — the most mature architecture is ineffective without competent monitoring, documented runbooks, tested backup/restore procedures, and ongoing operational investment