HA & DR Runbooks for Production Infrastructure

High availability runbooks and disaster recovery runbooks are not paperwork. They are the difference between a controlled incident and a long production outage where every engineer is guessing under pressure. A good SRE runbook gives the team a clear path from alert to diagnosis, mitigation, failover, restore validation, and post-incident follow-up.

For SteadyOps clients, I treat HA and DR documentation as production infrastructure. If the runbook is not tested, it is only an assumption. If restore was never rehearsed, backups are not yet a recovery strategy. If failover requires one engineer’s memory, the system still has a human single point of failure.

What every HA and DR runbook must contain

A production runbook should be short enough to use during stress and detailed enough to prevent dangerous improvisation. It should answer five questions immediately: what is the impact, what should be checked first, who can make the failover decision, how do we recover safely, and how do we prove the service is healthy after recovery.

The minimum structure I use is:

The most common failure is writing runbooks as long wiki pages. During an outage nobody wants to read a novel. The runbook must guide the next safe action: check replication lag, verify quorum, confirm backups, stop writes, promote standby, run smoke tests, and communicate status.

Define RPO and RTO before choosing technology

Many teams start with tools: Patroni, etcd, keepalived, HAProxy, cloud snapshots, object storage, or Kubernetes operators. Tools matter, but the runbook starts with business tolerances.

RPO means how much data the business can afford to lose. RTO means how long the business can tolerate downtime. These two numbers drive the architecture. A system with a 15-minute RPO can use different backup and replication strategy than a financial system that needs near-zero data loss.

A practical DR runbook should explicitly document:

AreaQuestionExample target
RPOMaximum acceptable data loss0-5 minutes
RTOMaximum acceptable recovery time15-30 minutes
Restore sourceWhere recovery data comes fromWAL archive + base backup
Decision ownerWho approves failoverIncident commander / SRE lead
ValidationHow recovery is provenAPI health, DB consistency, synthetic transaction

Without these numbers, every incident becomes a debate. With them, the team can make decisions based on agreed operational risk.

PostgreSQL disaster recovery runbook

PostgreSQL disaster recovery needs more than pg_dump or snapshots. You need a tested path for primary failure, replica lag, corrupted data, accidental deletes, and full environment rebuild. The runbook should separate high availability failover from disaster recovery restore.

For HA, the first checks are usually:

patronictl list
systemctl status patroni
curl -s http://127.0.0.1:8008/health
psql -c "select pg_is_in_recovery();"
psql -c "select now() - pg_last_xact_replay_timestamp() as replica_lag;"

For DR, the runbook must include backup inventory and restore commands. The exact commands depend on the backup tool, but the checks are universal:

A strong PostgreSQL disaster recovery runbook includes a restore test date. If the last restore test is unknown, the backup strategy is not proven.

Kubernetes and service failover runbooks

In Kubernetes, many incidents look like application problems but are really capacity, rollout, networking, DNS, or dependency failures. A useful Kubernetes HA runbook starts by reducing the blast radius.

kubectl get nodes -o wide
kubectl get pods -A --field-selector=status.phase!=Running
kubectl get events -A --sort-by=.lastTimestamp | tail -50
kubectl rollout status deployment/app -n production
kubectl describe hpa -n production app

For production services, the runbook should document rollback criteria before the release. If p99 latency grows, 5xx increases, queue depth rises, or database connections spike, the team should not debate whether rollback is allowed. The runbook should say when to rollback and how to validate it.

A good service failover runbook also includes dependency order. For example, after restoring PostgreSQL you may need to restart API workers, rebuild read models, drain queues, or invalidate caches. Restarting everything randomly can create a second incident.

Incident communication and evidence

HA and DR are not only technical. During a production incident, business stakeholders need status, customer support needs language, and engineering needs one source of truth. The runbook should define where updates happen and who sends them.

Use a simple timeline:

This timeline becomes audit evidence and postmortem input. For SOC 2-ready infrastructure, the ability to show incident handling, access decisions, backup tests, and recovery verification is as important as the recovery itself.

Decision matrix

ApproachBest forStability impactComplexity
Manual recovery notesVery small systemsBetter than memory, still riskyLow
Versioned runbooks in GitGrowing production teamsConsistent recovery steps and review historyMedium
Automated failover with manual approvalDatabases and critical servicesReduces downtime while keeping controlMedium
Regular DR drillsMission-critical systemsProves RPO/RTO before real outageHigh
Full SRE operating modelRegulated or high-traffic platformsStrong incident response and audit evidenceHigh

Key takeaways

Operational takeaway

Write the runbook before the outage, test it before the business depends on it, and keep it close to the infrastructure code. A reliable recovery plan is a production control, not documentation decoration.

Need production-grade HA and DR?

SteadyOps can review your current failover, backup, and incident response process and turn it into a practical HA/DR runbook with restore tests and clear recovery steps.

Ask DevOps Copilot Request audit