High availability runbooks and disaster recovery runbooks are not paperwork. They are the difference between a controlled incident and a long production outage where every engineer is guessing under pressure. A good SRE runbook gives the team a clear path from alert to diagnosis, mitigation, failover, restore validation, and post-incident follow-up.
For SteadyOps clients, I treat HA and DR documentation as production infrastructure. If the runbook is not tested, it is only an assumption. If restore was never rehearsed, backups are not yet a recovery strategy. If failover requires one engineer’s memory, the system still has a human single point of failure.
What every HA and DR runbook must contain
A production runbook should be short enough to use during stress and detailed enough to prevent dangerous improvisation. It should answer five questions immediately: what is the impact, what should be checked first, who can make the failover decision, how do we recover safely, and how do we prove the service is healthy after recovery.
The minimum structure I use is:
- Impact and severity definition.
- First checks and dashboards.
- Safe mitigation steps.
- Failover or rollback criteria.
- Recovery commands and validation.
- Escalation owner and communication channel.
- Post-incident cleanup and evidence.
The most common failure is writing runbooks as long wiki pages. During an outage nobody wants to read a novel. The runbook must guide the next safe action: check replication lag, verify quorum, confirm backups, stop writes, promote standby, run smoke tests, and communicate status.
Define RPO and RTO before choosing technology
Many teams start with tools: Patroni, etcd, keepalived, HAProxy, cloud snapshots, object storage, or Kubernetes operators. Tools matter, but the runbook starts with business tolerances.
RPO means how much data the business can afford to lose. RTO means how long the business can tolerate downtime. These two numbers drive the architecture. A system with a 15-minute RPO can use different backup and replication strategy than a financial system that needs near-zero data loss.
A practical DR runbook should explicitly document:
| Area | Question | Example target |
|---|---|---|
| RPO | Maximum acceptable data loss | 0-5 minutes |
| RTO | Maximum acceptable recovery time | 15-30 minutes |
| Restore source | Where recovery data comes from | WAL archive + base backup |
| Decision owner | Who approves failover | Incident commander / SRE lead |
| Validation | How recovery is proven | API health, DB consistency, synthetic transaction |
Without these numbers, every incident becomes a debate. With them, the team can make decisions based on agreed operational risk.
PostgreSQL disaster recovery runbook
PostgreSQL disaster recovery needs more than pg_dump or snapshots. You need a tested path for primary failure, replica lag, corrupted data, accidental deletes, and full environment rebuild. The runbook should separate high availability failover from disaster recovery restore.
For HA, the first checks are usually:
patronictl list
systemctl status patroni
curl -s http://127.0.0.1:8008/health
psql -c "select pg_is_in_recovery();"
psql -c "select now() - pg_last_xact_replay_timestamp() as replica_lag;"
For DR, the runbook must include backup inventory and restore commands. The exact commands depend on the backup tool, but the checks are universal:
- Is the latest base backup complete?
- Are WAL files available for the target recovery point?
- Was the restore tested in a clean environment recently?
- Can the application connect after restore?
- Are background jobs, queues, and caches safe to restart?
A strong PostgreSQL disaster recovery runbook includes a restore test date. If the last restore test is unknown, the backup strategy is not proven.
Kubernetes and service failover runbooks
In Kubernetes, many incidents look like application problems but are really capacity, rollout, networking, DNS, or dependency failures. A useful Kubernetes HA runbook starts by reducing the blast radius.
kubectl get nodes -o wide
kubectl get pods -A --field-selector=status.phase!=Running
kubectl get events -A --sort-by=.lastTimestamp | tail -50
kubectl rollout status deployment/app -n production
kubectl describe hpa -n production app
For production services, the runbook should document rollback criteria before the release. If p99 latency grows, 5xx increases, queue depth rises, or database connections spike, the team should not debate whether rollback is allowed. The runbook should say when to rollback and how to validate it.
A good service failover runbook also includes dependency order. For example, after restoring PostgreSQL you may need to restart API workers, rebuild read models, drain queues, or invalidate caches. Restarting everything randomly can create a second incident.
Incident communication and evidence
HA and DR are not only technical. During a production incident, business stakeholders need status, customer support needs language, and engineering needs one source of truth. The runbook should define where updates happen and who sends them.
Use a simple timeline:
- Time detected.
- Impact confirmed.
- Mitigation started.
- Failover or restore decision.
- Service restored.
- Validation completed.
- Follow-up actions created.
This timeline becomes audit evidence and postmortem input. For SOC 2-ready infrastructure, the ability to show incident handling, access decisions, backup tests, and recovery verification is as important as the recovery itself.
Decision matrix
| Approach | Best for | Stability impact | Complexity |
|---|---|---|---|
| Manual recovery notes | Very small systems | Better than memory, still risky | Low |
| Versioned runbooks in Git | Growing production teams | Consistent recovery steps and review history | Medium |
| Automated failover with manual approval | Databases and critical services | Reduces downtime while keeping control | Medium |
| Regular DR drills | Mission-critical systems | Proves RPO/RTO before real outage | High |
| Full SRE operating model | Regulated or high-traffic platforms | Strong incident response and audit evidence | High |
Related SteadyOps reading
- PostgreSQL at Scale — HA and DR runbooks are strongest when PostgreSQL replication, backups, and failover design are explicit.
- SOC 2-ready Ops Model — incident response, backup evidence, and access control are core parts of audit-ready operations.
- Zero-Downtime Blue/Green Deployments — release rollback runbooks reduce the number of incidents that become full disaster recovery events.
Key takeaways
- A runbook is useful only if it can be executed during stress.
- RPO and RTO must be defined before choosing tools.
- PostgreSQL backups are not real until restore is tested.
- Kubernetes runbooks should include rollback criteria, dependency order, and validation.
- Incident timelines create both operational clarity and audit evidence.
Operational takeaway
Write the runbook before the outage, test it before the business depends on it, and keep it close to the infrastructure code. A reliable recovery plan is a production control, not documentation decoration.
Need production-grade HA and DR?
SteadyOps can review your current failover, backup, and incident response process and turn it into a practical HA/DR runbook with restore tests and clear recovery steps.