Disaster Recovery Runbook Template: PostgreSQL + K8s

Disaster Recovery Runbook Template for Production Systems

Reusable production assets included3 downloadable templates · MIT licensed

View resources

Practical guide scope

Who this is for

CTOs, SREs, platform engineers, database owners, and incident commanders

Where it applies

Production systems that need documented failover, backup restore, RPO/RTO, and recovery ownership

Problems this guide helps solve

Backups exist, but nobody has proved that a clean restore works.
Failover decisions depend on one engineer remembering undocumented steps.
Recovery actions have no owner, stop condition, or business validation.
Incident communication and evidence are assembled manually after the outage.

A disaster recovery runbook is an executable recovery procedure, not a long policy document. It tells an engineer what failure has occurred, who owns the decision, which checks come first, when to stop, how to restore or fail over, and how to prove that the customer-facing service is healthy again.

This guide is designed for CTOs, SREs, database owners, and platform engineers who need a usable DR runbook example for PostgreSQL, Kubernetes, web applications, queues, object storage, and supporting infrastructure. It can be copied into Git, Notion, Confluence, or an incident-management repository, but the commands and thresholds must be adapted and tested against the real environment.

The most important rule is simple: a backup report is not recovery evidence. Recovery becomes credible only after a clean restore or failover drill produces a measured recovered point, elapsed time, technical validation, and business validation.

How to use this disaster recovery runbook template

Start with one critical service, not the entire company infrastructure. Choose the system whose outage, data loss, or prolonged degradation would create the largest customer or business impact. Then complete the template in this order:

Define service impact and severity.
Agree RPO and RTO with the business owner.
Name the incident commander, recovery operator, and business validator.
Document dependencies and recovery order.
Write exact checks, commands, expected results, and stop conditions.
Run a drill in a clean environment.
Record evidence and convert every gap into an owned action.

A reusable Markdown skeleton and configuration examples are included in the practical implementation sections below. The same template is also available in the SteadyOps disaster recovery runbook repository.

What every DR runbook must contain

A production runbook should be short enough to use under pressure and precise enough to prevent dangerous improvisation. At minimum, it needs:

Service name, business impact, and incident severity.
Recovery trigger and explicit stop conditions.
RPO and RTO.
Incident commander and recovery operator.
Dependency inventory and recovery order.
Backup, snapshot, WAL, or replica source.
Exact commands with placeholders and expected output.
Technical health checks.
Critical business transaction validation.
Communication timeline and stakeholder owner.
Evidence location, follow-up actions, and next drill date.

The most common failure is writing a runbook as prose. During an outage nobody wants to interpret a ten-page wiki article. The procedure must make the next safe action obvious: capture state, verify the recovery source, stop unsafe writes, restore or promote, reconnect dependencies, validate, and communicate.

Define RPO and RTO before choosing tools

Many teams begin with products: Patroni, etcd, Kubernetes operators, snapshots, object storage, cloud backup services, or database replicas. Those tools matter, but the recovery contract begins with business tolerances.

RPO defines the maximum acceptable data loss. RTO defines the maximum acceptable recovery time. A service with a 30-minute RPO can use a different architecture from a payment or authentication system that requires near-zero data loss.

Area	Runbook question	Example target
RPO	How much committed data may be lost?	0–5 minutes
RTO	How long may the service remain unavailable?	15–30 minutes
Recovery source	Which tested source is used?	Base backup plus WAL archive
Decision owner	Who approves failover or restore?	Incident commander / SRE lead
Validation	What proves recovery?	API health, consistency check, synthetic transaction
Evidence	Where are outputs and timestamps stored?	Incident record and drill repository

Without these answers, every incident becomes a debate. With them, the team can compare architectures and make recovery decisions against agreed business risk.

Separate high availability failover from disaster recovery

High availability and disaster recovery solve different failure modes.

HA failover usually promotes an already-running healthy replica after primary or infrastructure failure.
DR restore rebuilds service from backup, WAL, snapshots, object storage, infrastructure code, and secrets after corruption, deletion, regional loss, or broader compromise.

Treating a replica as the only backup is dangerous because corruption or accidental deletes can replicate. Treating restore from backup as a normal application rollback is also dangerous because it is slower, affects data state, and may require coordinated downtime.

The runbook should state which path applies to each trigger:

Trigger	Preferred path	Stop condition
Primary host failure, healthy replica	Controlled HA failover	Replica lag exceeds RPO
Corrupted data replicated to standbys	Point-in-time restore	Recovery target is unknown
Accidental deletion	PITR or application-level recovery	New writes would overwrite evidence
Full environment loss	Infrastructure rebuild plus restore	Secrets, DNS, or backup source unavailable
Bad application deploy	Application rollback	Database migration is not backward compatible

PostgreSQL disaster recovery procedure

PostgreSQL recovery needs more than pg_dump or a snapshot. The runbook must cover primary failure, unsafe replica lag, corrupted data, accidental deletes, missing WAL, credential recovery, routing, and application validation.

First capture current state with read-only checks:

patronictl list
systemctl status patroni
curl -fsS http://127.0.0.1:8008/cluster
psql -X -v ON_ERROR_STOP=1 -c "select pg_is_in_recovery();"
psql -X -v ON_ERROR_STOP=1 -c "select now() - pg_last_xact_replay_timestamp() as replica_lag;"

Before promotion or restore, confirm:

Which node is currently writable.
Whether any replica satisfies the RPO.
Whether the latest base backup completed.
Whether required WAL files exist.
Which client route will point to the recovered primary.
How application retries behave during the switch.
Whether backup and monitoring jobs will continue afterward.

After recovery, validate database state and a real application operation. A successful psql connection is necessary but not sufficient.

Kubernetes and service recovery order

Kubernetes can restart workloads, but it does not automatically restore databases, queues, secrets, object storage, external DNS, or application consistency. A service-level DR runbook must document dependency order.

A common order is:

Recover network, identity, secrets, and required storage access.
Recover the database or confirm the healthy primary.
Recover queues and object storage.
Start APIs in a controlled mode.
Start workers only after their dependencies are safe.
Run technical health checks.
Run a critical business transaction.
Re-enable normal traffic and background processing.

Useful initial checks include:

kubectl get nodes -o wide
kubectl get pods -A --field-selector=status.phase!=Running
kubectl get events -A --sort-by=.metadata.creationTimestamp | tail -50
kubectl rollout status deployment/app -n production
kubectl get pvc -A

Do not restart every component simultaneously. Random restart order can create a second incident through queue duplication, connection storms, stale cache, or workers processing data before the database is consistent.

Recovery validation must include the business path

A runbook is incomplete if it ends with “pods are Running” or “database accepts connections.” Recovery must be proved at three levels:

Infrastructure validation

Required nodes, storage, network, and certificates are healthy.
Only the intended primary is writable.
Replication, backups, and monitoring resumed.
Queues and workers are not accumulating errors.

Application validation

Health and readiness endpoints succeed.
Error rate and p95/p99 latency return to baseline.
Authentication and dependency calls work.
Logs contain no repeated recovery-related failures.

Business validation

A user can log in.
A customer can complete the critical transaction.
The recovered data point is acceptable.
Support or the business owner confirms that customer impact is falling.

The business validator should be named before the incident. SREs can prove infrastructure health, but only the product or service owner can confirm that the recovered behavior is correct for customers.

Incident communication and recovery evidence

The runbook should define one source of truth and a simple timeline:

Detection time.
Impact confirmation.
Mitigation start.
Failover or restore decision.
Recovery completion.
Technical validation.
Business validation.
Follow-up actions.

Store commands, timestamps, selected outputs, recovered point, restore duration, screenshots or dashboard links, and approvals in the incident record. This evidence supports postmortems, customer communication, audits, and future runbook improvement.

DR runbook decision matrix

Approach	Best for	Reliability impact	Complexity
Manual recovery notes	Small internal tools	Better than memory, but difficult to validate	Low
Versioned runbook in Git	Growing production teams	Clear review history and executable recovery path	Medium
Automated HA with manual approval	Databases and critical services	Faster recovery while preserving decision control	Medium
Scheduled restore drills	Business-critical systems	Proves RPO/RTO before a real outage	Medium/High
Multi-region DR operating model	Strict availability and regional failure requirements	Strongest resilience with significant cost and complexity	High

PostgreSQL at Scale — replication, Patroni, connection control, backup, and restore behavior.
Kubernetes Production Readiness Checklist — controls that should exist before a production launch.
Kubernetes Rollback Checklist — when the correct response is application rollback rather than disaster recovery.
Security Evidence Operations Model — recovery records, access events, and incident timelines as operational evidence.

Key takeaways

A disaster recovery runbook is an executable and tested procedure.
RPO, RTO, owners, triggers, and stop conditions must exist before an incident.
HA failover and DR restore are different recovery paths.
Database connectivity is not enough; validate the real customer transaction.
A clean restore drill is the evidence that turns backups into a recovery strategy.
Every drill should produce measured results and owned improvements.

Operational takeaway

Write the runbook before the outage, execute it in a clean environment, measure the recovered point and elapsed time, and keep the evidence beside the infrastructure code. Recovery confidence comes from drills, not from backup status alone.

Need a production-grade DR runbook?

SteadyOps can review failover, backups, dependencies, RPO/RTO, and incident ownership, then build and test a practical recovery procedure for the real environment.

Implementation blueprint

Use this sequence to turn the theory into an auditable production change. Adjust commands, thresholds, and ownership to the real environment before execution.

Define the recovery contract

Agree the service scope, business impact, RPO, RTO, recovery owner, escalation path, and the evidence required to declare recovery complete.
- RPO and RTO are explicit
- Decision owner is named
- Critical customer journey is identified
Inventory dependencies and recovery order

Document databases, queues, object storage, secrets, DNS, certificates, external APIs, workers, and the order in which they must recover.
- Dependency map is current
- Credentials path is documented
- Restart order is tested
Write executable procedures

Use exact commands, expected outputs, abort criteria, rollback steps, communication checkpoints, and validation queries instead of narrative-only documentation.
- Commands use placeholders safely
- Expected output is shown
- Dangerous actions require approval
Run a restore drill and record evidence

Restore into a clean environment, measure elapsed time and recovered point, run application smoke tests, and create follow-up actions for every gap.
- Restore time is measured
- Recovered timestamp is verified
- Application smoke test passes

Configuration and command examples

Examples are conservative starting points. Review security, version compatibility, failure behavior, and rollback before production use.

Copyable disaster recovery runbook skeleton

Keep this file in Git next to infrastructure code and replace every placeholder before the first drill.

# Disaster Recovery Runbook

## Service and impact
- Service: <name>
- Business impact: <customer journey>
- Severity: <SEV-1/SEV-2>
- RPO: <minutes>
- RTO: <minutes>

## Ownership
- Incident commander: <role>
- Recovery operator: <role>
- Business validator: <role>

## Trigger and stop conditions
- Trigger: <measurable condition>
- Do not continue when: <data corruption / unknown primary / missing backup>

## Recovery steps
1. Freeze risky writes or traffic.
2. Capture current state and timestamps.
3. Validate backup and recovery data availability.
4. Restore or fail over using the approved procedure.
5. Reconnect dependencies in documented order.
6. Run technical and business smoke tests.

## Validation
- Health endpoint: <URL>
- Database consistency query: <query>
- Critical transaction: <test>
- Monitoring returned to baseline: <dashboard>

## Communication timeline
- Detected:
- Mitigation started:
- Recovery completed:
- Business validation completed:

## Follow-up
- Evidence location: <link>
- Action items: <tickets>

PostgreSQL recovery evidence checks

Run read-only checks after promotion or restore before reopening normal traffic.

patronictl list
psql -X -v ON_ERROR_STOP=1 -c "select pg_is_in_recovery();"
psql -X -v ON_ERROR_STOP=1 -c "select now(), current_database();"
psql -X -v ON_ERROR_STOP=1 -c "select count(*) from pg_stat_activity;"
curl -fsS https://service.example.com/health

Production validation checklist

The latest backup and required recovery data are available.
The runbook was executed in a clean environment within the agreed review period.
The measured restore time satisfies the stated RTO.
The recovered point satisfies the stated RPO.
Technical health and a real business transaction both pass.
The timeline, commands, outputs, and follow-up actions are stored.

Official references

Reusable assets

Download templates and validation files

Use these files as reviewed starting points. Keep the source link and version when sharing or adapting them.

Markdown

DR runbook template

Copyable Markdown structure with owners, RPO/RTO, triggers, stop conditions, recovery steps, validation, communications, and evidence.

Download →Markdown

Recovery validation checklist

Infrastructure, application, data, and business validation after failover or restore.

Download →Markdown

Incident timeline template

A compact timeline for detection, decisions, recovery actions, validation, and follow-up ownership.

Download →

Templates are provided under the MIT License. Production use still requires environment-specific review and testing.

Stable reference

Version, testing scope, and citation

Version: 1.0.0
Last reviewed: Jul 10, 2026
Tested with: PostgreSQL 12–16 · Patroni 3.x · Kubernetes 1.29–1.31 · Linux/systemd
License: CC BY 4.0 for the article; MIT for downloadable templates
Permanent URL: https://steadyops.best/articles/ha-dr-runbooks/

Yuri Osipov. "SteadyOps Disaster Recovery Runbook Template." SteadyOps, version 1.0.0, reviewed 2026-07-10. https://steadyops.best/articles/ha-dr-runbooks/

Disaster recovery review

Need your recovery runbook tested against a real failure scenario?

Send the architecture, backup method, target RPO/RTO, and date of the last restore test. SteadyOps will identify the highest-risk recovery gaps and define a practical drill.

Request DR Runbook Review Review service scope

Focused request

Need your recovery runbook tested against a real failure scenario?

Send your current stack and the production risk. Optional commercial details can be added after the technical context.

Practical guide scope

Who this is for

Where it applies

Problems this guide helps solve

How to use this disaster recovery runbook template

What every DR runbook must contain

Define RPO and RTO before choosing tools

Separate high availability failover from disaster recovery

PostgreSQL disaster recovery procedure

Kubernetes and service recovery order

Recovery validation must include the business path

Infrastructure validation

Application validation

Business validation

Incident communication and recovery evidence

DR runbook decision matrix

Related SteadyOps reading

Key takeaways

Operational takeaway

Need a production-grade DR runbook?

Implementation blueprint

Define the recovery contract

Inventory dependencies and recovery order

Write executable procedures

Run a restore drill and record evidence

Configuration and command examples

Copyable disaster recovery runbook skeleton

PostgreSQL recovery evidence checks

Production validation checklist

Official references

Download templates and validation files

DR runbook template

Recovery validation checklist

Incident timeline template

Version, testing scope, and citation

Need your recovery runbook tested against a real failure scenario?

Need your recovery runbook tested against a real failure scenario?