PostgreSQL at Scale: HA, Performance, and DR

PostgreSQL high availability is not achieved by adding one replica and hoping failover works. At scale, PostgreSQL becomes a reliability system: storage behavior, replication lag, connection storms, backups, failover automation, observability, and application retry logic all interact. A database can look healthy from the outside while p99 latency grows, locks accumulate, autovacuum falls behind, and user requests start timing out.

For SteadyOps work, PostgreSQL at scale means designing for predictable failure. The goal is not only more throughput. The goal is to keep latency stable, failover controlled, restore tested, and operational decisions clear enough that an on-call engineer can act without guessing.

Start with workload and failure model

Before choosing bare metal, Kubernetes, cloud managed PostgreSQL, Patroni, or a database operator, define the workload. Is it read-heavy or write-heavy? Does the application tolerate stale reads? How many connections can appear during a traffic spike? What is the acceptable RPO/RTO? Which tables grow fastest? Which queries dominate p95 and p99 latency?

Scaling without this model usually creates expensive fragility. Teams add replicas, but the application still sends all writes and most reads to the primary. They increase instance size, but the bottleneck is connection count or lock contention. They add Kubernetes automation, but persistent storage and failover behavior are not tested.

A practical PostgreSQL scaling review starts with:

Bare metal versus Kubernetes for PostgreSQL

Bare metal often gives the most predictable I/O path. There are fewer layers between PostgreSQL and disks, easier control over kernel parameters, and clearer ownership of network and storage behavior. For latency-sensitive databases, this simplicity can be a major advantage.

Kubernetes can still be useful, especially when the team already operates a mature platform with strong storage classes, node isolation, backup automation, and tested operators. But Kubernetes does not remove database engineering. It adds scheduling, networking, and storage abstractions that must be understood during failover.

The decision is less about ideology and more about operational maturity:

PlatformBest forStability impactComplexity
Bare metal PostgreSQLPredictable I/O and critical databasesStrong control over latency and storageMedium
VM-based PostgreSQLTeams needing isolation and simpler operationsGood balance of manageability and controlMedium
Kubernetes operatorMature platform teams with tested storageConsistent automation when designed wellHigh
Managed PostgreSQLTeams prioritizing operational offloadGood baseline, less low-level controlLow/Medium

Patroni, etcd, and failover discipline

Patroni is a strong choice for PostgreSQL high availability when the team understands the failure model. It coordinates leader election through a distributed consensus store such as etcd or Consul and manages promotion of replicas. But Patroni is not magic. If etcd is unstable, network partitions are ignored, or health checks are wrong, failover can become dangerous.

Useful checks:

patronictl list
etcdctl endpoint health --cluster
curl -s http://127.0.0.1:8008/cluster
psql -c "select pg_is_in_recovery();"
psql -c "select client_addr, state, sync_state, replay_lag from pg_stat_replication;"

The runbook must define who approves manual failover, when automatic failover is allowed, and how to validate the new primary. After promotion, check application writes, replication rebuild, backup continuity, and monitoring targets.

PgBouncer and connection control

Many PostgreSQL incidents are connection incidents. Application pods scale out, each pod opens too many connections, deploys restart workers simultaneously, and PostgreSQL spends resources managing sessions instead of queries. PgBouncer can protect the database, but it must be configured for the application behavior.

Typical checks:

psql -c "select count(*) from pg_stat_activity;"
psql -c "select state, count(*) from pg_stat_activity group by state;"
psql -c "select wait_event_type, wait_event, count(*) from pg_stat_activity group by 1,2 order by 3 desc;"

Connection pooling should be paired with sane application pool limits. PgBouncer cannot fix unbounded concurrency, missing indexes, or long transactions. It gives the database breathing room while you correct the workload.

Monitoring PostgreSQL before users feel pain

A reliable PostgreSQL monitoring setup should show saturation early. CPU is useful, but it is rarely enough. Watch disk latency, checkpoint behavior, cache hit ratio, replication lag, locks, dead tuples, autovacuum, query latency, and connection count.

For production dashboards, I want to see:

Alerts should point to action. “Database CPU high” is weak. “Primary has 90% connection saturation and API p99 doubled after deploy” is useful.

Backup and restore testing

Backups are not a recovery strategy until restore is tested. PostgreSQL disaster recovery must include base backups, WAL availability, restore procedure, target recovery point, credentials path, and application validation.

A practical restore test should answer:

Restore tests are also SEO-worthy because many teams search for “PostgreSQL disaster recovery” after learning the hard way that backup success does not mean recovery success.

Key takeaways

Operational takeaway

Treat PostgreSQL as a reliability product. Measure saturation, control connections, test failover, rehearse restore, and make every recovery step boring before production is under pressure.

Need PostgreSQL HA help?

SteadyOps can review your PostgreSQL architecture, Patroni setup, PgBouncer configuration, backups, and restore process to produce a practical reliability plan.

Ask DevOps Copilot Request audit