Zero-Downtime Deployment: Blue/Green & Rollback

Reusable production assets included1 downloadable template · MIT licensed

View resources

Practical guide scope

Who this is for

Release engineers, SREs, backend leads, and SaaS teams with customer-facing deploys

Where it applies

Web applications and APIs behind Nginx, HAProxy, Kubernetes Services, ingress controllers, or cloud load balancers

Problems this guide helps solve

Deployments require an outage or risky in-place restart.
Teams switch traffic before validating real dependencies.
Rollback is slow because old and new versions are not simultaneously available.
Database migrations turn application rollback into disaster recovery.

Zero-downtime deployment is not a marketing phrase. It is an operating model for releasing production changes without turning every deploy into a user-visible outage. The Blue/Green pattern is one of the clearest ways to do it: keep the current version stable, deploy the candidate version beside it, validate it, shift traffic, and keep rollback simple.

For SteadyOps, the goal is not only “the page stayed up.” A production release is safe when the team can prove that the new version is healthy, observe p95/p99 latency during the switch, stop the rollout quickly, and return traffic to the previous version without guessing.

What Blue/Green deployment really means

In a Blue/Green deployment, Blue is the currently serving environment and Green is the candidate environment. Both should be production-like: same configuration model, same secrets injection process, same database compatibility expectations, same monitoring, and the same external dependencies where possible.

The pattern fails when Green is treated as a demo environment. If Green has different resources, missing observability, untested migrations, or incomplete dependency access, the traffic switch becomes a production experiment.

A good Blue/Green deployment answers these questions before traffic moves:

Is Green running the exact artifact expected for the release?
Are readiness and health checks meaningful?
Are database migrations backward compatible?
Are queues, cron jobs, and workers safe to run in parallel?
Can we switch traffic back to Blue immediately?
Are dashboards and alerts tagged by version?

Pre-release checklist

The safest deployment is the one that can be stopped early. Before the switch, validate the candidate with smoke tests, dependency checks, and operational signals.

curl -fsS https://green.example.com/health
curl -fsS https://green.example.com/api/health
kubectl rollout status deployment/app-green -n production
kubectl logs -n production deploy/app-green --tail=100

For Kubernetes, do not rely only on “pods are running.” A pod can be Running while the application cannot reach PostgreSQL, Redis, RabbitMQ, Keycloak, or an external API. Readiness checks must verify real dependencies or at least fail safely when the service cannot process requests.

For VM or bare-metal setups behind Nginx/HAProxy, validate backend health directly and through the load balancer. The release is not ready until the same path users will hit has passed checks.

Database migration safety

Most zero-downtime failures are database failures disguised as deployment failures. Application code can be rolled back quickly, but destructive schema changes can make rollback impossible.

Use expand-and-contract migrations:

Add new nullable columns or new tables.
Deploy code that writes both old and new paths if needed.
Backfill data safely.
Switch reads to the new structure.
Remove old structures only after the previous version is no longer needed.

Avoid deploying code that requires an irreversible migration at the same moment traffic is switched. If rollback requires restoring a database backup, the deployment is not zero-downtime. It is a high-risk release.

Traffic switching and rollback criteria

Traffic switching should be boring. Whether you use DNS, Nginx, HAProxy, Kubernetes Service selectors, Argo Rollouts, or a cloud load balancer, the rollback path must be known before the deployment starts.

Good rollback criteria are objective:

5xx rate increases above agreed threshold.
p99 latency grows for critical endpoints.
queue depth starts rising.
PostgreSQL connection count spikes.
error logs show repeated dependency failures.
business smoke test fails.

A simple SRE rule: if the team has to debate whether rollback is allowed, the release plan is incomplete.

Observability during the release

Zero-downtime deployment requires release-aware observability. Dashboards should show version, environment, traffic share, error rate, latency, saturation, and dependency health. Logs should include release version or commit SHA.

Useful release checks:

kubectl get deploy -n production -o wide
kubectl describe hpa -n production app
kubectl top pods -n production
curl -s https://service.example.com/version

For APIs, watch RED metrics: rate, errors, duration. For infrastructure, watch USE metrics: utilization, saturation, errors. For databases, watch locks, slow queries, active connections, and replication lag. A deployment that looks healthy at the web layer can still overload the database.

Decision matrix

Approach	Best for	Stability impact	Complexity
Manual deploy + manual rollback	Small internal tools	Better than uncontrolled changes, but slow under pressure	Low
Blue/Green with load balancer switch	Web apps and APIs	Strong rollback path and clear validation point	Medium
Canary release	High-traffic services	Reduces blast radius with gradual exposure	Medium/High
Argo Rollouts / automated analysis	Kubernetes platforms	Automates progressive delivery with metrics	High
Feature flags + Blue/Green	Product changes with risk separation	Decouples deployment from feature exposure	High

Load Balancing: Comparative Architectures — traffic switching depends on correct L4/L7 routing and health-check behavior.
HA & DR Runbooks — release rollback criteria should be documented like any other incident runbook.
PostgreSQL at Scale — database migrations and connection behavior often decide whether rollback is safe.

Key takeaways

Zero-downtime deployment means validated release safety, not only avoiding a restart.
Blue/Green works when both environments are production-like and observable.
Database migrations must preserve rollback paths.
Rollback criteria should be objective and agreed before traffic moves.
The best deployment process makes aborting a bad release fast and boring.

Operational takeaway

Design every production release around the rollback path first. If you can switch back safely, measure impact clearly, and validate dependencies before traffic moves, zero-downtime deployment becomes an engineering control instead of hope.

Need safer production releases?

SteadyOps can review your CI/CD pipeline, rollback strategy, health checks, and release observability to build a safer deployment process for production systems.

Implementation blueprint

Use this sequence to turn the theory into an auditable production change. Adjust commands, thresholds, and ownership to the real environment before execution.

Create equivalent blue and green environments

Use the same artifact model, configuration injection, dependency access, health checks, observability, and capacity assumptions.
- Artifact SHA is visible
- Secrets and config are equivalent
- Green has production-like dependency access
Validate before traffic switch

Run synthetic transactions, dependency checks, warm-up, migration compatibility, queue and worker checks, and release-aware monitoring.
- Critical transaction passes
- Latency is within target
- Workers are safe to run in parallel
Switch gradually and keep blue intact

Move a controlled share of traffic where possible, watch objective rollback signals, and do not destroy blue until the observation window ends.
- Rollback switch is one action
- Blue remains healthy
- Observation window is defined

Configuration and command examples

Examples are conservative starting points. Review security, version compatibility, failure behavior, and rollback before production use.

Nginx upstream switch pattern

Keep upstream definitions versioned and validate configuration before reload.

upstream app_active {
    server 127.0.0.1:8082 max_fails=3 fail_timeout=10s;
    keepalive 32;
}

server {
    listen 443 ssl http2;
    server_name app.example.com;

    location / {
        proxy_pass http://app_active;
        proxy_set_header Host $host;
        proxy_set_header X-Forwarded-Proto $scheme;
        proxy_set_header X-Request-ID $request_id;
    }
}

Run nginx -t before reload.
Keep the previous upstream definition ready for immediate rollback.

Production validation checklist

Green serves the intended artifact and configuration.
Health checks include required dependencies.
Critical business transactions pass before and after traffic switch.
Error rate, latency, saturation, queue depth, and database connections remain stable.
Blue can receive traffic immediately without data incompatibility.
The release record contains commit SHA, switch time, signals, and outcome.

Official references

Reusable assets

Download templates and validation files

Use these files as reviewed starting points. Keep the source link and version when sharing or adapting them.

Markdown

Blue/Green release checklist

Traffic switch, migration safety, smoke validation, rollback trigger, and cleanup checks.

Download →

Templates are provided under the MIT License. Production use still requires environment-specific review and testing.

Stable reference

Version, testing scope, and citation

Version: 1.0.0
Last reviewed: Jul 10, 2026
Tested with: Kubernetes Services · NGINX/HAProxy · Helm 3 · GitHub Actions/GitLab CI
License: CC BY 4.0 for the article; MIT for downloadable templates
Permanent URL: https://steadyops.best/articles/zero-downtime-bluegreen-deployments/

Yuri Osipov. "SteadyOps Zero-Downtime Deployment Checklist." SteadyOps, version 1.0.0, reviewed 2026-07-10. https://steadyops.best/articles/zero-downtime-bluegreen-deployments/

Deployment safety review

Need a zero-downtime deployment path that includes rollback?

Send the runtime, traffic router, deployment method, database migration pattern, and current smoke checks. SteadyOps will map the safest release sequence.

Request Deployment Safety Review Review service scope

Focused request

Need a zero-downtime deployment path that includes rollback?

Send your current stack and the production risk. Optional commercial details can be added after the technical context.

Practical guide scope

Who this is for

Where it applies

Problems this guide helps solve

What Blue/Green deployment really means

Pre-release checklist

Database migration safety

Traffic switching and rollback criteria

Observability during the release

Decision matrix

Related SteadyOps reading

Key takeaways

Operational takeaway

Need safer production releases?

Implementation blueprint

Create equivalent blue and green environments

Validate before traffic switch

Switch gradually and keep blue intact

Configuration and command examples

Nginx upstream switch pattern

Production validation checklist

Official references

Download templates and validation files

Blue/Green release checklist

Version, testing scope, and citation

Need a zero-downtime deployment path that includes rollback?

Need a zero-downtime deployment path that includes rollback?