Kubernetes Production Checklist: YAML, Tests & Rollback

Kubernetes Production Readiness Checklist for SaaS Teams

Reusable production assets included4 downloadable templates · MIT licensed

View resources

Practical guide scope

Who this is for

Platform engineers, SREs, backend leads, and teams preparing a Kubernetes production launch

Where it applies

Managed or self-managed clusters running customer-facing APIs, workers, and stateful dependencies

Problems this guide helps solve

The cluster deploys successfully but failure behavior has never been tested.
Readiness, capacity, security, rollback, and recovery controls are reviewed separately.
Teams have dashboards and manifests but no go-live decision criteria.
Production ownership depends on tribal knowledge.

A Kubernetes production readiness checklist is a go-live decision tool, not a generic list of best practices. It should prove that the cluster and its workloads can survive realistic pressure: node loss, failed deploys, traffic spikes, DNS problems, certificate rotation, dependency degradation, database failover, security events, and capacity growth.

For a SaaS team, readiness means the platform has clear SLOs, observable failure modes, tested rollback, controlled access, backup evidence, and enough capacity headroom to keep customer-facing latency and error rate within target. A successful kubectl apply does not prove any of those things.

Use this checklist before a production launch, after a major platform change, during a quarterly reliability review, or before asking an external Kubernetes consultant to audit the environment.

Copyable Kubernetes go-live checklist

A team should be able to answer yes to each item before production traffic is considered safe:

Critical workloads survive a node drain and expected zone or fault-domain failure.
Resource requests, replica count, PDB, topology spread, and graceful shutdown are configured.
Readiness verifies real serving ability; liveness cannot create a restart loop during dependency slowdown.
p95/p99 latency, error rate, saturation, queue depth, and dependency health are visible.
Every critical alert has an owner, dashboard, runbook, severity, and first safe action.
Production access is named, revocable, least-privileged, and separated from staging.
A bad release can be paused or rolled back without an incompatible database change.
Backup, restore, secret recovery, and service recovery order have been tested.
Capacity headroom covers deploy surge, traffic burst, and expected node replacement.

A “no” does not always block launch, but it must become an explicit accepted risk with an owner, mitigation, and deadline. Hidden gaps are more dangerous than known and consciously accepted gaps.

1. Define the production failure model

A production cluster should not depend on one node, one zone, one ingress path, one storage path, or one engineer who remembers how recovery works. Start by documenting what the platform must survive.

Ask:

What happens when a worker node disappears?
What happens when an availability zone or rack is degraded?
What happens when the ingress controller restarts during peak traffic?
What happens when PostgreSQL or another external dependency becomes slow?
What happens when a deployment passes scheduling but fails business validation?
What happens when secrets, DNS, or certificates cannot be refreshed?

The readiness review should use these failure scenarios, not only a static architecture diagram.

2. Cluster architecture and high availability

Minimum architecture checks:

Control plane is managed or deployed with quorum-aware redundancy.
Worker nodes are spread across zones or fault domains where the business requirement justifies it.
Critical workloads use topology spread constraints or anti-affinity.
PodDisruptionBudgets protect important services during maintenance.
Cluster autoscaler and node pools have enough headroom for replacement and bursts.
Ingress, DNS, certificate, metrics, and logging components have no accidental single replica.
Persistent storage behavior during rescheduling and zone failure is understood.

Example PodDisruptionBudget for a critical API:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: payment-api-pdb
  namespace: production
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: payment-api

The YAML is not the proof. Drain a node in a controlled environment and verify the real customer-facing path while the disruption occurs.

3. Resource requests, limits, and capacity headroom

Production scheduling depends on requests. Autoscaling and failure recovery also depend on requests. Workloads without meaningful requests can make a cluster look healthy until a node is lost or a rollout creates temporary surge capacity.

Check:

Every production workload has reviewed CPU and memory requests.
Memory limits are based on measured behavior and OOM risk.
CPU limits are used consciously because throttling can increase latency.
HPA uses a signal that reflects demand and does not fight with slow dependencies.
Node pools can absorb expected rescheduling after a node loss.
Deployment maxSurge fits within available capacity.
Storage and log growth have forecasts and alerts.

Do not optimize for perfect average utilization. Production needs headroom for failure, deployment, backlog drain, and unpredictable customer demand.

4. Probes and graceful lifecycle

Readiness and liveness solve different problems.

Readiness decides whether a pod should receive traffic.
Liveness decides whether the process should be restarted.
Startup probes protect slow-starting applications from premature liveness failure.

A common anti-pattern is using the same dependency-heavy endpoint for readiness and liveness. If PostgreSQL becomes slow, liveness can restart every healthy application pod and amplify the incident.

Validate:

Readiness fails before the pod receives traffic it cannot serve.
Liveness detects a genuinely stuck process, not a temporary dependency issue.
Graceful shutdown stops new traffic before process termination.
terminationGracePeriodSeconds matches real drain behavior.
Long-running workers stop or finish jobs safely.
Pre-stop hooks do not create hidden shutdown delays.

5. Observability and alerting

Kubernetes observability must cover cluster health, workloads, ingress, dependencies, and business-facing behavior. Node CPU alone does not explain whether users are affected.

A production-ready setup should include:

Metrics for nodes, pods, deployments, HPA, ingress, DNS, and persistent volumes.
p95/p99 latency, request rate, and error rate for critical APIs.
Queue depth, worker throughput, and dead-letter behavior.
PostgreSQL connection pressure, locks, slow queries, and replica lag where relevant.
Centralized logs with service, environment, version, request ID, and trace ID.
Deployment markers and release version on dashboards.
Alert routing with severity, owner, dashboard, runbook, and escalation path.

An alert saying “CPU high” is weak. An alert saying “checkout p99 is above SLO after release 1.4.2 and database connection saturation reached 90%” supports an operational decision.

6. Security and access control

Production Kubernetes security starts with least privilege and clear boundaries.

Security checks:

No default cluster-admin access for users or service accounts.
RBAC roles are reviewed and tied to named owners.
Production access uses SSO and MFA where possible.
Shared kubeconfigs and long-lived personal tokens are removed.
Secrets are injected from an approved source and not stored in plain manifests.
Sensitive namespaces use default-deny NetworkPolicy where the networking model supports it.
Image provenance and vulnerability policy are part of deployment.
Audit logs and privileged events are retained.
Staging credentials cannot modify production.
Break-glass access is controlled, logged, and tested.

Useful review command:

kubectl auth can-i --list --as system:serviceaccount:production:api

The output needs human review. The goal is not an empty permission set; it is the smallest explicit set required by the workload.

7. Release safety and rollback

Kubernetes makes deployment convenient, but safety depends on release controls.

Release checklist:

Immutable image tag or digest identifies the artifact.
Rollout status is checked in CI/CD.
Readiness blocks bad pods from traffic.
Helm or GitOps rollback procedure is documented.
Database migrations preserve the previous version during the rollback window.
Risky changes use canary or Blue/Green exposure where practical.
Dashboards separate old and new versions.
Rollback criteria are agreed before traffic moves.
A business smoke test runs after deploy and rollback.

Example Helm command:

helm upgrade api ./chart \
  --namespace production \
  --install \
  --atomic \
  --timeout 5m

--atomic helps with failed releases, but it does not make irreversible database migrations safe. Application and data rollback must be designed together.

8. Backups, disaster recovery, and stateful dependencies

Persistent volumes are not automatically a recovery strategy. Every stateful dependency needs explicit backup, restore, ownership, and validation.

Verify:

Backup jobs are monitored and failures page an owner when required.
Restore tests are performed in a clean environment.
RPO and RTO are documented.
PostgreSQL, Redis, object storage, queues, and external systems have appropriate recovery paths.
Secrets, manifests, and infrastructure configuration can be recovered.
Recovery order is documented.
Technical checks and a critical business transaction are part of validation.

If PostgreSQL runs inside or near the cluster, include replication lag, WAL availability, failover routing, application reconnect, and backup continuity after promotion.

9. Operational ownership and runbooks

The cluster is not production-ready if only one engineer can operate it safely.

Required ownership records:

Who approves production deployment?
Who owns cluster upgrades?
Who owns ingress, DNS, certificates, and storage?
Who can approve rollback or failover?
Who owns critical alerts and dashboards?
Who validates business recovery?
Where are incident timelines and follow-up actions stored?

Critical alerts should link directly to short runbooks. Runbooks should contain checks, safe actions, stop conditions, escalation, and validation—not only background theory.

Kubernetes readiness decision matrix

Approach	Best for	Stability impact	Complexity
Single-zone cluster	Proof of concept and internal tools	Low resilience; unsuitable for many customer-facing workloads	Low
Multi-zone managed Kubernetes	Most SaaS production workloads	Strong baseline with lower control-plane burden	Medium
Self-managed multi-control-plane cluster	Bare metal or strict control requirements	High control with significant operational responsibility	High
Canary or Blue/Green delivery	Customer-facing APIs and risky releases	Reduces release blast radius	Medium/High
Full SRE operating model	High-traffic or business-critical platforms	Strong ownership, evidence, and incident response	High

Kubernetes Rollback Checklist — objective rollback triggers and validation.
Zero-Downtime Blue/Green Deployments — safe traffic switching and migration compatibility.
Disaster Recovery Runbook Template — recovery ownership, restore drills, and RPO/RTO.
Infrastructure Cost Optimization — capacity and cost control without removing resilience.
Security Evidence Operations Model — access, change, and incident evidence.

Key takeaways

Production readiness is evidence that operational controls work under failure.
Requests, PDB, topology, replicas, and headroom must support disruption.
Probes must distinguish traffic readiness from a genuinely stuck process.
Observability should connect customer impact, dependencies, and release identity.
Rollback and database migration compatibility must be designed together.
Backups matter only after a clean restore and business validation.
Named ownership and executable runbooks prevent tribal knowledge.

Operational takeaway

Run this checklist before launch and after every major platform change. Record every failed item as an accepted risk or an owned remediation, then prove the important paths with node drain, failed rollout, restore, access review, and dependency-failure exercises.

Need a Kubernetes production readiness review?

SteadyOps can audit cluster architecture, workloads, CI/CD, observability, security, rollback, recovery, and capacity, then produce a prioritized production-readiness plan with implementation checks.

Implementation blueprint

Use this sequence to turn the theory into an auditable production change. Adjust commands, thresholds, and ownership to the real environment before execution.

Define workload criticality and SLOs

Classify services, critical user paths, availability targets, latency targets, data-loss tolerance, and acceptable degraded modes.
- Critical workloads are named
- p95/p99 and error SLOs exist
- RPO/RTO are defined for stateful services
Make workloads disruption-safe

Set resource requests, topology spread, PodDisruptionBudgets, rollout strategy, probes, graceful shutdown, and enough replicas for node maintenance.
- Node drain was tested
- PDB allows maintenance without outage
- Termination grace matches shutdown behavior
Harden access and network boundaries

Review RBAC, service accounts, secrets, audit logs, admission policy, NetworkPolicy, image provenance, and stage/production separation.
- No default cluster-admin
- Sensitive namespaces use default deny
- Production access is named and revocable
Prove rollout and recovery paths

Run a failed rollout exercise, workload restart, node drain, backup restore, and dependency outage test while watching customer-facing signals.
- Rollback command is documented
- Restore test has a date
- Business smoke test is automated

Configuration and command examples

Examples are conservative starting points. Review security, version compatibility, failure behavior, and rollback before production use.

Production deployment baseline

A compact example combining rollout safety, resources, probes, and graceful termination.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api
spec:
  replicas: 3
  strategy:
    rollingUpdate:
      maxUnavailable: 0
      maxSurge: 1
  selector:
    matchLabels:
      app: api
  template:
    metadata:
      labels:
        app: api
    spec:
      terminationGracePeriodSeconds: 45
      containers:
        - name: api
          image: registry.example.com/api:1.4.2
          ports:
            - containerPort: 8080
          resources:
            requests:
              cpu: 250m
              memory: 256Mi
            limits:
              memory: 512Mi
          readinessProbe:
            httpGet:
              path: /ready
              port: 8080
            periodSeconds: 5
            failureThreshold: 3
          livenessProbe:
            httpGet:
              path: /live
              port: 8080
            periodSeconds: 10
            failureThreshold: 3

Pre-launch verification commands

Run these against the intended production namespace and review the output rather than only checking exit status.

kubectl get nodes -o wide
kubectl get deploy,pod,pdb,hpa -n production
kubectl auth can-i --list --as system:serviceaccount:production:api
kubectl rollout status deployment/api -n production
kubectl get events -n production --sort-by=.metadata.creationTimestamp | tail -50

Production validation checklist

A node can be drained without breaking the critical user path.
A bad image or failed readiness check stops the rollout.
Rollback restores the previous version without incompatible data changes.
Critical alerts contain an owner, dashboard, runbook, and first action.
Backup restore and secret recovery have been tested.
Capacity headroom covers a deploy surge or node replacement.

Official references

Reusable assets

Download templates and validation files

Use these files as reviewed starting points. Keep the source link and version when sharing or adapting them.

Markdown

Production evidence checklist

A review worksheet for failure model, scheduling, lifecycle, security, observability, releases, recovery, and ownership.

Download →YAML

PodDisruptionBudget example

A conservative PDB example for a critical replicated API.

Download →Markdown

Node-drain validation

A practical test plan for disruption, customer-path checks, capacity, and recovery evidence.

Download →Shell

Readiness validation script

A shell starting point for workload status, rollout, endpoints, events, and service health checks.

Download →

Templates are provided under the MIT License. Production use still requires environment-specific review and testing.

Stable reference

Version, testing scope, and citation

Version: 1.0.0
Last reviewed: Jul 10, 2026
Tested with: Kubernetes 1.29–1.31 · Prometheus Operator · NGINX Ingress · Helm 3
License: CC BY 4.0 for the article; MIT for downloadable templates
Permanent URL: https://steadyops.best/articles/kubernetes-production-readiness-checklist/

Yuri Osipov. "SteadyOps Kubernetes Production Evidence Checklist." SteadyOps, version 1.0.0, reviewed 2026-07-10. https://steadyops.best/articles/kubernetes-production-readiness-checklist/

Kubernetes readiness review

Need an evidence-based Kubernetes go-live review?

Send the cluster version, workload type, ingress, deployment method, and the failure scenario you are least confident about. SteadyOps will map readiness gaps and validation tests.

Request Kubernetes Readiness Review Review service scope

Focused request

Need an evidence-based Kubernetes go-live review?

Send your current stack and the production risk. Optional commercial details can be added after the technical context.

Practical guide scope

Who this is for

Where it applies

Problems this guide helps solve

Copyable Kubernetes go-live checklist

1. Define the production failure model

2. Cluster architecture and high availability

3. Resource requests, limits, and capacity headroom

4. Probes and graceful lifecycle

5. Observability and alerting

6. Security and access control

7. Release safety and rollback

8. Backups, disaster recovery, and stateful dependencies

9. Operational ownership and runbooks

Kubernetes readiness decision matrix

Related SteadyOps reading

Key takeaways

Operational takeaway

Need a Kubernetes production readiness review?

Implementation blueprint

Define workload criticality and SLOs

Make workloads disruption-safe

Harden access and network boundaries

Prove rollout and recovery paths

Configuration and command examples

Production deployment baseline

Pre-launch verification commands

Production validation checklist

Official references

Download templates and validation files

Production evidence checklist

PodDisruptionBudget example

Node-drain validation

Readiness validation script

Version, testing scope, and citation

Need an evidence-based Kubernetes go-live review?

Need an evidence-based Kubernetes go-live review?