⏳
Loading cheatsheet...
Deployment lifecycle, rollout strategies, container delivery, CI/CD flow, and production release best practices.
| Strategy | Downtime | Risk | Complexity | Best For |
|---|---|---|---|---|
| Rolling | None | Medium | Low | Stateless services, small teams |
| Blue-Green | None | Low | Medium | Full switch deployments, databases |
| Canary | None | Low | High | Gradual rollout, metric validation |
| A/B Testing | None | Low | High | Feature experiments, user testing |
| Shadow | None | Very Low | High | Load testing on production traffic |
| Feature Flags | None | Very Low | Low | Progressive feature rollout |
| # | Principle | Summary |
|---|---|---|
| 1 | Codebase | One codebase per app, many deploys |
| 3 | Config | Store config in env vars, not in code |
| 4 | Backing Services | Treat databases/cache as attached resources |
| 5 | Build/Release/Run | Strictly separate stages |
| 8 | Concurrency | Scale out via process model |
| 9 | Disposability | Fast startup, graceful shutdown |
| 10 | Dev/Prod Parity | Keep environments as similar as possible |
| 11 | Logs | Treat logs as event streams |
| Registry | Provider | Auth Method |
|---|---|---|
| Docker Hub | Docker Inc. | docker login (username/token) |
| ECR | AWS | aws ecr get-login-password | docker login |
| GCR / Artifact Registry | Google Cloud | gcloud auth configure-docker |
| ACR | Azure | az acr login |
| GitHub Container Registry | GitHub | ghcr.io (PAT or GITHUB_TOKEN) |
| Quay | Red Hat | docker login quay.io |
# ── Multi-stage Dockerfile (Node.js) ──
# Stage 1: Build
FROM node:20-alpine AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production && npm cache clean --force
COPY . .
RUN npm run build
# Stage 2: Runtime
FROM gcr.io/distroless/nodejs20-debian12
WORKDIR /app
COPY --from=builder /app/dist ./dist
COPY --from=builder /app/node_modules ./node_modules
COPY --from=builder /app/package.json ./
USER nonroot:nonroot
EXPOSE 3000
HEALTHCHECK --interval=30s --timeout=3s CMD curl -f http://localhost:3000/health
ENTRYPOINT ["node", "dist/server.js"]| Driver | Scope | Use Case |
|---|---|---|
| bridge (default) | Single host | Containers on same host communicate via internal DNS |
| host | Single host | Container shares host network stack (no isolation) |
| overlay | Multi-host (Swarm) | Cross-host communication for Docker Swarm |
| macvlan | Single host | Container gets a real MAC address on the LAN |
| none | Single host | No networking — fully isolated container |
| Practice | How |
|---|---|
| Non-root user | USER 1000:1000 in Dockerfile |
| Read-only FS | docker run --read-only --tmpfs /tmp |
| No SUID binaries | Use distroless or scratch images |
| Secrets | Use Docker secrets or mounted files, never ENV for secrets |
| Image scanning | Trivy, Snyk, docker scout |
| Minimal image | Alpine, distroless, or DockerSlim |
| Capabilities drop | --cap-drop ALL --cap-add NET_BIND_SERVICE |
| Resource | Use Case | Key Features |
|---|---|---|
| Deployment | Stateless apps | Rolling updates, rollbacks, ReplicaSet management |
| StatefulSet | Databases, queues | Stable identities, persistent volumes, ordered pods |
| DaemonSet | Logging, monitoring agents | Runs on every node (or subset via taints) |
| Job | Batch processing | Runs to completion, supports parallelism |
| CronJob | Scheduled tasks | Cron schedule, concurrency policy, history limits |
| ReplicaSet | Pod replication | Ensures N pods running (usually via Deployment) |
| Service | Reachability | Use Case |
|---|---|---|
| ClusterIP | Cluster internal only | Internal microservice communication |
| NodePort | NodeIP:Port (30000-32767) | Dev/test external access |
| LoadBalancer | Cloud LB + NodePort | Production external traffic |
| Headless (None) | Direct pod IPs via DNS | StatefulSet pods, custom load balancing |
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-app
labels:
app: web-app
spec:
replicas: 3
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
selector:
matchLabels:
app: web-app
template:
metadata:
labels:
app: web-app
spec:
containers:
- name: web-app
image: registry.example.com/web-app:v1.2.0
ports:
- containerPort: 3000
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 512Mi
livenessProbe:
httpGet:
path: /health
port: 3000
initialDelaySeconds: 15
periodSeconds: 20
readinessProbe:
httpGet:
path: /ready
port: 3000
initialDelaySeconds: 5
periodSeconds: 10
env:
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: app-secrets
key: db-url
terminationGracePeriodSeconds: 30| Resource | Purpose |
|---|---|
| ConfigMap | Non-sensitive configuration (env vars, config files) |
| Secret | Sensitive data (passwords, tokens, TLS certs) |
| PersistentVolume (PV) | Cluster-wide storage resource |
| PersistentVolumeClaim (PVC) | Pod request for storage (bound to PV) |
| StorageClass | Dynamic provisioning (gp3, io1, standard) |
| Ingress | HTTP/S routing with TLS, path-based routing |
| Feature | What It Does |
|---|---|
| HPA | Scales pods based on CPU/memory/custom metrics |
| VPA | Adjusts resource requests/limits automatically |
| Cluster Autoscaler | Adds/removes nodes based on pending pods |
| PodDisruptionBudget | Min/max available pods during disruptions |
| TopologySpreadConstraints | Spread pods across zones/nodes |
| PriorityClass | Preemption ordering for resource contention |
kubectl apply manually in production — if it is not in Git, it does not exist.| Platform | Language Support | Timeout | Key Features |
|---|---|---|---|
| AWS Lambda | Node, Python, Java, Go, Rust, .NET | 15 min | Layers, SnapStart, Provisioned Concurrency |
| Azure Functions | C#, JS, Python, Java, PowerShell | 10 min (Consumption) | Durable Functions,_slots, managed identity |
| Google Cloud Functions | Node, Python, Go, Java, Ruby | 9 min (1st gen), 60 min (2nd gen) | Eventarc, Cloud Run-based (2nd gen) |
| Cloudflare Workers | JS, TS, Wasm | 30s (free), 15 min (paid) | Edge runtime, KV storage, R2 |
| Vercel Edge | JS, TS, Go, Rust (Wasm) | 30s | Edge SSR, ISR, streaming |
| Deno Deploy | TS, JS, Wasm | 60s | Edge-native, zero-config deploy |
| Trigger Type | Examples |
|---|---|
| HTTP / REST | API Gateway, ALB, Function URL |
| Queue / Message | SQS, SNS, EventBridge, Pub/Sub, Service Bus |
| Timer / Schedule | EventBridge Scheduler, CronTrigger |
| Blob / Storage | S3, Azure Blob, GCS object changes |
| Database | DynamoDB Streams, Firebase, Supabase triggers |
| Stream | Kinesis, Kafka (via EventBridge) |
# ── Serverless Framework config ──
service: my-api
frameworkVersion: '3'
provider:
name: aws
runtime: nodejs20.x
architecture: arm64
stage: production
region: us-east-1
environment:
NODE_ENV: production
httpApi:
cors: true
functions:
api:
handler: src/handler.handler
events:
- httpApi:
path: /{proxy+}
method: ANY
provisionedConcurrency: 5 # Reduce cold starts
resources:
Resources:
ApiLogGroup:
Type: AWS::Logs::LogGroup
Properties:
RetentionInDays: 30| Platform | Best For | Key Features |
|---|---|---|
| Heroku | Quick startups, low ops | Git push deploy, add-ons, Review Apps |
| Render | Modern Heroku alternative | Docker, background workers, auto-deploy |
| Railway | Full-stack apps | PR environments, infra-as-code, monorepo support |
| Fly.io | Edge/low-latency | Docker-based, global regions, persistent volumes |
| Northflank | Multi-service apps | Docker compose, CI/CD, managed DBs |
| Google App Engine | Google Cloud native | Auto-scaling, versions, traffic splitting |
| Concept | Details |
|---|---|
| Buildpacks | Auto-detect language, install deps, build artifact |
| Procfile | Defines process types: web, worker, clock |
| Environment vars | Primary config mechanism (12-factor) |
| Add-ons | Managed services (Postgres, Redis, SendGrid) |
| Slug/Image | Compiled artifact uploaded to dyno/instance |
| Release | Immutable artifact + config version |
# ── Procfile (Heroku / Render / Railway) ──
web: npm start
worker: node worker.js
release: npx prisma migrate deploy
clock: node scheduler.js| Platform | Build | Key Features |
|---|---|---|
| Vercel | Git → auto-detect | Edge SSR, ISR, Preview deploys, Analytics |
| Netlify | Git → auto-detect | Edge Functions, Split Testing, Forms, Identity |
| Cloudflare Pages | Git → auto-detect | Global CDN, Workers integration, R2 |
| GitHub Pages | Git branch | Free, Jekyll native, Actions-based deploys |
| AWS Amplify | Git → auto-detect | Fullstack (auth, API, hosting), SSR support |
| Firebase Hosting | firebase deploy | CDN, Cloud Functions, real-time DB |
| Strategy | When to Use | Platforms |
|---|---|---|
| SSG (Static) | Blogs, docs, marketing pages | All platforms |
| SSR (Server) | Dynamic, personalized content | Vercel, Netlify, Cloudflare Pages |
| ISR (Incremental) | Frequently updated static pages | Vercel, Netlify ISR |
| CSR (Client) | Highly interactive apps, dashboards | All platforms |
| Streaming SSR | Large pages, progressive loading | Next.js, Remix on Vercel |
| Edge SSR | Ultra-low latency dynamic pages | Vercel Edge, Cloudflare Workers |
// ── Next.js Deployment Configuration ──
/** @type {import('next').NextConfig} */
const nextConfig = {
output: 'standalone', // For Docker / bare metal
images: {
remotePatterns: [{
protocol: 'https',
hostname: 'cdn.example.com',
}],
},
experimental: {
serverActions: true,
},
};
// For Vercel: git push → auto-detect & deploy
// For Docker: use output: 'standalone' with node:alpine
// For Netlify: use @netlify/plugin-nextjs| Layer | Tool | Purpose |
|---|---|---|
| Reverse Proxy | nginx / Caddy | TLS termination, static serving, rate limiting |
| App Server | PM2 / systemd | Process management, auto-restart, log rotation |
| Database | PostgreSQL / MySQL | Managed via systemd, daily backups |
| Load Balancer | HAProxy / nginx | Distribute traffic across app instances |
| SSL/TLS | Let's Encrypt / Certbot | Free auto-renewing TLS certificates |
| Firewall | UFW / iptables | Port filtering, IP whitelisting |
| Manager | Type | Best For |
|---|---|---|
| systemd | System-level | Production Linux servers (built-in) |
| PM2 | Node.js | Node.js apps, cluster mode, graceful reload |
| supervisord | General purpose | Multi-language process management |
| runit | System-level | Lightweight init system (used by Docker) |
#!/bin/bash
# ── Bare Metal Deploy Script ──
set -euo pipefail
APP_DIR="/opt/myapp"
BRANCH="main"
echo ">>> Pulling latest code..."
cd $APP_DIR && git pull origin $BRANCH
echo ">>> Installing dependencies..."
npm ci --production
echo ">>> Running database migrations..."
npx prisma migrate deploy
echo ">>> Building application..."
npm run build
echo ">>> Restarting service..."
sudo systemctl restart myapp
sudo systemctl status myapp --no-pager
echo ">>> Health check..."
sleep 3 && curl -sf http://localhost:3000/health || exit 1
echo "Deploy successful!"| Service | Cloud | Engines |
|---|---|---|
| RDS | AWS | PostgreSQL, MySQL, MariaDB, SQL Server, Oracle |
| Cloud SQL | GCP | PostgreSQL, MySQL, SQL Server |
| Azure SQL | Azure | SQL Server, PostgreSQL, MySQL, CosmosDB |
| PlanetScale | Serverless | MySQL-compatible, branching, non-blocking schema changes |
| Neon | Serverless | PostgreSQL, branching, auto-scaling, serverless driver |
| Supabase | Serverless | PostgreSQL, auth, realtime, storage built-in |
| Tool | Language | Key Features |
|---|---|---|
| Prisma Migrate | Node/TS | Declarative schema, auto-generate migrations |
| Flyway | Java/Kotlin | Version-controlled SQL migrations |
| Liquibase | Java/Kotlin | XML/YAML/JSON changelogs, rollback support |
| EF Core Migrations | C#/.NET | Code-first model changes |
| Atlas | Go | Declarative HCL schema, plan-and-apply workflow |
| golang-migrate | Go | CLI + library, Go/SQL source files |
| Strategy | How | Trade-off |
|---|---|---|
| Vertical | Bigger instance (more CPU/RAM) | Simple but has a ceiling, brief downtime |
| Read replicas | Direct reads to replicas | Write is single-node; eventual consistency |
| Connection pooling | PgBouncer, ProxySQL | Reduces connections to the DB server |
| Sharding | Partition data across DB instances | Complex; app must be shard-aware |
| Partitioning | Split tables by range/hash/list | Transparent to app; single DB |
| Caching | Redis, Memcached in front of DB | Reduces read load; cache invalidation complexity |
| Feature | Details |
|---|---|
| Automated backups | Daily snapshots (AWS RDS, Cloud SQL) |
| Point-in-time recovery | Restore to any second (PITR), 7-35 day retention |
| Cross-region replication | DR standby in another region |
| Export/Import | pg_dump, mysqldump, or managed export |
| Backup testing | Regularly restore to a test environment |
| Platform | Config Format | Runners | Notes |
|---|---|---|---|
| GitHub Actions | YAML (.github/workflows) | GitHub-hosted or self-hosted | Largest marketplace; free for public repos |
| GitLab CI | .gitlab-ci.yml | GitLab-hosted or self-hosted | Built-in container registry, environments |
| Azure DevOps | azure-pipelines.yml | Microsoft-hosted or self-hosted | Deep Azure integration, release gates |
| Jenkins | Jenkinsfile | Self-hosted agents | Most flexible, plugin ecosystem, heavy setup |
| CircleCI | .circleci/config.yml | Cloud or self-hosted | Docker layer caching, orbs |
| Bitbucket Pipelines | bitbucket-pipelines.yml | Atlassian-hosted | Integrated with Bitbucket, Docker support |
| Practice | Why |
|---|---|
| Cache dependencies | Save 50-80% build time (npm, pip, Docker layers) |
| Parallel jobs | Run lint, test, security scan simultaneously |
| Artifact promotion | Build once, deploy same artifact everywhere |
| Environment protection | Require approval for production deploys |
| Secret scanning | TruffleHog, gitleaks in CI pipeline |
| Fast feedback | Under 10 min total pipeline time |
| Immutable artifacts | Version-tagged Docker images, never mutable tags |
name: CI/CD Pipeline
on:
push:
branches: [main]
pull_request:
branches: [main]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with: { node-version: 20, cache: 'npm' }
- run: npm ci
- run: npm run lint
- run: npm run test
- run: npm run test:e2e
build-push:
needs: test
runs-on: ubuntu-latest
if: github.ref == 'refs/heads/main'
steps:
- uses: actions/checkout@v4
- uses: docker/login-action@v3
with:
registry: ghcr.io
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- uses: docker/build-push-action@v5
with:
push: true
tags: ghcr.io/org/app:${{ github.sha }}
cache-from: type=gha
cache-to: type=gha,mode=max
deploy:
needs: build-push
runs-on: ubuntu-latest
environment: production
steps:
- run: echo "Deploy image ${{ github.sha }} to production"| Tool | Language | State Mgmt | Best For |
|---|---|---|---|
| Terraform | HCL | Remote state (S3, GCS) | Multi-cloud, mature ecosystem, modules |
| Pulumi | Python, TS, Go, C# | Pulumi Cloud / backend | Teams preferring real languages over HCL |
| AWS CDK | TypeScript, Python, Java | CloudFormation | AWS-native, object-oriented IaC |
| CloudFormation | JSON/YAML | AWS managed | AWS-native, no extra tooling |
| Bicep | Bicep DSL | Azure managed | Azure-native, cleaner than ARM templates |
| Crossplane | YAML (CRDs) | Kubernetes | K8s-native multi-cloud IaC |
| Criteria | Terraform | Pulumi |
|---|---|---|
| Language | HCL (declarative DSL) | General-purpose (TS, Python, Go) |
| Learning curve | Moderate (learn HCL) | Low (use existing language skills) |
| Ecosystem | Massive (15K+ providers) | Growing (Terraform providers compatible) |
| Testing | terratest (Go), pytest | Native unit testing per language |
| State | S3/GCS/remote backend | Pulumi Cloud / S3 / local |
| GitOps | Atlantis, TF Controller | Pulumi Deployments |
# ── Terraform: EC2 + RDS + VPC ──
terraform {
required_version = ">= 1.5"
required_providers {
aws = { source = "hashicorp/aws", version = "~> 5.0" }
}
backend "s3" {
bucket = "my-terraform-state"
key = "prod/terraform.tfstate"
region = "us-east-1"
}
}
provider "aws" { region = "us-east-1" }
module "vpc" {
source = "terraform-aws-modules/vpc/aws"
version = "~> 5.0"
cidr = "10.0.0.0/16"
azs = ["us-east-1a", "us-east-1b"]
public_subnets = ["10.0.1.0/24", "10.0.2.0/24"]
private_subnets = ["10.0.3.0/24", "10.0.4.0/24"]
}
resource "aws_db_instance" "app" {
engine = "postgres"
engine_version = "16"
instance_class = "db.t3.micro"
allocated_storage = 20
db_name = "appdb"
username = "admin"
password = var.db_password
skip_final_snapshot = true
}| Solution | Provider | Key Features |
|---|---|---|
| HashiCorp Vault | Self-hosted or HCP | Dynamic secrets, PKI, encryption-as-service, transit |
| AWS Secrets Manager | AWS | Auto-rotation, RDS integration, cross-account access |
| AWS SSM Parameter Store | AWS | Hierarchical params, KMS encryption, free tier generous |
| Azure Key Vault | Azure | Keys, secrets, certificates, managed HSM |
| GCP Secret Manager | GCP | Auto-replication, versioning, IAM integration |
| 1Password / Doppler | SaaS | Developer-friendly, CLI sync, team management |
| Approach | How | Trade-off |
|---|---|---|
| K8s native Secrets | Base64 encoded, etcd | Simple but not encrypted by default |
| Sealed Secrets | Bitnami, encrypt with cert | GitOps-friendly, store sealed data in Git |
| External Secrets Operator | Sync from Vault/AWS/GCP | Centralized, auto-sync, enterprise-grade |
| SOPS + Flux | Encrypt with age/KMS | GitOps-native, encrypted files in repo |
| CSI Secrets Store | Mount secrets as volumes | Pod-level secret injection, auto-rotation |
| Pillar | Tool | What It Tells You |
|---|---|---|
| Metrics | Prometheus, Grafana, Datadog | Numbers over time (CPU, latency, error rate) |
| Logs | ELK Stack, Loki, CloudWatch | Discrete events (errors, requests, stack traces) |
| Traces | Jaeger, Zipkin, OpenTelemetry | Request lifecycle across services |
| Term | Definition | Example |
|---|---|---|
| SLI | Service Level Indicator (the metric) | Request latency p99 < 200ms |
| SLO | Service Level Objective (the target) | 99.9% of requests under 200ms |
| SLA | Service Level Agreement (the contract) | 99.9% uptime or credit refund |
| Error Budget | Allowed failures per period | 0.1% of 30 days = 43.2 min downtime |
| Burn Rate | How fast error budget is consumed | Burn rate 1x = normal; 6x = page immediately |
| Tool | Type | Best For |
|---|---|---|
| PagerDuty | On-call management | Escalation policies, incident workflows |
| OpsGenie | On-call management | Team routing, alert deduplication |
| Grafana Alerts | Metric-based | Prometheus/Grafana stack native alerts |
| AWS CloudWatch Alarms | Cloud-native | AWS resource metrics and logs |
| Sentry | Error tracking | Application exceptions, breadcrumbs, releases |
| Signal | Question | Metric Example |
|---|---|---|
| Latency | Is the service fast enough? | p50, p95, p99 response time |
| Traffic | How much load? | Requests/sec, concurrent connections |
| Errors | Are things failing? | Error rate %, HTTP 5xx count |
| Saturation | Are we near capacity? | CPU %, memory %, queue depth |
| Category | Tool | What It Scans |
|---|---|---|
| Container Image | Trivy, Snyk, Grype | CVEs in OS packages and dependencies |
| SAST | SonarQube, CodeQL, Semgrep | Static code analysis for vulnerabilities |
| DAST | OWASP ZAP, Burp Suite | Running application vulnerability scanning |
| Dependency | npm audit, Snyk, Dependabot | Known CVEs in npm/pip/maven packages |
| Secret Detection | gitleaks, trufflehog | Hardcoded secrets in code/repo history |
| IaC Scanning | Checkov, tfsec, Kics | Misconfigurations in Terraform/CloudFormation |
| Practice | How |
|---|---|
| TLS Everywhere | Let's Encrypt, managed certs, no HTTP |
| VPC / Private Subnets | No public DB endpoints, private subnets for app tiers |
| Security Groups | Least-privilege inbound/outbound rules |
| WAF | AWS WAF, Cloudflare WAF, OWASP rule sets |
| RBAC / IAM | Principle of least privilege, no admin access for apps |
| Service Accounts | Workload identity, IRSA, managed identity |
| Pattern | Description | Complexity |
|---|---|---|
| Active-Active | All regions serve traffic simultaneously | High (data sync, split brain) |
| Active-Passive | One active, standby takes over on failure | Medium (DNS failover) |
| Pilot Light | Minimal resources in DR region, scale on failover | Low (fast RTO) |
| Warm Standby | Scaled-down replica in DR, scale up on failover | Medium |
| Metric | Definition |
|---|---|
| RPO | Recovery Point Objective — max acceptable data loss time |
| RTO | Recovery Time Objective — max acceptable downtime |
| DNS Failover | Route 53 health checks, Cloudflare failover routing |
| Global LB | AWS Global Accelerator, Cloudflare, GCP Global LB |
| Anycast | Single IP announced from multiple POPs worldwide |
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: web-app
spec:
replicas: 10
strategy:
canary:
steps:
- setWeight: 5
- pause: { duration: 5m }
- setWeight: 25
- pause: { duration: 10m }
- setWeight: 50
- pause: { duration: 10m }
- setWeight: 100
canaryService: web-app-canary
stableService: web-app-stable
analysis:
templates:
- templateName: success-rate
startingStep: 2
args:
- name: service-name
value: web-app-canary| Platform | Hosting | Key Features |
|---|---|---|
| LaunchDarkly | SaaS | Enterprise-grade, targeting, experimentation, gradual rollout |
| Unleash | Self-hosted / SaaS | Open-source, SDKs for all languages, gradual rollout |
| Flipt | Self-hosted | Open-source, Git-backed, declarative config |
| Statsig | SaaS | Feature flags + experimentation + A/B testing |
| PostHog | SaaS / Self-hosted | Feature flags + product analytics |
| Tool | What It Does | Integrates With |
|---|---|---|
| Argo Rollouts | Canary/blue-green for K8s | Istio, Nginx, ALB, SMI |
| Flagger | Canary + A/B + blue-green | Istio, Linkerd, App Mesh, Nginx |
| Traefik Mesh | Service mesh with traffic mgmt | K8s native, mTLS, traffic splitting |
| # | Check | Owner |
|---|---|---|
| 1 | All tests pass (unit, integration, E2E) | CI/CD |
| 2 | Security scan clean (no critical CVEs) | Security |
| 3 | Database migration tested on staging | DBA / Backend |
| 4 | Backward-compatible API changes verified | Backend |
| 5 | Environment variables configured | DevOps |
| 6 | Rollback procedure documented and tested | DevOps |
| 7 | Monitoring dashboards and alerts in place | SRE |
| 8 | Change request approved (if required) | Manager |
| # | Verification | Method |
|---|---|---|
| 1 | Health endpoint returns 200 | curl / Synthetic monitor |
| 2 | Error rate is within normal range | Grafana / Datadog |
| 3 | Latency p99 is acceptable | APM / traces |
| 4 | Key user flows work (smoke tests) | Automated E2E |
| 5 | Database queries performing well | DB monitoring |
| 6 | No new error spikes in logs | Log aggregation |
| 7 | Feature flags enabled as planned | Feature flag dashboard |
| 8 | Communicate deploy to stakeholders | Slack / Status page |
| # | Step |
|---|---|
| 1 | Identify the previous stable version / image tag |
| 2 | Run rollback command (kubectl rollout undo, etc.) |
| 3 | Verify old version is running (health checks) |
| 4 | Run database migration rollback (if needed) |
| 5 | Check error rates and latency return to normal |
| 6 | Communicate rollback and RCA timeline |
| 7 | Create incident ticket and schedule post-mortem |
| # | Rule |
|---|---|
| 1 | Always write a down migration alongside every up migration |
| 2 | Add columns first (deploy), remove columns second (next release) |
| 3 | Never rename columns — add new + deprecate old |
| 4 | Test migrations on a production data copy |
| 5 | Lock tables only if necessary and keep it brief |
| 6 | Use a migration lock to prevent concurrent execution |
| 7 | Verify data integrity after migration with automated checks |
| Criteria | Docker | Kubernetes | Serverless | PaaS | Bare Metal |
|---|---|---|---|---|---|
| Setup Complexity | Low | High | Low | Very Low | Medium |
| Ops Overhead | Medium | High | Very Low | Low | High |
| Cost (small) | Low | Medium | Free tier | Low | Fixed |
| Cost (large scale) | Medium | Optimized | Can spike | Premium | Most cost-effective |
| Scalability | Manual | Auto (HPA/CA) | Instant auto | Auto (limited) | Manual |
| Latency | Low | Low | Cold start risk | Low | Lowest |
| Control | High | Full | Very Limited | Limited | Full |
| Vendor Lock-in | Low | Low | High | Medium | None |
| Best Team Size | 1-5 | 5+ | 1-3 | 1-5 | 3+ (with ops) |
| Good For | Simple services | Microservices | APIs, events | Startups, MVPs | Regulated, legacy |
| Company | Approach | Why |
|---|---|---|
| Stripe | Kubernetes + Edge | Microservices at scale, global payments |
| Vercel | Edge Functions | Frontend SSR at the edge, globally |
| Linear | Serverless + Edge | Real-time app, minimal ops team |
| GitLab | Kubernetes (GKE) | Self-hosted by customers, complex workloads |
| Basecamp | Bare Metal | Full control, predictable costs, 20+ years |
| Notion | Kubernetes | Complex collaboration, high availability |