Monitoring Cheatsheet Cheatsheet

📊

Prometheus & PromQL

METRICS

prometheus.yml

# ── Prometheus Configuration ──
global:
  scrape_interval: 15s
  evaluation_interval: 15s
  scrape_timeout: 10s

  external_labels:
    cluster: 'production'
    region: 'us-east-1'

rule_files:
  - 'alerts/*.yml'      # Alerting rules
  - 'recording/*.yml'   # Recording rules

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']

scrape_configs:
  # ── Prometheus self-monitoring ──
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  # ── Kubernetes Service Discovery ──
  - job_name: 'kubernetes-apiservers'
    kubernetes_sd_configs:
      - role: endpoints
    scheme: https
    tls_config:
      ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
    relabel_configs:
      - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
        action: keep
        regex: default;kubernetes;https

  # ── Kubernetes Pods ──
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        target_label: __address__
        regex: (.+)
        replacement: ${1}:9102
      - action: labelmap
        regex: __meta_kubernetes_pod_label_(.+)

  # ── Node Exporter (DaemonSet) ──
  - job_name: 'node-exporter'
    kubernetes_sd_configs:
      - role: endpoints
    relabel_configs:
      - source_labels: [__meta_kubernetes_service_name]
        action: keep
        regex: node-exporter

promql

# ── PromQL: Essential Queries ──

# ── Rate & Counter ──
# Request rate (requests per second)
rate(http_requests_total[5m])

# 95th percentile latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# Average request duration
rate(http_request_duration_seconds_sum[5m])
  / rate(http_request_duration_seconds_count[5m])

# ── Availability ──
# Uptime percentage
sum(up) / count(up) * 100

# Service availability (non-5xx responses)
sum(rate(http_requests_total{status!~"5.."}[5m]))
  / sum(rate(http_requests_total[5m])) * 100

# ── Resource Utilization ──
# CPU usage by pod
sum(rate(container_cpu_usage_seconds_total{namespace="production"}[5m])) by (pod)

# Memory usage by pod
sum(container_memory_working_set_bytes{namespace="production"}) by (pod) / (1024 * 1024 * 1024)

# Disk usage percentage
100 - ((node_filesystem_avail_bytes{mountpoint="/"} * 100)
  / node_filesystem_size_bytes{mountpoint="/"})

# Network I/O
rate(node_network_receive_bytes_total{device="eth0"}[5m])
rate(node_network_transmit_bytes_total{device="eth0"}[5m])

# ── K8s Pod Status ──
# Pods not ready
sum(kube_pod_status_phase{phase!="Running"}) by (namespace, pod)

# Pods restarting frequently
rate(kube_pod_container_status_restarts_total[1h]) > 0.05

# Pending pods
kube_pod_status_phase{phase="Pending"}

# ── RED Method (Rate, Errors, Duration) ──
# Rate: requests/sec per service
sum(rate(http_requests_total{service="myapp"}[5m]))

# Errors: error rate
sum(rate(http_requests_total{service="myapp",code=~"5.."}[5m]))

# Duration: p99 latency
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{service="myapp"}[5m])) by (le))

# ── USE Method (Utilization, Saturation, Errors) ──
# Utilization: CPU % per node
100 * (1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance))

# Saturation: load average vs CPU count
node_load1 / count(node_cpu_seconds_total{mode="idle"}) by (instance) without (cpu)

# Errors: CPU throttling
rate(container_cpu_cfs_throttled_periods_total[5m])
  / rate(container_cpu_cfs_periods_total[5m])

PromQL Operators & Functions

Function	Purpose	Example
rate()	Per-second rate of counter	rate(cpu_seconds[5m])
increase()	Total increase over time	increase(errors[1h])
histogram_quantile()	Percentile from histogram	histogram_quantile(0.99, ...)
sum()	Aggregate across labels	sum by (pod)(memory)
avg()	Average across series	avg(cpu_usage)
max()	Maximum value	max(memory_usage)
topk()	Top N series by value	topk(5, request_rate)
bottomk()	Bottom N series	bottomk(3, latency)
clamp_max()	Upper bound a value	clamp_max(cpu, 100)
predict_linear()	Predict future values	predict_linear(disk[1h], 4*3600)

Prometheus Architecture

PrometheusPull-based metrics collection & storage

AlertmanagerAlert routing, grouping, silencing

GrafanaDashboard visualization

PushgatewayPush metrics for short-lived jobs

Service DiscoveryK8s SD, Consul, EC2, DNS

RetentionDefault 15 days (local), longer with remote write

Remote WriteThanos, Cortex, Mimir for long-term storage

Exportersnode_exporter, blackbox, mysqld, redis

💡

Always use rate() with counters. Counters only increase — using rate() normalizes them to per-second values and handles counter resets. For recording rules, prefer increase() for raw count queries over a time window.

📈

Grafana

VISUALIZATION

grafana-dashboard.json

{
  "dashboard": {
    "title": "Application Overview",
    "uid": "app-overview",
    "tags": ["production", "kubernetes"],
    "timezone": "browser",
    "panels": [
      {
        "title": "Request Rate (req/s)",
        "type": "timeseries",
        "gridPos": { "h": 8, "w": 12, "x": 0, "y": 0 },
        "targets": [{
          "expr": "sum(rate(http_requests_total{service=\"myapp\"}[5m])) by (method, path)",
          "legendFormat": "{{method}} {{path}}"
        }],
        "fieldConfig": {
          "defaults": {
            "unit": "reqps",
            "color": { "mode": "palette-classic" }
          }
        }
      },
      {
        "title": "Error Rate (%)",
        "type": "stat",
        "gridPos": { "h": 4, "w": 6, "x": 12, "y": 0 },
        "targets": [{
          "expr": "sum(rate(http_requests_total{service=\"myapp\",code=~\"5..\"}[5m])) / sum(rate(http_requests_total{service=\"myapp\"}[5m])) * 100"
        }],
        "fieldConfig": {
          "defaults": {
            "unit": "percent",
            "thresholds": {
              "mode": "absolute",
              "steps": [
                { "color": "green", "value": null },
                { "color": "yellow", "value": 1 },
                { "color": "red", "value": 5 }
              ]
            },
            "noValue": "0%"
          }
        }
      }
    ],
    "templating": {
      "list": [
        {
          "name": "datasource",
          "type": "datasource",
          "query": "prometheus"
        },
        {
          "name": "namespace",
          "type": "query",
          "datasource": { "type": "prometheus" },
          "refresh": 1,
          "query": "label_values(kube_pod_info, namespace)"
        },
        {
          "name": "service",
          "type": "query",
          "datasource": { "type": "prometheus" },
          "refresh": 1,
          "query": "label_values(http_requests_total{namespace=\"$namespace\"}, service)",
          "allValue": ".*"
        }
      ]
    }
  }
}

Panel Types

Panel	Best For	Key Options
Time Series	Metrics over time, multi-series	Axes, legends, tooltips
Stat	Single value, big number	Thresholds, sparkline, color
Gauge	Value within a range	Min/max, thresholds, arc
Bar Chart	Categorical comparison	Horizontal/vertical, stacking
Table	Structured data, top N's	Sorting, pagination, coloring
Heatmap	Distribution over time	Color scales, x/y axes
Logs	Log exploration	Filter, level, live tail
Trace	Distributed traces	Span details, waterfall
Pie Chart	Proportional breakdown	Donut mode, legends

Template Variables

Variable	Type	Description
$namespace	Query	Filter by Kubernetes namespace
$service	Query	Filter by service name
$pod	Query	Filter by pod (depends on $service)
$interval	Interval	Auto step: 1m/5m/15m based on range
$datasource	Datasource	Switch Prometheus server
$__rate_interval	Built-in	Auto-calculated rate window
$__range	Built-in	Current time range in seconds

grafana-alerts.yml

# ── Grafana Alerting (Unified) ──
apiVersion: 1
groups:
  - name: application-alerts
    rules:
      - alert: HighErrorRate
        for: 5m
        labels:
          severity: warning
          team: platform
        annotations:
          summary: "High error rate on {{ $labels.service }}"
          description: "{{ $value }}% of requests are returning 5xx"
          runbook_url: "https://wiki.internal/runbooks/high-error-rate"
        condition: |
          100 * (
            sum(rate(http_requests_total{code=~"5.."}[5m])) by (service)
            /
            sum(rate(http_requests_total[5m])) by (service)
          ) > 5

      - alert: HighLatency
        for: 10m
        labels:
          severity: critical
          team: platform
        annotations:
          summary: "P99 latency > 2s on {{ $labels.service }}"
        condition: |
          histogram_quantile(0.99,
            sum(rate(http_request_duration_seconds_bucket{service=~"$service"}[5m])) by (le, service)
          ) > 2

      - alert: PodCrashLooping
        for: 5m
        labels:
          severity: critical
          team: platform
        annotations:
          summary: "Pod {{ $labels.pod }} is crash-looping"
        condition: |
          rate(kube_pod_container_status_restarts_total{namespace="$namespace"}[15m]) > 0.1

# ── Contact Points (Notification Channels) ──
# ── Notification Policies (Route alerts by label) ──
# severity=critical → PagerDuty + Slack #incidents
# severity=warning  → Slack #alerts + email
# team=platform     → Slack #platform-alerts

💡

Use template variables with All option to let users view aggregated data across all services or drill into a specific one. Chain variables ($namespace → $service → $pod) for hierarchical filtering.

📝

Logging (ELK Stack)

OBSERVABILITY

filebeat.yml

# ── Filebeat Configuration ──
filebeat.inputs:
  # ── Application Logs ──
  - type: container
    paths:
      - /var/log/containers/myapp-*.log
    processors:
      - decode_json_fields:
          fields: ["message"]
          target: "app"
          overwrite_keys: true
      - add_kubernetes_metadata:
          host: ${NODE_NAME}
          matchers:
            - logs_path:
                logs_path: "/var/log/containers/"

  # ── Nginx Access Logs ──
  - type: log
    paths:
      - /var/log/nginx/access.log
    fields:
      log_type: nginx_access
    fields_under_root: true

  # ── System Logs ──
  - type: log
    paths:
      - /var/log/syslog
      - /var/log/auth.log
    fields:
      log_type: system
    fields_under_root: true

# ── Elasticsearch Output ──
output.elasticsearch:
  hosts: ["https://elasticsearch:9200"]
  username: elastic
  password: ${ELASTIC_PASSWORD}
  ssl.certificate_authorities: ["/usr/share/filebeat/certs/ca.crt"]

# ── Logstash Pipeline (alternative) ──
output.logstash:
  hosts: ["logstash:5044"]

# ── Kafka Buffer (for high-volume) ──
output.kafka:
  hosts: ["kafka:9092"]
  topic: "logs-raw"
  required_acks: 1

# ── Kibana index pattern setup ──
setup.kibana:
  host: "https://kibana:5601"
  username: elastic
  password: ${ELASTIC_PASSWORD}

# ── ILM (Index Lifecycle Management) ──
setup.ilm.enabled: true
setup.ilm.rollover_alias: "app-logs"
setup.ilm.pattern: "{now/d}-000001"

logstash.conf

# ── Logstash Pipeline Configuration ──
input {
  beats {
    port => 5044
  }
  kafka {
    bootstrap_servers => "kafka:9092"
    topics => ["logs-raw"]
    consumer_threads => 4
  }
}

filter {
  # Parse JSON log lines
  if [message] =~ /^{.*}$/ {
    json {
      source => "message"
      target => "parsed"
    }
  }

  # Parse timestamp
  date {
    match => ["timestamp", "ISO8601", "UNIX"]
    target => "@timestamp"
  }

  # Add GeoIP for IP addresses
  if [client_ip] {
    geoip {
      source => "client_ip"
      target => "geoip"
    }
  }

  # Remove sensitive fields
  mutate {
    remove_field => ["password", "credit_card", "token"]
    rename => { "message" => "original_message" }
  }

  # Grok pattern for unstructured logs
  grok {
    match => { "message" => "%{COMBINEDAPACHELOG}" }
    overwrite => ["message"]
  }

  # Add tags for filtering
  if [status] and [status] =~ /^[5]/ {
    mutate { add_tag => ["error"] }
  }
  if [level] == "ERROR" {
    mutate { add_tag => ["app_error"] }
  }
}

output {
  elasticsearch {
    hosts => ["https://elasticsearch:9200"]
    index => "app-logs-%{+YYYY.MM.dd}"
    user => "elastic"
    password => "${ELASTIC_PASSWORD}"
  }
}

kibana-queries

# ── Kibana Query Language (KQL) ──
# Simple search
message: "error" and level: "ERROR"

# Field exists
service: "myapp" and status >= 500

# Wildcard
path: "/api/users/*"

# Boolean
(level: "ERROR" or level: "WARN") and service: "auth-service"

# Negation
not level: "DEBUG" and service: "payment"

# Range
response_time > 1000 and status: 200

# Nested
request.user.id: "alice" and response.status: 500

# ── Elasticsearch Query (for Logstash/curator) ──
# Get error logs from last hour
GET app-logs-*/_search
{
  "query": {
    "bool": {
      "must": [
        { "match": { "level": "ERROR" } },
        { "range": { "@timestamp": { "gte": "now-1h" } } }
      ]
    }
  },
  "aggs": {
    "by_service": {
      "terms": { "field": "service.keyword", "size": 10 }
    },
    "errors_over_time": {
      "date_histogram": { "field": "@timestamp", "interval": "5m" }
    }
  },
  "size": 0
}

ELK Stack Components

Component	Role	Port
Elasticsearch	Search & analytics engine	9200 (HTTP), 9300 (transport)
Logstash	Log processing pipeline	5044 (Beats input)
Kibana	Visualization & dashboard	5601 (HTTP)
Filebeat	Log shipper (lightweight)	5066
Metricbeat	Metrics shipper	5140
Heartbeat	Uptime monitoring	5220
Curator	Index lifecycle management	CLI tool

Logging Alternatives

Tool	Type	Best For
Grafana Loki	Log aggregation	K8s-native, labels not full-text index
Fluentd/Fluent Bit	Log shipper	CNCF, lightweight, flexible routing
Splunk	Enterprise SIEM/Log	Advanced analytics, compliance
Datadog Logs	Managed	All-in-one monitoring platform
AWS CloudWatch Logs	Managed	AWS ecosystem integration
OpenSearch	Open-source ELK	Elasticsearch fork, Apache 2.0 license

💡

Use Grafana Loki + Grafana for Kubernetes logging. Loki indexes only labels (not full text), making it 10x cheaper than Elasticsearch for log storage. Query with LogQL, visualize in Grafana dashboards alongside your Prometheus metrics.

🔔

Alerting (Alertmanager)

NOTIFICATIONS

alertmanager.yml

# ── Alertmanager Configuration ──
global:
  resolve_timeout: 5m
  smtp_smarthost: 'smtp.example.com:587'
  smtp_from: 'alerts@example.com'
  smtp_auth_username: 'alert-bot@example.com'
  smtp_auth_password: '${SMTP_PASSWORD}'

# ── Templates ──
templates:
  - '/etc/alertmanager/templates/*.tmpl'

# ── Routing Tree ──
route:
  # Default receiver
  receiver: 'default-slack'

  # Grouping
  group_by: ['alertname', 'service', 'cluster']
  group_wait: 30s        # Wait before sending first notification
  group_interval: 5m     # Wait before sending next group notification
  repeat_interval: 4h    # Re-notify if alert is still firing

  # Child routes (evaluated top-down)
  routes:
    # Critical alerts → PagerDuty + on-call Slack
    - match:
        severity: critical
      receiver: 'pagerduty-critical'
      group_wait: 10s
      repeat_interval: 1h
      routes:
        - match:
            team: database
          receiver: 'db-oncall'

    # Warning alerts → Slack
    - match:
        severity: warning
      receiver: 'slack-warnings'
      group_wait: 2m
      repeat_interval: 4h

    # Infrastructure alerts → infra team
    - match:
        category: infra
      receiver: 'infra-team'
      routes:
        - match:
            team: kubernetes
          receiver: 'k8s-oncall'

# ── Inhibition Rules ──
inhibit_rules:
  # If critical alert is firing, suppress warnings for same service
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'service']

  # If service is down, suppress individual alerts
  - source_match:
      alertname: 'ServiceDown'
    target_match_re:
      alertname: '.*'
    equal: ['service']

# ── Receivers (Notification Channels) ──
receivers:
  - name: 'default-slack'
    slack_configs:
      - api_url: '${SLACK_WEBHOOK_URL}'
        channel: '#alerts'
        title: '{{ .GroupLabels.alertname }}'
        text: >-
          {{ range .Alerts }}
          *Alert:* {{ .Annotations.summary }}
          *Severity:* {{ .Labels.severity }}
          *Service:* {{ .Labels.service }}
          *Details:* {{ .Annotations.description }}
          {{ end }}

  - name: 'pagerduty-critical'
    pagerduty_configs:
      - service_key: '${PAGERDUTY_SERVICE_KEY}'
        severity: '{{ .GroupLabels.severity }}'
        description: '{{ .GroupLabels.alertname }} - {{ .CommonAnnotations.summary }}'
        url: 'https://grafana.example.com/alerts'

  - name: 'slack-warnings'
    slack_configs:
      - api_url: '${SLACK_WEBHOOK_URL}'
        channel: '#alerts'
        send_resolved: true

  - name: 'infra-team'
    slack_configs:
      - api_url: '${SLACK_WEBHOOK_URL}'
        channel: '#infra-alerts'

  - name: 'db-oncall'
    pagerduty_configs:
      - service_key: '${PAGERDUTY_DB_SERVICE_KEY}'
        description: '{{ .GroupLabels.alertname }} - DB Team'

  - name: 'k8s-oncall'
    slack_configs:
      - api_url: '${SLACK_WEBHOOK_URL}'
        channel: '#k8s-alerts'

alerts.yml

# ── Prometheus Alert Rules ──
groups:
  - name: application.alerts
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{code=~"5.."}[5m])) by (service)
            / sum(rate(http_requests_total[5m])) by (service) > 0.05
        for: 5m
        labels:
          severity: critical
          category: application
        annotations:
          summary: "High error rate on {{ $labels.service }}"
          description: "{{ $value | humanizePercentage }} of requests are failing (threshold: 5%)"
          runbook_url: "https://wiki/runbooks/high-error-rate"

      - alert: HighLatencyP99
        expr: |
          histogram_quantile(0.99,
            sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
          ) > 2
        for: 10m
        labels:
          severity: warning
          category: application
        annotations:
          summary: "P99 latency exceeds 2s on {{ $labels.service }}"
          description: "Current P99: {{ $value }}s"

      - alert: PodCrashLooping
        expr: rate(kube_pod_container_status_restarts_total[15m]) > 0.1
        for: 5m
        labels:
          severity: critical
          category: infra
          team: kubernetes
        annotations:
          summary: "Pod {{ $labels.pod }} is crash-looping"
          description: "Restart rate: {{ $value | humanize }}/s"

      - alert: DiskSpaceLow
        expr: (node_filesystem_avail_bytes{fstype=~"ext4|xfs"} / node_filesystem_size_bytes{fstype=~"ext4|xfs"}) < 0.15
        for: 5m
        labels:
          severity: warning
          category: infra
        annotations:
          summary: "Disk space low on {{ $labels.instance }}"
          description: "{{ $value | humanizePercentage }} remaining"

      - alert: TooManyReplicas
        expr: kube_deployment_spec_replicas != kube_deployment_status_available_replicas
        for: 10m
        labels:
          severity: warning
          category: infra
          team: kubernetes
        annotations:
          summary: "Deployment {{ $labels.deployment }} has unavailable replicas"

Alerting Best Practices

Alert on symptomsAlert on user impact (latency, errors), not causes

Avoid alert fatigueOnly page for actionable, urgent issues

Severity levelscritical (page), warning (notify), info (log)

GroupingGroup by service/alertname to avoid spam

RunbooksEvery alert must link to a runbook

For durationUse 'for:' to avoid flapping (5m minimum)

InhibitionSuppress downstream alerts for root causes

Re-notificationRepeat at reasonable intervals (1-4h)

Test alertsTest routing with alertmanager --config.check

Alerting Flow

Stage	Component	Action
1. Rule evaluation	Prometheus	Check expr every evaluation_interval
2. State: Pending	Prometheus	Alert condition met, waiting for "for:" duration
3. State: Firing	Prometheus	Alert active, sent to Alertmanager
4. Routing	Alertmanager	Match rules tree, group, inhibit, dedupe
5. Notification	Alertmanager	Send to receiver (Slack, PagerDuty, email)
6. Resolved	Alertmanager	Condition cleared, send resolved notification

🚫

Every alert needs a runbook.If the on-call engineer doesn't know what to do when the alert fires, it's noise. Link to a runbook URL in the alert annotation. Include: symptoms, diagnosis steps, resolution actions, and escalation contacts.

🔍

Distributed Tracing

OBSERVABILITY

opentelemetry

// ── OpenTelemetry Setup (Node.js) ──
import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-grpc';
import { HttpInstrumentation } from '@opentelemetry/instrumentation-http';
import { ExpressInstrumentation } from '@opentelemetry/instrumentation-express';
import { PgInstrumentation } from '@opentelemetry/instrumentation-pg';
import { Resource } from '@opentelemetry/resources';
import { ATTR_SERVICE_NAME, ATTR_SERVICE_VERSION } from '@opentelemetry/semantic-conventions';

const sdk = new NodeSDK({
  resource: new Resource({
    [ATTR_SERVICE_NAME]: 'my-api-service',
    [ATTR_SERVICE_VERSION]: process.env.npm_package_version,
    deployment: {
      environment: process.env.NODE_ENV,
    },
  }),
  traceExporter: new OTLPTraceExporter({
    url: 'http://otel-collector:4317',
  }),
  instrumentations: [
    new HttpInstrumentation(),
    new ExpressInstrumentation(),
    new PgInstrumentation(),
  ],
});

sdk.start();

// ── Manual Span Creation ──
import { trace } from '@opentelemetry/api';

async function processOrder(orderId: string) {
  const tracer = trace.getTracer('order-processor');

  // Create a span
  return tracer.startActiveSpan('processOrder', {
    attributes: { 'order.id': orderId },
  }, async (span) => {
    try {
      // Add events
      span.addEvent('order-validated', { orderId });
      await validateOrder(orderId);

      span.addEvent('payment-processed');
      await processPayment(orderId);

      span.addEvent('inventory-updated');
      await updateInventory(orderId);

      span.setStatus({ code: trace.SpanStatusCode.OK });
      return { success: true };
    } catch (error) {
      span.setStatus({ code: trace.SpanStatusCode.ERROR, message: error.message });
      span.recordException(error);
      throw error;
    } finally {
      span.end();
    }
  });
}

otel-collector.yml

# ── OpenTelemetry Collector Configuration ──
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

  jaeger:
    protocols:
      thrift_http:
        endpoint: 0.0.0.0:14268

processors:
  batch:
    timeout: 5s
    send_batch_size: 1024

  memory_limiter:
    check_interval: 1s
    limit_mib: 512
    spike_limit_mib: 128

  # Add k8s metadata to spans
  k8sattributes:
    auth_type: serviceAccount
    passthrough: false
    extract:
      metadata:
        - k8s.namespace.name
        - k8s.deployment.name
        - k8s.pod.name

  # Redact sensitive data
  attributes:
    key: "authorization"
    action: delete
    key: "password"
    action: delete

exporters:
  jaeger:
    endpoint: jaeger-collector:14250
    tls:
      insecure: true

  otlphttp:
    endpoint: http://tempo:4318

  prometheus:
    endpoint: 0.0.0.0:8889

service:
  pipelines:
    traces:
      receivers: [otlp, jaeger]
      processors: [memory_limiter, k8sattributes, attributes, batch]
      exporters: [jaeger, otlphttp]

    metrics:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [prometheus]

Tracing Concepts

Concept	Description
Trace	End-to-end journey of a request through all services
Span	Single unit of work within a trace
Context	Propagated between services (trace ID + span ID)
SpanContext	Identifies span (trace ID, span ID, flags)
Baggage	Key-value pairs propagated through entire trace
Sampler	Determines which traces to collect (head/tail-based)

Tracing Backends

Tool	Type	Best For
Jaeger	Open-source, CNCF	Feature-rich, K8s-native, Tempo-compatible
Zipkin	Open-source	Simple, widely adopted, Apache
Grafana Tempo	Open-source	Object storage backend, Grafana integration
AWS X-Ray	Managed	AWS ecosystem, service map
Datadog APM	Managed	Full-stack observability platform
Honeycomb	Managed	Burn rate alerts, BubbleUp analysis

💡

Use OpenTelemetry as your instrumentation standard.It's vendor-agnostic — instrument once and export to Jaeger, Tempo, Datadog, or any backend. Avoid vendor-specific SDKs (datadog tracing, aws-xray-sdk) to prevent lock-in.

📐

Metrics Frameworks

MEASUREMENT

RED Method (Request-focused)

Signal	What	PromQL
Rate	Requests per second	rate(http_requests_total[5m])
Errors	Failed requests rate	rate(http_requests_total{code=~"5.."}[5m])
Duration	Latency distribution	histogram_quantile(0.99, rate(duration_bucket[5m]))

USE Method (Resource-focused)

Signal	What	PromQL
Utilization	Resource usage %	cpu_usage / cpu_capacity * 100
Saturation	Queue depth, load	node_load1 / cpu_count
Errors	Hardware/software errors	rate(node_hw_errors[5m])

Four Golden Signals (Google SRE)

Signal	Question	Example Metric
Latency	How long do requests take?	http_request_duration_seconds
Traffic	How much demand is there?	http_requests_total per second
Errors	How many requests fail?	5xx response rate
Saturation	How full is the system?	CPU, memory, queue depth

Prometheus Metric Types

Type	Description	Example
Counter	Monotonically increasing	http_requests_total, bytes_sent
Gauge	Can go up and down	temperature, active_connections
Histogram	Distribution + buckets	request_duration_seconds_bucket
Summary	Quantiles (client-side)	request_duration_seconds

⚠️

Prefer histograms over summaries. Histograms can be aggregated across instances with sum() and rate(), but summaries cannot. Histograms enable accurate global percentiles.

app-metrics

// ── Instrumenting a Node.js App with Prom Client ──
import promClient from 'prom-client';

// Create a Registry (for multi-app / default)
const register = new promClient.Registry();
promClient.collectDefaultMetrics({ register, prefix: 'myapp_' });

// ── Counter ──
const httpRequestsTotal = new promClient.Counter({
  name: 'http_requests_total',
  help: 'Total HTTP requests',
  labelNames: ['method', 'path', 'status_code'],
  registers: [register],
});

// ── Histogram ──
const httpRequestDuration = new promClient.Histogram({
  name: 'http_request_duration_seconds',
  help: 'HTTP request duration in seconds',
  labelNames: ['method', 'path', 'status_code'],
  buckets: [0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10],
  registers: [register],
});

// ── Gauge ──
const activeConnections = new promClient.Gauge({
  name: 'active_db_connections',
  help: 'Number of active database connections',
  labelNames: ['pool'],
  registers: [register],
});

// ── Express Middleware ──
export function metricsMiddleware(req, res, next) {
  const start = Date.now();
  res.on('finish', () => {
    const duration = (Date.now() - start) / 1000;
    const route = req.route?.path || req.path;
    httpRequestsTotal.labels(req.method, route, res.statusCode).inc();
    httpRequestDuration.labels(req.method, route, res.statusCode).observe(duration);
  });
  next();
}

// ── Expose /metrics endpoint ──
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', register.contentType);
  res.end(await register.metrics());
});

// ── Business Metrics ──
const ordersTotal = new promClient.Counter({
  name: 'orders_total',
  help: 'Total orders processed',
  labelNames: ['status', 'payment_method'],
  registers: [register],
});

const orderAmount = new promClient.Histogram({
  name: 'order_amount_dollars',
  help: 'Order amount distribution',
  buckets: [10, 25, 50, 100, 250, 500, 1000],
  registers: [register],
});

💡

Use RED for services and USE for infrastructure.RED (Rate, Errors, Duration) answers "how is my service doing?" USE (Utilization, Saturation, Errors) answers "how is my resource doing?" Together they cover both application and infrastructure health.

📊

Dashboards

VISUALIZATION

Essential Dashboards

Dashboard	Audience	Key Panels
Service Overview	Engineering	Rate, latency, errors, saturation per service
Infrastructure	SRE / Platform	CPU, memory, disk, network per node
Kubernetes	SRE / Platform	Pods, deployments, resource quotas, HPA
Business Metrics	Product / Exec	Orders, revenue, DAU, conversion rate
CI/CD Pipeline	DevOps	Build duration, success rate, deploy frequency
Cost Overview	FinOps	Per-service spend, forecast, budget vs actual
SLA / SLO	SRE / Management	Error budget, availability %, SLO burn rate
On-call	SRE on-call	Active alerts, MTTR, alert frequency

Dashboard Best Practices

Top rowSummary stats (rate, errors, latency) — glanceable in 5 seconds

Middle rowTime series graphs — trends and patterns over time

Bottom rowTables and details — drill-down into specific issues

Use thresholdsGreen/yellow/red thresholds on stat panels

Consistent time rangeDefault to last 1h, with 6h/24h/7d options

AnnotationsMark deployments, incidents, config changes on graphs

LinksPanel links to runbooks, logs, traces for deep dives

Template variables$namespace, $service, $pod for multi-tenant dashboards

slo-dashboard

# ── SLO / Error Budget Dashboard Queries ──

# ── SLO: 99.9% Availability ──
# Error budget = 0.1% (0.001)
# Target: less than 0.1% of requests fail

# Current error rate (30-day window)
sum(rate(http_requests_total{code=~"5.."}[30d]))
  / sum(rate(http_requests_total[30d]))

# ── 30-Day SLO Compliance ──
# (1 - error_rate) * 100
(1 - (
  sum(increase(http_requests_total{code=~"5.."}[30d]))
  / sum(increase(http_requests_total[30d]))
)) * 100

# ── Error Budget Remaining ──
# If SLO is 99.9%, budget = total * 0.001
(
  (sum(increase(http_requests_total[30d])) * 0.001)
  - sum(increase(http_requests_total{code=~"5.."}[30d]))
) / (sum(increase(http_requests_total[30d])) * 0.001) * 100

# ── Error Budget Burn Rate ──
# How fast are we burning the error budget?
# burn_rate > 1 = burning faster than allowed
(
  sum(rate(http_requests_total{code=~"5.."}[1h]))
  / sum(rate(http_requests_total[1h]))
) / 0.001

# ── Time to Error Budget Exhaustion ──
# At current burn rate, when will budget be exhausted?
# (remaining_budget / burn_rate) in hours
# Expressed as a prediction graph

⚠️

Your SLO dashboard should answer one question in 5 seconds:"Are we on track?" Put the SLO compliance % as a big stat panel at the top. Green = healthy, yellow = burning budget fast, red = SLO breach incoming.

🚨

Incident Response

SRE

incident-process

# ── Incident Response Process ──
# Phase 1: Detection
#   - Alert fires (PagerDuty/Slack/Email)
#   - On-call acknowledges the alert
#   - Determine severity (SEV1-SEV4)

# Phase 2: Triage
#   - Assess blast radius (how many users affected?)
#   - Check dashboards for correlated alerts
#   - Look at recent deployments (did this cause it?)
#   - Communicate status to stakeholders

# Phase 3: Mitigation
#   - Follow the runbook (if available)
#   - Identify root cause (logs, traces, metrics)
#   - Apply fix (rollback, scale, restart, config change)
#   - Verify fix is working (monitor for 15-30 min)

# Phase 4: Resolution
#   - Confirm service is healthy
#   - Send resolution notification
#   - Stand down responders

# Phase 5: Post-Mortem (within 48 hours)
#   - Blameless post-mortem meeting
#   - Document timeline, root cause, action items
#   - Create follow-up tickets for improvements
#   - Update runbook if new knowledge gained

Severity Levels

Level	Impact	Response Time	Example
SEV1	Complete outage, all users	< 5 min	Site down, data loss risk
SEV2	Major feature down, many users	< 15 min	Payment broken, 50% errors
SEV3	Minor feature degraded	< 1 hour	Slow API, degraded UX
SEV4	Non-critical, low impact	< 24 hours	Monitoring gap, typo in UI

On-Call Best Practices

RotationWeekly rotation, max 1 week on-call

RunbooksEvery alert must have a documented runbook

EscalationDefine clear escalation path (L1 → L2 → L3)

Page fatigueOnly page for action needed, not FYI

Recovery timeTrack MTTR (Mean Time To Recovery)

Alert reviewWeekly review of all alerts for tuning

Chaos engineeringPractice failures in staging (GameDay)

Shadow on-callNew SREs shadow experienced on-call first

runbook-template

# ── Runbook Template: High Error Rate ──
# Alert: HighErrorRate (service error rate > 5%)

## 1. Quick Diagnosis
```bash
# Check which endpoints are failing
curl -s 'http://prometheus:9090/api/v1/query?query=topk(10,sum(rate(http_requests_total{code=~"5..",service="myapp"}[5m]))by(path))' | jq

# Check recent deployments
kubectl rollout history deployment/myapp -n production
kubectl get events -n production --sort-by=.lastTimestamp | tail -20
```

## 2. Common Causes
| Cause | How to Check | Fix |
|-------|-------------|-----|
| Bad deployment | Rollout history | `kubectl rollout undo` |
| DB connection pool | Check DB metrics | Scale DB or increase pool |
| Dependency down | Check upstream health | Enable circuit breaker |
| OOM kills | Check pod events | Increase memory limit |
| Config error | Check ConfigMap/Env | Revert config change |

## 3. Emergency Actions
```bash
# Rollback deployment (fastest fix)
kubectl rollout undo deployment/myapp -n production

# Scale up (if resource-constrained)
kubectl scale deployment/myapp --replicas=10 -n production

# Restart all pods
kubectl rollout restart deployment/myapp -n production

# Check pod logs for errors
kubectl logs -l app=myapp -n production --tail=100 --since=10m
```

## 4. Escalation
- L2: #platform-oncall (Slack)
- L3: @sre-lead (PagerDuty)

## 5. Post-Incident
- Update this runbook with new findings
- Create ticket for root cause fix
- Add monitoring for the root cause

On-Call Tools

Tool	Purpose
PagerDuty	Incident management, on-call schedules, escalation
Opsgenie	Atlassian incident management (Jira integration)
VictorOps	Splunk incident management
Incident.io	Modern incident management with status pages
Slack	Alert channels, war rooms, updates
FireHydrant	Incident lifecycle management
GoToWebinar / Zoom	Bridge calls for SEV1 incidents

Post-Mortem Template

Incident titleClear, specific description of impact

SeveritySEV1/2/3/4 with rationale

TimelineChronological: detect → triage → fix → resolve

Root cause5 Whys analysis

ImpactUsers affected, duration, revenue impact

DetectionHow was it detected? Could it be faster?

ResolutionWhat fixed it? How long?

Action itemsSpecific, assigned, with due dates

Lessons learnedWhat went well? What to improve?

💡

Blameless post-mortemsare essential for building a learning culture. Never ask "who did this?" — ask "why did the system allow this?" Focus on systemic improvements: better monitoring, automated rollbacks, more resilient architecture.

⏳

Loading cheatsheet...