⏳
Loading cheatsheet...
Observability fundamentals: metrics, logs, traces, alerts, SLOs and incident response basics.
# ── Prometheus Configuration ──
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_timeout: 10s
external_labels:
cluster: 'production'
region: 'us-east-1'
rule_files:
- 'alerts/*.yml' # Alerting rules
- 'recording/*.yml' # Recording rules
alerting:
alertmanagers:
- static_configs:
- targets: ['alertmanager:9093']
scrape_configs:
# ── Prometheus self-monitoring ──
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
# ── Kubernetes Service Discovery ──
- job_name: 'kubernetes-apiservers'
kubernetes_sd_configs:
- role: endpoints
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
action: keep
regex: default;kubernetes;https
# ── Kubernetes Pods ──
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
target_label: __address__
regex: (.+)
replacement: ${1}:9102
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
# ── Node Exporter (DaemonSet) ──
- job_name: 'node-exporter'
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- source_labels: [__meta_kubernetes_service_name]
action: keep
regex: node-exporter# ── PromQL: Essential Queries ──
# ── Rate & Counter ──
# Request rate (requests per second)
rate(http_requests_total[5m])
# 95th percentile latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
# Average request duration
rate(http_request_duration_seconds_sum[5m])
/ rate(http_request_duration_seconds_count[5m])
# ── Availability ──
# Uptime percentage
sum(up) / count(up) * 100
# Service availability (non-5xx responses)
sum(rate(http_requests_total{status!~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) * 100
# ── Resource Utilization ──
# CPU usage by pod
sum(rate(container_cpu_usage_seconds_total{namespace="production"}[5m])) by (pod)
# Memory usage by pod
sum(container_memory_working_set_bytes{namespace="production"}) by (pod) / (1024 * 1024 * 1024)
# Disk usage percentage
100 - ((node_filesystem_avail_bytes{mountpoint="/"} * 100)
/ node_filesystem_size_bytes{mountpoint="/"})
# Network I/O
rate(node_network_receive_bytes_total{device="eth0"}[5m])
rate(node_network_transmit_bytes_total{device="eth0"}[5m])
# ── K8s Pod Status ──
# Pods not ready
sum(kube_pod_status_phase{phase!="Running"}) by (namespace, pod)
# Pods restarting frequently
rate(kube_pod_container_status_restarts_total[1h]) > 0.05
# Pending pods
kube_pod_status_phase{phase="Pending"}
# ── RED Method (Rate, Errors, Duration) ──
# Rate: requests/sec per service
sum(rate(http_requests_total{service="myapp"}[5m]))
# Errors: error rate
sum(rate(http_requests_total{service="myapp",code=~"5.."}[5m]))
# Duration: p99 latency
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{service="myapp"}[5m])) by (le))
# ── USE Method (Utilization, Saturation, Errors) ──
# Utilization: CPU % per node
100 * (1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance))
# Saturation: load average vs CPU count
node_load1 / count(node_cpu_seconds_total{mode="idle"}) by (instance) without (cpu)
# Errors: CPU throttling
rate(container_cpu_cfs_throttled_periods_total[5m])
/ rate(container_cpu_cfs_periods_total[5m])| Function | Purpose | Example |
|---|---|---|
| rate() | Per-second rate of counter | rate(cpu_seconds[5m]) |
| increase() | Total increase over time | increase(errors[1h]) |
| histogram_quantile() | Percentile from histogram | histogram_quantile(0.99, ...) |
| sum() | Aggregate across labels | sum by (pod)(memory) |
| avg() | Average across series | avg(cpu_usage) |
| max() | Maximum value | max(memory_usage) |
| topk() | Top N series by value | topk(5, request_rate) |
| bottomk() | Bottom N series | bottomk(3, latency) |
| clamp_max() | Upper bound a value | clamp_max(cpu, 100) |
| predict_linear() | Predict future values | predict_linear(disk[1h], 4*3600) |
rate() with counters. Counters only increase — using rate() normalizes them to per-second values and handles counter resets. For recording rules, prefer increase() for raw count queries over a time window.{
"dashboard": {
"title": "Application Overview",
"uid": "app-overview",
"tags": ["production", "kubernetes"],
"timezone": "browser",
"panels": [
{
"title": "Request Rate (req/s)",
"type": "timeseries",
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 0 },
"targets": [{
"expr": "sum(rate(http_requests_total{service=\"myapp\"}[5m])) by (method, path)",
"legendFormat": "{{method}} {{path}}"
}],
"fieldConfig": {
"defaults": {
"unit": "reqps",
"color": { "mode": "palette-classic" }
}
}
},
{
"title": "Error Rate (%)",
"type": "stat",
"gridPos": { "h": 4, "w": 6, "x": 12, "y": 0 },
"targets": [{
"expr": "sum(rate(http_requests_total{service=\"myapp\",code=~\"5..\"}[5m])) / sum(rate(http_requests_total{service=\"myapp\"}[5m])) * 100"
}],
"fieldConfig": {
"defaults": {
"unit": "percent",
"thresholds": {
"mode": "absolute",
"steps": [
{ "color": "green", "value": null },
{ "color": "yellow", "value": 1 },
{ "color": "red", "value": 5 }
]
},
"noValue": "0%"
}
}
}
],
"templating": {
"list": [
{
"name": "datasource",
"type": "datasource",
"query": "prometheus"
},
{
"name": "namespace",
"type": "query",
"datasource": { "type": "prometheus" },
"refresh": 1,
"query": "label_values(kube_pod_info, namespace)"
},
{
"name": "service",
"type": "query",
"datasource": { "type": "prometheus" },
"refresh": 1,
"query": "label_values(http_requests_total{namespace=\"$namespace\"}, service)",
"allValue": ".*"
}
]
}
}
}| Panel | Best For | Key Options |
|---|---|---|
| Time Series | Metrics over time, multi-series | Axes, legends, tooltips |
| Stat | Single value, big number | Thresholds, sparkline, color |
| Gauge | Value within a range | Min/max, thresholds, arc |
| Bar Chart | Categorical comparison | Horizontal/vertical, stacking |
| Table | Structured data, top N's | Sorting, pagination, coloring |
| Heatmap | Distribution over time | Color scales, x/y axes |
| Logs | Log exploration | Filter, level, live tail |
| Trace | Distributed traces | Span details, waterfall |
| Pie Chart | Proportional breakdown | Donut mode, legends |
| Variable | Type | Description |
|---|---|---|
| $namespace | Query | Filter by Kubernetes namespace |
| $service | Query | Filter by service name |
| $pod | Query | Filter by pod (depends on $service) |
| $interval | Interval | Auto step: 1m/5m/15m based on range |
| $datasource | Datasource | Switch Prometheus server |
| $__rate_interval | Built-in | Auto-calculated rate window |
| $__range | Built-in | Current time range in seconds |
# ── Grafana Alerting (Unified) ──
apiVersion: 1
groups:
- name: application-alerts
rules:
- alert: HighErrorRate
for: 5m
labels:
severity: warning
team: platform
annotations:
summary: "High error rate on {{ $labels.service }}"
description: "{{ $value }}% of requests are returning 5xx"
runbook_url: "https://wiki.internal/runbooks/high-error-rate"
condition: |
100 * (
sum(rate(http_requests_total{code=~"5.."}[5m])) by (service)
/
sum(rate(http_requests_total[5m])) by (service)
) > 5
- alert: HighLatency
for: 10m
labels:
severity: critical
team: platform
annotations:
summary: "P99 latency > 2s on {{ $labels.service }}"
condition: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket{service=~"$service"}[5m])) by (le, service)
) > 2
- alert: PodCrashLooping
for: 5m
labels:
severity: critical
team: platform
annotations:
summary: "Pod {{ $labels.pod }} is crash-looping"
condition: |
rate(kube_pod_container_status_restarts_total{namespace="$namespace"}[15m]) > 0.1
# ── Contact Points (Notification Channels) ──
# ── Notification Policies (Route alerts by label) ──
# severity=critical → PagerDuty + Slack #incidents
# severity=warning → Slack #alerts + email
# team=platform → Slack #platform-alertsAll option to let users view aggregated data across all services or drill into a specific one. Chain variables ($namespace → $service → $pod) for hierarchical filtering.# ── Filebeat Configuration ──
filebeat.inputs:
# ── Application Logs ──
- type: container
paths:
- /var/log/containers/myapp-*.log
processors:
- decode_json_fields:
fields: ["message"]
target: "app"
overwrite_keys: true
- add_kubernetes_metadata:
host: ${NODE_NAME}
matchers:
- logs_path:
logs_path: "/var/log/containers/"
# ── Nginx Access Logs ──
- type: log
paths:
- /var/log/nginx/access.log
fields:
log_type: nginx_access
fields_under_root: true
# ── System Logs ──
- type: log
paths:
- /var/log/syslog
- /var/log/auth.log
fields:
log_type: system
fields_under_root: true
# ── Elasticsearch Output ──
output.elasticsearch:
hosts: ["https://elasticsearch:9200"]
username: elastic
password: ${ELASTIC_PASSWORD}
ssl.certificate_authorities: ["/usr/share/filebeat/certs/ca.crt"]
# ── Logstash Pipeline (alternative) ──
output.logstash:
hosts: ["logstash:5044"]
# ── Kafka Buffer (for high-volume) ──
output.kafka:
hosts: ["kafka:9092"]
topic: "logs-raw"
required_acks: 1
# ── Kibana index pattern setup ──
setup.kibana:
host: "https://kibana:5601"
username: elastic
password: ${ELASTIC_PASSWORD}
# ── ILM (Index Lifecycle Management) ──
setup.ilm.enabled: true
setup.ilm.rollover_alias: "app-logs"
setup.ilm.pattern: "{now/d}-000001"# ── Logstash Pipeline Configuration ──
input {
beats {
port => 5044
}
kafka {
bootstrap_servers => "kafka:9092"
topics => ["logs-raw"]
consumer_threads => 4
}
}
filter {
# Parse JSON log lines
if [message] =~ /^{.*}$/ {
json {
source => "message"
target => "parsed"
}
}
# Parse timestamp
date {
match => ["timestamp", "ISO8601", "UNIX"]
target => "@timestamp"
}
# Add GeoIP for IP addresses
if [client_ip] {
geoip {
source => "client_ip"
target => "geoip"
}
}
# Remove sensitive fields
mutate {
remove_field => ["password", "credit_card", "token"]
rename => { "message" => "original_message" }
}
# Grok pattern for unstructured logs
grok {
match => { "message" => "%{COMBINEDAPACHELOG}" }
overwrite => ["message"]
}
# Add tags for filtering
if [status] and [status] =~ /^[5]/ {
mutate { add_tag => ["error"] }
}
if [level] == "ERROR" {
mutate { add_tag => ["app_error"] }
}
}
output {
elasticsearch {
hosts => ["https://elasticsearch:9200"]
index => "app-logs-%{+YYYY.MM.dd}"
user => "elastic"
password => "${ELASTIC_PASSWORD}"
}
}# ── Kibana Query Language (KQL) ──
# Simple search
message: "error" and level: "ERROR"
# Field exists
service: "myapp" and status >= 500
# Wildcard
path: "/api/users/*"
# Boolean
(level: "ERROR" or level: "WARN") and service: "auth-service"
# Negation
not level: "DEBUG" and service: "payment"
# Range
response_time > 1000 and status: 200
# Nested
request.user.id: "alice" and response.status: 500
# ── Elasticsearch Query (for Logstash/curator) ──
# Get error logs from last hour
GET app-logs-*/_search
{
"query": {
"bool": {
"must": [
{ "match": { "level": "ERROR" } },
{ "range": { "@timestamp": { "gte": "now-1h" } } }
]
}
},
"aggs": {
"by_service": {
"terms": { "field": "service.keyword", "size": 10 }
},
"errors_over_time": {
"date_histogram": { "field": "@timestamp", "interval": "5m" }
}
},
"size": 0
}| Component | Role | Port |
|---|---|---|
| Elasticsearch | Search & analytics engine | 9200 (HTTP), 9300 (transport) |
| Logstash | Log processing pipeline | 5044 (Beats input) |
| Kibana | Visualization & dashboard | 5601 (HTTP) |
| Filebeat | Log shipper (lightweight) | 5066 |
| Metricbeat | Metrics shipper | 5140 |
| Heartbeat | Uptime monitoring | 5220 |
| Curator | Index lifecycle management | CLI tool |
| Tool | Type | Best For |
|---|---|---|
| Grafana Loki | Log aggregation | K8s-native, labels not full-text index |
| Fluentd/Fluent Bit | Log shipper | CNCF, lightweight, flexible routing |
| Splunk | Enterprise SIEM/Log | Advanced analytics, compliance |
| Datadog Logs | Managed | All-in-one monitoring platform |
| AWS CloudWatch Logs | Managed | AWS ecosystem integration |
| OpenSearch | Open-source ELK | Elasticsearch fork, Apache 2.0 license |
# ── Alertmanager Configuration ──
global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.example.com:587'
smtp_from: 'alerts@example.com'
smtp_auth_username: 'alert-bot@example.com'
smtp_auth_password: '${SMTP_PASSWORD}'
# ── Templates ──
templates:
- '/etc/alertmanager/templates/*.tmpl'
# ── Routing Tree ──
route:
# Default receiver
receiver: 'default-slack'
# Grouping
group_by: ['alertname', 'service', 'cluster']
group_wait: 30s # Wait before sending first notification
group_interval: 5m # Wait before sending next group notification
repeat_interval: 4h # Re-notify if alert is still firing
# Child routes (evaluated top-down)
routes:
# Critical alerts → PagerDuty + on-call Slack
- match:
severity: critical
receiver: 'pagerduty-critical'
group_wait: 10s
repeat_interval: 1h
routes:
- match:
team: database
receiver: 'db-oncall'
# Warning alerts → Slack
- match:
severity: warning
receiver: 'slack-warnings'
group_wait: 2m
repeat_interval: 4h
# Infrastructure alerts → infra team
- match:
category: infra
receiver: 'infra-team'
routes:
- match:
team: kubernetes
receiver: 'k8s-oncall'
# ── Inhibition Rules ──
inhibit_rules:
# If critical alert is firing, suppress warnings for same service
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'service']
# If service is down, suppress individual alerts
- source_match:
alertname: 'ServiceDown'
target_match_re:
alertname: '.*'
equal: ['service']
# ── Receivers (Notification Channels) ──
receivers:
- name: 'default-slack'
slack_configs:
- api_url: '${SLACK_WEBHOOK_URL}'
channel: '#alerts'
title: '{{ .GroupLabels.alertname }}'
text: >-
{{ range .Alerts }}
*Alert:* {{ .Annotations.summary }}
*Severity:* {{ .Labels.severity }}
*Service:* {{ .Labels.service }}
*Details:* {{ .Annotations.description }}
{{ end }}
- name: 'pagerduty-critical'
pagerduty_configs:
- service_key: '${PAGERDUTY_SERVICE_KEY}'
severity: '{{ .GroupLabels.severity }}'
description: '{{ .GroupLabels.alertname }} - {{ .CommonAnnotations.summary }}'
url: 'https://grafana.example.com/alerts'
- name: 'slack-warnings'
slack_configs:
- api_url: '${SLACK_WEBHOOK_URL}'
channel: '#alerts'
send_resolved: true
- name: 'infra-team'
slack_configs:
- api_url: '${SLACK_WEBHOOK_URL}'
channel: '#infra-alerts'
- name: 'db-oncall'
pagerduty_configs:
- service_key: '${PAGERDUTY_DB_SERVICE_KEY}'
description: '{{ .GroupLabels.alertname }} - DB Team'
- name: 'k8s-oncall'
slack_configs:
- api_url: '${SLACK_WEBHOOK_URL}'
channel: '#k8s-alerts'# ── Prometheus Alert Rules ──
groups:
- name: application.alerts
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{code=~"5.."}[5m])) by (service)
/ sum(rate(http_requests_total[5m])) by (service) > 0.05
for: 5m
labels:
severity: critical
category: application
annotations:
summary: "High error rate on {{ $labels.service }}"
description: "{{ $value | humanizePercentage }} of requests are failing (threshold: 5%)"
runbook_url: "https://wiki/runbooks/high-error-rate"
- alert: HighLatencyP99
expr: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
) > 2
for: 10m
labels:
severity: warning
category: application
annotations:
summary: "P99 latency exceeds 2s on {{ $labels.service }}"
description: "Current P99: {{ $value }}s"
- alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) > 0.1
for: 5m
labels:
severity: critical
category: infra
team: kubernetes
annotations:
summary: "Pod {{ $labels.pod }} is crash-looping"
description: "Restart rate: {{ $value | humanize }}/s"
- alert: DiskSpaceLow
expr: (node_filesystem_avail_bytes{fstype=~"ext4|xfs"} / node_filesystem_size_bytes{fstype=~"ext4|xfs"}) < 0.15
for: 5m
labels:
severity: warning
category: infra
annotations:
summary: "Disk space low on {{ $labels.instance }}"
description: "{{ $value | humanizePercentage }} remaining"
- alert: TooManyReplicas
expr: kube_deployment_spec_replicas != kube_deployment_status_available_replicas
for: 10m
labels:
severity: warning
category: infra
team: kubernetes
annotations:
summary: "Deployment {{ $labels.deployment }} has unavailable replicas"| Stage | Component | Action |
|---|---|---|
| 1. Rule evaluation | Prometheus | Check expr every evaluation_interval |
| 2. State: Pending | Prometheus | Alert condition met, waiting for "for:" duration |
| 3. State: Firing | Prometheus | Alert active, sent to Alertmanager |
| 4. Routing | Alertmanager | Match rules tree, group, inhibit, dedupe |
| 5. Notification | Alertmanager | Send to receiver (Slack, PagerDuty, email) |
| 6. Resolved | Alertmanager | Condition cleared, send resolved notification |
// ── OpenTelemetry Setup (Node.js) ──
import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-grpc';
import { HttpInstrumentation } from '@opentelemetry/instrumentation-http';
import { ExpressInstrumentation } from '@opentelemetry/instrumentation-express';
import { PgInstrumentation } from '@opentelemetry/instrumentation-pg';
import { Resource } from '@opentelemetry/resources';
import { ATTR_SERVICE_NAME, ATTR_SERVICE_VERSION } from '@opentelemetry/semantic-conventions';
const sdk = new NodeSDK({
resource: new Resource({
[ATTR_SERVICE_NAME]: 'my-api-service',
[ATTR_SERVICE_VERSION]: process.env.npm_package_version,
deployment: {
environment: process.env.NODE_ENV,
},
}),
traceExporter: new OTLPTraceExporter({
url: 'http://otel-collector:4317',
}),
instrumentations: [
new HttpInstrumentation(),
new ExpressInstrumentation(),
new PgInstrumentation(),
],
});
sdk.start();
// ── Manual Span Creation ──
import { trace } from '@opentelemetry/api';
async function processOrder(orderId: string) {
const tracer = trace.getTracer('order-processor');
// Create a span
return tracer.startActiveSpan('processOrder', {
attributes: { 'order.id': orderId },
}, async (span) => {
try {
// Add events
span.addEvent('order-validated', { orderId });
await validateOrder(orderId);
span.addEvent('payment-processed');
await processPayment(orderId);
span.addEvent('inventory-updated');
await updateInventory(orderId);
span.setStatus({ code: trace.SpanStatusCode.OK });
return { success: true };
} catch (error) {
span.setStatus({ code: trace.SpanStatusCode.ERROR, message: error.message });
span.recordException(error);
throw error;
} finally {
span.end();
}
});
}# ── OpenTelemetry Collector Configuration ──
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
jaeger:
protocols:
thrift_http:
endpoint: 0.0.0.0:14268
processors:
batch:
timeout: 5s
send_batch_size: 1024
memory_limiter:
check_interval: 1s
limit_mib: 512
spike_limit_mib: 128
# Add k8s metadata to spans
k8sattributes:
auth_type: serviceAccount
passthrough: false
extract:
metadata:
- k8s.namespace.name
- k8s.deployment.name
- k8s.pod.name
# Redact sensitive data
attributes:
key: "authorization"
action: delete
key: "password"
action: delete
exporters:
jaeger:
endpoint: jaeger-collector:14250
tls:
insecure: true
otlphttp:
endpoint: http://tempo:4318
prometheus:
endpoint: 0.0.0.0:8889
service:
pipelines:
traces:
receivers: [otlp, jaeger]
processors: [memory_limiter, k8sattributes, attributes, batch]
exporters: [jaeger, otlphttp]
metrics:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [prometheus]| Concept | Description |
|---|---|
| Trace | End-to-end journey of a request through all services |
| Span | Single unit of work within a trace |
| Context | Propagated between services (trace ID + span ID) |
| SpanContext | Identifies span (trace ID, span ID, flags) |
| Baggage | Key-value pairs propagated through entire trace |
| Sampler | Determines which traces to collect (head/tail-based) |
| Tool | Type | Best For |
|---|---|---|
| Jaeger | Open-source, CNCF | Feature-rich, K8s-native, Tempo-compatible |
| Zipkin | Open-source | Simple, widely adopted, Apache |
| Grafana Tempo | Open-source | Object storage backend, Grafana integration |
| AWS X-Ray | Managed | AWS ecosystem, service map |
| Datadog APM | Managed | Full-stack observability platform |
| Honeycomb | Managed | Burn rate alerts, BubbleUp analysis |
| Signal | What | PromQL |
|---|---|---|
| Rate | Requests per second | rate(http_requests_total[5m]) |
| Errors | Failed requests rate | rate(http_requests_total{code=~"5.."}[5m]) |
| Duration | Latency distribution | histogram_quantile(0.99, rate(duration_bucket[5m])) |
| Signal | What | PromQL |
|---|---|---|
| Utilization | Resource usage % | cpu_usage / cpu_capacity * 100 |
| Saturation | Queue depth, load | node_load1 / cpu_count |
| Errors | Hardware/software errors | rate(node_hw_errors[5m]) |
| Signal | Question | Example Metric |
|---|---|---|
| Latency | How long do requests take? | http_request_duration_seconds |
| Traffic | How much demand is there? | http_requests_total per second |
| Errors | How many requests fail? | 5xx response rate |
| Saturation | How full is the system? | CPU, memory, queue depth |
| Type | Description | Example |
|---|---|---|
| Counter | Monotonically increasing | http_requests_total, bytes_sent |
| Gauge | Can go up and down | temperature, active_connections |
| Histogram | Distribution + buckets | request_duration_seconds_bucket |
| Summary | Quantiles (client-side) | request_duration_seconds |
sum() and rate(), but summaries cannot. Histograms enable accurate global percentiles.// ── Instrumenting a Node.js App with Prom Client ──
import promClient from 'prom-client';
// Create a Registry (for multi-app / default)
const register = new promClient.Registry();
promClient.collectDefaultMetrics({ register, prefix: 'myapp_' });
// ── Counter ──
const httpRequestsTotal = new promClient.Counter({
name: 'http_requests_total',
help: 'Total HTTP requests',
labelNames: ['method', 'path', 'status_code'],
registers: [register],
});
// ── Histogram ──
const httpRequestDuration = new promClient.Histogram({
name: 'http_request_duration_seconds',
help: 'HTTP request duration in seconds',
labelNames: ['method', 'path', 'status_code'],
buckets: [0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10],
registers: [register],
});
// ── Gauge ──
const activeConnections = new promClient.Gauge({
name: 'active_db_connections',
help: 'Number of active database connections',
labelNames: ['pool'],
registers: [register],
});
// ── Express Middleware ──
export function metricsMiddleware(req, res, next) {
const start = Date.now();
res.on('finish', () => {
const duration = (Date.now() - start) / 1000;
const route = req.route?.path || req.path;
httpRequestsTotal.labels(req.method, route, res.statusCode).inc();
httpRequestDuration.labels(req.method, route, res.statusCode).observe(duration);
});
next();
}
// ── Expose /metrics endpoint ──
app.get('/metrics', async (req, res) => {
res.set('Content-Type', register.contentType);
res.end(await register.metrics());
});
// ── Business Metrics ──
const ordersTotal = new promClient.Counter({
name: 'orders_total',
help: 'Total orders processed',
labelNames: ['status', 'payment_method'],
registers: [register],
});
const orderAmount = new promClient.Histogram({
name: 'order_amount_dollars',
help: 'Order amount distribution',
buckets: [10, 25, 50, 100, 250, 500, 1000],
registers: [register],
});| Dashboard | Audience | Key Panels |
|---|---|---|
| Service Overview | Engineering | Rate, latency, errors, saturation per service |
| Infrastructure | SRE / Platform | CPU, memory, disk, network per node |
| Kubernetes | SRE / Platform | Pods, deployments, resource quotas, HPA |
| Business Metrics | Product / Exec | Orders, revenue, DAU, conversion rate |
| CI/CD Pipeline | DevOps | Build duration, success rate, deploy frequency |
| Cost Overview | FinOps | Per-service spend, forecast, budget vs actual |
| SLA / SLO | SRE / Management | Error budget, availability %, SLO burn rate |
| On-call | SRE on-call | Active alerts, MTTR, alert frequency |
# ── SLO / Error Budget Dashboard Queries ──
# ── SLO: 99.9% Availability ──
# Error budget = 0.1% (0.001)
# Target: less than 0.1% of requests fail
# Current error rate (30-day window)
sum(rate(http_requests_total{code=~"5.."}[30d]))
/ sum(rate(http_requests_total[30d]))
# ── 30-Day SLO Compliance ──
# (1 - error_rate) * 100
(1 - (
sum(increase(http_requests_total{code=~"5.."}[30d]))
/ sum(increase(http_requests_total[30d]))
)) * 100
# ── Error Budget Remaining ──
# If SLO is 99.9%, budget = total * 0.001
(
(sum(increase(http_requests_total[30d])) * 0.001)
- sum(increase(http_requests_total{code=~"5.."}[30d]))
) / (sum(increase(http_requests_total[30d])) * 0.001) * 100
# ── Error Budget Burn Rate ──
# How fast are we burning the error budget?
# burn_rate > 1 = burning faster than allowed
(
sum(rate(http_requests_total{code=~"5.."}[1h]))
/ sum(rate(http_requests_total[1h]))
) / 0.001
# ── Time to Error Budget Exhaustion ──
# At current burn rate, when will budget be exhausted?
# (remaining_budget / burn_rate) in hours
# Expressed as a prediction graph# ── Incident Response Process ──
# Phase 1: Detection
# - Alert fires (PagerDuty/Slack/Email)
# - On-call acknowledges the alert
# - Determine severity (SEV1-SEV4)
# Phase 2: Triage
# - Assess blast radius (how many users affected?)
# - Check dashboards for correlated alerts
# - Look at recent deployments (did this cause it?)
# - Communicate status to stakeholders
# Phase 3: Mitigation
# - Follow the runbook (if available)
# - Identify root cause (logs, traces, metrics)
# - Apply fix (rollback, scale, restart, config change)
# - Verify fix is working (monitor for 15-30 min)
# Phase 4: Resolution
# - Confirm service is healthy
# - Send resolution notification
# - Stand down responders
# Phase 5: Post-Mortem (within 48 hours)
# - Blameless post-mortem meeting
# - Document timeline, root cause, action items
# - Create follow-up tickets for improvements
# - Update runbook if new knowledge gained| Level | Impact | Response Time | Example |
|---|---|---|---|
| SEV1 | Complete outage, all users | < 5 min | Site down, data loss risk |
| SEV2 | Major feature down, many users | < 15 min | Payment broken, 50% errors |
| SEV3 | Minor feature degraded | < 1 hour | Slow API, degraded UX |
| SEV4 | Non-critical, low impact | < 24 hours | Monitoring gap, typo in UI |
# ── Runbook Template: High Error Rate ──
# Alert: HighErrorRate (service error rate > 5%)
## 1. Quick Diagnosis
```bash
# Check which endpoints are failing
curl -s 'http://prometheus:9090/api/v1/query?query=topk(10,sum(rate(http_requests_total{code=~"5..",service="myapp"}[5m]))by(path))' | jq
# Check recent deployments
kubectl rollout history deployment/myapp -n production
kubectl get events -n production --sort-by=.lastTimestamp | tail -20
```
## 2. Common Causes
| Cause | How to Check | Fix |
|-------|-------------|-----|
| Bad deployment | Rollout history | `kubectl rollout undo` |
| DB connection pool | Check DB metrics | Scale DB or increase pool |
| Dependency down | Check upstream health | Enable circuit breaker |
| OOM kills | Check pod events | Increase memory limit |
| Config error | Check ConfigMap/Env | Revert config change |
## 3. Emergency Actions
```bash
# Rollback deployment (fastest fix)
kubectl rollout undo deployment/myapp -n production
# Scale up (if resource-constrained)
kubectl scale deployment/myapp --replicas=10 -n production
# Restart all pods
kubectl rollout restart deployment/myapp -n production
# Check pod logs for errors
kubectl logs -l app=myapp -n production --tail=100 --since=10m
```
## 4. Escalation
- L2: #platform-oncall (Slack)
- L3: @sre-lead (PagerDuty)
## 5. Post-Incident
- Update this runbook with new findings
- Create ticket for root cause fix
- Add monitoring for the root cause| Tool | Purpose |
|---|---|
| PagerDuty | Incident management, on-call schedules, escalation |
| Opsgenie | Atlassian incident management (Jira integration) |
| VictorOps | Splunk incident management |
| Incident.io | Modern incident management with status pages |
| Slack | Alert channels, war rooms, updates |
| FireHydrant | Incident lifecycle management |
| GoToWebinar / Zoom | Bridge calls for SEV1 incidents |