Site Reliability Engineering
Google's foundational SRE book, free online: SLOs, error budgets, toil, on-call, and postmortems.
Reliability isn't designed. It's defended.
A running collection of writing, courses, and tutorials on operating production systems: observability, incident response, performance, identity, secrets, and network security. Less about the tools than about the practice. What to instrument, when to page, how to respond, and what to do before the next time.
This index covers what you do once your systems are running. For what to build them with, see System Architecture. For where they run, see Infrastructure & Hosting.
Cross-cutting writings on running production systems well.
Google's foundational SRE book, free online: SLOs, error budgets, toil, on-call, and postmortems.
Practical companion to the SRE book with worked examples for implementing SLOs and incident response.
Canonical O'Reilly text on events-first observability; free PDF behind a form on Honeycomb.
The definitive deep-dive on Linux performance: CPUs, memory, disks, networks, tracing, and methodology.
USE Method and other systematic approaches to finding real performance bottlenecks fast.
Design so a dependency's failure changes nothing. Pre-provision instead of reacting.
How Amazon designs queue-based systems that recover from backlogs instead of compounding them.
Short free e-book defining monitoring vs observability and the three pillars: logs, metrics, traces.
The case that deploy fear is technical debt; small, frequent, author-owned changes beat blanket freezes.
Latest State of DevOps research linking delivery practices, AI, and reliability to org performance.
Production excellence as a practice: shared ownership, observability, SLOs, and risk-based prioritization.
Active blog from the observability vendor that practitioners actually read: SLOs, OTel, incidents, on-call.
The three pillars (metrics, logs, traces) are a taxonomy, not a plan. What you want is structured events you can slice arbitrarily: OpenTelemetry to instrument, plus a backend that handles high-cardinality (Honeycomb, Logfire) or a predictable metrics stack (Prometheus + Grafana + Loki). Start with one signal you'll actually look at. What to instrument is a design decision (see System Architecture); this list is about the tools that store and query what you've chosen.
Install Prometheus, scrape targets, run PromQL queries, and configure your first alert.
Approachable intro to PromQL data types, selectors, rate, and aggregation operators.
Install Grafana, connect a data source, build dashboards, and configure alerting.
Instrument an app with traces, metrics, and logs using the Collector and language SDKs.
Microservices reference app showing real instrumentation across many languages and signals.
Install Loki, ship logs with Promtail or Alloy, and query them with LogQL in Grafana.
Deploy Tempo, instrument an app, configure a collector, and visualize traces in Grafana.
Spin up a local Tempo instance via Docker Compose to explore distributed tracing end-to-end.
Run Mimir in monolithic mode, scrape Prometheus metrics, and query them through Grafana.
Architecture, operations, and tenancy patterns for long-term Prometheus storage at scale.
Install Logfire, instrument Python apps, and view structured traces and logs in the UI.
Send events via OpenTelemetry, run BubbleUp queries, and investigate production issues.
Install the Datadog Agent, instrument an app, and explore your first traces.
APM landing page: instrumentation, configuration, and trace correlation for production workloads.
SDK install, DSN configuration, and verification across backend languages and frameworks.
Docs root covering error tracking, performance, profiling, alerts, and release health.
Introductory tour of APM, infrastructure, browser, and alerting flows with quick install paths.
Language-agent install guides plus the core APM concepts for instrumenting a production service.
Vendor-maintained primer on metrics, traces, logs, dashboards, and alerting strategies.
Send OTLP traces into Better Stack using the collector or OpenTelemetry directly.
Introduction to Axiom's EventDB and MetricsDB platform for telemetry at scale.
Install SigNoz via cloud, Docker Compose, or Kubernetes Helm to start ingesting OTLP signals.
Overview of SigNoz's open-source logs, traces, and metrics on OpenTelemetry.
The page going off at 3 AM is one piece. The other pieces are knowing who responds, what runbook they follow, how the team coordinates during the incident, what gets fixed afterward, and how customers find out. PagerDuty and Opsgenie are the legacy defaults; FireHydrant and Incident.io are the new wave with deeper opinions about coordination. Statuspage handles the outward-facing part.
Official knowledge-base entry point with onboarding training and configuration guidance.
PagerDuty's open documentation on building an incident response practice, including roles and process.
Configure profile, notification rules, schedules, and integrations to start receiving alerts.
Mental model for teams, schedules, escalations, and routing rules in Opsgenie.
Stand up OnCall, wire integrations, define notification policies, and connect Slack.
Reference for OnCall scheduling, escalations, integrations, and notification policies.
Walks first-time admins through the integrations and configuration needed before the first incident.
Full docs covering Signals alerting, incident lifecycle, catalog, and runbooks.
Searchable docs for On-call, Response, Status Pages, Catalog, Workflows, and Insights.
Configure components, subscriber channels, and your first incident communication.
Pre-launch checklist covering branding, components, automation, and subscriber setup.
Create incidents from monitors, manually, or via API; resolve, group, and post-mortem them.
Product overview tying on-call schedules, Slack workflows, and status pages into one offering.
Shipping logs from your servers to your backend is its own discipline. Vector and Fluent Bit dominate the open-source side. Fluentd is the older sibling that's still widely deployed. Cribl is the commercial heavyweight when you need to filter, transform, or route to multiple backends. The choice is usually about throughput and how much processing you want at the edge before logs hit storage.
Install Vector and build your first sources to transforms to sinks observability pipeline.
Reference for every Vector component, plus operating-at-scale guidance.
Install Fluent Bit on Linux, macOS, Windows, or BSD and ship your first pipeline.
Inputs, parsers, filters, outputs, and operating Fluent Bit in production.
Installation, configuration, and the basic log forwarder pattern with Fluentd.
Plugin, deployment, and tuning guides for Fluentd in production.
Install Bento (the WarpStream-maintained Benthos fork) and run your first stream-processing config.
Conceptual intro to Bento's declarative, at-least-once stream processing pipelines.
One-hour hands-on tour of Sources, Routes, Pipelines, Functions, and Destinations.
Documentation root for Cribl Stream covering deployment, processing, and routing.
Performance is the discipline observability won't teach you. Metrics tell you something's slow. Profiles tell you why. Tools split into continuous profiling (always running, sampling) and on-demand profiling (you reach for them during an incident). eBPF unlocks the kernel-level view that used to be a Brendan Gregg exclusive.
Run Pyroscope, instrument your app with an SDK or Alloy, and explore flame graphs in Grafana.
Hands-on demo app that walks through diagnosing CPU and memory issues with continuous profiling.
Intro to Parca's server plus eBPF agent architecture for always-on continuous profiling.
How Polar Signals Cloud combines a zero-instrumentation eBPF agent with hosted symbolization.
Deploy the Polar Signals agent and stream profiles into Polar Signals Cloud.
How Pixie uses eBPF for auto-telemetry on Kubernetes without manual instrumentation.
Profile a sample service, find a real performance problem, and fix it with the Datadog Profiler.
Reference docs for enabling, configuring, and interpreting profile data across runtimes.
Brendan Gregg's canonical page introducing flame graphs, types, and how to read them.
The official flamegraph.pl toolchain with stack-collapse scripts for perf, DTrace, and friends.
Learn bpftrace in 12 lessons through one-liners covering probes, maps, and printf actions.
Reference docs, language guides, and labs for dynamic Linux tracing with bpftrace.
Auth is one of the few things worth outsourcing early. Auth0 if you need it now. Clerk and WorkOS are the new generation with better B2B ergonomics. Keycloak or Ory if you self-host. Supertokens, FusionAuth, and Logto are the lighter-weight self-hostable options. Decision is usually how much B2B complexity you have (SAML, SCIM, RBAC) and how much you want to pay someone else to manage it. The choice of how much identity you build into your domain model is an API decision (see APIs & protocols); this section is about who runs identity for you.
Set up a tenant, create applications, and integrate login via Universal Login and SDKs.
Reference architectures for SPA+API, mobile+API, and B2B/B2C identity scenarios.
Framework-specific quickstarts (Next.js, React, Express, etc.) for adding Clerk auth in minutes.
Docs root covering auth strategies, components, organizations, billing, and deployment.
Add a hosted auth flow with SSO, social login, and passkeys in under ten minutes.
Enterprise-ready auth APIs: SSO, SCIM directory sync, audit logs, user management.
Overview of B2B and Consumer auth suites plus framework-specific quickstart paths.
Full API reference covering magic links, OTP, OAuth, sessions, and fraud detection.
Run Keycloak in Docker, create a realm, register a client, and secure a sample app.
Reference for realms, clients, identity brokering, user federation, and authentication flows.
Get started with Ory Kratos identities, Hydra OAuth2/OIDC, Keto permissions, and Oathkeeper.
CLI-based quickstart for integrating SuperTokens SDKs across common backend and frontend stacks.
Authentication recipes: email/password, passwordless, social/enterprise SSO, and MFA.
Pick a 15-minute quickstart for your language or framework and stand up FusionAuth CIAM.
Install, identity providers, MFA, passkeys, and APIs reference.
Logto identity-and-access overview: SSO, MFA, RBAC, multi-tenancy on OIDC and OAuth 2.1.
Framework quickstarts (Python shown; 30+ SDKs available) for adding Logto auth to an app.
Everyone outgrows .env files. The question is where you go next. Cloud-native secrets managers (AWS, GCP, Azure) win on integration if you're already in that cloud. Vault is the heavyweight when you need dynamic credentials and deep audit. Doppler and Infisical are the modern alternatives with better DX. SOPS and age handle the "encrypt at rest, decrypt at deploy" pattern for Git-backed secrets. Wiring secrets through your pipeline is a delivery concern (see Software Delivery).
Hands-on tutorials for KV secrets, dynamic database creds, transit encryption, and auth methods.
Official checklist: end-to-end TLS, root token rotation, auditing, and least-privilege policies.
Conceptual intro plus links into setup, retrieval, and rotation patterns.
Hands-on tutorials covering moving hardcoded secrets, configuring rotation, and CodeGuru integration.
Create, access, and rotate a secret via console, gcloud, or client library.
Concepts, IAM, regional configs, and integration patterns for Secret Manager.
Overview of Service Accounts and Connect server approaches to automating secret retrieval.
Create a service account, scope its vaults, and authenticate the 1Password CLI in CI/CD.
Project and config setup, then sync secrets to local apps, cloud providers, and other stores.
SecretOps workflows, RBAC, audit logs, and integrations.
Sign up for Infisical Cloud or self-host, then manage secrets, certs, and access across environments.
Platform overview for secrets storage, rotation, references, and audit logging.
CNCF-hosted project site for the editor that encrypts values inside YAML/JSON/ENV/INI files.
README with install and usage walkthroughs for SOPS with KMS, age, PGP, and Vault backends.
Official repo and README for age, a simple modern file-encryption tool with small explicit keys.
The perimeter is everywhere. WAFs (Cloudflare, AWS) handle the edge. Host firewalls (iptables, nftables, ufw) handle the server. mTLS and service-mesh policies handle service-to-service. Let's Encrypt handles TLS certificates. Each layer is mandatory if you care about the layer beneath.
Enable Cloudflare's Managed Rulesets and OWASP Core Ruleset, tune paranoia and score thresholds.
Custom rules, rate limiting, exposed credentials, and bot rules.
Build your first Web ACL, attach managed rule groups, and protect a CloudFront or ALB resource.
Rules, rule groups, Firewall Manager, and Shield Advanced reference.
Conceptual intro to Cloud Armor security policies, preconfigured WAF rules, and load-balancer attach points.
Create a security policy, add prioritized rules, and attach it to a backend service.
Upstream wiki primer on tables, chains, rules, and families; the successor to iptables.
Examples, migration-from-iptables guides, and command reference.
Official Ubuntu community wiki on enabling, configuring, and logging the Uncomplicated Firewall.
Canonical's server-docs section on configuring host firewalls with UFW.
Project landing page linking to install media, the docs portal, and configuration walkthroughs.
Netgate-maintained pfSense manual covering install, networking, VPN, and high availability.
Hardware selection, install media, and first-boot configuration for OPNsense.
Install, manual configuration, and the development manual.
Decide between provider-managed and self-managed ACME flows, with Certbot recommended for self-managed.
Rate limits, challenge types, ACME clients, and best practices.
Install the CLI, deploy the control plane to Kubernetes, and mesh your first application.
Conceptual overview of Linkerd's Rust-based service-mesh data plane and control plane.
Engineers consistently publishing on observability, performance, and operations.
Short answers grounded in the work of practitioners running real production systems.
One signal you'll actually look at, instrumented well, beats four signals you glance at during incidents. Charity Majors frames this as observability 1.0 vs. 2.0: the three pillars (metrics, logs, traces) are 1.0. Structured wide events you can slice arbitrarily are 2.0. For most teams, OpenTelemetry to instrument plus one backend that handles high-cardinality (Honeycomb, Logfire) is the smallest setup that pays off. Add the other pillars once you're actually using the first one.
Static stability is the cleanest mental model: design so that when a dependency fails, your system behaves the same. Pre-provision instead of reacting. Pre-build instead of pulling at request time. Decide what works looks like when half your dependencies are down. Most overengineering is reacting to abstract failures instead of ones that have actually hurt you. Re-read your last three postmortems, then design for those.
Start with one SLO on the one user-facing thing you'd be paged for: requests successful within some latency budget. Pick a target you'd actually defend in a meeting (99% is fine; 99.99% is a research project), and burn-rate alerts that page you before the budget runs out, not after. The Google SRE Workbook is still the canonical step-by-step. Skip the elaborate multi-SLO error-budget machinery until you have one SLO running for six months.
Slack is enough until your team is big enough that the absence of structure costs more than the cost of the tool. Concretely: when you have multiple concurrent incidents, a postmortem backlog that's not getting written, or new on-call engineers asking "what do I do first?", Slack alone isn't covering you. Incident.io and FireHydrant give you channel orchestration, role tracking, and a clean handoff to the postmortem. The cost is per-responder pricing; the value is consistency.
Source: Incident.io: The case for incident management software
When the alternative is one person fielding every page, or nobody fielding pages at night. Both happen earlier than teams admit. A formal rotation buys predictable handoffs, escalation, and a clear answer to "who's the primary?" — but it only works if there's a real runbook for the common pages and a review of what's actually paging you. The cost of a bad rotation (alert fatigue, burnout) is higher than the cost of no rotation. Build the page-quality discipline first.
Outsource. Auth is one of the highest blast-radius things you'll ever ship, and the cost of getting it wrong is everyone else's data. The strongest argument for rolling your own is when your domain model genuinely needs identity primitives the vendors don't offer (rare). For most teams, Auth0 / Clerk / WorkOS / Stytch are worth the price; Keycloak / Ory / Supertokens are the self-hostable middle ground if you want OSS without writing your own.
Blameless doesn't mean accountability-free. It means the goal is learning, not punishment, and the operator's actions made sense given what they knew at the time. Three habits: write the timeline before the meeting; ask "how did this make sense to do, in the moment?" instead of "who did this?"; and end with concrete, owned action items that target the system, not the operator. John Allspaw's writing is the canonical reference.
Source: John Allspaw: Blameless PostMortems and a Just Culture
Original writing coming.
Smarter Dev walkthroughs and short courses on observability, incident response, and running real production systems will land here.
Join the Discord to be notifiedLast updated