Running Modern Systems

Reliability isn't designed. It's defended.

A running collection of writing, courses, and tutorials on operating production systems: observability, incident response, performance, identity, secrets, and network security. Less about the tools than about the practice. What to instrument, when to page, how to respond, and what to do before the next time.

This index covers what you do once your systems are running. For what to build them with, see System Architecture. For where they run, see Infrastructure & Hosting.

Best Practices & Discussions

Cross-cutting writings on running production systems well.

Site Reliability Engineering

Tutorial Google · sre.google

Google's foundational SRE book, free online: SLOs, error budgets, toil, on-call, and postmortems.

The Site Reliability Workbook

Tutorial Google · sre.google

Practical companion to the SRE book with worked examples for implementing SLOs and incident response.

Observability Engineering

Tutorial Majors, Fong-Jones, Miranda · Honeycomb / O'Reilly

Canonical O'Reilly text on events-first observability; free PDF behind a form on Honeycomb.

DORA Research 2025

Best Practices DORA · Google Cloud

Latest State of DevOps research linking delivery practices, AI, and reliability to org performance.

The Honeycomb Blog

Discussion Honeycomb

Active blog from the observability vendor that practitioners actually read: SLOs, OTel, incidents, on-call.

Tools by Category

Observability

The three pillars (metrics, logs, traces) are a taxonomy, not a plan. What you want is structured events you can slice arbitrarily: OpenTelemetry to instrument, plus a backend that handles high-cardinality (Honeycomb, Logfire) or a predictable metrics stack (Prometheus + Grafana + Loki). Start with one signal you'll actually look at. What to instrument is a design decision (see System Architecture); this list is about the tools that store and query what you've chosen.

  • Prometheus Pull-based metrics collection. Deeply integrated with the Kubernetes ecosystem.
  • Grafana Dashboard tool that fronts Prometheus, Loki, and most observability backends.
  • OpenTelemetry Vendor-neutral instrumentation standard for traces, metrics, and logs.
  • Grafana Loki Log aggregation system designed to pair with Prometheus and Grafana.
  • Grafana Tempo Grafana's distributed tracing backend. Pairs with Loki and Mimir as the LGTM stack.
  • Grafana Mimir Horizontally-scalable Prometheus-compatible metrics backend for very large deployments.
  • Pydantic Logfire Observability platform from the Pydantic team. Structured tracing, OpenTelemetry-native.
  • Honeycomb High-cardinality structured-event observability. Pioneered the "observability 2.0" frame.
  • Datadog The commercial heavyweight. Observability plus APM plus security, with pricing to match.
  • Sentry Error and performance monitoring. The default for catching frontend and backend exceptions.
  • New Relic Long-time APM vendor reorganized around usage-based pricing and a single observability platform.
  • Better Stack Combined logs, uptime, and dashboards in one commercial offering. Lightweight pricing.
  • Axiom Log management and analytics with structured-event focus. The cheap-for-volume alternative.
  • SigNoz Open-source full observability platform. The Datadog alternative if you want to self-host.

Prometheus Getting Started

Tutorial Prometheus docs

Install Prometheus, scrape targets, run PromQL queries, and configure your first alert.

PromQL for Mere Mortals

Tutorial Grafana Labs blog

Approachable intro to PromQL data types, selectors, rate, and aggregation operators.

Grafana Getting Started

Tutorial Grafana docs

Install Grafana, connect a data source, build dashboards, and configure alerting.

OpenTelemetry Demo

Tutorial OpenTelemetry docs

Microservices reference app showing real instrumentation across many languages and signals.

Grafana Loki Get Started

Tutorial Grafana docs

Install Loki, ship logs with Promtail or Alloy, and query them with LogQL in Grafana.

Tempo Quick Start (Docker)

Tutorial Grafana docs

Spin up a local Tempo instance via Docker Compose to explore distributed tracing end-to-end.

Grafana Mimir Get Started

Tutorial Grafana docs

Run Mimir in monolithic mode, scrape Prometheus metrics, and query them through Grafana.

Grafana Mimir Documentation

Best Practices Grafana docs

Architecture, operations, and tenancy patterns for long-term Prometheus storage at scale.

Honeycomb Get Started

Tutorial Honeycomb docs

Send events via OpenTelemetry, run BubbleUp queries, and investigate production issues.

Datadog APM Documentation

Best Practices Datadog docs

APM landing page: instrumentation, configuration, and trace correlation for production workloads.

Sentry Documentation

Best Practices Sentry docs

Docs root covering error tracking, performance, profiling, alerts, and release health.

Get Started with New Relic

Tutorial New Relic docs

Introductory tour of APM, infrastructure, browser, and alerting flows with quick install paths.

Axiom Docs

Best Practices Axiom docs

Introduction to Axiom's EventDB and MetricsDB platform for telemetry at scale.

Get Started with SigNoz

Tutorial SigNoz docs

Install SigNoz via cloud, Docker Compose, or Kubernetes Helm to start ingesting OTLP signals.

Introduction to SigNoz

Best Practices SigNoz docs

Overview of SigNoz's open-source logs, traces, and metrics on OpenTelemetry.

Incident response & alerting

The page going off at 3 AM is one piece. The other pieces are knowing who responds, what runbook they follow, how the team coordinates during the incident, what gets fixed afterward, and how customers find out. PagerDuty and Opsgenie are the legacy defaults; FireHydrant and Incident.io are the new wave with deeper opinions about coordination. Statuspage handles the outward-facing part.

  • PagerDuty The original on-call platform. Deepest integration ecosystem; expensive at scale.
  • Opsgenie Atlassian's PagerDuty competitor. Strong if you're already in the Atlassian stack.
  • Grafana OnCall OSS on-call scheduling. Pairs cleanly with Grafana Alertmanager.
  • FireHydrant Incident response with strong runbook automation and Slack coordination.
  • Incident.io Slack-native incident management; opinionated about coordination patterns.
  • Atlassian Statuspage The outward-facing status-communication standard.
  • Better Stack (incident management) Combined uptime, on-call rotations, and status page in one lightweight offering.

PagerDuty Knowledge Base

Best Practices PagerDuty docs

Official knowledge-base entry point with onboarding training and configuration guidance.

Opsgenie Quickstart Guide

Tutorial Atlassian docs

Configure profile, notification rules, schedules, and integrations to start receiving alerts.

Learn How Opsgenie Works

Best Practices Atlassian docs

Mental model for teams, schedules, escalations, and routing rules in Opsgenie.

FireHydrant Documentation

Best Practices FireHydrant docs

Full docs covering Signals alerting, incident lifecycle, catalog, and runbooks.

incident.io Help Center

Best Practices incident.io docs

Searchable docs for On-call, Response, Status Pages, Catalog, Workflows, and Insights.

Logging pipelines

Shipping logs from your servers to your backend is its own discipline. Vector and Fluent Bit dominate the open-source side. Fluentd is the older sibling that's still widely deployed. Cribl is the commercial heavyweight when you need to filter, transform, or route to multiple backends. The choice is usually about throughput and how much processing you want at the edge before logs hit storage.

  • Vector Datadog's Rust-based log and metric pipeline. The high-performance default for modern stacks.
  • Fluent Bit Lightweight log collector and forwarder. The OpenTelemetry-adjacent default in Kubernetes.
  • Fluentd The older Ruby + C log aggregator. Still widely deployed; broader plugin ecosystem.
  • Bento Stream-processing toolkit (formerly Benthos). Filter, transform, and route at the edge.
  • Cribl Stream Commercial heavyweight for routing logs to many backends with filter, transform, and replay.

Vector Quickstart

Tutorial Vector docs

Install Vector and build your first sources to transforms to sinks observability pipeline.

Vector Documentation

Best Practices Vector docs

Reference for every Vector component, plus operating-at-scale guidance.

Fluent Bit Manual

Best Practices Fluent Bit docs

Inputs, parsers, filters, outputs, and operating Fluent Bit in production.

Fluentd Quickstart

Tutorial Fluentd docs

Installation, configuration, and the basic log forwarder pattern with Fluentd.

Fluentd Documentation

Best Practices Fluentd docs

Plugin, deployment, and tuning guides for Fluentd in production.

Bento Getting Started

Tutorial Bento docs

Install Bento (the WarpStream-maintained Benthos fork) and run your first stream-processing config.

What is Bento For?

Best Practices Bento docs

Conceptual intro to Bento's declarative, at-least-once stream processing pipelines.

Performance & profiling

Performance is the discipline observability won't teach you. Metrics tell you something's slow. Profiles tell you why. Tools split into continuous profiling (always running, sampling) and on-demand profiling (you reach for them during an incident). eBPF unlocks the kernel-level view that used to be a Brendan Gregg exclusive.

  • Grafana Pyroscope Grafana's continuous profiling backend. eBPF-powered, language-agnostic.
  • Parca OSS continuous profiling. Polar Signals' open-source foundation.
  • Polar Signals Continuous profiling as a service. Pyroscope-compatible, hosted offering.
  • Pixie eBPF-based observability for Kubernetes. Auto-instrumented; no code changes.
  • Datadog Continuous Profiler Datadog's profiling product, deeply integrated with their APM and request tracing.
  • perf / flamegraph Brendan Gregg's flamegraph tooling. The Linux profiling stack practitioners reach for.
  • BCC / bpftrace eBPF-based dynamic tracing. Kernel-level visibility without recompiling the kernel.

Get Started with Pyroscope

Tutorial Grafana docs

Run Pyroscope, instrument your app with an SDK or Alloy, and explore flame graphs in Grafana.

Parca Overview

Best Practices Parca docs

Intro to Parca's server plus eBPF agent architecture for always-on continuous profiling.

Polar Signals Overview

Best Practices Polar Signals docs

How Polar Signals Cloud combines a zero-instrumentation eBPF agent with hosted symbolization.

Pixie Overview

Best Practices Pixie docs

How Pixie uses eBPF for auto-telemetry on Kubernetes without manual instrumentation.

Datadog Continuous Profiler

Best Practices Datadog docs

Reference docs for enabling, configuring, and interpreting profile data across runtimes.

Flame Graphs

Best Practices Brendan Gregg

Brendan Gregg's canonical page introducing flame graphs, types, and how to read them.

FlameGraph Toolchain

Tutorial GitHub

The official flamegraph.pl toolchain with stack-collapse scripts for perf, DTrace, and friends.

bpftrace Project Home

Best Practices bpftrace docs

Reference docs, language guides, and labs for dynamic Linux tracing with bpftrace.

Identity & auth

Auth is one of the few things worth outsourcing early. Auth0 if you need it now. Clerk and WorkOS are the new generation with better B2B ergonomics. Keycloak or Ory if you self-host. Supertokens, FusionAuth, and Logto are the lighter-weight self-hostable options. Decision is usually how much B2B complexity you have (SAML, SCIM, RBAC) and how much you want to pay someone else to manage it. The choice of how much identity you build into your domain model is an API decision (see APIs & protocols); this section is about who runs identity for you.

  • Auth0 Managed identity-as-a-service. Fastest to integrate; expensive at scale.
  • Clerk Developer-first auth-as-a-service. Excellent React/Next.js story and UI components.
  • WorkOS B2B-first auth: SSO, SCIM, audit logs. The "enterprise-ready" middleware.
  • Stytch Passwordless-first auth platform. Strong on email magic links and B2B SSO.
  • Keycloak Self-hostable identity and access management. Full OAuth, OIDC, SAML support.
  • Ory Modern OSS identity stack split into composable services (Kratos, Hydra, Oathkeeper).
  • Supertokens Self-hostable auth library. Open source, modern, language-agnostic.
  • FusionAuth Self-hostable auth platform. Generous free tier; deep CIAM features.
  • Logto Modern OSS auth and identity platform. The lighter alternative to Keycloak.

Auth0 Get Started

Tutorial Auth0 docs

Set up a tenant, create applications, and integrate login via Universal Login and SDKs.

Clerk Quickstarts

Tutorial Clerk docs

Framework-specific quickstarts (Next.js, React, Express, etc.) for adding Clerk auth in minutes.

Clerk Documentation

Best Practices Clerk docs

Docs root covering auth strategies, components, organizations, billing, and deployment.

WorkOS Documentation

Best Practices WorkOS docs

Enterprise-ready auth APIs: SSO, SCIM directory sync, audit logs, user management.

Stytch Get Started

Tutorial Stytch docs

Overview of B2B and Consumer auth suites plus framework-specific quickstart paths.

Stytch API Reference

Best Practices Stytch docs

Full API reference covering magic links, OTP, OAuth, sessions, and fraud detection.

Ory Documentation

Tutorial Ory docs

Get started with Ory Kratos identities, Hydra OAuth2/OIDC, Keto permissions, and Oathkeeper.

SuperTokens Documentation

Tutorial SuperTokens docs

CLI-based quickstart for integrating SuperTokens SDKs across common backend and frontend stacks.

FusionAuth Get Started

Tutorial FusionAuth docs

Pick a 15-minute quickstart for your language or framework and stand up FusionAuth CIAM.

Logto Introduction

Best Practices Logto docs

Logto identity-and-access overview: SSO, MFA, RBAC, multi-tenancy on OIDC and OAuth 2.1.

Logto Quick Starts

Tutorial Logto docs

Framework quickstarts (Python shown; 30+ SDKs available) for adding Logto auth to an app.

Secrets management

Everyone outgrows .env files. The question is where you go next. Cloud-native secrets managers (AWS, GCP, Azure) win on integration if you're already in that cloud. Vault is the heavyweight when you need dynamic credentials and deep audit. Doppler and Infisical are the modern alternatives with better DX. SOPS and age handle the "encrypt at rest, decrypt at deploy" pattern for Git-backed secrets. Wiring secrets through your pipeline is a delivery concern (see Software Delivery).

  • HashiCorp Vault Secrets management: dynamic credentials, encryption-as-a-service, deep audit.
  • AWS Secrets Manager AWS-native secrets storage with rotation. Deep IAM integration if you're in AWS.
  • Google Cloud Secret Manager GCP-native secrets storage. Versioned, IAM-controlled, integrated with Cloud Build.
  • 1Password Secrets Automation 1Password for app secrets. Strong developer UX; CLI + service tokens for CI.
  • Doppler Modern secrets manager with team workflows. Sync to most platforms.
  • Infisical Open-source Doppler alternative. Self-hostable, modern UI, growing ecosystem.
  • SOPS Mozilla's secret-encryption tool. Encrypt YAML/JSON files in Git.
  • age Modern file encryption tool. The cryptographic primitive under SOPS and many others.

Vault Tutorials

Tutorial HashiCorp Developer

Hands-on tutorials for KV secrets, dynamic database creds, transit encryption, and auth methods.

Vault Production Hardening

Best Practices HashiCorp Developer

Official checklist: end-to-end TLS, root token rotation, auditing, and least-privilege policies.

1Password Secrets Automation

Best Practices 1Password Developer

Overview of Service Accounts and Connect server approaches to automating secret retrieval.

What is Infisical?

Tutorial Infisical docs

Sign up for Infisical Cloud or self-host, then manage secrets, certs, and access across environments.

SOPS: Secrets OPerationS

Best Practices getsops.io

CNCF-hosted project site for the editor that encrypts values inside YAML/JSON/ENV/INI files.

SOPS on GitHub

Tutorial GitHub

README with install and usage walkthroughs for SOPS with KMS, age, PGP, and Vault backends.

age File Encryption

Tutorial GitHub

Official repo and README for age, a simple modern file-encryption tool with small explicit keys.

Network security & firewalls

The perimeter is everywhere. WAFs (Cloudflare, AWS) handle the edge. Host firewalls (iptables, nftables, ufw) handle the server. mTLS and service-mesh policies handle service-to-service. Let's Encrypt handles TLS certificates. Each layer is mandatory if you care about the layer beneath.

  • Cloudflare WAF Edge WAF rules. The default for most teams already behind Cloudflare.
  • AWS WAF AWS's WAF for CloudFront, ALB, and API Gateway. AWS-native rules and managed rule groups.
  • Google Cloud Armor GCP's WAF and DDoS protection. Integrates with Load Balancing.
  • iptables / nftables Linux's host firewall. The packet-filtering foundation everything else builds on.
  • ufw (Uncomplicated Firewall) Friendlier wrapper around iptables. The Ubuntu/Debian default for simple host firewall rules.
  • pfSense Open-source firewall and router OS. The default for serious on-prem firewalls.
  • OPNsense Fork of pfSense with a different update cadence and license posture.
  • Let's Encrypt Free, automated TLS certificate authority. The reason HTTPS is the default now.
  • Linkerd Service mesh with built-in mTLS. The lightweight alternative to Istio.

Cloudflare WAF: Get Started

Tutorial Cloudflare docs

Enable Cloudflare's Managed Rulesets and OWASP Core Ruleset, tune paranoia and score thresholds.

Get Started with AWS WAF

Tutorial AWS docs

Build your first Web ACL, attach managed rule groups, and protect a CloudFront or ALB resource.

Cloud Armor Product Overview

Best Practices Google Cloud docs

Conceptual intro to Cloud Armor security policies, preconfigured WAF rules, and load-balancer attach points.

nftables in 10 Minutes

Tutorial nftables wiki

Upstream wiki primer on tables, chains, rules, and families; the successor to iptables.

nftables Wiki

Best Practices nftables wiki

Examples, migration-from-iptables guides, and command reference.

UFW Community Help Wiki

Tutorial Ubuntu wiki

Official Ubuntu community wiki on enabling, configuring, and logging the Uncomplicated Firewall.

pfSense Getting Started

Tutorial pfSense

Project landing page linking to install media, the docs portal, and configuration walkthroughs.

pfSense Documentation

Best Practices Netgate docs

Netgate-maintained pfSense manual covering install, networking, VPN, and high availability.

OPNsense Quickstart

Tutorial OPNsense docs

Hardware selection, install media, and first-boot configuration for OPNsense.

Linkerd Getting Started

Tutorial Linkerd docs

Install the CLI, deploy the control plane to Kubernetes, and mesh your first application.

Linkerd Overview

Best Practices Linkerd docs

Conceptual overview of Linkerd's Rust-based service-mesh data plane and control plane.

Creators to follow

Engineers consistently publishing on observability, performance, and operations.

Frequently Asked Questions

Short answers grounded in the work of practitioners running real production systems.

What's the smallest observability stack that's actually useful?

One signal you'll actually look at, instrumented well, beats four signals you glance at during incidents. Charity Majors frames this as observability 1.0 vs. 2.0: the three pillars (metrics, logs, traces) are 1.0. Structured wide events you can slice arbitrarily are 2.0. For most teams, OpenTelemetry to instrument plus one backend that handles high-cardinality (Honeycomb, Logfire) is the smallest setup that pays off. Add the other pillars once you're actually using the first one.

Source: Charity Majors: Observability 1.0 vs 2.0

How do I think about reliability without overengineering?

Static stability is the cleanest mental model: design so that when a dependency fails, your system behaves the same. Pre-provision instead of reacting. Pre-build instead of pulling at request time. Decide what works looks like when half your dependencies are down. Most overengineering is reacting to abstract failures instead of ones that have actually hurt you. Re-read your last three postmortems, then design for those.

Source: AWS Builders' Library: Static Stability

How do I think about SLOs without overcomplicating them?

Start with one SLO on the one user-facing thing you'd be paged for: requests successful within some latency budget. Pick a target you'd actually defend in a meeting (99% is fine; 99.99% is a research project), and burn-rate alerts that page you before the budget runs out, not after. The Google SRE Workbook is still the canonical step-by-step. Skip the elaborate multi-SLO error-budget machinery until you have one SLO running for six months.

Source: Google SRE Workbook: Implementing SLOs

Do I need a real incident management tool, or is Slack enough?

Slack is enough until your team is big enough that the absence of structure costs more than the cost of the tool. Concretely: when you have multiple concurrent incidents, a postmortem backlog that's not getting written, or new on-call engineers asking "what do I do first?", Slack alone isn't covering you. Incident.io and FireHydrant give you channel orchestration, role tracking, and a clean handoff to the postmortem. The cost is per-responder pricing; the value is consistency.

Source: Incident.io: The case for incident management software

When does my team need a formal on-call rotation?

When the alternative is one person fielding every page, or nobody fielding pages at night. Both happen earlier than teams admit. A formal rotation buys predictable handoffs, escalation, and a clear answer to "who's the primary?" — but it only works if there's a real runbook for the common pages and a review of what's actually paging you. The cost of a bad rotation (alert fatigue, burnout) is higher than the cost of no rotation. Build the page-quality discipline first.

Source: PagerDuty: Setting up your first on-call rotation

Should I outsource auth, or roll my own?

Outsource. Auth is one of the highest blast-radius things you'll ever ship, and the cost of getting it wrong is everyone else's data. The strongest argument for rolling your own is when your domain model genuinely needs identity primitives the vendors don't offer (rare). For most teams, Auth0 / Clerk / WorkOS / Stytch are worth the price; Keycloak / Ory / Supertokens are the self-hostable middle ground if you want OSS without writing your own.

Source: Reflag: Why we use Clerk for auth

How do I run a blameless postmortem that doesn't devolve into blame?

Blameless doesn't mean accountability-free. It means the goal is learning, not punishment, and the operator's actions made sense given what they knew at the time. Three habits: write the timeline before the meeting; ask "how did this make sense to do, in the moment?" instead of "who did this?"; and end with concrete, owned action items that target the system, not the operator. John Allspaw's writing is the canonical reference.

Source: John Allspaw: Blameless PostMortems and a Just Culture

From Smarter Dev

Original writing coming.

Smarter Dev walkthroughs and short courses on observability, incident response, and running real production systems will land here.

Join the Discord to be notified

Last updated