Running Modern Systems

Reliability isn't designed. It's defended.

A running collection of writing, courses, and tutorials on operating production systems: observability, incident response, performance, identity, secrets, and network security. Less about the tools than about the practice. What to instrument, when to page, how to respond, and what to do before the next time.

This index covers what you do once your systems are running. For what to build them with, see System Architecture. For where they run, see Infrastructure & Hosting.

Best Practices & Discussions

Cross-cutting writings on running production systems well.

Site Reliability Engineering

Tutorial Google · sre.google

Google's foundational SRE book, free online: SLOs, error budgets, toil, on-call, and postmortems.

The Site Reliability Workbook

Tutorial Google · sre.google

Practical companion to the SRE book with worked examples for implementing SLOs and incident response.

Observability Engineering

Tutorial Majors, Fong-Jones, Miranda · Honeycomb / O'Reilly

Canonical O'Reilly text on events-first observability; free PDF behind a form on Honeycomb.

Systems Performance, 2nd Edition

Tutorial Brendan Gregg

The definitive deep-dive on Linux performance: CPUs, memory, disks, networks, tracing, and methodology.

Performance Analysis Methodology

Best Practices Brendan Gregg

USE Method and other systematic approaches to finding real performance bottlenecks fast.

Static Stability Using Availability Zones

Best Practices AWS Builders' Library

Design so a dependency's failure changes nothing. Pre-provision instead of reacting.

Avoiding insurmountable queue backlogs

Best Practices David Yanacek · AWS Builders' Library

How Amazon designs queue-based systems that recover from backlogs instead of compounding them.

Distributed Systems Observability

Tutorial Cindy Sridharan · O'Reilly

Short free e-book defining monitoring vs observability and the three pillars: logs, metrics, traces.

Friday Deploy Freezes Are Exactly Like Murdering Puppies

Discussion Charity Majors

The case that deploy fear is technical debt; small, frequent, author-owned changes beat blanket freezes.

DORA Research 2025

Best Practices DORA · Google Cloud

Latest State of DevOps research linking delivery practices, AI, and reliability to org performance.

Cultivating Production Excellence

Talk Liz Fong-Jones · InfoQ

Production excellence as a practice: shared ownership, observability, SLOs, and risk-based prioritization.

The Honeycomb Blog

Discussion Honeycomb

Active blog from the observability vendor that practitioners actually read: SLOs, OTel, incidents, on-call.

Tools by Category

Observability

The three pillars (metrics, logs, traces) are a taxonomy, not a plan. What you want is structured events you can slice arbitrarily: OpenTelemetry to instrument, plus a backend that handles high-cardinality (Honeycomb, Logfire) or a predictable metrics stack (Prometheus + Grafana + Loki). Start with one signal you'll actually look at. What to instrument is a design decision (see System Architecture); this list is about the tools that store and query what you've chosen.

Prometheus Pull-based metrics collection. Deeply integrated with the Kubernetes ecosystem.
Grafana Dashboard tool that fronts Prometheus, Loki, and most observability backends.
OpenTelemetry Vendor-neutral instrumentation standard for traces, metrics, and logs.
Grafana Loki Log aggregation system designed to pair with Prometheus and Grafana.
Grafana Tempo Grafana's distributed tracing backend. Pairs with Loki and Mimir as the LGTM stack.
Grafana Mimir Horizontally-scalable Prometheus-compatible metrics backend for very large deployments.
Pydantic Logfire Observability platform from the Pydantic team. Structured tracing, OpenTelemetry-native.
Honeycomb High-cardinality structured-event observability. Pioneered the "observability 2.0" frame.
Datadog The commercial heavyweight. Observability plus APM plus security, with pricing to match.
Sentry Error and performance monitoring. The default for catching frontend and backend exceptions.
New Relic Long-time APM vendor reorganized around usage-based pricing and a single observability platform.
Better Stack Combined logs, uptime, and dashboards in one commercial offering. Lightweight pricing.
Axiom Log management and analytics with structured-event focus. The cheap-for-volume alternative.
SigNoz Open-source full observability platform. The Datadog alternative if you want to self-host.

Prometheus Getting Started

Tutorial Prometheus docs

Install Prometheus, scrape targets, run PromQL queries, and configure your first alert.

PromQL for Mere Mortals

Tutorial Grafana Labs blog

Approachable intro to PromQL data types, selectors, rate, and aggregation operators.

Grafana Getting Started

Tutorial Grafana docs

Install Grafana, connect a data source, build dashboards, and configure alerting.

OpenTelemetry Getting Started

Tutorial OpenTelemetry docs

Instrument an app with traces, metrics, and logs using the Collector and language SDKs.

OpenTelemetry Demo

Tutorial OpenTelemetry docs

Microservices reference app showing real instrumentation across many languages and signals.

Grafana Loki Get Started

Tutorial Grafana docs

Install Loki, ship logs with Promtail or Alloy, and query them with LogQL in Grafana.

Grafana Tempo Getting Started

Tutorial Grafana docs

Deploy Tempo, instrument an app, configure a collector, and visualize traces in Grafana.

Tempo Quick Start (Docker)

Tutorial Grafana docs

Spin up a local Tempo instance via Docker Compose to explore distributed tracing end-to-end.

Grafana Mimir Get Started

Tutorial Grafana docs

Run Mimir in monolithic mode, scrape Prometheus metrics, and query them through Grafana.

Grafana Mimir Documentation

Best Practices Grafana docs

Architecture, operations, and tenancy patterns for long-term Prometheus storage at scale.

Pydantic Logfire Documentation

Tutorial Pydantic docs

Install Logfire, instrument Python apps, and view structured traces and logs in the UI.

Honeycomb Get Started

Tutorial Honeycomb docs

Send events via OpenTelemetry, run BubbleUp queries, and investigate production issues.

Getting Started with Datadog APM

Tutorial Datadog docs

Install the Datadog Agent, instrument an app, and explore your first traces.

Datadog APM Documentation

Best Practices Datadog docs

APM landing page: instrumentation, configuration, and trace correlation for production workloads.

Getting Started with Sentry

Tutorial Sentry docs

SDK install, DSN configuration, and verification across backend languages and frameworks.

Sentry Documentation

Best Practices Sentry docs

Docs root covering error tracking, performance, profiling, alerts, and release health.

Get Started with New Relic

Tutorial New Relic docs

Introductory tour of APM, infrastructure, browser, and alerting flows with quick install paths.

Introduction to New Relic APM

Tutorial New Relic docs

Language-agent install guides plus the core APM concepts for instrumenting a production service.

Better Stack: Observability Guide

Best Practices Better Stack docs

Vendor-maintained primer on metrics, traces, logs, dashboards, and alerting strategies.

Distributed Tracing with Better Stack

Tutorial Better Stack docs

Send OTLP traces into Better Stack using the collector or OpenTelemetry directly.

Axiom Docs

Best Practices Axiom docs

Introduction to Axiom's EventDB and MetricsDB platform for telemetry at scale.

Get Started with SigNoz

Tutorial SigNoz docs

Install SigNoz via cloud, Docker Compose, or Kubernetes Helm to start ingesting OTLP signals.

Introduction to SigNoz

Best Practices SigNoz docs

Overview of SigNoz's open-source logs, traces, and metrics on OpenTelemetry.

Incident response & alerting

The page going off at 3 AM is one piece. The other pieces are knowing who responds, what runbook they follow, how the team coordinates during the incident, what gets fixed afterward, and how customers find out. PagerDuty and Opsgenie are the legacy defaults; FireHydrant and Incident.io are the new wave with deeper opinions about coordination. Statuspage handles the outward-facing part.

PagerDuty The original on-call platform. Deepest integration ecosystem; expensive at scale.
Opsgenie Atlassian's PagerDuty competitor. Strong if you're already in the Atlassian stack.
Grafana OnCall OSS on-call scheduling. Pairs cleanly with Grafana Alertmanager.
FireHydrant Incident response with strong runbook automation and Slack coordination.
Incident.io Slack-native incident management; opinionated about coordination patterns.
Atlassian Statuspage The outward-facing status-communication standard.
Better Stack (incident management) Combined uptime, on-call rotations, and status page in one lightweight offering.

PagerDuty Knowledge Base

Best Practices PagerDuty docs

Official knowledge-base entry point with onboarding training and configuration guidance.

PagerDuty Incident Response: Getting Started

Best Practices PagerDuty Incident Response

PagerDuty's open documentation on building an incident response practice, including roles and process.

Opsgenie Quickstart Guide

Tutorial Atlassian docs

Configure profile, notification rules, schedules, and integrations to start receiving alerts.

Learn How Opsgenie Works

Best Practices Atlassian docs

Mental model for teams, schedules, escalations, and routing rules in Opsgenie.

Get Started with Grafana OnCall

Tutorial Grafana docs

Stand up OnCall, wire integrations, define notification policies, and connect Slack.

Grafana OnCall Documentation

Best Practices Grafana docs

Reference for OnCall scheduling, escalations, integrations, and notification policies.

FireHydrant Onboarding Quickstart

Tutorial FireHydrant docs

Walks first-time admins through the integrations and configuration needed before the first incident.

FireHydrant Documentation

Best Practices FireHydrant docs

Full docs covering Signals alerting, incident lifecycle, catalog, and runbooks.

incident.io Help Center

Best Practices incident.io docs

Searchable docs for On-call, Response, Status Pages, Catalog, Workflows, and Insights.

Get Started with Statuspage

Tutorial Atlassian docs

Configure components, subscriber channels, and your first incident communication.

Statuspage: Launch Your Status Page

Best Practices Atlassian docs

Pre-launch checklist covering branding, components, automation, and subscriber setup.

Better Stack: Get Started with Incidents

Tutorial Better Stack docs

Create incidents from monitors, manually, or via API; resolve, group, and post-mortem them.

Better Stack Incident Management

Best Practices Better Stack docs

Product overview tying on-call schedules, Slack workflows, and status pages into one offering.

Logging pipelines

Shipping logs from your servers to your backend is its own discipline. Vector and Fluent Bit dominate the open-source side. Fluentd is the older sibling that's still widely deployed. Cribl is the commercial heavyweight when you need to filter, transform, or route to multiple backends. The choice is usually about throughput and how much processing you want at the edge before logs hit storage.

Vector Datadog's Rust-based log and metric pipeline. The high-performance default for modern stacks.
Fluent Bit Lightweight log collector and forwarder. The OpenTelemetry-adjacent default in Kubernetes.
Fluentd The older Ruby + C log aggregator. Still widely deployed; broader plugin ecosystem.
Bento Stream-processing toolkit (formerly Benthos). Filter, transform, and route at the edge.
Cribl Stream Commercial heavyweight for routing logs to many backends with filter, transform, and replay.

Vector Quickstart

Tutorial Vector docs

Install Vector and build your first sources to transforms to sinks observability pipeline.

Vector Documentation

Best Practices Vector docs

Reference for every Vector component, plus operating-at-scale guidance.

Get Started with Fluent Bit

Tutorial Fluent Bit docs

Install Fluent Bit on Linux, macOS, Windows, or BSD and ship your first pipeline.

Fluent Bit Manual

Best Practices Fluent Bit docs

Inputs, parsers, filters, outputs, and operating Fluent Bit in production.

Fluentd Quickstart

Tutorial Fluentd docs

Installation, configuration, and the basic log forwarder pattern with Fluentd.

Fluentd Documentation

Best Practices Fluentd docs

Plugin, deployment, and tuning guides for Fluentd in production.

Bento Getting Started

Tutorial Bento docs

Install Bento (the WarpStream-maintained Benthos fork) and run your first stream-processing config.

What is Bento For?

Best Practices Bento docs

Conceptual intro to Bento's declarative, at-least-once stream processing pipelines.

Get Started with Cribl Stream

Tutorial Cribl docs

One-hour hands-on tour of Sources, Routes, Pipelines, Functions, and Destinations.

Cribl Stream Documentation

Best Practices Cribl docs

Documentation root for Cribl Stream covering deployment, processing, and routing.

Performance & profiling

Performance is the discipline observability won't teach you. Metrics tell you something's slow. Profiles tell you why. Tools split into continuous profiling (always running, sampling) and on-demand profiling (you reach for them during an incident). eBPF unlocks the kernel-level view that used to be a Brendan Gregg exclusive.

Grafana Pyroscope Grafana's continuous profiling backend. eBPF-powered, language-agnostic.
Parca OSS continuous profiling. Polar Signals' open-source foundation.
Polar Signals Continuous profiling as a service. Pyroscope-compatible, hosted offering.
Pixie eBPF-based observability for Kubernetes. Auto-instrumented; no code changes.
Datadog Continuous Profiler Datadog's profiling product, deeply integrated with their APM and request tracing.
perf / flamegraph Brendan Gregg's flamegraph tooling. The Linux profiling stack practitioners reach for.
BCC / bpftrace eBPF-based dynamic tracing. Kernel-level visibility without recompiling the kernel.

Get Started with Pyroscope

Tutorial Grafana docs

Run Pyroscope, instrument your app with an SDK or Alloy, and explore flame graphs in Grafana.

Pyroscope Ride-Share Tutorial

Tutorial Grafana docs

Hands-on demo app that walks through diagnosing CPU and memory issues with continuous profiling.

Parca Overview

Best Practices Parca docs

Intro to Parca's server plus eBPF agent architecture for always-on continuous profiling.

Polar Signals Overview

Best Practices Polar Signals docs

How Polar Signals Cloud combines a zero-instrumentation eBPF agent with hosted symbolization.

Polar Signals Agent Deployment Guide

Tutorial Polar Signals docs

Deploy the Polar Signals agent and stream profiles into Polar Signals Cloud.

Pixie Overview

Best Practices Pixie docs

How Pixie uses eBPF for auto-telemetry on Kubernetes without manual instrumentation.

Getting Started with Datadog Continuous Profiler

Tutorial Datadog docs

Profile a sample service, find a real performance problem, and fix it with the Datadog Profiler.

Datadog Continuous Profiler

Best Practices Datadog docs

Reference docs for enabling, configuring, and interpreting profile data across runtimes.

Flame Graphs

Best Practices Brendan Gregg

Brendan Gregg's canonical page introducing flame graphs, types, and how to read them.

FlameGraph Toolchain

Tutorial GitHub

The official flamegraph.pl toolchain with stack-collapse scripts for perf, DTrace, and friends.

bpftrace One-Liner Tutorial

Tutorial bpftrace docs

Learn bpftrace in 12 lessons through one-liners covering probes, maps, and printf actions.

bpftrace Project Home

Best Practices bpftrace docs

Reference docs, language guides, and labs for dynamic Linux tracing with bpftrace.

Identity & auth

Auth is one of the few things worth outsourcing early. Auth0 if you need it now. Clerk and WorkOS are the new generation with better B2B ergonomics. Keycloak or Ory if you self-host. Supertokens, FusionAuth, and Logto are the lighter-weight self-hostable options. Decision is usually how much B2B complexity you have (SAML, SCIM, RBAC) and how much you want to pay someone else to manage it. The choice of how much identity you build into your domain model is an API decision (see APIs & protocols); this section is about who runs identity for you.

Auth0 Managed identity-as-a-service. Fastest to integrate; expensive at scale.
Clerk Developer-first auth-as-a-service. Excellent React/Next.js story and UI components.
WorkOS B2B-first auth: SSO, SCIM, audit logs. The "enterprise-ready" middleware.
Stytch Passwordless-first auth platform. Strong on email magic links and B2B SSO.
Keycloak Self-hostable identity and access management. Full OAuth, OIDC, SAML support.
Ory Modern OSS identity stack split into composable services (Kratos, Hydra, Oathkeeper).
Supertokens Self-hostable auth library. Open source, modern, language-agnostic.
FusionAuth Self-hostable auth platform. Generous free tier; deep CIAM features.
Logto Modern OSS auth and identity platform. The lighter alternative to Keycloak.

Auth0 Get Started

Tutorial Auth0 docs

Set up a tenant, create applications, and integrate login via Universal Login and SDKs.

Auth0 Architecture Scenarios

Best Practices Auth0 docs

Reference architectures for SPA+API, mobile+API, and B2B/B2C identity scenarios.

Clerk Quickstarts

Tutorial Clerk docs

Framework-specific quickstarts (Next.js, React, Express, etc.) for adding Clerk auth in minutes.

Clerk Documentation

Best Practices Clerk docs

Docs root covering auth strategies, components, organizations, billing, and deployment.

Get Started with WorkOS AuthKit

Tutorial WorkOS docs

Add a hosted auth flow with SSO, social login, and passkeys in under ten minutes.

WorkOS Documentation

Best Practices WorkOS docs

Enterprise-ready auth APIs: SSO, SCIM directory sync, audit logs, user management.

Stytch Get Started

Tutorial Stytch docs

Overview of B2B and Consumer auth suites plus framework-specific quickstart paths.

Stytch API Reference

Best Practices Stytch docs

Full API reference covering magic links, OTP, OAuth, sessions, and fraud detection.

Keycloak Getting Started (Docker)

Tutorial Keycloak docs

Run Keycloak in Docker, create a realm, register a client, and secure a sample app.

Keycloak Server Administration Guide

Best Practices Keycloak docs

Reference for realms, clients, identity brokering, user federation, and authentication flows.

Ory Documentation

Tutorial Ory docs

Get started with Ory Kratos identities, Hydra OAuth2/OIDC, Keto permissions, and Oathkeeper.

SuperTokens Documentation

Tutorial SuperTokens docs

CLI-based quickstart for integrating SuperTokens SDKs across common backend and frontend stacks.

SuperTokens Authentication Overview

Best Practices SuperTokens docs

Authentication recipes: email/password, passwordless, social/enterprise SSO, and MFA.

FusionAuth Get Started

Tutorial FusionAuth docs

Pick a 15-minute quickstart for your language or framework and stand up FusionAuth CIAM.

FusionAuth Documentation

Best Practices FusionAuth docs

Install, identity providers, MFA, passkeys, and APIs reference.

Logto Introduction

Best Practices Logto docs

Logto identity-and-access overview: SSO, MFA, RBAC, multi-tenancy on OIDC and OAuth 2.1.

Logto Quick Starts

Tutorial Logto docs

Framework quickstarts (Python shown; 30+ SDKs available) for adding Logto auth to an app.

Secrets management

Everyone outgrows .env files. The question is where you go next. Cloud-native secrets managers (AWS, GCP, Azure) win on integration if you're already in that cloud. Vault is the heavyweight when you need dynamic credentials and deep audit. Doppler and Infisical are the modern alternatives with better DX. SOPS and age handle the "encrypt at rest, decrypt at deploy" pattern for Git-backed secrets. Wiring secrets through your pipeline is a delivery concern (see Software Delivery).

HashiCorp Vault Secrets management: dynamic credentials, encryption-as-a-service, deep audit.
AWS Secrets Manager AWS-native secrets storage with rotation. Deep IAM integration if you're in AWS.
Google Cloud Secret Manager GCP-native secrets storage. Versioned, IAM-controlled, integrated with Cloud Build.
1Password Secrets Automation 1Password for app secrets. Strong developer UX; CLI + service tokens for CI.
Doppler Modern secrets manager with team workflows. Sync to most platforms.
Infisical Open-source Doppler alternative. Self-hostable, modern UI, growing ecosystem.
SOPS Mozilla's secret-encryption tool. Encrypt YAML/JSON files in Git.
age Modern file encryption tool. The cryptographic primitive under SOPS and many others.

Vault Tutorials

Tutorial HashiCorp Developer

Hands-on tutorials for KV secrets, dynamic database creds, transit encryption, and auth methods.

Vault Production Hardening

Best Practices HashiCorp Developer

Official checklist: end-to-end TLS, root token rotation, auditing, and least-privilege policies.

What Is AWS Secrets Manager?

Best Practices AWS docs

Conceptual intro plus links into setup, retrieval, and rotation patterns.

AWS Secrets Manager Tutorials

Tutorial AWS docs

Hands-on tutorials covering moving hardcoded secrets, configuring rotation, and CodeGuru integration.

GCP Secret Manager Quickstart

Tutorial Google Cloud docs

Create, access, and rotate a secret via console, gcloud, or client library.

GCP Secret Manager Documentation

Best Practices Google Cloud docs

Concepts, IAM, regional configs, and integration patterns for Secret Manager.

1Password Secrets Automation

Best Practices 1Password Developer

Overview of Service Accounts and Connect server approaches to automating secret retrieval.

Get Started with 1Password Service Accounts

Tutorial 1Password Developer

Create a service account, scope its vaults, and authenticate the 1Password CLI in CI/CD.

Getting Started with Doppler

Tutorial Doppler docs

Project and config setup, then sync secrets to local apps, cloud providers, and other stores.

Doppler Documentation

Best Practices Doppler docs

SecretOps workflows, RBAC, audit logs, and integrations.

What is Infisical?

Tutorial Infisical docs

Infisical Secrets Management

Best Practices Infisical docs

Platform overview for secrets storage, rotation, references, and audit logging.

SOPS: Secrets OPerationS

Best Practices getsops.io

CNCF-hosted project site for the editor that encrypts values inside YAML/JSON/ENV/INI files.

SOPS on GitHub

Tutorial GitHub

README with install and usage walkthroughs for SOPS with KMS, age, PGP, and Vault backends.

age File Encryption

Tutorial GitHub

Official repo and README for age, a simple modern file-encryption tool with small explicit keys.

Network security & firewalls

The perimeter is everywhere. WAFs (Cloudflare, AWS) handle the edge. Host firewalls (iptables, nftables, ufw) handle the server. mTLS and service-mesh policies handle service-to-service. Let's Encrypt handles TLS certificates. Each layer is mandatory if you care about the layer beneath.

Cloudflare WAF Edge WAF rules. The default for most teams already behind Cloudflare.
AWS WAF AWS's WAF for CloudFront, ALB, and API Gateway. AWS-native rules and managed rule groups.
Google Cloud Armor GCP's WAF and DDoS protection. Integrates with Load Balancing.
iptables / nftables Linux's host firewall. The packet-filtering foundation everything else builds on.
ufw (Uncomplicated Firewall) Friendlier wrapper around iptables. The Ubuntu/Debian default for simple host firewall rules.
pfSense Open-source firewall and router OS. The default for serious on-prem firewalls.
OPNsense Fork of pfSense with a different update cadence and license posture.
Let's Encrypt Free, automated TLS certificate authority. The reason HTTPS is the default now.
Linkerd Service mesh with built-in mTLS. The lightweight alternative to Istio.

Cloudflare WAF: Get Started

Tutorial Cloudflare docs

Enable Cloudflare's Managed Rulesets and OWASP Core Ruleset, tune paranoia and score thresholds.

Cloudflare WAF Overview

Best Practices Cloudflare docs

Custom rules, rate limiting, exposed credentials, and bot rules.

Get Started with AWS WAF

Tutorial AWS docs

Build your first Web ACL, attach managed rule groups, and protect a CloudFront or ALB resource.

AWS WAF Developer Guide

Best Practices AWS docs

Rules, rule groups, Firewall Manager, and Shield Advanced reference.

Cloud Armor Product Overview

Best Practices Google Cloud docs

Conceptual intro to Cloud Armor security policies, preconfigured WAF rules, and load-balancer attach points.

Configure Cloud Armor Security Policies

Tutorial Google Cloud docs

Create a security policy, add prioritized rules, and attach it to a backend service.

nftables in 10 Minutes

Tutorial nftables wiki

Upstream wiki primer on tables, chains, rules, and families; the successor to iptables.

nftables Wiki

Best Practices nftables wiki

Examples, migration-from-iptables guides, and command reference.

UFW Community Help Wiki

Tutorial Ubuntu wiki

Official Ubuntu community wiki on enabling, configuring, and logging the Uncomplicated Firewall.

Ubuntu Server: Firewalls

Best Practices Ubuntu docs

Canonical's server-docs section on configuring host firewalls with UFW.

pfSense Getting Started

Tutorial pfSense

Project landing page linking to install media, the docs portal, and configuration walkthroughs.

pfSense Documentation

Best Practices Netgate docs

Netgate-maintained pfSense manual covering install, networking, VPN, and high availability.

OPNsense Quickstart

Tutorial OPNsense docs

Hardware selection, install media, and first-boot configuration for OPNsense.

OPNsense Documentation

Best Practices OPNsense docs

Install, manual configuration, and the development manual.

Let's Encrypt: Getting Started

Tutorial Let's Encrypt

Decide between provider-managed and self-managed ACME flows, with Certbot recommended for self-managed.

Let's Encrypt Documentation

Best Practices Let's Encrypt

Rate limits, challenge types, ACME clients, and best practices.

Linkerd Getting Started

Tutorial Linkerd docs

Install the CLI, deploy the control plane to Kubernetes, and mesh your first application.

Linkerd Overview

Best Practices Linkerd docs

Conceptual overview of Linkerd's Rust-based service-mesh data plane and control plane.

Creators to follow

Engineers consistently publishing on observability, performance, and operations.

Brendan Gregg blog · @brendangregg The reference for Linux performance, flame graphs, and BPF observability. Charity Majors blog · @mipsytipsy Honeycomb co-founder. Defines what modern observability means and where it's going. Jessica Kerr blog · @jessitron Honeycomb dev advocate; systems-thinking essays on observability and socio-technical design. Cindy Sridharan blog · @copyconstruct Distributed systems and observability writing; essays on testing in production and reliability. Liz Fong-Jones x · @lizthegrey Honeycomb principal developer advocate. SRE practice and production-excellence framework. John Allspaw blog · @allspaw Adaptive Capacity Labs; canonical voice on incident analysis and learning from failure. Lorin Hochstein blog · @norootcause Netflix SRE alum; chaos engineering, postmortem culture, and resilience engineering. Casey Rosenthal blog · @caseyrosenthal Formerly Netflix Chaos team; co-author of the canonical chaos engineering book. Kelly Shortridge blog · @swagitda_ Security chaos engineering; author of the canonical book of the same name. Will Larson blog · @lethain Engineering leader writing concrete frameworks for ops strategy, on-call, and platform investment.

Frequently Asked Questions

Short answers grounded in the work of practitioners running real production systems.

What's the smallest observability stack that's actually useful?

One signal you'll actually look at, instrumented well, beats four signals you glance at during incidents. Charity Majors frames this as observability 1.0 vs. 2.0: the three pillars (metrics, logs, traces) are 1.0. Structured wide events you can slice arbitrarily are 2.0. For most teams, OpenTelemetry to instrument plus one backend that handles high-cardinality (Honeycomb, Logfire) is the smallest setup that pays off. Add the other pillars once you're actually using the first one.

Source: Charity Majors: Observability 1.0 vs 2.0

How do I think about reliability without overengineering?

Static stability is the cleanest mental model: design so that when a dependency fails, your system behaves the same. Pre-provision instead of reacting. Pre-build instead of pulling at request time. Decide what works looks like when half your dependencies are down. Most overengineering is reacting to abstract failures instead of ones that have actually hurt you. Re-read your last three postmortems, then design for those.

Source: AWS Builders' Library: Static Stability

How do I think about SLOs without overcomplicating them?

Start with one SLO on the one user-facing thing you'd be paged for: requests successful within some latency budget. Pick a target you'd actually defend in a meeting (99% is fine; 99.99% is a research project), and burn-rate alerts that page you before the budget runs out, not after. The Google SRE Workbook is still the canonical step-by-step. Skip the elaborate multi-SLO error-budget machinery until you have one SLO running for six months.

Source: Google SRE Workbook: Implementing SLOs

Do I need a real incident management tool, or is Slack enough?

Slack is enough until your team is big enough that the absence of structure costs more than the cost of the tool. Concretely: when you have multiple concurrent incidents, a postmortem backlog that's not getting written, or new on-call engineers asking "what do I do first?", Slack alone isn't covering you. Incident.io and FireHydrant give you channel orchestration, role tracking, and a clean handoff to the postmortem. The cost is per-responder pricing; the value is consistency.

Source: Incident.io: The case for incident management software

When does my team need a formal on-call rotation?

When the alternative is one person fielding every page, or nobody fielding pages at night. Both happen earlier than teams admit. A formal rotation buys predictable handoffs, escalation, and a clear answer to "who's the primary?" — but it only works if there's a real runbook for the common pages and a review of what's actually paging you. The cost of a bad rotation (alert fatigue, burnout) is higher than the cost of no rotation. Build the page-quality discipline first.

Source: PagerDuty: Setting up your first on-call rotation

Should I outsource auth, or roll my own?

Outsource. Auth is one of the highest blast-radius things you'll ever ship, and the cost of getting it wrong is everyone else's data. The strongest argument for rolling your own is when your domain model genuinely needs identity primitives the vendors don't offer (rare). For most teams, Auth0 / Clerk / WorkOS / Stytch are worth the price; Keycloak / Ory / Supertokens are the self-hostable middle ground if you want OSS without writing your own.

Source: Reflag: Why we use Clerk for auth

How do I run a blameless postmortem that doesn't devolve into blame?

Blameless doesn't mean accountability-free. It means the goal is learning, not punishment, and the operator's actions made sense given what they knew at the time. Three habits: write the timeline before the meeting; ask "how did this make sense to do, in the moment?" instead of "who did this?"; and end with concrete, owned action items that target the system, not the operator. John Allspaw's writing is the canonical reference.

Source: John Allspaw: Blameless PostMortems and a Just Culture

From Smarter Dev

Original writing coming.

Smarter Dev walkthroughs and short courses on observability, incident response, and running real production systems will land here.

Join the Discord to be notified

Last updated May 12, 2026