Architecting Modern Systems

Every system has an architecture. Most of them weren't designed, they accumulated.

A running collection of writing, courses, and tutorials on architecting modern systems: how to choose between databases, queues, caches, search engines, APIs, and the rest of the data and integration stack. Less about new tools, more about the judgment behind picking one and living with it.

This index covers the data and integration layer: the systems you build with. For where those systems run, see Infrastructure & Hosting. For keeping them healthy in production, see Production Operations.

Best Practices, Courses & Discussions

Cross-cutting writings on how to think about systems, regardless of which tool you pick.

Designing Data-Intensive Applications

Best Practices Martin Kleppmann

The canonical map of databases, replication, consensus, and stream-processing tradeoffs.

The Amazon Builders' Library

Best Practices AWS Builders' Library

Principal engineers explaining how Amazon actually builds and operates production services.

Marc Brooker's Blog

Discussion Marc Brooker (AWS)

AWS distinguished engineer on databases, durability, queues, retries, and metastability.

Jepsen Analyses

Best Practices Kyle Kingsbury (Aphyr)

Empirical safety tests of databases and queues under partitions, clock skew, and faults.

TIGER_STYLE

Best Practices TigerBeetle

Safety-first engineering doctrine: assertions, static allocation, batching, zero dependencies.

Distributed Systems lecture series

Course Martin Kleppmann (Cambridge)

Eight-lecture Cambridge series on clocks, replication, consensus, linearizability, and Spanner.

Hyrum's Law

Discussion Hyrum Wright

With enough users, every observable behavior of your API becomes a contract someone depends on.

Workload Isolation Using Shuffle-Sharding

Best Practices Colm MacCárthaigh, AWS Builders' Library

Multi-tenant fault isolation pattern that limits blast radius without dedicated capacity.

The Morning Paper (archive)

Discussion Adrian Colyer

Archive of daily computer-science paper summaries. Paused since 2021; back catalog is foundational.

Caches, Modes, and Unstable Systems

Best Practices Marc Brooker Aug 2021

Why caches create metastable failure modes load tests miss until production explodes.

Tools by Category

Peer tools grouped by what problem they solve. The intro before each list articulates the decision space; the list is what you actually choose between.

Databases & data stores

Postgres handles 99% of what most teams need. Specialized stores buy specific things: ClickHouse for analytics over billions of rows, DuckDB for local-first columnar work, DynamoDB for flat scaling, SQLite for embedded. Pick by what you'll do too much of.

PostgreSQL The default for almost everything: relational, JSON, full-text, geospatial, vectors.
MySQL Relational alternative to Postgres with slightly different operational characteristics.
SQLite Embedded relational, single file, no server. The most widely deployed database on earth.
MongoDB Document database with flexible schemas. Hosted offering (Atlas) is the common path.
DynamoDB Managed key-value with predictable performance at any scale. Real lock-in to AWS.
DuckDB SQLite for analytics. Columnar, in-process, embarrassingly fast on local data.
ClickHouse Column-store OLAP for billions of rows. The fastest analytical database in widespread use.

PostgreSQL Tutorial (official)

Tutorial PostgreSQL docs

Official tutorial covering SQL basics, schemas, transactions, inheritance, and Postgres-specific features.

PostgreSQL Tutorial (third-party)

Tutorial PostgreSQL Tutorial

Free comprehensive tutorial covering psql, queries, joins, transactions, indexes, and performance.

Use The Index, Luke!

Best Practices Markus Winand

Canonical guide to SQL indexing and query performance tuning for application developers.

MySQL Tutorial

Tutorial MySQL docs

Official walkthrough of the mysql client, creating databases, tables, and running queries.

MySQL Course for Beginners

Course YouTube · freeCodeCamp

Three-hour video course covering installation, SQL syntax, joins, and database design.

SQLite Quickstart

Tutorial SQLite docs

Official quickstart for the sqlite3 CLI: creating databases, schemas, and running queries.

Appropriate Uses For SQLite

Best Practices SQLite docs

When SQLite is the right choice versus a client/server database, with concrete scenarios.

MongoDB Getting Started

Tutorial MongoDB docs

Official getting-started covering documents, collections, CRUD operations, and aggregation.

MongoDB University

Course MongoDB University

Free structured courses on data modeling, indexing, aggregation, and operational topics.

Getting Started with DynamoDB

Tutorial AWS docs

Official tutorial walking through tables, items, queries, and the DynamoDB Local environment.

The DynamoDB Guide

Best Practices Alex DeBrie

Opinionated guide to single-table design, access patterns, and DynamoDB modeling fundamentals.

DuckDB Getting Started

Tutorial DuckDB docs

Official quickstart for the embedded analytical database, with CLI, Python, and SQL examples.

DuckDB Tutorial for Beginners

Tutorial MotherDuck blog

Hands-on intro to querying CSV, Parquet, and JSON files directly with DuckDB.

ClickHouse Quick Start

Tutorial ClickHouse docs

Install, load data, and run analytical queries on the columnar OLAP database.

ClickHouse Academy

Course ClickHouse Academy

Free courses on data modeling, MergeTree engines, and production operations.

Caching & in-memory

A cache is a second source of truth with worse durability. The decision isn't Redis or Memcached. It's what you're caching, how stale it can be, and what happens when the cache fails. Most cache-related outages are about the failure mode, not throughput.

Redis In-memory data structure store. Defaults to caching but also queues, streams, pub/sub.
Valkey Linux Foundation Redis fork; BSD-licensed, growing fast since Redis's license shift.
Memcached Minimal in-memory cache. No persistence, no data types, no fuss.

Redis Quick Start

Tutorial Redis docs

Official getting-started covering installation, redis-cli, key types, and common commands.

Redis University

Course Redis University

Free structured courses on data structures, caching patterns, and Redis Stack modules.

Valkey: Introduction

Tutorial Valkey docs

Introduction to the Linux Foundation Redis fork, including installation and command reference.

Memcached Wiki

Tutorial GitHub · memcached

Official wiki covering protocol, configuration, tuning, and common usage patterns.

Queues & streams

Queues let work outlive the request that asked for it. Streams let multiple consumers read the same log with their own pointers. Pick a queue when downstream is slower than upstream. Pick a stream when you need to replay.

RabbitMQ Mature message broker with multiple protocols. The default work-queue choice.
Apache Kafka Durable append-only log. The canonical event-streaming platform.
NATS Lightweight high-performance messaging. Queues, streams, request/reply, key-value.
Amazon SQS Managed queue from AWS. Simple, reliable, no infrastructure to operate.
Redis Streams Redis-native append-only log. Lightweight Kafka alternative if you already run Redis.
BullMQ Node.js job queue library on top of Redis. Most popular choice in the JS ecosystem.
Sidekiq Ruby background job processor; uses Redis. The standard in Rails apps.
Celery Python distributed task queue. Uses Redis or RabbitMQ as broker.

RabbitMQ Tutorials

Tutorial RabbitMQ docs

Six canonical tutorials: work queues, pub/sub, routing, topics, RPC, and acknowledgements.

RabbitMQ Documentation

Best Practices RabbitMQ docs

Full documentation hub: clustering, persistence, flow control, monitoring, and production tuning.

Apache Kafka Quickstart

Tutorial Kafka docs

Start a broker, create topics, produce and consume messages, and run Kafka Connect.

Apache Kafka 101

Course Confluent Developer

Free video course on Kafka fundamentals: topics, partitions, producers, consumers, and brokers.

NATS Concepts & Walkthrough

Tutorial NATS docs

Concept overview and walkthroughs covering core NATS, JetStream, and key/value stores.

Getting Started with Amazon SQS

Tutorial AWS docs

Create queues, send and receive messages, and configure dead-letter queues via console or SDK.

SQS Best Practices

Best Practices AWS docs

Official guidance on visibility timeouts, polling, idempotency, and queue throughput tuning.

Introduction to Redis Streams

Tutorial Redis docs

Official guide: XADD, consumer groups, XREADGROUP, acknowledgement, and stream trimming.

BullMQ Guide

Tutorial BullMQ docs

Official guide for the Node.js queue library: producers, workers, flows, repeatable jobs.

Sidekiq Getting Started

Tutorial Sidekiq wiki

Official wiki: install, define workers, enqueue jobs, and run the Sidekiq process.

Sidekiq Best Practices

Best Practices Sidekiq wiki

Job idempotency, small arguments, embracing concurrency, and operational guidance.

First Steps with Celery

Tutorial Celery docs

Official tutorial: define tasks, configure brokers, run workers, and check results.

Celery Best Practices

Best Practices Deni Bertovic

Canonical post on idempotent tasks, retries, naming, and avoiding common Celery pitfalls.

Search

Most teams reach for Elasticsearch before they need it. Postgres full-text handles more than people think. When you actually need search (relevance tuning, facets, real-time indexing over millions of docs), pick between heavyweight (Elastic, OpenSearch) and lightweight (Meili, Typesense).

Elasticsearch Heavyweight distributed search engine. License shifted to SSPL/Elastic License.
OpenSearch Apache-2.0 fork of Elasticsearch maintained by AWS and the community.
Meilisearch Lightweight typo-tolerant search. Trivial to operate, great defaults.
Typesense Lightweight search alternative to Meili with slightly different feature surface.

Elasticsearch Quick Start

Tutorial Elastic docs

Run Elasticsearch locally, index documents, and run match, term, and aggregation queries.

Elasticsearch: The Definitive Guide

Best Practices Elastic docs

Long-form guide to mapping, analyzers, relevance, aggregations, and cluster scaling.

OpenSearch Quickstart

Tutorial OpenSearch docs

Run OpenSearch and Dashboards, index data, and run search and aggregation queries.

Meilisearch Quick Start

Tutorial Meilisearch docs

Install, add documents, and run typo-tolerant searches with filters and ranking rules.

Typesense Guide

Tutorial Typesense docs

Install, create collections, index documents, and tune ranking and faceting in Typesense.

Object storage

Object storage is solved. The decisions are cost and lock-in. R2 and B2 have no egress fees, S3 has the deepest ecosystem, MinIO runs on your own hardware. For most workloads, S3-compatible is the only spec that matters.

Amazon S3 The canonical object storage service. Deepest ecosystem; you'll integrate with this.
Cloudflare R2 S3-compatible with no egress fees. Great when you serve files directly to users.
Backblaze B2 S3-compatible with low storage prices and free egress to many CDNs.
MinIO Self-hosted S3-compatible storage. Run on your own hardware or k8s.

Getting Started with Amazon S3

Tutorial AWS docs

Create buckets, upload objects, manage access, and configure lifecycle and versioning.

S3 Performance Best Practices

Best Practices AWS docs

Official guidance on request rates, key naming, multipart uploads, and Transfer Acceleration.

Cloudflare R2 Get Started

Tutorial Cloudflare docs

Create R2 buckets, upload objects via Wrangler or the S3-compatible API, and serve them.

Backblaze B2 Getting Started

Tutorial Backblaze docs

Create buckets and application keys, upload files via web UI, CLI, and S3-compatible API.

MinIO Quickstart

Tutorial MinIO docs

Install MinIO single-node and distributed, use the mc client, and configure access policies.

Vector & retrieval

For most teams, pgvector inside Postgres is the right answer. Specialized vector databases buy scale (billions of vectors), advanced filtering, or hosted SLAs. The decision point is when retrieval workload starts to dominate normal load. Usually later than you think.

pgvector Postgres extension that adds vector similarity search. The default for most teams.
Qdrant Rust-based vector database with strong filtering. Open source; hosted offering available.
Weaviate Vector database with knowledge-graph features. Open source; hosted offering available.
Pinecone Managed vector database. The original commercial offering in the space.

pgvector README

Tutorial GitHub · pgvector

Install the extension, create vector columns, build HNSW/IVFFlat indexes, and run kNN queries.

Qdrant Quickstart

Tutorial Qdrant docs

Run Qdrant in Docker, create collections, upsert vectors with payloads, and run filtered searches.

Weaviate Quickstart

Tutorial Weaviate docs

Spin up Weaviate Cloud, define collections, import data, and run vector and hybrid queries.

Pinecone Quickstart

Tutorial Pinecone docs

Create an index, upsert vectors with metadata, and run similarity queries via Python SDK.

APIs & protocols

REST is the default for most public-facing APIs. GraphQL pays off when clients have wildly varying data needs and you can absorb the resolver complexity. gRPC and Protobuf win for internal service-to-service traffic where latency and schema discipline matter. The protocol is a contract: pick once, change rarely.

OpenAPI (Swagger) REST spec format and tooling ecosystem; the lingua franca for documenting HTTP APIs.
JSON Schema Schema validation standard for JSON; backbone of OpenAPI, AsyncAPI, and a lot of tooling.
Protocol Buffers Google's IDL and binary wire format; used by gRPC and many in-house RPC systems.
GraphQL Query language and runtime spec for client-driven APIs with flexible field selection.
gRPC Google's binary RPC framework over HTTP/2; the default for service-to-service traffic.
tRPC TypeScript-native end-to-end typed RPC; no codegen, no schema, just function calls.
Apollo Server The most-used GraphQL server in the JS/TS ecosystem; the path of least resistance.
Hasura Instant GraphQL over Postgres (and other DBs); subscriptions and authorization built in.
PostgREST Instant REST over Postgres; tables and views become endpoints with row-level security.
Connect (Buf) gRPC-compatible RPC that also works in browsers; the friendly modern face of Protobuf.

Getting Started with the OpenAPI Specification

Tutorial OpenAPI Initiative

Official guided entry point that walks new authors through writing their first OpenAPI description.

Creating your first JSON Schema

Tutorial JSON Schema docs

Step-by-step product-catalog example covering types, required fields, constraints, and $ref.

Understanding JSON Schema

Best Practices JSON Schema docs

Reference companion that explains keywords and idiomatic patterns for real-world schema design.

Protocol Buffers Tutorials

Tutorial protobuf.dev

Official per-language quickstarts (Go, Python, Java, C++) covering .proto files and codegen.

Proto Best Practices

Best Practices protobuf.dev

Google's vetted rules on tag numbers, enum zero values, and safe schema evolution.

Learn GraphQL

Tutorial GraphQL Foundation

Official tour through schemas, queries, mutations, subscriptions, HTTP transport, and authorization.

Production Ready GraphQL

Best Practices Marc-Andre Giroux

Shopify/GitHub veteran's book on schema design, performance, security, and migrating legacy APIs.

Introduction to gRPC

Tutorial grpc.io

Conceptual primer plus links to per-language quickstarts that build a working client and server.

tRPC Quickstart

Tutorial tRPC docs

Framework-agnostic walkthrough of routers, queries, mutations, and zod input validation.

GraphQL, tRPC, REST and more: Pick Your Poison

Talk YouTube · Theo

Decision-framework video on when tRPC beats GraphQL or REST for typed full-stack TypeScript.

Get Started with Apollo Server

Tutorial Apollo GraphQL docs

Eight-step TypeScript tutorial building a Books schema, resolvers, and Apollo Sandbox queries.

Hasura Basics

Course Hasura Learn

30-minute course building a realtime todo backend with queries, subscriptions, and authorization.

Hasura Quickstart with Docker

Tutorial Hasura docs

Spin up Hasura Engine plus Postgres locally and get an auto-generated GraphQL API in minutes.

PostgREST Tutorial 0: Get it Running

Tutorial PostgREST docs

Install PostgREST and build a todo API backed by a schema with role-based access.

PostgREST Tutorial 1: The Golden Key

Tutorial PostgREST docs

Layer JWT authentication and per-role authorization on top of the Tutorial 0 API.

Connect Introduction

Tutorial Connect RPC docs

Overview of Connect's browser- and gRPC-compatible protocol with codegen across Go, TS, and more.

Connect Getting Started in Go

Tutorial Connect RPC docs

Fifteen-minute walkthrough writing a protobuf schema and serving it with connect-go.

Creators to follow

Writers and engineers consistently publishing substantive content on architecture, distributed systems, and performance.

Martin Kleppmann blog · @martinkl DDIA author and Cambridge researcher; writes on distributed systems and local-first software. Marc Brooker blog · @MarcJBrooker AWS distinguished engineer publishing weekly on databases, queues, retries, and durability. Kyle Kingsbury (Aphyr) blog · @aphyr Jepsen author. The empirical voice on what distributed systems actually do under failure. Will Larson blog · @lethain Engineering leader writing concrete frameworks for strategy, architecture, and staff-engineer work. Hillel Wayne blog · @hillelogram TLA+ and formal-methods writer. Teaches how to reason about systems before shipping. Camille Fournier blog · @skamille Ex-ZooKeeper maintainer; pragmatic essays on technical strategy and engineering leadership. Alex Xu (ByteByteGo) newsletter · @bytebytego High-volume system design explainers; great for surveying patterns and interview vocabulary. Jepsen blog · @jepsen Companion to Aphyr's blog: vendor-funded but uncompromising correctness analyses.

Frequently Asked Questions

Short answers grounded in the work of practitioners running real production systems.

What does system architecture actually mean for a small team?

Forget the enterprise diagrams. For a small team, architecture is the set of recurring decisions you're making every few months: which database, which queue, which observability stack, which auth provider. Will Larson's framing is to write down five real architecture decisions you've made, then find the pattern. That pattern is your architecture. Documents that don't change anything aren't architecture; they're paperwork.

Source: Will Larson: Good engineering strategy is boring

When should I use a queue vs. just a database table?

If you need work to outlive the request that asked for it, you need persistence somewhere. A database table with a poll loop is the simplest version. A real queue (RabbitMQ, SQS) buys fair scheduling, backpressure, and retries handled properly. Marc Brooker's point: queues don't actually deliver exactly-once. They deliver at-least-once and you handle the duplicates. If your workload can't tolerate that, you need idempotency in the consumer regardless of which tool.

Source: Marc Brooker: Exactly-Once Delivery May Not Be What You Want

Postgres vs. specialized stores: when does it stop being enough?

Later than you think. Postgres handles relational workloads, full-text search, JSON, time-series (with extensions), geospatial (with PostGIS), and vector search (with pgvector). The hard line comes when one workload starts to dominate enough that operating Postgres for it becomes harder than running a specialized store: analytics over billions of rows (ClickHouse), key-value at flat scale (DynamoDB), columnar local-first (DuckDB). For most teams, the right answer is to wait for the actual pain before splitting it out.

Source: Tiger Data: It's 2026, Just Use Postgres

What do I need to know about caching before I deploy a cache?

Caches add a metastable failure mode that load tests usually miss. When your cache is warm, the database sees 10% of traffic. When the cache empties (eviction storm, network blip, restart), the database suddenly sees 100%. Usually more than it can handle, which keeps the cache from refilling, which keeps the database overloaded. Decide what cache-miss looks like at full hit-rate before you ship. The fix is usually request coalescing, stale-while-revalidate, or some kind of admission control.

Source: Marc Brooker: Caches, Modes, and Unstable Systems

When should I reach for GraphQL or gRPC instead of REST?

GraphQL pays off at client-server edges where the client wants to fetch exactly the fields it needs in one round trip, and where the complexity of writing resolvers is a price you can afford. gRPC wins for internal service-to-service traffic: binary framing, HTTP/2 streaming, generated clients, tight schema discipline. REST is still the right default for everything else, especially public APIs where caching, browser support, and developer familiarity matter more than the protocol's expressive power.

Source: Stack Overflow Blog: When to use gRPC vs GraphQL

Do I need a separate search index, or can Postgres full-text handle it?

Postgres full-text search is more capable than most teams realize. For typical workloads (tens of millions of rows, basic ranking, single-language stemming) it's competitive with Elasticsearch and Meilisearch and skips the operational cost of running a second system. The line is when you need fuzzy/typo-tolerant matching, rich faceting, multi-language ranking, or real-time updates over very large indexes. Until then, the right move is staying in Postgres.

Source: Supabase Engineering: Postgres FTS vs the rest

When does pgvector stop being enough?

pgvector inside the Postgres you already run wins until the retrieval workload starts to dominate normal load. Specialized vector databases buy you horizontal sharding past tens of millions of vectors, faster index rebuilds, and best-in-class filtered approximate-nearest-neighbor search. If your collection fits comfortably alongside your transactional data and your query patterns are simple kNN with a few filters, stay in Postgres.

Source: Tiger Data Engineering: pgvector vs Qdrant

From Smarter Dev

Original writing coming.

Smarter Dev essays, walkthroughs, and short courses on architecting production systems will land here as they're written.

Join the Discord to be notified

Last updated May 12, 2026