Designing Data-Intensive Applications
The canonical map of databases, replication, consensus, and stream-processing tradeoffs.
Every system has an architecture. Most of them weren't designed, they accumulated.
A running collection of writing, courses, and tutorials on architecting modern systems: how to choose between databases, queues, caches, search engines, APIs, and the rest of the data and integration stack. Less about new tools, more about the judgment behind picking one and living with it.
This index covers the data and integration layer: the systems you build with. For where those systems run, see Infrastructure & Hosting. For keeping them healthy in production, see Production Operations.
Cross-cutting writings on how to think about systems, regardless of which tool you pick.
The canonical map of databases, replication, consensus, and stream-processing tradeoffs.
Principal engineers explaining how Amazon actually builds and operates production services.
AWS distinguished engineer on databases, durability, queues, retries, and metastability.
Empirical safety tests of databases and queues under partitions, clock skew, and faults.
Safety-first engineering doctrine: assertions, static allocation, batching, zero dependencies.
Eight-lecture Cambridge series on clocks, replication, consensus, linearizability, and Spanner.
With enough users, every observable behavior of your API becomes a contract someone depends on.
Multi-tenant fault isolation pattern that limits blast radius without dedicated capacity.
Archive of daily computer-science paper summaries. Paused since 2021; back catalog is foundational.
Why caches create metastable failure modes load tests miss until production explodes.
Peer tools grouped by what problem they solve. The intro before each list articulates the decision space; the list is what you actually choose between.
Postgres handles 99% of what most teams need. Specialized stores buy specific things: ClickHouse for analytics over billions of rows, DuckDB for local-first columnar work, DynamoDB for flat scaling, SQLite for embedded. Pick by what you'll do too much of.
Official tutorial covering SQL basics, schemas, transactions, inheritance, and Postgres-specific features.
Free comprehensive tutorial covering psql, queries, joins, transactions, indexes, and performance.
Canonical guide to SQL indexing and query performance tuning for application developers.
Official walkthrough of the mysql client, creating databases, tables, and running queries.
Three-hour video course covering installation, SQL syntax, joins, and database design.
Official quickstart for the sqlite3 CLI: creating databases, schemas, and running queries.
When SQLite is the right choice versus a client/server database, with concrete scenarios.
Official getting-started covering documents, collections, CRUD operations, and aggregation.
Free structured courses on data modeling, indexing, aggregation, and operational topics.
Official tutorial walking through tables, items, queries, and the DynamoDB Local environment.
Opinionated guide to single-table design, access patterns, and DynamoDB modeling fundamentals.
Official quickstart for the embedded analytical database, with CLI, Python, and SQL examples.
Hands-on intro to querying CSV, Parquet, and JSON files directly with DuckDB.
Install, load data, and run analytical queries on the columnar OLAP database.
Free courses on data modeling, MergeTree engines, and production operations.
A cache is a second source of truth with worse durability. The decision isn't Redis or Memcached. It's what you're caching, how stale it can be, and what happens when the cache fails. Most cache-related outages are about the failure mode, not throughput.
Official getting-started covering installation, redis-cli, key types, and common commands.
Free structured courses on data structures, caching patterns, and Redis Stack modules.
Introduction to the Linux Foundation Redis fork, including installation and command reference.
Official wiki covering protocol, configuration, tuning, and common usage patterns.
Queues let work outlive the request that asked for it. Streams let multiple consumers read the same log with their own pointers. Pick a queue when downstream is slower than upstream. Pick a stream when you need to replay.
Six canonical tutorials: work queues, pub/sub, routing, topics, RPC, and acknowledgements.
Full documentation hub: clustering, persistence, flow control, monitoring, and production tuning.
Start a broker, create topics, produce and consume messages, and run Kafka Connect.
Free video course on Kafka fundamentals: topics, partitions, producers, consumers, and brokers.
Concept overview and walkthroughs covering core NATS, JetStream, and key/value stores.
Create queues, send and receive messages, and configure dead-letter queues via console or SDK.
Official guidance on visibility timeouts, polling, idempotency, and queue throughput tuning.
Official guide: XADD, consumer groups, XREADGROUP, acknowledgement, and stream trimming.
Official guide for the Node.js queue library: producers, workers, flows, repeatable jobs.
Official wiki: install, define workers, enqueue jobs, and run the Sidekiq process.
Job idempotency, small arguments, embracing concurrency, and operational guidance.
Official tutorial: define tasks, configure brokers, run workers, and check results.
Canonical post on idempotent tasks, retries, naming, and avoiding common Celery pitfalls.
Most teams reach for Elasticsearch before they need it. Postgres full-text handles more than people think. When you actually need search (relevance tuning, facets, real-time indexing over millions of docs), pick between heavyweight (Elastic, OpenSearch) and lightweight (Meili, Typesense).
Run Elasticsearch locally, index documents, and run match, term, and aggregation queries.
Long-form guide to mapping, analyzers, relevance, aggregations, and cluster scaling.
Run OpenSearch and Dashboards, index data, and run search and aggregation queries.
Install, add documents, and run typo-tolerant searches with filters and ranking rules.
Install, create collections, index documents, and tune ranking and faceting in Typesense.
Object storage is solved. The decisions are cost and lock-in. R2 and B2 have no egress fees, S3 has the deepest ecosystem, MinIO runs on your own hardware. For most workloads, S3-compatible is the only spec that matters.
Create buckets, upload objects, manage access, and configure lifecycle and versioning.
Official guidance on request rates, key naming, multipart uploads, and Transfer Acceleration.
Create R2 buckets, upload objects via Wrangler or the S3-compatible API, and serve them.
Create buckets and application keys, upload files via web UI, CLI, and S3-compatible API.
Install MinIO single-node and distributed, use the mc client, and configure access policies.
For most teams, pgvector inside Postgres is the right answer. Specialized vector databases buy scale (billions of vectors), advanced filtering, or hosted SLAs. The decision point is when retrieval workload starts to dominate normal load. Usually later than you think.
Install the extension, create vector columns, build HNSW/IVFFlat indexes, and run kNN queries.
Run Qdrant in Docker, create collections, upsert vectors with payloads, and run filtered searches.
Spin up Weaviate Cloud, define collections, import data, and run vector and hybrid queries.
Create an index, upsert vectors with metadata, and run similarity queries via Python SDK.
REST is the default for most public-facing APIs. GraphQL pays off when clients have wildly varying data needs and you can absorb the resolver complexity. gRPC and Protobuf win for internal service-to-service traffic where latency and schema discipline matter. The protocol is a contract: pick once, change rarely.
Official guided entry point that walks new authors through writing their first OpenAPI description.
Step-by-step product-catalog example covering types, required fields, constraints, and $ref.
Reference companion that explains keywords and idiomatic patterns for real-world schema design.
Official per-language quickstarts (Go, Python, Java, C++) covering .proto files and codegen.
Google's vetted rules on tag numbers, enum zero values, and safe schema evolution.
Official tour through schemas, queries, mutations, subscriptions, HTTP transport, and authorization.
Shopify/GitHub veteran's book on schema design, performance, security, and migrating legacy APIs.
Conceptual primer plus links to per-language quickstarts that build a working client and server.
Framework-agnostic walkthrough of routers, queries, mutations, and zod input validation.
Decision-framework video on when tRPC beats GraphQL or REST for typed full-stack TypeScript.
Eight-step TypeScript tutorial building a Books schema, resolvers, and Apollo Sandbox queries.
30-minute course building a realtime todo backend with queries, subscriptions, and authorization.
Spin up Hasura Engine plus Postgres locally and get an auto-generated GraphQL API in minutes.
Install PostgREST and build a todo API backed by a schema with role-based access.
Layer JWT authentication and per-role authorization on top of the Tutorial 0 API.
Overview of Connect's browser- and gRPC-compatible protocol with codegen across Go, TS, and more.
Fifteen-minute walkthrough writing a protobuf schema and serving it with connect-go.
Writers and engineers consistently publishing substantive content on architecture, distributed systems, and performance.
Short answers grounded in the work of practitioners running real production systems.
Forget the enterprise diagrams. For a small team, architecture is the set of recurring decisions you're making every few months: which database, which queue, which observability stack, which auth provider. Will Larson's framing is to write down five real architecture decisions you've made, then find the pattern. That pattern is your architecture. Documents that don't change anything aren't architecture; they're paperwork.
If you need work to outlive the request that asked for it, you need persistence somewhere. A database table with a poll loop is the simplest version. A real queue (RabbitMQ, SQS) buys fair scheduling, backpressure, and retries handled properly. Marc Brooker's point: queues don't actually deliver exactly-once. They deliver at-least-once and you handle the duplicates. If your workload can't tolerate that, you need idempotency in the consumer regardless of which tool.
Source: Marc Brooker: Exactly-Once Delivery May Not Be What You Want
Later than you think. Postgres handles relational workloads, full-text search, JSON, time-series (with extensions), geospatial (with PostGIS), and vector search (with pgvector). The hard line comes when one workload starts to dominate enough that operating Postgres for it becomes harder than running a specialized store: analytics over billions of rows (ClickHouse), key-value at flat scale (DynamoDB), columnar local-first (DuckDB). For most teams, the right answer is to wait for the actual pain before splitting it out.
Caches add a metastable failure mode that load tests usually miss. When your cache is warm, the database sees 10% of traffic. When the cache empties (eviction storm, network blip, restart), the database suddenly sees 100%. Usually more than it can handle, which keeps the cache from refilling, which keeps the database overloaded. Decide what cache-miss looks like at full hit-rate before you ship. The fix is usually request coalescing, stale-while-revalidate, or some kind of admission control.
GraphQL pays off at client-server edges where the client wants to fetch exactly the fields it needs in one round trip, and where the complexity of writing resolvers is a price you can afford. gRPC wins for internal service-to-service traffic: binary framing, HTTP/2 streaming, generated clients, tight schema discipline. REST is still the right default for everything else, especially public APIs where caching, browser support, and developer familiarity matter more than the protocol's expressive power.
Postgres full-text search is more capable than most teams realize. For typical workloads (tens of millions of rows, basic ranking, single-language stemming) it's competitive with Elasticsearch and Meilisearch and skips the operational cost of running a second system. The line is when you need fuzzy/typo-tolerant matching, rich faceting, multi-language ranking, or real-time updates over very large indexes. Until then, the right move is staying in Postgres.
pgvector inside the Postgres you already run wins until the retrieval workload starts to dominate normal load. Specialized vector databases buy you horizontal sharding past tens of millions of vectors, faster index rebuilds, and best-in-class filtered approximate-nearest-neighbor search. If your collection fits comfortably alongside your transactional data and your query patterns are simple kNN with a few filters, stay in Postgres.
Original writing coming.
Smarter Dev essays, walkthroughs, and short courses on architecting production systems will land here as they're written.
Join the Discord to be notifiedLast updated