28 · Database and Storage Selection

The thesis in one line: database choice is not just "MySQL or PostgreSQL." It is "what are this data's access patterns, consistency needs, query shapes, growth rate, and failure cost?" Choose storage by drawing the data lifecycle first. Tool names come later.

🧰 Technology Stack Selection Track, Chapter 2 · One thing to practice
Chapter 05 said the hard part of systems is data. Languages can be changed, services can be split, but data placed in the wrong model is expensive to move. This chapter puts databases, caches, search, object storage, and vector stores on one selection map.

Opening: do not make one database solve every problem

Many systems start like this:

   App -> one relational database
          |-- transaction data
          |-- reports
          |-- search filters
          |-- file attachments
          |-- AI retrieval vectors

For an MVP, this can be fine. As the system grows, these access patterns diverge:

Orders need correctness and audit.
Reports scan a lot of history and should not slow the primary database.
Search needs relevance ranking, not only LIKE.
Images and videos need cheap storage and CDN delivery.
RAG needs vector retrieval, metadata, citations, and permission filtering.

Architectural judgment: do not ask "which database is best." Ask "which storage model should carry which kind of data." A system may need multiple stores, but every new store adds synchronization, consistency, ops, and debugging cost.

1. Draw the data lifecycle

Before choosing a database, trace a piece of data:

   write -> validate -> commit -> query/search -> analyze/report -> archive/delete
      |        |          |          |              |                 |
   who?     how?      how consistent? query shape?   how often?        retention?

Ask five questions:

Question	Why it matters
Read-heavy or write-heavy?	Drives read replicas, cache, index model
Where is the transaction boundary?	Decides whether relational transactions can protect it
What is the query shape?	Key lookup, range query, full-text, vector similarity are different
How fast does data grow?	Drives partitioning, hot/cold tiering, archival
What happens if it is wrong?	Drives consistency, backup, audit, recovery

This is the same discipline as the back-of-the-envelope work in Chapter 07: estimate volume, read/write ratio, and retention before tool choice.

2. Primary storage: relational is still the default start

Type	Good for	Poor fit
Relational database	Transactions, orders, permissions, tenants, ledgers	Huge analytics, full-text search, large binary files
Document database	Flexible structure, whole-document reads/writes	Strict transactions, complex joins, rigorous reporting
Key-value store	High-speed key access, simple shape, very large scale	Ad hoc joins and flexible querying
OLAP / analytical store	Reports, aggregation, logs, behavior analytics	High-frequency small transactions

Default: keep core transaction data in a relational source of truth. Split reports, search, logs, and vectors only when they become real pressure.

3. Read models: search, analytics, and vectors are not the primary truth

Need	Common engine	Trade-off
Full-text search	Elasticsearch, OpenSearch, Meilisearch	Strong relevance, but index sync and eventual consistency are required
Analytics	ClickHouse, BigQuery, Snowflake	Fast scans/aggregates, not a transaction source
Object storage	S3, OSS, GCS	Cheap durable files, not complex querying
Vector database	Milvus, Qdrant, pgvector	Similarity retrieval, but permissions and recall need design
Time-series database	Prometheus, InfluxDB	Metrics over time, not generic business objects

The common mistake is making a read model the source of truth:

   Correct:
   primary DB = source of truth
   search / analytics / vector store = read model rebuilt from source

   Wrong:
   user updates profile -> only search index changes -> primary DB never knows

Read models may lag, but you must state how much lag is acceptable, how to repair, and how to rebuild. That is Chapter 11 in practice.

4. RAG: the vector store is not the whole system

An enterprise RAG system such as DocuMind is often oversimplified as:

   chunk document -> vector DB -> topK -> LLM answer

A real system is closer to:

   original object storage -> parsing/chunking -> metadata DB
          |                                      |
          +--------------> keyword index <------+
          +--------------> vector index  <------+
          +--------------> permission filter <--+

Ask:

Is permission filtering before retrieval or after retrieval?
Is retrieval vector-only, or hybrid search (keyword + vector)?
Where are originals and citations stored?
Can indexes be rebuilt from originals and metadata?
Does cost scale with document and query volume?

A vector database solves similarity recall. It does not solve source of truth, permissions, citations, evaluation, or cost by itself.

5. When to split storage

Signal	Meaning	Possible action
Reports slow the transaction DB	OLTP and OLAP interfere	Sync to an analytical store
Search relevance is poor or `LIKE` scans are slow	Query shape is full-text search	Build a search index
Attachments bloat the DB	Binary files do not belong in core tables	Move to object storage + CDN
Table growth slows indexes and backups	No lifecycle strategy	Partition, archive, tier hot/cold data
RAG recall quality is unstable	Single vector retrieval is not enough	Hybrid retrieval + rerank + eval

The opposite matters too: if data volume is small, team size is small, and failure cost is low, one relational database with good indexes and backups may be healthier than five stores.

🎯 Quick check

🤔An e-commerce system needs correct order, payment, and inventory state transitions. What is the best default source of truth?

AA relational database with transactions, constraints, and a state machine
BA full-text search engine because it is flexible
CObject storage because it is cheap

🤔What is the right relationship between search indexes, analytical stores, vector stores, and the primary database?

AAll of them should be separate sources of truth
BThe primary DB or original object store is usually the source of truth; search/analytics/vector stores are read models for specific query shapes
COnce you have a vector store, you do not need metadata or permissions

Chapter summary

Draw the data lifecycle before choosing storage: write, validate, commit, query, analyze, archive.
Relational databases remain the default truth source for many business systems.
Search, analytics, and vector stores are read models for specific query shapes.
RAG is not a vector store: it also needs originals, metadata, permissions, retrieval strategy, citations, and evals.
Split storage on signals: reports hurting transactions, search getting slow, files bloating DB, recall failing.

Next: primary storage answers "where does truth live." Hotspots, traffic spikes, and cross-boundary collaboration bring caches, queues, and events. Chapter 29 covers those middle layers.

Method core: 05 · Data and state · 06 · Quality attributes · 11 · Consistency engineering
Evolution: 13 · Mechanics of scale · 14 · Evolution and splitting
Case references: DocuMind · StarArena

28 · Database and Storage Selection ​

Opening: do not make one database solve every problem ​

1. Draw the data lifecycle ​

2. Primary storage: relational is still the default start ​

3. Read models: search, analytics, and vectors are not the primary truth ​

4. RAG: the vector store is not the whole system ​

5. When to split storage ​

🎯 Quick check ​

Chapter summary ​

Related links ​

💬 Comments