Skip to content

28 · Database and Storage Selection

The thesis in one line: database choice is not just "MySQL or PostgreSQL." It is "what are this data's access patterns, consistency needs, query shapes, growth rate, and failure cost?" Choose storage by drawing the data lifecycle first. Tool names come later.


🧰 Technology Stack Selection Track, Chapter 2 · One thing to practice

Chapter 05 said the hard part of systems is data. Languages can be changed, services can be split, but data placed in the wrong model is expensive to move. This chapter puts databases, caches, search, object storage, and vector stores on one selection map.


Opening: do not make one database solve every problem

Many systems start like this:

   App -> one relational database
          |-- transaction data
          |-- reports
          |-- search filters
          |-- file attachments
          |-- AI retrieval vectors

For an MVP, this can be fine. As the system grows, these access patterns diverge:

  • Orders need correctness and audit.
  • Reports scan a lot of history and should not slow the primary database.
  • Search needs relevance ranking, not only LIKE.
  • Images and videos need cheap storage and CDN delivery.
  • RAG needs vector retrieval, metadata, citations, and permission filtering.

Architectural judgment: do not ask "which database is best." Ask "which storage model should carry which kind of data." A system may need multiple stores, but every new store adds synchronization, consistency, ops, and debugging cost.


1. Draw the data lifecycle

Before choosing a database, trace a piece of data:

   write -> validate -> commit -> query/search -> analyze/report -> archive/delete
      |        |          |          |              |                 |
   who?     how?      how consistent? query shape?   how often?        retention?

Ask five questions:

QuestionWhy it matters
Read-heavy or write-heavy?Drives read replicas, cache, index model
Where is the transaction boundary?Decides whether relational transactions can protect it
What is the query shape?Key lookup, range query, full-text, vector similarity are different
How fast does data grow?Drives partitioning, hot/cold tiering, archival
What happens if it is wrong?Drives consistency, backup, audit, recovery

This is the same discipline as the back-of-the-envelope work in Chapter 07: estimate volume, read/write ratio, and retention before tool choice.


2. Primary storage: relational is still the default start

TypeGood forPoor fit
Relational databaseTransactions, orders, permissions, tenants, ledgersHuge analytics, full-text search, large binary files
Document databaseFlexible structure, whole-document reads/writesStrict transactions, complex joins, rigorous reporting
Key-value storeHigh-speed key access, simple shape, very large scaleAd hoc joins and flexible querying
OLAP / analytical storeReports, aggregation, logs, behavior analyticsHigh-frequency small transactions

Default: keep core transaction data in a relational source of truth. Split reports, search, logs, and vectors only when they become real pressure.


3. Read models: search, analytics, and vectors are not the primary truth

NeedCommon engineTrade-off
Full-text searchElasticsearch, OpenSearch, MeilisearchStrong relevance, but index sync and eventual consistency are required
AnalyticsClickHouse, BigQuery, SnowflakeFast scans/aggregates, not a transaction source
Object storageS3, OSS, GCSCheap durable files, not complex querying
Vector databaseMilvus, Qdrant, pgvectorSimilarity retrieval, but permissions and recall need design
Time-series databasePrometheus, InfluxDBMetrics over time, not generic business objects

The common mistake is making a read model the source of truth:

   Correct:
   primary DB = source of truth
   search / analytics / vector store = read model rebuilt from source

   Wrong:
   user updates profile -> only search index changes -> primary DB never knows

Read models may lag, but you must state how much lag is acceptable, how to repair, and how to rebuild. That is Chapter 11 in practice.


4. RAG: the vector store is not the whole system

An enterprise RAG system such as DocuMind is often oversimplified as:

   chunk document -> vector DB -> topK -> LLM answer

A real system is closer to:

   original object storage -> parsing/chunking -> metadata DB
          |                                      |
          +--------------> keyword index <------+
          +--------------> vector index  <------+
          +--------------> permission filter <--+

Ask:

  • Is permission filtering before retrieval or after retrieval?
  • Is retrieval vector-only, or hybrid search (keyword + vector)?
  • Where are originals and citations stored?
  • Can indexes be rebuilt from originals and metadata?
  • Does cost scale with document and query volume?

A vector database solves similarity recall. It does not solve source of truth, permissions, citations, evaluation, or cost by itself.


5. When to split storage

SignalMeaningPossible action
Reports slow the transaction DBOLTP and OLAP interfereSync to an analytical store
Search relevance is poor or LIKE scans are slowQuery shape is full-text searchBuild a search index
Attachments bloat the DBBinary files do not belong in core tablesMove to object storage + CDN
Table growth slows indexes and backupsNo lifecycle strategyPartition, archive, tier hot/cold data
RAG recall quality is unstableSingle vector retrieval is not enoughHybrid retrieval + rerank + eval

The opposite matters too: if data volume is small, team size is small, and failure cost is low, one relational database with good indexes and backups may be healthier than five stores.


🎯 Quick check

🤔An e-commerce system needs correct order, payment, and inventory state transitions. What is the best default source of truth?
  • AA relational database with transactions, constraints, and a state machine
  • BA full-text search engine because it is flexible
  • CObject storage because it is cheap
🤔What is the right relationship between search indexes, analytical stores, vector stores, and the primary database?
  • AAll of them should be separate sources of truth
  • BThe primary DB or original object store is usually the source of truth; search/analytics/vector stores are read models for specific query shapes
  • COnce you have a vector store, you do not need metadata or permissions

Chapter summary

  • Draw the data lifecycle before choosing storage: write, validate, commit, query, analyze, archive.
  • Relational databases remain the default truth source for many business systems.
  • Search, analytics, and vector stores are read models for specific query shapes.
  • RAG is not a vector store: it also needs originals, metadata, permissions, retrieval strategy, citations, and evals.
  • Split storage on signals: reports hurting transactions, search getting slow, files bloating DB, recall failing.

Next: primary storage answers "where does truth live." Hotspots, traffic spikes, and cross-boundary collaboration bring caches, queues, and events. Chapter 29 covers those middle layers.


💬 Comments