Case 04 · SyncRoom: a realtime collaboration workspace for remote teams

The thesis in one line: this case drills ordering and convergence in realtime collaboration — the hard part is not sending messages, but making everyone eventually see the same trusted collaboration history despite disconnects, reconnects, multiple devices, concurrent edits, and duplicate delivery.

🧪 Case track, case 4 · This case drills one thing
Drill architectural judgment for realtime chat + collaborative documents + notifications: when simple polling is enough, when long connections become necessary, who owns message ordering, how offline users catch up, how concurrent edits avoid overwriting each other, and which states can be eventually consistent.
After reading you should be able to How this case trains it
Explain why realtime systems cannot rely only on request-response Use WebSocket long connections, connection routing, and heartbeats
Reason about message order, duplicates, and offline catch-up Use server seq, ack, retry, idempotent dedupe, and cursor-based pulls
See why collaborative editing cannot use last-save-wins Use operation logs, single-document serialization, and OT / CRDT
Separate strongly consistent data from relaxed states Treat messages / documents as core state, and presence / typing / read receipts / notifications as side paths
Important reminder: this is a teaching case, not any collaboration product's internal blueprint. The numbers are for order-of-magnitude reasoning. The goal is judgment, not a single correct answer.

After reading you should be able to	How this case trains it
Explain why realtime systems cannot rely only on request-response	Use WebSocket long connections, connection routing, and heartbeats
Reason about message order, duplicates, and offline catch-up	Use server seq, ack, retry, idempotent dedupe, and cursor-based pulls
See why collaborative editing cannot use last-save-wins	Use operation logs, single-document serialization, and OT / CRDT
Separate strongly consistent data from relaxed states	Treat messages / documents as core state, and presence / typing / read receipts / notifications as side paths

Opening: why "can send messages" is not realtime collaboration

Because remote teams do not only need a chat box that refreshes. They need a system that makes people feel they are working in the same room.

SyncRoom is a realtime collaboration workspace for remote teams. Inside a project room, members can chat, mention teammates, co-edit meeting notes, see who is online, and receive offline push notifications. If someone disconnects and returns, messages must not be lost. If two people edit the same note at the same time, one person's work must not overwrite the other's. Phone, web, and desktop clients should converge on the same unread state and history.

At first glance, it looks like a few ordinary features combined:

room chat;
online status;
typing indicator;
read / unread state;
collaborative meeting notes;
mention alerts and offline push.

But after launch, the real incidents are not "a message arrived one second late." They are:

Messages arrive out of order, repeat, or disappear; offline catch-up misses gaps; concurrent document edits overwrite each other; unread and read states fight across devices.

So this chapter is not about "how to build a WebSocket demo." It asks a sharper question:

How do you make chat messages reliable and ordered, collaborative documents convergent, and presence / notifications restrained on an unreliable network?

The pressure source is different from the first three cases:

StarArena fears high concurrency and inventory mistakes.
PatchDesk fears multi-tenant boundaries and uncontrolled side effects.
DocuMind fears untrustworthy answers and broken evidence chains.
SyncRoom fears unreliable realtime networks while users still expect everyone to see the same collaboration history.

Mini glossary before reading

This chapter repeats a few terms. Here they are in plain language:

Term	Plain-language meaning
WebSocket	A long-lived two-way connection between browser and server. The server can push messages to the client.
SSE	Server-Sent Events. The server pushes events one way to the browser. Useful for downstream updates, weaker for complex two-way collaboration.
Long connection	A connection that stays open instead of closing after one HTTP request. Realtime systems use it for low-latency push.
Heartbeat	Client and server periodically say they are alive. If heartbeats stop, the connection is considered dead.
Route table	Records which gateway a user is currently connected to. To push to someone, first find where they are connected.
seq	Sequence number. The server assigns increasing numbers inside one room / conversation, and clients display messages by seq.
ack	Acknowledgement. The client tells the server: I received this message.
Idempotency	Processing the same message multiple times still has the effect only once. Retry creates duplicates, so idempotency is required.
Offline catch-up	After a user disconnects or goes offline, they later pull the missing messages from where they left off.
Cursor	A position marker such as "I have received / read up to here."
Read watermark	A read position: "this user has read up to this seq in this room." Unread counts can be recomputed from it.
Presence	Online status, typing state, cursor position, and similar ambient collaboration state. Useful, but usually not strongly consistent.
OT	Operational Transformation. It transforms concurrent editing operations into one agreed order so edits do not overwrite each other.
CRDT	Conflict-free Replicated Data Type. A data structure where concurrent changes from different clients automatically merge into the same final state.
Operation log	Every edit is recorded as an operation instead of only saving final text. It supports replay, audit, and version history.
Snapshot	A periodic full copy of the document state, so opening a document does not require replaying every operation from the beginning.

1. Starting point: get the realtime basics right first

SyncRoom version one has a simple goal: let a team of fewer than 20 people chat in project rooms, read history, and co-maintain a meeting note.

The starting constraints look roughly like this:

Dimension	Starting phase
Team size	5-8 engineers
Room count	Fewer than 1,000
Members per room	5-30
Concurrent online users	1,000-5,000
Peak message writes	50-200 per second
Collaborative docs	1-3 notes per room
Core goal	Chat must not be lost or disordered; docs must not overwrite edits
Must not fail	A sent message disappears; concurrent editing loses one person's content

The right architecture at this point is not multi-region active-active or a complex CRDT platform from day one. It is a central long-connection gateway + message service + operation log + simple collaboration engine:

Browser / mobile client
      │ WebSocket
      ▼
┌────────────────────────────────────────────┐
│ Long-connection gateway                     │
│ heartbeat, connection management, uplink,   │
│ downstream push                             │
└──────────────┬─────────────────────────────┘
               ▼
┌────────────────────────────────────────────┐
│ SyncRoom backend                            │
│ room messages → assign seq → persist        │
│ → deliver / offline catch-up                │
│ collaborative docs → merge op → op log      │
│ → broadcast                                 │
│ presence / notification → async side path   │
└──────────────┬─────────────────────────────┘
               ▼
       ┌──────────────┐
       │ message / op  │
       │ storage       │
       │ snapshots     │
       │ cursors       │
       └──────────────┘

This is not "adding complexity early." It establishes the basic realtime floor: core history is reliable, ambient presence can be relaxed.

2. Quantified assumptions: large requests will not kill it first; long connections and disorder will

Run the numbers. Suppose SyncRoom has been adopted by mid-sized remote teams for half a year:

Registered teams: 2,000
Active teams: 500
Daily active users: 30,000
Concurrent online users: 8,000-20,000
Devices per user: 1.5-2.2
Peak long connections: 15,000-40,000
Total rooms: 50,000
Active rooms: 5,000
Peak chat messages: 1,000-3,000 per second
Peak presence events: 10,000-50,000 per second
Collaborative document operations: 500-2,000 ops per second
Typical online users per room: 5-50
Large-room online users: 200-1,000
Target: online message P95 < 300ms, offline catch-up P95 < 2s
Collaboration target: local edit shows immediately, remote sync P95 < 500ms
Presence target: may converge within 5-15 seconds
Reliability target: core messages are not lost; duplicate messages display once; same-room messages display by server seq

The individual messages are small, but the workload shape is unusual:

Connections are stateful: if a user is connected to a gateway, pushes must go to that gateway.
The network will disconnect: mobile backgrounding, weak subway networks, and sleeping laptops all break connections.
Arrival order is not real order: messages and edits can arrive out of order because of network jitter.
State updates are very frequent: online, typing, cursor, and read changes outnumber real messages.

So SyncRoom's architectural center of gravity is not "how to process one large request." It is:

Make core collaboration history reliable and ordered, while making volatile ambient state eventually consistent and degradable.

3. Trigger signals: when version one starts to be insufficient

Once version one is running, do not upgrade by feeling. Watch these signals:

Signal	What it looks like	Why this is architectural
Messages sometimes appear out of order	"Received" appears before "can you see this?"	The client sorts by arrival time; there is no server seq
Users report missing messages	After reconnect, a few middle messages are gone	No durable cursor or offline catch-up
Messages display twice	Under weak network, the same message appears twice	Retry lacks idempotent dedupe
Unread differs across devices	Phone shows 3 unread, web shows 0	Read cursor has no server convergence point
Large rooms stutter	Presence events from a 500-person room overload gateways	Ambient state lacks throttling / aggregation / degradation
Meeting notes get overwritten	A paragraph written by A disappears after B saves	Collaborative docs are saved like ordinary forms; no op merge
Doc jumps after reconnect	Offline edits are replaced by the server snapshot	Offline operations are not merged with OT / CRDT
Mention notifications become spam	One discussion triggers repeated multi-device, multi-channel alerts	Notifications lack dedupe, rate limits, and online / offline separation

These signals are not saying "add more machines." They are saying: the realtime system lacks reliable delivery, ordering, and merge protocols.

4. Core tension: users want realtime feel; the system needs trusted history

SyncRoom has three groups of core objects:

Room / member / connection: who is in the room, and which gateway they are connected to now.
Message / cursor / notification: what was sent, who received up to where, who read up to where, and who needs an alert.
Document / operation / snapshot: what editing intent was made, how it merges, and how the final document is rebuilt.

If you look only at the simplest path, it feels like this:

User sends message → server forwards → others display
User edits doc → save latest text → others refresh

A real system must answer at every step:

What is the server-defined order of this message?
Is the receiver online? Which gateway are they connected to?
After reconnect, from which cursor should the client catch up?
Has this retried message already been processed?
Which device's read state wins?
If two people edit the same paragraph at the same time, whose intent is preserved?
Can online status and typing state be dropped or delayed?

The new architectural statement becomes:

Core history must be reliable and ordered; collaborative editing must merge intent; ambient state and notifications must degrade gracefully.

The easy trap is mixing chat messages and collaborative documents as if they were the same consistency problem.

Chat messages: append history → ordering, delivery, catch-up, dedupe
Collaborative docs: many users mutate one state → operation merge, convergence, no overwrite
Presence: ambient feel → fast, light, expiring, degradable

If all three are forced into one strongly consistent model, the system becomes slow and expensive. If all three are treated as temporary messages, history gets lost, ordering breaks, and edits overwrite each other.

5. Solution reasoning: what actually makes realtime collaboration reliable?

This is the most important decision in the case. Many realtime demos work until weak networks, multiple devices, offline usage, and concurrent edits show up.

Option A: polling + last-save-wins

Client polls every 3 seconds for new messages
Document saves the entire text each time
The last saver overwrites the previous version

Benefit	Cost
Simplest implementation, works with ordinary web stack	Not truly realtime; polling wastes server capacity
Document save logic is simple	Concurrent edits lose content; offline merge is almost impossible

Option B: WebSocket push, but order is client-owned

Client sends message → gateway broadcasts → clients display by local time / arrival time

Benefit	Cost
Realtime feel improves a lot	Network jitter causes disorder; retry creates duplicates
Good enough for a small-room demo	Offline catch-up, multi-device sync, and unread cursors lack a reliable base

Option C: server seq + persistence + ack / retry / dedupe

Send message
  └─▶ server assigns room_seq
      └─▶ persist first
          └─▶ deliver online / store cursor for offline
              └─▶ client ack, dedupe and sort by seq

Benefit	Cost
Same-room messages have one server-defined order	Need seq assignment, ack, retry, and cursors
Offline reconnect can catch up from `last_seen_seq`	Protocol complexity rises
Duplicate delivery does not duplicate display	Client and server both store message ID / seq

Option D: collaborative documents use operation logs + OT / CRDT

User editing does not save the whole document
It sends an operation: insert "decision" at position 10
Server / CRDT engine merges operations
All clients converge to the same document

Benefit	Cost
Last-save overwrite disappears	OT / CRDT is harder to understand and implement
Offline edits can merge later	Requires operation logs, snapshots, and conflict tests
Version history and audit come naturally	The document model moves from "final text" to "operation sequence"

SyncRoom chooses, for phase one: chat uses server seq + ack / retry / dedupe; documents use centralized OT or a mature CRDT library; presence and notifications use eventually consistent side paths.

The key is not merely "use WebSocket." The key is:

Realtime is experience; ordering, catch-up, merge, and idempotency are the foundation of trustworthy collaboration.

6. Key architecture decisions: record the "why" with ADRs

ADR means Architecture Decision Record. Realtime systems are often questioned later: "Why not sort by client time? Why persist before delivery? Why not save the whole document? Why can online status be inaccurate?" Those answers should be recorded before memory fades.

ADR-01: use WebSocket long-connection gateways + user connection route table, with SSE / HTTP fallback

Context: realtime experience requires the server to push actively; ordinary HTTP request-response cannot notify online users with low latency.
Decision: clients connect to long-connection gateways through WebSocket; connection open / close updates a user_id -> gateway_id route table; delivery checks the route first. Enterprise proxies, weak networks, or read-only observer mode may fall back to SSE downstream + HTTP upstream.
Gave up: short polling as the core realtime mechanism.
Gained: online messages have low latency, the server can find the user's current connection, and read-only scenarios still have a usable fallback.
Risk: long connections are stateful; gateway failure causes many users to reconnect at once, and the route table changes frequently.
Revisit when: connection count exceeds one cluster's capacity or cross-region latency becomes visible; then add nearby access, multi-region gateways, and connection migration. If read-only observers greatly outnumber editors, expand SSE usage.

ADR-02: chat messages get room-local seq from the server, persist before delivery

Context: client clocks are untrusted, network arrival order is untrusted, and delivery can repeat.
Decision: each room's messages receive increasing room_seq from the server; messages are written to durable storage before online delivery or offline catch-up; clients sort by room_seq and dedupe by message_id.
Gave up: displaying messages by client timestamp / arrival time.
Gained: one ordered history per room, catch-up from last_seen_seq, and duplicate-safe delivery.
Risk: hot large rooms may make seq assignment and writes a bottleneck.
Revisit when: one room's write rate gets too high; consider room partitioning, channel splitting, or large-room shared streams + cursor reads.

ADR-03: collaborative documents store operation logs + snapshots, not last-save-wins

Context: several people can edit the same note at the same time; saving final text drops concurrent intent.
Decision: clients send editing operations; all operations for one document route to one merge worker and are processed serially; the server assigns doc_seq / document version to ops; OT or CRDT merges operations; operation logs are persisted and snapshots are generated periodically.
Gave up: whole-document saves where the last writer wins.
Gained: concurrent edits do not overwrite each other, offline edits can catch up and merge from last_doc_seq, and version history / audit come naturally.
Risk: OT / CRDT boundaries are complex; rich text, tables, and images magnify algorithmic difficulty.
Revisit when: if offline editing is common and cross-device concurrency is complex, prefer a mature CRDT library; if centralized collaboration is enough, OT + single-document writer is easier to control.

ADR-04: presence, typing, read receipts, and notifications stay off the core path

Context: online status, typing indicators, cursors, read receipts, and push alerts change frequently but tolerate short inaccuracies.
Decision: presence lives in high-speed storage with TTL; typing broadcasts are rate-limited; read receipts converge through server cursors; notifications enter an async queue, with in-app / long-connection alerts while online and Push / email while offline.
Gave up: strongly consistent transactions for all ambient state.
Gained: core message and document paths are not slowed by high-frequency state; ambient state can degrade; notifications do not spam.
Risk: users may briefly see inaccurate online / read state.
Revisit when: if read state becomes a compliance or contractual promise, raise its durability and acknowledgement level; presence still should not become strongly consistent.

7. Structure and data flow after evolution

SyncRoom is not a WebSocket endpoint. It is a collaboration system that handles messages, documents, and ambient state in separate layers.

Starting path

User sends message
  └─▶ gateway broadcasts
      └─▶ clients display

User edits note
  └─▶ save whole text
      └─▶ overwrite old version

Problem: messages have no shared order, disconnects have no catch-up base, and collaborative docs can be overwritten by the last save.

Evolved structure

Client Web / Mobile / Desktop
      │ WebSocket
      ▼
┌──────────────────────────────────────────────┐
│ Long-connection gateway layer                 │
│ heartbeat, connection management, uplink, push │
└───────────────┬──────────────────────────────┘
                ▼
┌──────────────────────────────────────────────┐
│ SyncRoom core services                         │
│                                              │
│  ┌──────────────┐   ┌──────────────┐         │
│  │ Message       │   │ Collaborative │         │
│  │ service       │   │ doc service   │         │
│  │ room_seq/ack  │   │ op merge      │         │
│  └──────┬───────┘   └──────┬───────┘         │
│         │                  │                 │
│  ┌──────▼───────┐   ┌──────▼───────┐         │
│  │ Message       │   │ op log +      │         │
│  │ storage       │   │ snapshots     │         │
│  └──────────────┘   └──────────────┘         │
│                                              │
│  ┌──────────────┐   ┌──────────────┐         │
│  │ Route table   │   │ Presence /    │         │
│  │ user->gateway │   │ notification  │         │
│  └──────────────┘   │ TTL/queue/rate │         │
│                     └──────────────┘         │
└──────────────────────────────────────────────┘

The core change is not "use WebSocket." The structure is clearer:

Long-connection gateways own connections and forwarding, not business order.
Message service assigns room-local seq, persists before delivery, and uses ack / retry / dedupe for reliability.
Collaborative document service handles editing operations, merges with OT / CRDT, and stores operation logs plus snapshots.
Route table solves "which gateway is this user connected to now?"
Presence / notifications are side paths that can be rate-limited, expired, and degraded.

Follow one "reconnect and catch up" flow end to end

1. User A sends a message in room R.
2. Message service assigns room_seq=1042 and writes it to message storage.
3. User B is online; the route table says B is connected to gateway-7, so the message is pushed there.
4. B's network jitters, so ack does not arrive in time.
5. Server retries delivery; B's client dedupes by message_id / room_seq and displays it once.
6. B disconnects for two minutes; room R receives seq=1043-1050 during that time.
7. B reconnects with last_seen_seq=1042.
8. Server returns messages where seq > 1042; client sorts by seq and catches up.
9. Read cursor is updated on the server, and phone / web eventually converge on the same unread state.

Key points:

Message order is defined by server seq, not client time.
Delivery can repeat; display must be idempotent.
Offline catch-up uses cursors, not the assumption that Push always succeeds.
Unread and read state converge through server cursors instead of each client calculating alone.

Follow one "two people edit the meeting note" flow end to end

1. Current document version is v10.
2. User A inserts "Decision: continue" under the title, producing opA.
3. User B simultaneously deletes an old decision paragraph, producing opB.
4. Both ops arrive through WebSocket; arrival order can vary.
5. Ops for the same document route to the same worker and are processed serially.
6. Server assigns doc_seq to merged ops, and the OT / CRDT engine preserves both editing intents.
7. Merged ops are appended to the operation log, producing version v11.
8. Server broadcasts final ops to all collaborators.
9. Every client applies the same ops and converges on the same note.
10. A periodic snapshot saves the full v11 document, so the next open does not replay from the first op.

Key points:

Collaborative editing stores operations, not only final text.
One document needs a deterministic operation order, otherwise concurrent operations cannot merge reliably.
Reconnecting clients also carry last_doc_seq, so missing document operations can be caught up.
OT / CRDT is not about deciding who wins; it preserves intent and converges.
The operation log is the basis for merging, version history, and audit.

8. What if it breaks: failure scenarios and fallbacks

Failure	Direct result	Detection	Architectural fallback
Gateway crashes	All users on that gateway disconnect	Connection count drops, reconnect spike	Client exponential backoff reconnect; route table expiry; gateway horizontal scaling
Route table is stale	Message goes to the wrong gateway or cannot be delivered	Delivery failure rate, route miss	Route TTL; re-check latest connection after failed delivery; reconnect refreshes route
Message pushes before persistence	Push succeeds but history is missing, or service failure loses message	Message ID reconciliation, user reports	Persist before delivery; messages that are not persisted cannot ack successfully
Messages sorted by client time	Weak network shows messages out of order	Disorder rate, client logs	Server assigns room_seq; clients display by seq
Retry lacks dedupe	Same message appears multiple times	duplicate message_id	Client and server dedupe by message_id / seq
Offline catch-up relies only on Push	User opens app and still misses history	last_seen_seq gaps	Push only wakes the app; online clients pull missing messages by cursor
Unread is computed locally per device	Phone, web, and desktop disagree	Multi-device state diff	Server stores read watermark / read_cursor; clients eventually converge
Unread count is only cached increment / decrement	Missed decrement or repeated decrement stays wrong	Unread count vs seq diff	Count can be cached, but must be recomputable from read watermark + message seq
Large-room presence broadcasts everything	Online / cursor / typing events overload gateway	Presence QPS, broadcast volume	Rate-limit, sample, aggregate; send only to users viewing the room
Last-save-wins overwrites docs	Concurrent editing loses content	Document diff, user feedback	Operation log + OT / CRDT; forbid whole-document overwrite saves
Operation replay fails	Opening the document shows inconsistent state	Snapshot check, op replay tests	Immutable op log; periodic snapshots; quarantine and repair bad ops
CRDT / OT edge cases are untested	Rich text, table, or image edits drift	Collaboration fuzz tests, replay tests	Build concurrent-edit test sets; split complex structures into independently collaborative blocks
SSE is treated as a universal two-way channel	Upstream still needs HTTP and editing feels worse	Fallback latency, retry volume	SSE is good for read-only / fallback; primary editing uses WebSocket
Notifications spam users	Mentions repeat across devices and channels	Notification dedupe rate, unsubscribe rate	Async notification queue; dedupe by event ID; in-app first while online, Push only when offline

Realtime collaboration maturity is not measured by how fast a demo message flies across the screen. It is measured by whether weak networks, reconnects, duplicates, and concurrent edits are covered by protocol.

📌 Validate your reasoning against the templates

This case is not a rewrite of the realtime chat or collaborative document templates. It separates the state types that are often mixed together in a remote-team product.

Reusable template / chapter	What this case reuses	What this case adds
Realtime Chat	Long connections, route table, server seq, ack / retry / dedupe, offline catch-up	Places chat messages inside team rooms and multi-device unread state
Collaborative Doc	Operation logs, single-document serialization, OT / CRDT, snapshots	Puts collaborative editing next to chat, presence, and notifications in one product
Notification System	Async notifications, dedupe, rate limits, multi-channel delivery	Shows online alerts through long connections and offline wake-up through Push / email
Distributed systems: hard truths	Unreliable networks, partial failure, retry	Treats disconnects, reconnects, and duplicate delivery as default
Data consistency engineering	Idempotency, retry, eventual consistency, compensation	Makes message delivery and read cursors recoverable protocols
Designing for failure	Degradation, isolation, circuit breaking	Presence and notifications can degrade without hurting core messages and documents

Reading suggestion: read this case first, then return to the Realtime Chat template and Collaborative Doc template. Both are realtime, but they solve different reliability problems.

🎯 Quick check

🤔Why should SyncRoom not sort messages in the same room by client time?

ABecause client clocks and network arrival order are unreliable, so the server should assign room-local seq
BBecause clients cannot display timestamps
CBecause every message needs one global order across the whole product

🤔Which mechanisms usually combine to avoid losing messages while avoiding duplicate display?

AWebSocket alone is enough
BPersist first, ack, retry, and idempotent dedupe by message_id / seq
CPush notification alone

🤔Why is last-save-wins wrong for co-editing meeting notes?

ABecause saving whole text uses too much disk
BBecause concurrent editing loses other people editing intent; use operation logs + OT / CRDT merge instead
CBecause documents can only be edited by one person

🤔How should presence data such as online status, typing, and cursor position be handled?

AMake it strongly consistent and durable like messages
BStore it in expiring fast state, rate-limit broadcasts, and tolerate short inaccuracies
CBroadcast every change to everyone with no limits

Case summary

Realtime is not the core; trusted collaboration history is. WebSocket is only the channel. Server seq, persistence, ack, catch-up, and dedupe make chat reliable.
Chat and collaborative documents are different problems. Chat appends history and needs ordered delivery; documents mutate shared state and need OT / CRDT merge and convergence.
Offline catch-up cannot rely on Push. Push wakes the app; actual catch-up uses server history and client cursors.
Multi-device state needs a server convergence point. Unread / read state cannot be calculated independently on each client forever.
Presence can be relaxed. Online, typing, and cursor position may be briefly inaccurate; they need rate limits and degradation, not competition with the core path.
Operation logs fit collaboration better than final text. They support merge, replay, version history, and audit.

Bridge forward: this case combines Realtime Chat, Collaborative Doc, and Notification System inside one product. If the next case moves into content feed / video distribution, the pressure changes again: fanout amplification, hot content, recommendation, search, and CDN delivery.

Template cross-check: Realtime Chat · Collaborative Doc · Notification System
Methodology: 02 · The architect's thinking framework · 07 · Designing from 0 to 1 · 08 · ADRs & evolution
Hard parts: 10 · Distributed systems: hard truths · 11 · Data consistency engineering · 12 · Designing for failure

Case 04 · SyncRoom: a realtime collaboration workspace for remote teams ​

Opening: why "can send messages" is not realtime collaboration ​

Mini glossary before reading ​

1. Starting point: get the realtime basics right first ​

2. Quantified assumptions: large requests will not kill it first; long connections and disorder will ​

3. Trigger signals: when version one starts to be insufficient ​

4. Core tension: users want realtime feel; the system needs trusted history ​

5. Solution reasoning: what actually makes realtime collaboration reliable? ​

Option A: polling + last-save-wins ​

Option B: WebSocket push, but order is client-owned ​

Option C: server seq + persistence + ack / retry / dedupe ​

Option D: collaborative documents use operation logs + OT / CRDT ​

6. Key architecture decisions: record the "why" with ADRs ​

ADR-01: use WebSocket long-connection gateways + user connection route table, with SSE / HTTP fallback ​

ADR-02: chat messages get room-local seq from the server, persist before delivery ​

ADR-03: collaborative documents store operation logs + snapshots, not last-save-wins ​

ADR-04: presence, typing, read receipts, and notifications stay off the core path ​

7. Structure and data flow after evolution ​

Starting path ​

Evolved structure ​

Follow one "reconnect and catch up" flow end to end ​

Follow one "two people edit the meeting note" flow end to end ​

8. What if it breaks: failure scenarios and fallbacks ​

📌 Validate your reasoning against the templates ​

🎯 Quick check ​

Case summary ​

Related links ​

💬 Comments