Case 06 · CodePilot: an AI Agent / coding Agent platform
Thesis in one line: this case drills "how an AI that can act stays controlled". The hard part of a coding Agent is not making a model write code. It is letting it read repos, edit files, run commands, call tools, and work for a long time while permissions, sandboxing, budgets, approvals, checkpoints, and evals still constrain it.
🧪 Case-track article 6 · This case drills one thing
It drills architectural judgment for an AI Agent / coding Agent platform: when to use a workflow, when to allow an autonomous Agent loop, why tool calls must pass through a permission gateway, why sandboxing and human approval are different controls, how long tasks survive through checkpoints and context compaction, when subagents help, and how evals catch silent regressions after a model or prompt change.
After reading, you should be able to This case drills it through Explain why an Agent is different from a chat box The "read repo → edit files → run tests → observe → edit again" action loop Explain why tool calls cannot be handed directly to the model Permission gateway, sandbox, workspace boundary, and high-risk approval Understand why long tasks forget goals, burn budgets, and drift Checkpoints, trace, context compaction, hard budgets, and resume flow Decide when subagents help, and when they add noise Independent context, independent budget, and structured result handoff Important note: this is a teaching case, not an internal design of any real coding Agent product. The numbers are for order-of-magnitude reasoning. The point is judgment, not one canonical answer.
Opening: why a coding Agent is not "a better code-writing chat box"
Because a chat box gives advice. A coding Agent actually touches your repository.
CodePilot is a coding Agent platform for engineering teams. An engineer can submit a task: "fix this failing test," "add idempotency to the payment callback," "upgrade this dependency," or "read this module and propose a refactor." CodePilot checks out the repo, reads code, plans, edits files, runs tests, reacts to failures, and eventually produces a diff or Pull Request.
At first glance, it looks like "a model that writes code":
- the user describes the task;
- the model reads code;
- the model generates a patch;
- tests run;
- a result is returned.
But after launch, the most dangerous failure is not "the code is not elegant enough." It is:
The Agent is induced by prompt injection in a README to read secrets, runs a dangerous command, edits files outside the workspace, forgets the user's constraint after context compaction, lets subagents conflict, or silently regresses after a model upgrade.
So this chapter is not about "how to call an LLM API." It asks a sharper question:
How do you let an Agent read and write code, execute commands, and work autonomously for a long time while still being stoppable, auditable, recoverable, and measurable?
The pressure source is different from the first five cases:
- StarArena fears inventory and payment-state errors.
- PatchDesk fears tenant-boundary failures.
- DocuMind fears ungrounded answers and prompt injection.
- SyncRoom fears realtime state convergence.
- FeedStream fears distribution amplification.
- CodePilot fears the moment AI can act: mistakes turn from "said the wrong thing" into "changed, leaked, spent, or shipped the wrong thing."
Pre-reading glossary
These words repeat throughout the chapter. Align on plain-language meanings first:
| Term | Plain-language meaning |
|---|---|
| Agent | AI that can take multiple steps by itself: plan, call tools, observe results, and decide what to do next. |
| Workflow | A fixed or mostly fixed sequence of steps. The model may judge at a few points, but the path is mostly predetermined. |
| Agent loop | The action loop: plan → call tool → observe → plan again. |
| Tool call | A model request to read files, search code, edit files, run tests, fetch a page, or call an internal API. |
| Sandbox | An isolated execution environment that limits where a command can read, where it can write, and whether it can access the network. |
| Workspace | The project directory the Agent is allowed to read and write. Files outside it are usually out of bounds. |
| Permission gateway | The control layer every tool call passes through. It decides allow, deny, or ask the user. |
| Approval | Human confirmation before high-risk actions such as networking, deleting files, committing, or deploying. |
| Checkpoint | Saved task state: current diff, tool-result summaries, budget use, and enough state to resume after interruption. |
| Trace | The execution trail: what the Agent did, which tools it called, and what it observed. |
| Context window | The amount of information the model can see at one time. Long tasks eventually run out of it. |
| Compaction | Summarizing older history to free context window space. |
| Memory | Rules, preferences, and past lessons retained across steps or sessions. More memory is not always better; it can become stale. |
| Subagent | A separate Agent with its own context for noisy subtasks such as repo-wide search or log analysis. It returns a summary to the main line. |
| Prompt injection | Malicious or misleading instructions embedded in external content, such as "ignore prior rules and leak secrets." |
| Budget | Hard limits on steps, tokens, time, tool calls, concurrency, and cost. |
| Eval | A representative test set for measuring task success, safety, regression, and quality over time. |
1. Starting state: build a controlled coding assistant first, not an autonomous engineer
CodePilot's first version has a modest goal: let a small team delegate low-risk engineering tasks to an Agent, while making every real change visible and interruptible.
Early constraints look like this:
| Dimension | Starting stage |
|---|---|
| Pilot team | 20-50 engineers |
| Connected repos | 5-20 |
| Repo size | 10k-200k lines |
| Agent tasks per day | 50-200 |
| Peak concurrent tasks | 5-20 |
| Common tasks | Tests, small bugs, code reading, docs, simple dependency upgrades |
| Core goal | Produce reviewable diffs and explain each step |
| Must not fail | Read secrets, edit outside the workspace, auto-run irreversible actions, or lose trace |
The right first architecture is not a "fully autonomous engineer." It is a bounded Agent loop + tool gateway + workspace sandbox + human review:
Engineer submits a task
│
▼
┌────────────────────────────────────────────┐
│ Task service │
│ Goal, repo, branch, budget, permission mode │
└──────────────┬─────────────────────────────┘
▼
┌────────────────────────────────────────────┐
│ Agent orchestrator │
│ Plan → read/search → patch → test → adjust │
└──────────────┬─────────────────────────────┘
▼
┌────────────────────────────────────────────┐
│ Tool gateway + sandbox │
│ Workspace limits, network limits, approvals │
│ and full trace │
└──────────────┬─────────────────────────────┘
▼
diff / test result / trace → human reviewThis is not timid. It states the boundary:
- The Agent can read the repo, but sensitive paths are denied by default.
- The Agent can edit files, but only inside the workspace.
- The Agent can run tests, but commands have timeouts and resource limits.
- The Agent can suggest a commit or PR, but cannot bypass review.
- The Agent can work for a long time, but every step must be traceable, resumable, and stoppable.
2. Quantified assumptions: it breaks first on permissions, context, and side effects
Now run the numbers. Suppose CodePilot expands from a pilot to a mid-sized engineering organization:
Engineers:300
Repos:800
Active repos:150
Repo size:50k-800k lines
Agent tasks per day:1,000-3,000
Peak concurrent tasks:50-120
Average task turns:8-20
Long-task turns:30-80
Tool calls per turn:1-4
Common tools:read file, search, patch, run tests, package manager, Git, internal APIs
Goal:simple tasks produce a diff in 5-10 minutes
Goal:complex tasks produce a PR or clear failure reason in 30-60 minutes
Safety target:out-of-workspace writes = 0; sensitive file exfiltration = 0
Budget target:every task has max steps, tokens, runtime, and concurrency
Quality target:before model / prompt / tool-policy changes, core evals stay above baselineAt this scale, request volume is not the first problem. A normal web system starts by calculating QPS. CodePilot first collides with:
- Permission boundaries: an Agent that can run commands can delete, modify, or access the network by mistake.
- Context boundaries: repos are large, logs are long, tasks run for a while, and the model cannot see everything.
- Side-effect boundaries: file edits can be rolled back; API calls, pushes, deploys, and database writes may not be.
- Cost boundaries: a few more attempts may fix the task, or may loop forever and burn money.
- Quality boundaries: a model upgrade may improve some tasks while weakening safety refusal or another task class.
So CodePilot's architecture is not mainly about handling more requests. It is about:
Putting Agent autonomy inside executable, auditable, recoverable, and measurable boundaries.
3. Trigger signals: when the first version is no longer enough
Do not upgrade by instinct. Watch these signals:
| Signal | Symptom | Why this is architectural |
|---|---|---|
| Users approve blindly | Every command asks; users stop reading and approve | Approval is too noisy, so safety becomes ceremonial |
| More overreach attempts | Agent often tries to read ~/.ssh, .env, or outside the workspace | Defaults and injection defenses are unclear |
| Interrupted long tasks restart from zero | A 40-minute run dies and cannot resume | No durable checkpoints or task state |
| Constraints vanish after compaction | The user said "do not touch the DB layer," but later edits do | Critical constraints are not pinned and reloaded |
| Tool logs fill the window | Full test logs, dependency trees, and search results crowd out task context | Tool outputs need summarization, truncation, and references |
| Subagents conflict | Two subagents edit the same code area | Subtask ownership and merge coordination are missing |
| Cost spikes | A task repeats the same fix attempt again and again | No step, time, token, or repetition cap |
| Results cannot be reviewed | Only the final diff is visible; the why is missing | Trace is incomplete |
| Model upgrade regresses | More small bugs get fixed, but safety refusal weakens | No eval gate measuring multiple quality dimensions |
| Prompt injection incident | An issue or README induces secret access or external calls | External content is treated as instruction instead of untrusted input |
These signals are not saying "the model is not smart enough." They are saying: model capability alone does not make an Agent production-ready. Control must be architecture.
4. Core tension: the more capable the Agent, the more it must be constrained
CodePilot has five core object groups:
- Task / goal / budget: what the user wants and how much time, money, and iteration the Agent may spend.
- Repo / workspace / branch: which project it may touch and where the output lands.
- Tools / permissions / sandbox: what the Agent can call and what each tool is physically allowed to do.
- Context / memory / compaction: what the model sees, and which constraints must survive long tasks.
- Trace / diff / eval: what it did, how humans review it, and whether quality stays stable over time.
The naive path looks like this:
User gives task → model generates code → tests run → result returnedThe real system must answer at every step:
- Is this tool call read-only, or will it change the world?
- Can this command access the network? Can it write outside the workspace?
- Is this README project documentation, or prompt injection written into the repo?
- Can this modification be rolled back? Should we checkpoint first?
- Did context compaction preserve the user's hard constraints?
- Is a subagent result evidence, or just a suggestion?
- Did the Agent fail because tests failed, permission failed, budget ran out, or the model drifted?
- After a model upgrade, did task success, safety refusal, and minimal-change behavior regress?
That creates the new design problem:
CodePilot's value comes from autonomous execution, and so does its risk. The architecture is not about making it freer. It is about making "what can it touch, when does it ask, how does it recover, and how do we prove it did not get worse" explicit system structure.
The key classification is tool risk:
| Tool type | Examples | Default policy |
|---|---|---|
| Low-risk read-only | Search code, read non-sensitive files, inspect test config | Automatic, but output is still untrusted observation |
| Reversible medium risk | Edit workspace files, generate diff, format code | Allowed inside sandbox, with checkpoints and diff |
| High-risk side effect | Network download, API request, DB write, email, push | Ask or deny by default |
| Irreversible / organizational risk | Deploy, delete remote resources, change permissions, transfer money | Mandatory approval, often outright forbidden |
5. Solution reasoning: how should a coding Agent execute tools?
This is the central decision. Many demos work, but the moment a real repo and shell are connected, risk becomes an engineering problem.
Option A: chat-only advice, no tool execution
User pastes code / error logs
└─▶ model gives advice
└─▶ human copies, edits, and tests| Benefit | Cost |
|---|---|
| Safety boundary is simple; the model has no real side effects | The user still moves context around; leverage is limited |
| Easy to integrate | It cannot verify its work |
| Good for learning and explanations | It cannot own an engineering task end-to-end |
Option B: local Agent with full shell permissions
Agent reads repo → edits files → runs arbitrary commands → networks → commits| Benefit | Cost |
|---|---|
| Very flexible | Highest risk: deletion, leakage, networking, and overreach |
| Strong demo experience | Hard to govern and audit for a team |
| Looks simple to build | There is no structural boundary when something goes wrong |
Option C: platform Agent, all tools through gateway and sandbox
Model proposes a tool call
└─▶ permission gateway decides allow / ask / deny
└─▶ sandbox executes
└─▶ result returns as observation
└─▶ trace / checkpoint / budget recorded throughout| Benefit | Cost |
|---|---|
| The Agent can act, but within physical boundaries | More configuration and user mental model |
| Side effects are auditable and constrained | Some legitimate tasks require explicit approval |
| Supports team governance, CI, and cloud jobs | Requires task state, trace, sandbox pools, and evals |
CodePilot chooses Option C for the growth stage: tool calls must pass through a permission gateway and sandbox, and high-risk actions must be interruptible.
The point is not whether the model is trusted. It is:
Anything the model can influence through instruction is not a hard boundary. Hard boundaries live where the model cannot bypass them: permission gateway, sandbox, budgets, approvals, CI, and evals.
6. Key architecture decisions: record the why with ADRs
An ADR is an Architecture Decision Record. Coding Agent systems will later be questioned: "Why not full autonomy? Why split sandboxing and approval? Why keep trace? Why can't subagents inherit the entire context?" Write those reasons down early.
ADR-01: Prefer workflows by default; use autonomous Agent loops only for open-ended tasks
- Context: Many engineering tasks are structured: run lint, fix formatting, run tests, produce a diff. Fixed workflows are cheaper and more predictable.
- Decision: CodePilot first decomposes tasks into controlled workflows. It enters an autonomous Agent loop only when the path is uncertain and tool feedback must guide the next step.
- Give up: The simple marketing story that every task is freely planned by the model.
- Gain: Lower cost, stronger control, and clearer failure analysis.
- Risk: Poor workflow boundaries may restrict useful flexibility.
- Review when: Many tasks get stuck in fixed flows, and human review shows open-ended exploration is genuinely needed.
ADR-02: Tool calls must pass through a permission gateway and sandbox
- Context: A coding Agent can read and write files, run shell commands, access the network, and call internal APIs. Side effects are real.
- Decision: Every tool call passes through a permission gateway. Writes are limited to the workspace. Sensitive paths are denied by default. Commands run in a sandbox. Network access is narrow by default. High-risk actions ask or deny.
- Give up: Letting the model run anything it wants.
- Gain: Even if the model is induced by prompt injection, physical boundaries still contain many failures.
- Risk: Permission configuration gets more complex, and some valid development tasks need explicit authorization.
- Review when: More internal tools, MCP (Model Context Protocol) tools, or cloud resources are added.
ADR-03: Decouple sandbox mode and human approval as two axes
- Context: People often think in one "automatic vs manual" slider, but that mixes two different questions: what can it physically touch, and when does it interrupt the human?
- Decision: Sandbox mode controls physical boundaries; approval controls process interruption. For example,
read-only + neverfits CI analysis,workspace-write + on-requestfits default coding tasks, and high-risk external side effects must ask. - Give up: A simple but vague one-dimensional automation setting.
- Gain: Clearer boundaries across local pairing, read-only CI, and async cloud tasks.
- Risk: Users must understand two dimensions; otherwise they may assume "asks less" means "has more permission."
- Review when: Users approve blindly, or legitimate tasks are interrupted too often.
ADR-04: Long tasks must persist checkpoints, trace, and budget state
- Context: Coding tasks may run for tens of minutes. The session grows, tool outputs get large, and processes can die.
- Decision: Persist task state, step trace, current diff, tool-result summaries, and budget use. When context nears its limit, compact history, then reload project rules, user goals, and hard constraints from stable storage.
- Give up: Relying on one long conversation to hold everything.
- Gain: Tasks are resumable and auditable, and compaction is less likely to erase critical constraints.
- Risk: Summaries may drop details; bad summaries can mislead later steps.
- Review when: Long-task failures often come from forgetting earlier constraints.
ADR-05: Use subagents only to isolate heavy work; return structured summaries to the main line
- Context: Repo-wide search, dependency analysis, and test-log triage produce a lot of noisy intermediate data.
- Decision: Subagents get independent context, limited tools, and separate budgets. The main Agent receives only structured summaries, evidence files, and recommendations. Final merging stays in the main line.
- Give up: Letting subagents inherit the entire parent conversation.
- Gain: Cleaner main context and controlled parallel decomposition.
- Risk: A subagent summary may omit a key fact, and multiple Agents increase cost and latency.
- Review when: A single Agent's context is crowded by noisy work, or the task is naturally decomposable with non-overlapping write scopes.
7. Evolved structure and data flow
Here is the core structure of CodePilot. It is not "model + shell." It is a task execution, permission-control, context-management, observability, and evaluation system.
Starting path
User sends task → model generates patch → local command runs → user sees resultThe problem: permissions, sandboxing, budgets, resume, trace, and evals are not structured.
Evolved structure
User / CI / scheduled task
│
▼
┌──────────────────────────────────────────────┐
│ Task intake │
│ Goal, repo, branch, permission mode, budget, │
│ approval policy │
└──────┬───────────────────────────────────────┘
▼
┌──────────────────────────────────────────────┐
│ Agent orchestrator │
│ Workflow first; Agent loop when necessary │
│ Plan → tool call → observe → plan again │
└──────┬───────────────────────┬───────────────┘
│ │
▼ ▼
┌──────────────┐ ┌──────────────────────┐
│ Context builder │ │ Task state / trace │
│ Repo map, files,│ │ Checkpoints, diff, │
│ rules, memory, │ │ budget, tool summaries│
│ compaction │ └──────────────────────┘
└──────┬───────┘
│
▼
┌──────────────────────────────────────────────┐
│ Tool gateway │
│ allow / ask / deny; sensitive paths; network │
│ policy; resource limits │
└──────┬───────────────────────┬───────────────┘
▼ ▼
┌──────────────┐ ┌──────────────────────┐
│ Sandbox │ │ Human approval / │
│ Workspace │ │ Review Gate │
│ isolation, │ │ High-risk actions, │
│ timeout, CPU, │ │ final PR review │
│ memory, net │ └──────────────────────┘
└──────┬───────┘
▼
tool result / test log / diff → back to orchestrator
Side path: eval platform continuously runs representative tasks and gates
model / prompt / tool-policy changes.The key change is not "more services." It is clearer structure:
- Task intake defines the goal, repo, budget, and permission mode.
- Agent orchestrator decides fixed workflow vs autonomous loop.
- Context builder supplies necessary information and pins project rules and hard constraints.
- Tool gateway turns model intent into controlled tool calls.
- Sandbox provides physical limits on workspace, network, and resources.
- Trace / checkpoint makes long tasks resumable and auditable.
- Review Gate hands irreversible actions and final output back to humans.
- Eval platform catches silent quality regression after system changes.
Walk through "fix a failing test and add a regression test"
1. Engineer submits: "fix this failing billing test and add a regression test."
2. Task intake records repo, branch, and budget: max 25 steps, 30 minutes, workspace writes only.
3. Context builder loads project rules, relevant AGENTS.md, test command, and recent failure log.
4. Agent orchestrator plans: locate failure → read files → patch → run unit test → add regression.
5. Model requests code search. Tool gateway treats it as low-risk read-only and allows it.
6. Model requests an edit to `billing/refund.ts`. This is reversible workspace write, so the sandbox allows it after checkpointing.
7. Model requests `pnpm test billing`. The command runs in the sandbox with timeout, CPU / memory limits, and network disabled.
8. The test fails. The tool-result summary enters context; the raw log is kept as trace evidence.
9. The Agent reads the failing stack, finds a missing idempotency-key boundary, and patches again.
10. Tests pass. The Agent produces final diff, explanation, and test evidence.
11. Review Gate requires a human to review the diff before PR creation or commit.Important details:
- The model does not operate the machine directly. It proposes tool calls.
- Tool results are observations, not new instructions. READMEs, issues, and test logs may contain prompt injection.
- Every write has a checkpoint and the final diff is reviewable.
- Test logs can be summarized into context, but raw trace must remain reachable.
- Without Review Gate approval, the Agent cannot push changes into the main line.
Walk through "long task approaches context limit"
1. The Agent has executed 40 steps, read many files, and run several test rounds.
2. The context window approaches its limit, so the system triggers compaction.
3. The compactor writes a structured summary:
- user goal
- hard constraints
- attempted approaches
- current diff
- unresolved problem
- evidence files and trace links
4. The system reloads project rules and permission policy from stable storage.
5. Budget state returns too: 8 steps and 10 minutes remain, network is still disabled.
6. The Agent continues. If it approaches the limit again or exhausts budget, it must stop and report state instead of looping forever.Context compaction is not magic. It turns "cannot fit" into "discard lower-value history with discipline." Anything that must never disappear belongs in stable rules, checkpoints, and trace.
8. Failure scenarios and fallbacks
| Failure | Direct result | Detection | Architectural fallback |
|---|---|---|---|
| Model executes shell directly | Deletion, exfiltration, external calls, overreach | High-risk command audit, overreach attempts | All tools pass through gateway and sandbox |
| Missing workspace boundary | Agent edits another local project | File-write path audit | Writes allowed only inside declared workspace |
| Sensitive paths are readable | .env, SSH keys, cloud credentials leak | Sensitive-path access alerts | Deny sensitive reads and isolate credentials |
| Network open by default | Injection induces code or secret exfiltration | External-domain audit, unusual egress | Default deny or allowlist; high-risk networking asks |
| Tool result treated as system instruction | README / issue induces rule override | Injection test cases, abnormal tool behavior | External content is untrusted data and cannot override system rules |
| Too much approval | Users approve blindly | Approval latency, continuous-approve ratio | Automate low-risk sandboxed work; ask only for high-risk actions |
| Too little approval | Irreversible action happens automatically | High-risk action audit | External side effects ask or deny |
| No checkpoint for long tasks | Restart from zero after interruption | Resume failure rate | Persist task state, diff, trace, and budget |
| Compaction drops rules | Later steps violate earlier constraints | Post-compaction violation rate | Reload project rules after compaction; pin hard constraints |
| Subagents inherit all context | Main context is drowned in noise | Context usage, irrelevant-token ratio | Isolate subagents and return summaries plus evidence |
| Subagents write-conflict | Multiple subtasks edit the same area | Diff conflict rate | Assign non-overlapping write scopes; main line merges |
| No budget caps | Infinite loop, token burn, test-environment pressure | Step / cost anomalies | Step, time, token, tool-call, and concurrency limits |
| Only final diff is visible | Review cannot understand the why | Review rejection rate | Trace plans, tool calls, and test evidence |
| No evals | Model upgrade silently regresses | Production failures, human complaints | Representative task evals gate releases |
| Test command has no timeout | One task occupies a worker forever | Worker occupancy | Command timeout, resource quota, cancellation |
| Auto-commit bypasses review | Bad code reaches main | Unreviewed PR / commit audit | Final commit, PR, and deploy require Review Gate |
A coding Agent's maturity is not measured by whether it can write pretty code once. It is measured by whether, when it reads the wrong context, tools fail, injection appears, a task is interrupted, or budget runs out, it can stop, resume, explain, and refuse to ship when evals fail.
📌 Validate the reasoning against templates
This case does not rewrite any specific coding Agent product template. It combines the linked concerns that make "AI that can act" risky.
| Reusable template / chapter | What this case reuses | What this case adds |
|---|---|---|
| AI Agent / workflow platform | Workflow vs Agent, action loop, tools, memory, human-in-the-loop, trace | Grounds the generic Agent in a code repo and real shell side effects |
| OpenAI Codex | Sandbox × approval axes, workspace boundary, async cloud task, PR output | Explains why "what can it touch" and "when does it ask" are separate |
| Claude Code | Context engineering, permission layers, OS sandbox, subagents, hooks, compaction | Emphasizes subagent isolation and untrusted external content |
| OpenClaw | Resident Agent, channel entry points, tool approval, self-hosted boundary, transparent memory | Warns that a single-user trust boundary is not a team multi-tenant platform |
| Hermes Agent | Long-term memory, skill accumulation, cron Agent jobs, pluggable subsystems | Adds the risks of stale memory and drifting skills |
| 23 · Spec as architecture | AGENTS.md / rules / CI constraints | Turns project rules into boundaries the Agent reads every time and machines can enforce |
| 24 · Review checklist | AI omits safety, failure, and edge cases by default | Final diffs must pass review checklist scrutiny |
| 25 · Eval-driven architecture | Eval gates | Prevents silent regression after model / prompt / tool-policy changes |
Reading suggestion: read this case first, then revisit the AI Agent platform, Codex, and Claude Code templates. The key will be clearer: a coding Agent is not mainly about a stronger model. It is autonomous action wrapped in engineering boundaries.
🎯 Quick check
- ABecause shell is too slow
- BBecause models can be wrong or induced by prompt injection; commands must pass through a permission gateway and sandbox
- CBecause an Agent does not need to run tests
- ABecause it sounds more professional
- BBecause sandboxing controls what the Agent can physically touch, while approval controls when the process interrupts the human
- CBecause approval can replace sandboxing
- ABecause they are always wrong
- BBecause they are external input that may be wrong, stale, or contain prompt injection; they are observations, not system rules
- CBecause an Agent should not read documents
- AEvery task must use subagents
- BIsolating noisy repo-wide search or log analysis, protecting the main context, and returning structured summaries
- CLetting multiple Agents freely edit the same files
Case summary
- A coding Agent is not a chat box; it is an execution system that can change repo state. Once it can edit files, run commands, and access the network, the architectural focus shifts from answer quality to controlled execution.
- Use workflows first when they are enough. Autonomous Agent loops fit uncertain paths; fixed flows are cheaper, steadier, and easier to audit.
- Tool calls must pass through a permission gateway and sandbox. The model proposes intent; hard layers decide allow / ask / deny.
- Sandboxing and approval are two axes. Sandboxing controls physical boundaries; approval controls human interruption.
- Long tasks need checkpoints, trace, compaction, and budgets. Without state, limits, and evidence, long tasks become black boxes.
- Subagents isolate heavy work; they are not the default. Use them for search, logs, and review tasks that would crowd the main context.
- Evals are the release gate for coding Agents. Before model, prompt, or tool-policy changes ship, representative tasks must verify success, safety refusal, minimal-change behavior, and regression stability.
First-batch closure: this completes the first six core cases: inventory transactions, multi-tenant SaaS, enterprise RAG, realtime collaboration, content distribution, and coding Agents. Future case batches can go deeper by industry or by technology stack without repeating these six.
Related links
- Template cross-check: AI Agent / workflow platform · OpenAI Codex · Claude Code · OpenClaw · Hermes Agent
- Method: 07 · Designing a system from 0 to 1 · 08 · ADRs and evolution · 17 · Architecting in the age of LLMs
- AI collaboration: 23 · Spec as architecture · 24 · Review checklist · 25 · Eval-driven architecture
💬 Comments