Skip to content

Case 06 · CodePilot: an AI Agent / coding Agent platform

Thesis in one line: this case drills "how an AI that can act stays controlled". The hard part of a coding Agent is not making a model write code. It is letting it read repos, edit files, run commands, call tools, and work for a long time while permissions, sandboxing, budgets, approvals, checkpoints, and evals still constrain it.


🧪 Case-track article 6 · This case drills one thing

It drills architectural judgment for an AI Agent / coding Agent platform: when to use a workflow, when to allow an autonomous Agent loop, why tool calls must pass through a permission gateway, why sandboxing and human approval are different controls, how long tasks survive through checkpoints and context compaction, when subagents help, and how evals catch silent regressions after a model or prompt change.

After reading, you should be able toThis case drills it through
Explain why an Agent is different from a chat boxThe "read repo → edit files → run tests → observe → edit again" action loop
Explain why tool calls cannot be handed directly to the modelPermission gateway, sandbox, workspace boundary, and high-risk approval
Understand why long tasks forget goals, burn budgets, and driftCheckpoints, trace, context compaction, hard budgets, and resume flow
Decide when subagents help, and when they add noiseIndependent context, independent budget, and structured result handoff

Important note: this is a teaching case, not an internal design of any real coding Agent product. The numbers are for order-of-magnitude reasoning. The point is judgment, not one canonical answer.


Opening: why a coding Agent is not "a better code-writing chat box"

Because a chat box gives advice. A coding Agent actually touches your repository.

CodePilot is a coding Agent platform for engineering teams. An engineer can submit a task: "fix this failing test," "add idempotency to the payment callback," "upgrade this dependency," or "read this module and propose a refactor." CodePilot checks out the repo, reads code, plans, edits files, runs tests, reacts to failures, and eventually produces a diff or Pull Request.

At first glance, it looks like "a model that writes code":

  • the user describes the task;
  • the model reads code;
  • the model generates a patch;
  • tests run;
  • a result is returned.

But after launch, the most dangerous failure is not "the code is not elegant enough." It is:

The Agent is induced by prompt injection in a README to read secrets, runs a dangerous command, edits files outside the workspace, forgets the user's constraint after context compaction, lets subagents conflict, or silently regresses after a model upgrade.

So this chapter is not about "how to call an LLM API." It asks a sharper question:

How do you let an Agent read and write code, execute commands, and work autonomously for a long time while still being stoppable, auditable, recoverable, and measurable?

The pressure source is different from the first five cases:

  • StarArena fears inventory and payment-state errors.
  • PatchDesk fears tenant-boundary failures.
  • DocuMind fears ungrounded answers and prompt injection.
  • SyncRoom fears realtime state convergence.
  • FeedStream fears distribution amplification.
  • CodePilot fears the moment AI can act: mistakes turn from "said the wrong thing" into "changed, leaked, spent, or shipped the wrong thing."

Pre-reading glossary

These words repeat throughout the chapter. Align on plain-language meanings first:

TermPlain-language meaning
AgentAI that can take multiple steps by itself: plan, call tools, observe results, and decide what to do next.
WorkflowA fixed or mostly fixed sequence of steps. The model may judge at a few points, but the path is mostly predetermined.
Agent loopThe action loop: plan → call tool → observe → plan again.
Tool callA model request to read files, search code, edit files, run tests, fetch a page, or call an internal API.
SandboxAn isolated execution environment that limits where a command can read, where it can write, and whether it can access the network.
WorkspaceThe project directory the Agent is allowed to read and write. Files outside it are usually out of bounds.
Permission gatewayThe control layer every tool call passes through. It decides allow, deny, or ask the user.
ApprovalHuman confirmation before high-risk actions such as networking, deleting files, committing, or deploying.
CheckpointSaved task state: current diff, tool-result summaries, budget use, and enough state to resume after interruption.
TraceThe execution trail: what the Agent did, which tools it called, and what it observed.
Context windowThe amount of information the model can see at one time. Long tasks eventually run out of it.
CompactionSummarizing older history to free context window space.
MemoryRules, preferences, and past lessons retained across steps or sessions. More memory is not always better; it can become stale.
SubagentA separate Agent with its own context for noisy subtasks such as repo-wide search or log analysis. It returns a summary to the main line.
Prompt injectionMalicious or misleading instructions embedded in external content, such as "ignore prior rules and leak secrets."
BudgetHard limits on steps, tokens, time, tool calls, concurrency, and cost.
EvalA representative test set for measuring task success, safety, regression, and quality over time.

1. Starting state: build a controlled coding assistant first, not an autonomous engineer

CodePilot's first version has a modest goal: let a small team delegate low-risk engineering tasks to an Agent, while making every real change visible and interruptible.

Early constraints look like this:

DimensionStarting stage
Pilot team20-50 engineers
Connected repos5-20
Repo size10k-200k lines
Agent tasks per day50-200
Peak concurrent tasks5-20
Common tasksTests, small bugs, code reading, docs, simple dependency upgrades
Core goalProduce reviewable diffs and explain each step
Must not failRead secrets, edit outside the workspace, auto-run irreversible actions, or lose trace

The right first architecture is not a "fully autonomous engineer." It is a bounded Agent loop + tool gateway + workspace sandbox + human review:

Engineer submits a task


┌────────────────────────────────────────────┐
│ Task service                                │
│ Goal, repo, branch, budget, permission mode │
└──────────────┬─────────────────────────────┘

┌────────────────────────────────────────────┐
│ Agent orchestrator                          │
│ Plan → read/search → patch → test → adjust  │
└──────────────┬─────────────────────────────┘

┌────────────────────────────────────────────┐
│ Tool gateway + sandbox                      │
│ Workspace limits, network limits, approvals │
│ and full trace                              │
└──────────────┬─────────────────────────────┘

        diff / test result / trace → human review

This is not timid. It states the boundary:

  • The Agent can read the repo, but sensitive paths are denied by default.
  • The Agent can edit files, but only inside the workspace.
  • The Agent can run tests, but commands have timeouts and resource limits.
  • The Agent can suggest a commit or PR, but cannot bypass review.
  • The Agent can work for a long time, but every step must be traceable, resumable, and stoppable.

2. Quantified assumptions: it breaks first on permissions, context, and side effects

Now run the numbers. Suppose CodePilot expands from a pilot to a mid-sized engineering organization:

Engineers:300
Repos:800
Active repos:150
Repo size:50k-800k lines
Agent tasks per day:1,000-3,000
Peak concurrent tasks:50-120
Average task turns:8-20
Long-task turns:30-80
Tool calls per turn:1-4
Common tools:read file, search, patch, run tests, package manager, Git, internal APIs

Goal:simple tasks produce a diff in 5-10 minutes
Goal:complex tasks produce a PR or clear failure reason in 30-60 minutes
Safety target:out-of-workspace writes = 0; sensitive file exfiltration = 0
Budget target:every task has max steps, tokens, runtime, and concurrency
Quality target:before model / prompt / tool-policy changes, core evals stay above baseline

At this scale, request volume is not the first problem. A normal web system starts by calculating QPS. CodePilot first collides with:

  1. Permission boundaries: an Agent that can run commands can delete, modify, or access the network by mistake.
  2. Context boundaries: repos are large, logs are long, tasks run for a while, and the model cannot see everything.
  3. Side-effect boundaries: file edits can be rolled back; API calls, pushes, deploys, and database writes may not be.
  4. Cost boundaries: a few more attempts may fix the task, or may loop forever and burn money.
  5. Quality boundaries: a model upgrade may improve some tasks while weakening safety refusal or another task class.

So CodePilot's architecture is not mainly about handling more requests. It is about:

Putting Agent autonomy inside executable, auditable, recoverable, and measurable boundaries.


3. Trigger signals: when the first version is no longer enough

Do not upgrade by instinct. Watch these signals:

SignalSymptomWhy this is architectural
Users approve blindlyEvery command asks; users stop reading and approveApproval is too noisy, so safety becomes ceremonial
More overreach attemptsAgent often tries to read ~/.ssh, .env, or outside the workspaceDefaults and injection defenses are unclear
Interrupted long tasks restart from zeroA 40-minute run dies and cannot resumeNo durable checkpoints or task state
Constraints vanish after compactionThe user said "do not touch the DB layer," but later edits doCritical constraints are not pinned and reloaded
Tool logs fill the windowFull test logs, dependency trees, and search results crowd out task contextTool outputs need summarization, truncation, and references
Subagents conflictTwo subagents edit the same code areaSubtask ownership and merge coordination are missing
Cost spikesA task repeats the same fix attempt again and againNo step, time, token, or repetition cap
Results cannot be reviewedOnly the final diff is visible; the why is missingTrace is incomplete
Model upgrade regressesMore small bugs get fixed, but safety refusal weakensNo eval gate measuring multiple quality dimensions
Prompt injection incidentAn issue or README induces secret access or external callsExternal content is treated as instruction instead of untrusted input

These signals are not saying "the model is not smart enough." They are saying: model capability alone does not make an Agent production-ready. Control must be architecture.


4. Core tension: the more capable the Agent, the more it must be constrained

CodePilot has five core object groups:

  1. Task / goal / budget: what the user wants and how much time, money, and iteration the Agent may spend.
  2. Repo / workspace / branch: which project it may touch and where the output lands.
  3. Tools / permissions / sandbox: what the Agent can call and what each tool is physically allowed to do.
  4. Context / memory / compaction: what the model sees, and which constraints must survive long tasks.
  5. Trace / diff / eval: what it did, how humans review it, and whether quality stays stable over time.

The naive path looks like this:

User gives task → model generates code → tests run → result returned

The real system must answer at every step:

  • Is this tool call read-only, or will it change the world?
  • Can this command access the network? Can it write outside the workspace?
  • Is this README project documentation, or prompt injection written into the repo?
  • Can this modification be rolled back? Should we checkpoint first?
  • Did context compaction preserve the user's hard constraints?
  • Is a subagent result evidence, or just a suggestion?
  • Did the Agent fail because tests failed, permission failed, budget ran out, or the model drifted?
  • After a model upgrade, did task success, safety refusal, and minimal-change behavior regress?

That creates the new design problem:

CodePilot's value comes from autonomous execution, and so does its risk. The architecture is not about making it freer. It is about making "what can it touch, when does it ask, how does it recover, and how do we prove it did not get worse" explicit system structure.

The key classification is tool risk:

Tool typeExamplesDefault policy
Low-risk read-onlySearch code, read non-sensitive files, inspect test configAutomatic, but output is still untrusted observation
Reversible medium riskEdit workspace files, generate diff, format codeAllowed inside sandbox, with checkpoints and diff
High-risk side effectNetwork download, API request, DB write, email, pushAsk or deny by default
Irreversible / organizational riskDeploy, delete remote resources, change permissions, transfer moneyMandatory approval, often outright forbidden

5. Solution reasoning: how should a coding Agent execute tools?

This is the central decision. Many demos work, but the moment a real repo and shell are connected, risk becomes an engineering problem.

Option A: chat-only advice, no tool execution

User pastes code / error logs
  └─▶ model gives advice
      └─▶ human copies, edits, and tests
BenefitCost
Safety boundary is simple; the model has no real side effectsThe user still moves context around; leverage is limited
Easy to integrateIt cannot verify its work
Good for learning and explanationsIt cannot own an engineering task end-to-end

Option B: local Agent with full shell permissions

Agent reads repo → edits files → runs arbitrary commands → networks → commits
BenefitCost
Very flexibleHighest risk: deletion, leakage, networking, and overreach
Strong demo experienceHard to govern and audit for a team
Looks simple to buildThere is no structural boundary when something goes wrong

Option C: platform Agent, all tools through gateway and sandbox

Model proposes a tool call
  └─▶ permission gateway decides allow / ask / deny
      └─▶ sandbox executes
          └─▶ result returns as observation
              └─▶ trace / checkpoint / budget recorded throughout
BenefitCost
The Agent can act, but within physical boundariesMore configuration and user mental model
Side effects are auditable and constrainedSome legitimate tasks require explicit approval
Supports team governance, CI, and cloud jobsRequires task state, trace, sandbox pools, and evals

CodePilot chooses Option C for the growth stage: tool calls must pass through a permission gateway and sandbox, and high-risk actions must be interruptible.

The point is not whether the model is trusted. It is:

Anything the model can influence through instruction is not a hard boundary. Hard boundaries live where the model cannot bypass them: permission gateway, sandbox, budgets, approvals, CI, and evals.


6. Key architecture decisions: record the why with ADRs

An ADR is an Architecture Decision Record. Coding Agent systems will later be questioned: "Why not full autonomy? Why split sandboxing and approval? Why keep trace? Why can't subagents inherit the entire context?" Write those reasons down early.

ADR-01: Prefer workflows by default; use autonomous Agent loops only for open-ended tasks

  • Context: Many engineering tasks are structured: run lint, fix formatting, run tests, produce a diff. Fixed workflows are cheaper and more predictable.
  • Decision: CodePilot first decomposes tasks into controlled workflows. It enters an autonomous Agent loop only when the path is uncertain and tool feedback must guide the next step.
  • Give up: The simple marketing story that every task is freely planned by the model.
  • Gain: Lower cost, stronger control, and clearer failure analysis.
  • Risk: Poor workflow boundaries may restrict useful flexibility.
  • Review when: Many tasks get stuck in fixed flows, and human review shows open-ended exploration is genuinely needed.

ADR-02: Tool calls must pass through a permission gateway and sandbox

  • Context: A coding Agent can read and write files, run shell commands, access the network, and call internal APIs. Side effects are real.
  • Decision: Every tool call passes through a permission gateway. Writes are limited to the workspace. Sensitive paths are denied by default. Commands run in a sandbox. Network access is narrow by default. High-risk actions ask or deny.
  • Give up: Letting the model run anything it wants.
  • Gain: Even if the model is induced by prompt injection, physical boundaries still contain many failures.
  • Risk: Permission configuration gets more complex, and some valid development tasks need explicit authorization.
  • Review when: More internal tools, MCP (Model Context Protocol) tools, or cloud resources are added.

ADR-03: Decouple sandbox mode and human approval as two axes

  • Context: People often think in one "automatic vs manual" slider, but that mixes two different questions: what can it physically touch, and when does it interrupt the human?
  • Decision: Sandbox mode controls physical boundaries; approval controls process interruption. For example, read-only + never fits CI analysis, workspace-write + on-request fits default coding tasks, and high-risk external side effects must ask.
  • Give up: A simple but vague one-dimensional automation setting.
  • Gain: Clearer boundaries across local pairing, read-only CI, and async cloud tasks.
  • Risk: Users must understand two dimensions; otherwise they may assume "asks less" means "has more permission."
  • Review when: Users approve blindly, or legitimate tasks are interrupted too often.

ADR-04: Long tasks must persist checkpoints, trace, and budget state

  • Context: Coding tasks may run for tens of minutes. The session grows, tool outputs get large, and processes can die.
  • Decision: Persist task state, step trace, current diff, tool-result summaries, and budget use. When context nears its limit, compact history, then reload project rules, user goals, and hard constraints from stable storage.
  • Give up: Relying on one long conversation to hold everything.
  • Gain: Tasks are resumable and auditable, and compaction is less likely to erase critical constraints.
  • Risk: Summaries may drop details; bad summaries can mislead later steps.
  • Review when: Long-task failures often come from forgetting earlier constraints.

ADR-05: Use subagents only to isolate heavy work; return structured summaries to the main line

  • Context: Repo-wide search, dependency analysis, and test-log triage produce a lot of noisy intermediate data.
  • Decision: Subagents get independent context, limited tools, and separate budgets. The main Agent receives only structured summaries, evidence files, and recommendations. Final merging stays in the main line.
  • Give up: Letting subagents inherit the entire parent conversation.
  • Gain: Cleaner main context and controlled parallel decomposition.
  • Risk: A subagent summary may omit a key fact, and multiple Agents increase cost and latency.
  • Review when: A single Agent's context is crowded by noisy work, or the task is naturally decomposable with non-overlapping write scopes.

7. Evolved structure and data flow

Here is the core structure of CodePilot. It is not "model + shell." It is a task execution, permission-control, context-management, observability, and evaluation system.

Starting path

User sends task → model generates patch → local command runs → user sees result

The problem: permissions, sandboxing, budgets, resume, trace, and evals are not structured.

Evolved structure

User / CI / scheduled task


┌──────────────────────────────────────────────┐
│ Task intake                                   │
│ Goal, repo, branch, permission mode, budget,  │
│ approval policy                               │
└──────┬───────────────────────────────────────┘

┌──────────────────────────────────────────────┐
│ Agent orchestrator                            │
│ Workflow first; Agent loop when necessary      │
│ Plan → tool call → observe → plan again        │
└──────┬───────────────────────┬───────────────┘
       │                       │
       ▼                       ▼
┌──────────────┐       ┌──────────────────────┐
│ Context builder │     │ Task state / trace    │
│ Repo map, files,│     │ Checkpoints, diff,    │
│ rules, memory,  │     │ budget, tool summaries│
│ compaction      │     └──────────────────────┘
└──────┬───────┘


┌──────────────────────────────────────────────┐
│ Tool gateway                                  │
│ allow / ask / deny; sensitive paths; network  │
│ policy; resource limits                       │
└──────┬───────────────────────┬───────────────┘
       ▼                       ▼
┌──────────────┐       ┌──────────────────────┐
│ Sandbox       │       │ Human approval /      │
│ Workspace     │       │ Review Gate           │
│ isolation,    │       │ High-risk actions,    │
│ timeout, CPU, │       │ final PR review        │
│ memory, net   │       └──────────────────────┘
└──────┬───────┘

  tool result / test log / diff → back to orchestrator

Side path: eval platform continuously runs representative tasks and gates
model / prompt / tool-policy changes.

The key change is not "more services." It is clearer structure:

  • Task intake defines the goal, repo, budget, and permission mode.
  • Agent orchestrator decides fixed workflow vs autonomous loop.
  • Context builder supplies necessary information and pins project rules and hard constraints.
  • Tool gateway turns model intent into controlled tool calls.
  • Sandbox provides physical limits on workspace, network, and resources.
  • Trace / checkpoint makes long tasks resumable and auditable.
  • Review Gate hands irreversible actions and final output back to humans.
  • Eval platform catches silent quality regression after system changes.

Walk through "fix a failing test and add a regression test"

1. Engineer submits: "fix this failing billing test and add a regression test."
2. Task intake records repo, branch, and budget: max 25 steps, 30 minutes, workspace writes only.
3. Context builder loads project rules, relevant AGENTS.md, test command, and recent failure log.
4. Agent orchestrator plans: locate failure → read files → patch → run unit test → add regression.
5. Model requests code search. Tool gateway treats it as low-risk read-only and allows it.
6. Model requests an edit to `billing/refund.ts`. This is reversible workspace write, so the sandbox allows it after checkpointing.
7. Model requests `pnpm test billing`. The command runs in the sandbox with timeout, CPU / memory limits, and network disabled.
8. The test fails. The tool-result summary enters context; the raw log is kept as trace evidence.
9. The Agent reads the failing stack, finds a missing idempotency-key boundary, and patches again.
10. Tests pass. The Agent produces final diff, explanation, and test evidence.
11. Review Gate requires a human to review the diff before PR creation or commit.

Important details:

  • The model does not operate the machine directly. It proposes tool calls.
  • Tool results are observations, not new instructions. READMEs, issues, and test logs may contain prompt injection.
  • Every write has a checkpoint and the final diff is reviewable.
  • Test logs can be summarized into context, but raw trace must remain reachable.
  • Without Review Gate approval, the Agent cannot push changes into the main line.

Walk through "long task approaches context limit"

1. The Agent has executed 40 steps, read many files, and run several test rounds.
2. The context window approaches its limit, so the system triggers compaction.
3. The compactor writes a structured summary:
   - user goal
   - hard constraints
   - attempted approaches
   - current diff
   - unresolved problem
   - evidence files and trace links
4. The system reloads project rules and permission policy from stable storage.
5. Budget state returns too: 8 steps and 10 minutes remain, network is still disabled.
6. The Agent continues. If it approaches the limit again or exhausts budget, it must stop and report state instead of looping forever.

Context compaction is not magic. It turns "cannot fit" into "discard lower-value history with discipline." Anything that must never disappear belongs in stable rules, checkpoints, and trace.


8. Failure scenarios and fallbacks

FailureDirect resultDetectionArchitectural fallback
Model executes shell directlyDeletion, exfiltration, external calls, overreachHigh-risk command audit, overreach attemptsAll tools pass through gateway and sandbox
Missing workspace boundaryAgent edits another local projectFile-write path auditWrites allowed only inside declared workspace
Sensitive paths are readable.env, SSH keys, cloud credentials leakSensitive-path access alertsDeny sensitive reads and isolate credentials
Network open by defaultInjection induces code or secret exfiltrationExternal-domain audit, unusual egressDefault deny or allowlist; high-risk networking asks
Tool result treated as system instructionREADME / issue induces rule overrideInjection test cases, abnormal tool behaviorExternal content is untrusted data and cannot override system rules
Too much approvalUsers approve blindlyApproval latency, continuous-approve ratioAutomate low-risk sandboxed work; ask only for high-risk actions
Too little approvalIrreversible action happens automaticallyHigh-risk action auditExternal side effects ask or deny
No checkpoint for long tasksRestart from zero after interruptionResume failure ratePersist task state, diff, trace, and budget
Compaction drops rulesLater steps violate earlier constraintsPost-compaction violation rateReload project rules after compaction; pin hard constraints
Subagents inherit all contextMain context is drowned in noiseContext usage, irrelevant-token ratioIsolate subagents and return summaries plus evidence
Subagents write-conflictMultiple subtasks edit the same areaDiff conflict rateAssign non-overlapping write scopes; main line merges
No budget capsInfinite loop, token burn, test-environment pressureStep / cost anomaliesStep, time, token, tool-call, and concurrency limits
Only final diff is visibleReview cannot understand the whyReview rejection rateTrace plans, tool calls, and test evidence
No evalsModel upgrade silently regressesProduction failures, human complaintsRepresentative task evals gate releases
Test command has no timeoutOne task occupies a worker foreverWorker occupancyCommand timeout, resource quota, cancellation
Auto-commit bypasses reviewBad code reaches mainUnreviewed PR / commit auditFinal commit, PR, and deploy require Review Gate

A coding Agent's maturity is not measured by whether it can write pretty code once. It is measured by whether, when it reads the wrong context, tools fail, injection appears, a task is interrupted, or budget runs out, it can stop, resume, explain, and refuse to ship when evals fail.


📌 Validate the reasoning against templates

This case does not rewrite any specific coding Agent product template. It combines the linked concerns that make "AI that can act" risky.

Reusable template / chapterWhat this case reusesWhat this case adds
AI Agent / workflow platformWorkflow vs Agent, action loop, tools, memory, human-in-the-loop, traceGrounds the generic Agent in a code repo and real shell side effects
OpenAI CodexSandbox × approval axes, workspace boundary, async cloud task, PR outputExplains why "what can it touch" and "when does it ask" are separate
Claude CodeContext engineering, permission layers, OS sandbox, subagents, hooks, compactionEmphasizes subagent isolation and untrusted external content
OpenClawResident Agent, channel entry points, tool approval, self-hosted boundary, transparent memoryWarns that a single-user trust boundary is not a team multi-tenant platform
Hermes AgentLong-term memory, skill accumulation, cron Agent jobs, pluggable subsystemsAdds the risks of stale memory and drifting skills
23 · Spec as architectureAGENTS.md / rules / CI constraintsTurns project rules into boundaries the Agent reads every time and machines can enforce
24 · Review checklistAI omits safety, failure, and edge cases by defaultFinal diffs must pass review checklist scrutiny
25 · Eval-driven architectureEval gatesPrevents silent regression after model / prompt / tool-policy changes

Reading suggestion: read this case first, then revisit the AI Agent platform, Codex, and Claude Code templates. The key will be clearer: a coding Agent is not mainly about a stronger model. It is autonomous action wrapped in engineering boundaries.


🎯 Quick check

🤔Why should CodePilot not let the model execute shell commands directly?
  • ABecause shell is too slow
  • BBecause models can be wrong or induced by prompt injection; commands must pass through a permission gateway and sandbox
  • CBecause an Agent does not need to run tests
🤔Why split sandbox mode and human approval into two axes?
  • ABecause it sounds more professional
  • BBecause sandboxing controls what the Agent can physically touch, while approval controls when the process interrupts the human
  • CBecause approval can replace sandboxing
🤔Why are README files, issues, and test logs returned by tools not fully trusted?
  • ABecause they are always wrong
  • BBecause they are external input that may be wrong, stale, or contain prompt injection; they are observations, not system rules
  • CBecause an Agent should not read documents
🤔What problem are subagents best suited for?
  • AEvery task must use subagents
  • BIsolating noisy repo-wide search or log analysis, protecting the main context, and returning structured summaries
  • CLetting multiple Agents freely edit the same files

Case summary

  • A coding Agent is not a chat box; it is an execution system that can change repo state. Once it can edit files, run commands, and access the network, the architectural focus shifts from answer quality to controlled execution.
  • Use workflows first when they are enough. Autonomous Agent loops fit uncertain paths; fixed flows are cheaper, steadier, and easier to audit.
  • Tool calls must pass through a permission gateway and sandbox. The model proposes intent; hard layers decide allow / ask / deny.
  • Sandboxing and approval are two axes. Sandboxing controls physical boundaries; approval controls human interruption.
  • Long tasks need checkpoints, trace, compaction, and budgets. Without state, limits, and evidence, long tasks become black boxes.
  • Subagents isolate heavy work; they are not the default. Use them for search, logs, and review tasks that would crowd the main context.
  • Evals are the release gate for coding Agents. Before model, prompt, or tool-policy changes ship, representative tasks must verify success, safety refusal, minimal-change behavior, and regression stability.

First-batch closure: this completes the first six core cases: inventory transactions, multi-tenant SaaS, enterprise RAG, realtime collaboration, content distribution, and coding Agents. Future case batches can go deeper by industry or by technology stack without repeating these six.


💬 Comments