Capstone 07 — Agentic Coding Assistant (Claude Code / Cursor / Codex clone)
Phase: 11 — Capstone | Difficulty: ⭐⭐⭐⭐⭐ | Time: 3–5 weeks
Real-world parallel: Claude Code, Cursor Agent, GitHub Copilot Workspace, OpenAI Codex/Operator, Devin, Aider, Continue.dev. The capstone for agent / applied-AI engineer roles at the most-funded AI products of 2025.
Goals
Build an autonomous coding agent that can read a repo, plan changes, edit files, run tests, debug failures, and iterate — all from a natural-language task description. Production targets:
- Tool-using LLM core with strict, validated tool-call schemas (file_read, file_write, run_shell, search_codebase, run_tests, web_fetch).
- Sandboxed execution in a Docker / Firecracker container with resource limits and network egress controls.
- Plan → act → observe → reflect loop with bounded recursion and budget tracking.
- Multi-file, multi-turn edits with diff preview and human approval mode.
- Evals: SWE-bench Lite (real GitHub issues) — your agent must score above the published baseline.
- Production CLI + VS Code extension (or web UI) for actual usability.
Architecture
┌────────────────────────────────────────────────────────────────┐
│ User: "Add pagination to the users API and update tests" │
└─────────────────────────┬──────────────────────────────────────┘
▼
┌────────────────────────────────────────────────────────────────┐
│ Agent Orchestrator (the brain) │
│ while not done and budget_remaining: │
│ plan = LLM(system, history, tools, observations) │
│ if plan.tool_call: │
│ result = sandbox.execute(plan.tool_call) │
│ history.append(plan, result) │
│ elif plan.final_answer: │
│ return plan.final_answer │
│ - Token / wall-clock / tool-call budget │
│ - Reflection step every N turns │
│ - Safety: human-in-the-loop for destructive ops │
└─────────────────────────┬──────────────────────────────────────┘
▼
┌────────────────────────────────────────────────────────────────┐
│ Tool Layer (validated JSON schemas) │
│ ┌──────────────┐ ┌────────────┐ ┌──────────────┐ ┌──────────┐│
│ │ file_read │ │ file_write │ │ search_code │ │ run_tests││
│ │ file_replace │ │ run_shell │ │ list_dir │ │ web_fetch││
│ └──────────────┘ └────────────┘ └──────────────┘ └──────────┘│
└─────────────────────────┬──────────────────────────────────────┘
▼
┌────────────────────────────────────────────────────────────────┐
│ Sandbox (Docker / Firecracker) │
│ - Per-task ephemeral container │
│ - CPU + memory + time limits │
│ - Filesystem snapshot per turn (rollback on error) │
│ - Egress allowlist (no exfiltration) │
│ - Captured stdout/stderr → observation │
└────────────────────────────────────────────────────────────────┘
Frontends: CLI (Aider-like) | VS Code extension | Web UI
Suggested Stack
| Component | Choice |
|---|---|
| LLM | Claude 3.5 Sonnet OR Llama-3.3-70B / Qwen2.5-Coder-32B (local) |
| Tool-call schema | JSON Schema (validated with jsonschema) |
| Sandbox | Docker (easy) or Firecracker (production) |
| Code search | ripgrep + tree-sitter for symbol-aware queries |
| Embeddings (optional) | BAAI/bge-code-v1 for semantic codebase search |
| Diff/patch | unidiff format; auto-apply with conflict detection |
| Test runner | language-detect → pytest / jest / cargo test / go test |
| CLI | typer or click |
| VS Code ext | TypeScript, LanguageClient API, sidebar webview |
| Eval | SWE-bench Lite harness |
| Telemetry | OpenTelemetry traces; per-step token/cost accounting |
Deliverables Checklist
Core Agent
-
agent/loop.py— orchestrator with budgets and termination conditions -
agent/prompts.py— system prompts (planner, executor, reflector) -
agent/tools/— one file per tool, with JSON schema + handler + tests -
agent/sandbox/docker.py— container lifecycle, snapshot, exec, egress filter -
agent/memory.py— bounded scratchpad, file-state tracking, history compaction
Frontends
-
cli/main.py—mycoder "task description"CLI with streaming output -
vscode-ext/— extension scaffold with chat sidebar (or web UI alternative) -
web/— optional FastAPI + React UI
Evaluation
-
eval/swebench/— SWE-bench Lite runner; reproducible scoring -
eval/internal/— 30 hand-built tasks across 3 languages (Python, TS, Go) with golden diffs -
EVAL_REPORT.md— pass@1 on SWE-bench Lite, success rate on internal tasks, cost per task, latency per task
Production
-
Dockerfilefor the agent service -
safety/policies.md— destructive-op allowlist, egress allowlist, max budget -
OBSERVABILITY.md— what you log per request, redaction policy -
WRITEUP.md— failure-mode taxonomy from your evals; what you'd fix next
Resume Bullet Pattern
Built an autonomous coding agent (Claude-Code-style) with tool-validated JSON schemas, Docker-sandboxed execution, plan/act/reflect loop, and per-task budget control. Achieved 24% pass@1 on SWE-bench Lite (above published Aider+Sonnet baseline) with a CLI + VS Code extension front-end. [demo + eval report]
Interview Talking Points
- Tool design as the actual product: schemas are your API to the LLM; sloppy schemas = unreliable agent. Why granular tools (
file_replacenotapply_diff) reduce LLM error rate. - The orchestrator state machine: when to reflect, when to bail, how to compact history when context fills (summarization, sliding window, evicting tool outputs).
- Sandbox security: container escapes, fork bombs, fs snapshots for rollback, egress allowlist (
hosts.deny-style), why Firecracker is overkill for personal but right at scale. - Cost control: per-tool token cost accounting, hard budget gates, cheap model for "navigation" + expensive model for "edit" (model routing).
- Failure modes: getting stuck in loops, fabricating file paths, ignoring tool errors, edit-conflict cascades. Your eval taxonomy.
- Why JSON schemas and not freeform: structured outputs (Anthropic tool_use, OpenAI function-calling) drop hallucinated tools to ~0%.
- Evaluation rigor: SWE-bench Lite vs full SWE-bench; pass@1 vs pass@k; the Aider polyglot benchmark; why your internal eval matters more than public benchmarks.
- Cursor vs Claude Code vs Devin: editor-integrated vs terminal vs autonomous-cloud. Tradeoffs and your design choice.
- Multi-agent: planner / coder / reviewer split — when it helps (complex refactors), when it adds latency without quality gain.
- Human-in-the-loop: opt-in approval for destructive ops; how you UX it without killing flow.
Getting Started
- Define your tool schemas first — write the JSON schemas before any agent code. They're the contract.
- Build the sandbox in Docker. Smoke-test: shell out from container, capture stdout, enforce 10s timeout.
- Single-tool agent: just
file_read+final_answer. Get the LLM to read a file and summarize it. Verify schemas are obeyed. - Add
file_write,run_shell,search_codebaseone at a time. Test each tool in isolation. - Wire the orchestrator loop with a hard 10-step budget. Run on a toy task: "fix the failing test in this 3-file repo".
- Add reflection step every 5 turns: "summarize what you've tried and what's left".
- Run on SWE-bench Lite (300 tasks; ~$50 in API cost with Sonnet). Score yourself. Compare to published.
- Build the failure taxonomy from the SWE-bench traces. Ship 3 specific fixes for the top 3 failure modes.
- Build the CLI (Aider-style: shows diffs, asks for approval). It's mostly UX polish.
- Build the VS Code extension (or web UI). Demo it. Record the demo. Most interviewers will only watch the video.
Stretch Goals
- Local model alternative: switch the LLM backend to a self-hosted Qwen2.5-Coder-32B served by your Capstone-05 mini-vLLM. Now it's 100% in your stack.
- Model routing: route navigation/search calls to Haiku/8B, edits to Sonnet/70B. 5–10× cost reduction at small quality loss.
- Codebase-aware retrieval: index the repo with code embeddings; retrieve top-5 relevant files for each task automatically.
- Multi-repo / monorepo support: cross-package refactors with dependency-graph awareness.
- Long-horizon tasks: tasks spanning days, with checkpointing and resume (Devin-style).
- Multi-agent debate: planner proposes, critic challenges, planner revises. Measurable improvement on hard tasks.
- CI integration: agent triggered by GitHub issue label, opens PR with proposed fix.
What This Capstone Proves About You
You can build the kind of product that defines current AI funding rounds: a real agent that does real work, safely. You understand the unglamorous engineering (sandboxing, schemas, retries, budgets, observability) that separates a demo from a product. You can quote SWE-bench numbers and discuss the failure taxonomy intelligently.
This is the bar for Applied AI Engineer / Agent Engineer / AI Product Engineer roles at Anthropic (Claude Code), Cursor, Cognition (Devin), GitHub (Copilot Workspace), Replit, OpenAI (Codex/Operator), and every well-funded coding-agent startup of 2025–2026.