Codex clone)

Phase: 11 — Capstone | Difficulty: ⭐⭐⭐⭐⭐ | Time: 3–5 weeks

Real-world parallel: Claude Code, Cursor Agent, GitHub Copilot Workspace, OpenAI Codex/Operator, Devin, Aider, Continue.dev. The capstone for agent / applied-AI engineer roles at the most-funded AI products of 2025.

Goals

Build an autonomous coding agent that can read a repo, plan changes, edit files, run tests, debug failures, and iterate — all from a natural-language task description. Production targets:

Tool-using LLM core with strict, validated tool-call schemas (file_read, file_write, run_shell, search_codebase, run_tests, web_fetch).
Sandboxed execution in a Docker / Firecracker container with resource limits and network egress controls.
Plan → act → observe → reflect loop with bounded recursion and budget tracking.
Multi-file, multi-turn edits with diff preview and human approval mode.
Evals: SWE-bench Lite (real GitHub issues) — your agent must score above the published baseline.
Production CLI + VS Code extension (or web UI) for actual usability.

Architecture

   ┌────────────────────────────────────────────────────────────────┐
   │ User: "Add pagination to the users API and update tests"       │
   └─────────────────────────┬──────────────────────────────────────┘
                             ▼
   ┌────────────────────────────────────────────────────────────────┐
   │ Agent Orchestrator (the brain)                                 │
   │  while not done and budget_remaining:                          │
   │     plan = LLM(system, history, tools, observations)           │
   │     if plan.tool_call:                                         │
   │        result = sandbox.execute(plan.tool_call)                │
   │        history.append(plan, result)                            │
   │     elif plan.final_answer:                                    │
   │        return plan.final_answer                                │
   │  - Token / wall-clock / tool-call budget                       │
   │  - Reflection step every N turns                               │
   │  - Safety: human-in-the-loop for destructive ops               │
   └─────────────────────────┬──────────────────────────────────────┘
                             ▼
   ┌────────────────────────────────────────────────────────────────┐
   │ Tool Layer (validated JSON schemas)                            │
   │  ┌──────────────┐ ┌────────────┐ ┌──────────────┐ ┌──────────┐│
   │  │ file_read    │ │ file_write │ │ search_code  │ │ run_tests││
   │  │ file_replace │ │ run_shell  │ │ list_dir     │ │ web_fetch││
   │  └──────────────┘ └────────────┘ └──────────────┘ └──────────┘│
   └─────────────────────────┬──────────────────────────────────────┘
                             ▼
   ┌────────────────────────────────────────────────────────────────┐
   │ Sandbox (Docker / Firecracker)                                 │
   │  - Per-task ephemeral container                                │
   │  - CPU + memory + time limits                                  │
   │  - Filesystem snapshot per turn (rollback on error)            │
   │  - Egress allowlist (no exfiltration)                          │
   │  - Captured stdout/stderr → observation                        │
   └────────────────────────────────────────────────────────────────┘

   Frontends: CLI (Aider-like) | VS Code extension | Web UI

Suggested Stack

Component	Choice
LLM	Claude 3.5 Sonnet OR Llama-3.3-70B / Qwen2.5-Coder-32B (local)
Tool-call schema	JSON Schema (validated with `jsonschema`)
Sandbox	Docker (easy) or Firecracker (production)
Code search	ripgrep + tree-sitter for symbol-aware queries
Embeddings (optional)	`BAAI/bge-code-v1` for semantic codebase search
Diff/patch	unidiff format; auto-apply with conflict detection
Test runner	language-detect → pytest / jest / cargo test / go test
CLI	`typer` or `click`
VS Code ext	TypeScript, LanguageClient API, sidebar webview
Eval	SWE-bench Lite harness
Telemetry	OpenTelemetry traces; per-step token/cost accounting

Deliverables Checklist

Core Agent

agent/loop.py — orchestrator with budgets and termination conditions
agent/prompts.py — system prompts (planner, executor, reflector)
agent/tools/ — one file per tool, with JSON schema + handler + tests
agent/sandbox/docker.py — container lifecycle, snapshot, exec, egress filter
agent/memory.py — bounded scratchpad, file-state tracking, history compaction

Frontends

cli/main.py — mycoder "task description" CLI with streaming output
vscode-ext/ — extension scaffold with chat sidebar (or web UI alternative)
web/ — optional FastAPI + React UI

Evaluation

eval/swebench/ — SWE-bench Lite runner; reproducible scoring
eval/internal/ — 30 hand-built tasks across 3 languages (Python, TS, Go) with golden diffs
EVAL_REPORT.md — pass@1 on SWE-bench Lite, success rate on internal tasks, cost per task, latency per task

Production

Dockerfile for the agent service
safety/policies.md — destructive-op allowlist, egress allowlist, max budget
OBSERVABILITY.md — what you log per request, redaction policy
WRITEUP.md — failure-mode taxonomy from your evals; what you'd fix next

Resume Bullet Pattern

Built an autonomous coding agent (Claude-Code-style) with tool-validated JSON schemas, Docker-sandboxed execution, plan/act/reflect loop, and per-task budget control. Achieved 24% pass@1 on SWE-bench Lite (above published Aider+Sonnet baseline) with a CLI + VS Code extension front-end. [demo + eval report]

Interview Talking Points

Tool design as the actual product: schemas are your API to the LLM; sloppy schemas = unreliable agent. Why granular tools (file_replace not apply_diff) reduce LLM error rate.
The orchestrator state machine: when to reflect, when to bail, how to compact history when context fills (summarization, sliding window, evicting tool outputs).
Sandbox security: container escapes, fork bombs, fs snapshots for rollback, egress allowlist (hosts.deny-style), why Firecracker is overkill for personal but right at scale.
Cost control: per-tool token cost accounting, hard budget gates, cheap model for "navigation" + expensive model for "edit" (model routing).
Failure modes: getting stuck in loops, fabricating file paths, ignoring tool errors, edit-conflict cascades. Your eval taxonomy.
Why JSON schemas and not freeform: structured outputs (Anthropic tool_use, OpenAI function-calling) drop hallucinated tools to ~0%.
Evaluation rigor: SWE-bench Lite vs full SWE-bench; pass@1 vs pass@k; the Aider polyglot benchmark; why your internal eval matters more than public benchmarks.
Cursor vs Claude Code vs Devin: editor-integrated vs terminal vs autonomous-cloud. Tradeoffs and your design choice.
Multi-agent: planner / coder / reviewer split — when it helps (complex refactors), when it adds latency without quality gain.
Human-in-the-loop: opt-in approval for destructive ops; how you UX it without killing flow.

Getting Started

Define your tool schemas first — write the JSON schemas before any agent code. They're the contract.
Build the sandbox in Docker. Smoke-test: shell out from container, capture stdout, enforce 10s timeout.
Single-tool agent: just file_read + final_answer. Get the LLM to read a file and summarize it. Verify schemas are obeyed.
Add file_write, run_shell, search_codebase one at a time. Test each tool in isolation.
Wire the orchestrator loop with a hard 10-step budget. Run on a toy task: "fix the failing test in this 3-file repo".
Add reflection step every 5 turns: "summarize what you've tried and what's left".
Run on SWE-bench Lite (300 tasks; ~$50 in API cost with Sonnet). Score yourself. Compare to published.
Build the failure taxonomy from the SWE-bench traces. Ship 3 specific fixes for the top 3 failure modes.
Build the CLI (Aider-style: shows diffs, asks for approval). It's mostly UX polish.
Build the VS Code extension (or web UI). Demo it. Record the demo. Most interviewers will only watch the video.

Stretch Goals

Local model alternative: switch the LLM backend to a self-hosted Qwen2.5-Coder-32B served by your Capstone-05 mini-vLLM. Now it's 100% in your stack.
Model routing: route navigation/search calls to Haiku/8B, edits to Sonnet/70B. 5–10× cost reduction at small quality loss.
Codebase-aware retrieval: index the repo with code embeddings; retrieve top-5 relevant files for each task automatically.
Multi-repo / monorepo support: cross-package refactors with dependency-graph awareness.
Long-horizon tasks: tasks spanning days, with checkpointing and resume (Devin-style).
Multi-agent debate: planner proposes, critic challenges, planner revises. Measurable improvement on hard tasks.
CI integration: agent triggered by GitHub issue label, opens PR with proposed fix.

What This Capstone Proves About You

You can build the kind of product that defines current AI funding rounds: a real agent that does real work, safely. You understand the unglamorous engineering (sandboxing, schemas, retries, budgets, observability) that separates a demo from a product. You can quote SWE-bench numbers and discuss the failure taxonomy intelligently.

This is the bar for Applied AI Engineer / Agent Engineer / AI Product Engineer roles at Anthropic (Claude Code), Cursor, Cognition (Devin), GitHub (Copilot Workspace), Replit, OpenAI (Codex/Operator), and every well-funded coding-agent startup of 2025–2026.

LLM Inference Engineer