Capstone 09 — On-Device LLM (Quantize → MLX / llama.cpp / GGUF → Ship)
Phase: 11 — Capstone | Difficulty: ⭐⭐⭐⭐☆ | Time: 2–3 weeks
Real-world parallel: Apple Intelligence (on-device 3B model), Ollama, LM Studio, GPT4All, Pocket Pal, Microsoft Phi-3.5 on edge, Gemma Nano on Android, llama.cpp ecosystem. The capstone for edge AI / on-device inference roles.
Goals
Take a capable open-source LLM, squeeze it into a laptop or phone, and ship it as a real product. End-to-end:
- Pick a target: Llama-3.2-3B, Phi-3.5-mini-3.8B, or Qwen2.5-3B.
- Quantize to multiple formats: GGUF Q4_K_M (CPU), GGUF Q5_K_M (quality), MLX 4-bit (Apple Silicon), AWQ INT4 (CUDA edge).
- Benchmark each for tokens/sec, RAM, perplexity, eval scores. Pick a Pareto-optimal default.
- Ship a real desktop app (Electron + Tauri / Swift / Flutter) with native streaming, model auto-download, and offline operation.
- Mobile bonus: iOS app via MLX-Swift or Android via MediaPipe / llama.cpp JNI.
- Production niceties: model manager, conversation history, system prompt presets, MCP / tool-use hook.
Architecture
┌──────────────────────────────────────────────────────────┐
│ Step 1: Quantization Lab │
│ HF model → GPTQ / AWQ / GGUF / MLX │
│ - PPL on WikiText-2 │
│ - HellaSwag / ARC / MMLU │
│ - tokens/sec on M-series, x86, ARM, CUDA edge │
│ - RAM usage at peak │
└──────────────────────────┬───────────────────────────────┘
▼
┌──────────────────────────────────────────────────────────┐
│ Step 2: Inference Backend │
│ - llama.cpp (Metal / CUDA / Vulkan / CPU) │
│ - MLX (Apple Silicon native) │
│ - MediaPipe LLM (Android / iOS / Web) │
│ - ONNX Runtime mobile (cross-platform fallback) │
└──────────────────────────┬───────────────────────────────┘
▼
┌──────────────────────────────────────────────────────────┐
│ Step 3: Application │
│ Desktop: Tauri (Rust + WebView) — small bundle, native │
│ - Model picker + auto-download w/ resume │
│ - Streaming chat UI │
│ - System prompt presets ("Code Reviewer", "Tutor"…) │
│ - Settings: temperature, top_p, max tokens, n_ctx │
│ - Conversation export (JSON, Markdown) │
│ - Optional: MCP-style tool hooks (browse, run code) │
│ Mobile: native (SwiftUI + MLX, or Compose + MediaPipe) │
└──────────────────────────┬───────────────────────────────┘
▼
┌──────────────────────────────────────────────────────────┐
│ Step 4: Distribution │
│ - GitHub Releases (signed binaries, auto-update) │
│ - Mac: notarized .dmg │
│ - Windows: signed .msi │
│ - Linux: AppImage / .deb │
│ - Mobile (stretch): TestFlight / Play Internal Testing │
└──────────────────────────────────────────────────────────┘
Suggested Stack
| Concern | Choice |
|---|---|
| Base model | Llama-3.2-3B-Instruct, Phi-3.5-mini, Qwen2.5-3B-Instruct |
| Quantization (cross-platform) | GGUF via llama.cpp/convert_hf_to_gguf.py, then quantize |
| Quantization (Apple) | MLX via mlx-lm, mlx_lm.convert --quantize -q 4 |
| Quantization (CUDA edge) | AWQ via autoawq |
| Inference engine | llama.cpp (default), mlx-lm (Mac), mediapipe-tasks-text (mobile) |
| Desktop UI | Tauri (Rust + web) for small bundles; alternative: Electron, Flutter |
| iOS | SwiftUI + mlx-swift-examples |
| Android | Kotlin + llama.cpp JNI bindings or MediaPipe LLM |
| Eval | lm-evaluation-harness, custom perplexity script |
| Bench | llama-bench (built into llama.cpp), MLX's mlx_lm.benchmark |
Deliverables Checklist
Quantization & Eval
-
quant/convert_gguf.sh— script that produces Q3_K_M, Q4_K_M, Q5_K_M, Q6_K, Q8_0 -
quant/convert_mlx.sh— produces MLX 4-bit and 8-bit -
quant/convert_awq.py— AWQ INT4 with calibration set -
eval/perplexity.py— WikiText-2 PPL across all variants -
eval/lm_harness.sh— HellaSwag, ARC-E, MMLU on each quant -
bench/run_bench.sh— tokens/sec on M2/M3 (Mac), x86 laptop CPU, ARM phone, GTX/RTX edge -
BENCHMARK.md— Pareto plot (quality vs speed vs RAM); recommended default per platform
Desktop App
-
app/— Tauri project (or Electron alternative) -
app/src-tauri/— Rust backend embedding llama.cpp viallama-cpp-rscrate -
app/src/— web UI (SvelteKit / React) - Model manager with download progress, integrity check (sha256), background loading
-
System-prompt presets file (
presets.json) with at least 6 useful personas - Streaming chat with stop-token handling, regenerate, edit-and-resubmit
-
Persistent conversation storage (SQLite via
rusqlite) - Settings UI: model select, temperature, top_p, top_k, repeat_penalty, n_ctx, threads, GPU layers
- Export: Markdown / JSON / share-link
- Signed releases for Mac (notarized) + Windows + Linux
Mobile (Stretch)
- iOS app: SwiftUI + MLX-Swift; Q4 model (~1.8 GB) running natively
- Android app: Kotlin + MediaPipe LLM Inference task
Production
-
MODEL_CARD.mdper quantization (quality numbers, intended use, limitations) -
PRIVACY.md— explicit "everything stays on device" statement; what telemetry (none, opt-in) -
WRITEUP.md— quality cliff (where Q3 fails), platform tradeoffs, what surprised you - Demo video (loom)
Performance Targets
| Platform | Model + Quant | Target |
|---|---|---|
| Apple M3 Pro | Llama-3.2-3B Q4_K_M | ≥ 35 tok/s, RAM ≤ 3 GB |
| Apple M3 Pro | Llama-3.2-3B MLX 4-bit | ≥ 60 tok/s, RAM ≤ 2.5 GB |
| Apple M3 Max | Llama-3.2-3B MLX 8-bit | ≥ 50 tok/s |
| x86 laptop CPU (8c/16t) | Q4_K_M | ≥ 12 tok/s |
| RTX 4060 Laptop | Q4_K_M, GPU offload | ≥ 80 tok/s |
| iPhone 15 Pro | MLX 4-bit | ≥ 15 tok/s |
| Quality vs FP16 | Q4_K_M | PPL within 5%, MMLU within 1.5 pts |
Resume Bullet Pattern
Shipped a fully on-device LLM desktop app (Tauri + llama.cpp + MLX) running Llama-3.2-3B at 60 tok/s on M3 Pro with <2.5 GB RAM. Benchmarked 5 quantization variants (Q3..Q8 GGUF + MLX-4/8 + AWQ) for the Pareto frontier; published model card + signed cross-platform releases. [downloads + benchmarks]
Interview Talking Points
- GGUF format: file layout (header, kv-metadata, tensor data), why it succeeded ggml; advantages over safetensors for inference (single-file, mmap-friendly, embedded vocab).
- K-quants (Q4_K_M et al.): block-wise quantization with per-block scale + min, mixed bit-widths within a tensor; why K-quants beat the old Q4_0 by ~2% PPL at the same bit budget.
- AWQ vs GPTQ vs RTN: AWQ identifies salient channels via activations, scales them up before INT4 (recoverable). GPTQ uses Hessian-aware second-order. RTN is the naive baseline.
- MLX vs llama.cpp on Apple Silicon: MLX uses unified memory more aggressively, faster for batch-1 decode; llama.cpp's Metal backend is more battle-tested and supports all GGUF quants.
- The Pareto frontier: bits-per-weight vs perplexity is roughly linear above 3 bits and falls off a cliff below; Q4_K_M is the universal sweet spot.
- Memory bandwidth bound: edge inference is ~always memory-bound (low arithmetic intensity at batch=1); halving model size doubles tokens/sec almost exactly.
- Apple Intelligence model: ~3B-param model with rank-2 LoRA adapters per task, 4-bit weights, runs on Neural Engine. The architecture you're cloning.
- Privacy story: zero-network operation, no telemetry, sandbox guarantees. The actual product differentiator vs cloud chatbots.
- Battery and thermal: token-rate target needs to match thermal envelope; sustained vs burst tokens/sec.
- Tool-use / MCP on-device: small models struggle with agentic loops; mitigations (constrained decoding, JSON-mode, retrieve-then-answer pattern).
Getting Started
- Pick the model. Llama-3.2-3B is the safest default (license, quality, ecosystem).
- Convert to GGUF with
llama.cpp/convert_hf_to_gguf.py. Then quantize to Q4_K_M, Q5_K_M, Q8_0. - Smoke test with
llama-cli -m model.gguf -p "Hello". Verify coherent output. - Run perplexity with
llama-perplexityon WikiText-2 for each quant. Build the table. - Run
llama-benchon every device you can access (yours, friends', cloud Mac instance). - For Mac users: convert with
mlx_lm.convert --hf-path ... -q --q-bits 4. Compare MLX speed vs llama.cpp Metal on the same hardware. - Build the desktop app. Start with Tauri scaffold +
llama-cpp-rs. Wire streaming first; UI second. - Add the model manager: download with progress, sha256 verify, mmap-load, swap models without restart.
- Polish UX: presets, regenerate, settings, conversation history. Spend at least a week here — UX is the product on edge.
- Sign and release cross-platform binaries on GitHub. Notarize the Mac build. Demo video. Submit to Hacker News / r/LocalLLaMA — community feedback is interview gold.
Stretch Goals
- iOS app in MLX-Swift. Real "ChatGPT in your pocket" demo.
- MCP (Model Context Protocol) integration: connect to local file system, browser, calendar via MCP servers — fully offline agent.
- LoRA hot-swap: ship base model + 4–6 task adapters (coder, writer, summarizer); switch without reload.
- Speculative decoding with a 0.5B draft (Qwen2.5-0.5B) for the 3B target. Surprisingly effective on M-series.
- RAG built in: drag a PDF into the app → local embeddings → retrieve while chatting. All offline.
- Voice mode: Whisper.cpp for STT + Coqui/Piper for TTS, 100% on-device.
- Web demo via WebGPU:
wllamaorweb-llmport — runs in the browser, zero install. - Auto-update with delta patches.
What This Capstone Proves About You
You can take a research artifact and turn it into a product normal humans can install and use. You understand the full stack from quantization formats to UI polish, and the platform-specific trade-offs (MLX vs llama.cpp, x86 vs ARM, mobile vs desktop). You can quote tokens/sec and RAM numbers across hardware tiers. You shipped a signed binary that other people use.
This is the bar for On-Device AI Engineer / Edge ML Engineer / AI Product Engineer roles at Apple (Intelligence team), Google (Gemini Nano / MediaPipe), Meta (on-device Llama), Microsoft (Phi on Surface / Windows), Qualcomm (AI Engine), Hugging Face (local-first tooling), Ollama, LM Studio, and any startup building privacy-first AI products. Few candidates have actually shipped a working installable AI app — having one is differentiating signal.