Capstone 09 — On-Device LLM (Quantize → MLX / llama.cpp / GGUF → Ship)

Phase: 11 — Capstone | Difficulty: ⭐⭐⭐⭐☆ | Time: 2–3 weeks

Real-world parallel: Apple Intelligence (on-device 3B model), Ollama, LM Studio, GPT4All, Pocket Pal, Microsoft Phi-3.5 on edge, Gemma Nano on Android, llama.cpp ecosystem. The capstone for edge AI / on-device inference roles.


Goals

Take a capable open-source LLM, squeeze it into a laptop or phone, and ship it as a real product. End-to-end:

  1. Pick a target: Llama-3.2-3B, Phi-3.5-mini-3.8B, or Qwen2.5-3B.
  2. Quantize to multiple formats: GGUF Q4_K_M (CPU), GGUF Q5_K_M (quality), MLX 4-bit (Apple Silicon), AWQ INT4 (CUDA edge).
  3. Benchmark each for tokens/sec, RAM, perplexity, eval scores. Pick a Pareto-optimal default.
  4. Ship a real desktop app (Electron + Tauri / Swift / Flutter) with native streaming, model auto-download, and offline operation.
  5. Mobile bonus: iOS app via MLX-Swift or Android via MediaPipe / llama.cpp JNI.
  6. Production niceties: model manager, conversation history, system prompt presets, MCP / tool-use hook.

Architecture

   ┌──────────────────────────────────────────────────────────┐
   │ Step 1: Quantization Lab                                 │
   │  HF model → GPTQ / AWQ / GGUF / MLX                      │
   │   - PPL on WikiText-2                                    │
   │   - HellaSwag / ARC / MMLU                              │
   │   - tokens/sec on M-series, x86, ARM, CUDA edge          │
   │   - RAM usage at peak                                    │
   └──────────────────────────┬───────────────────────────────┘
                              ▼
   ┌──────────────────────────────────────────────────────────┐
   │ Step 2: Inference Backend                                │
   │  - llama.cpp (Metal / CUDA / Vulkan / CPU)               │
   │  - MLX (Apple Silicon native)                            │
   │  - MediaPipe LLM (Android / iOS / Web)                   │
   │  - ONNX Runtime mobile (cross-platform fallback)         │
   └──────────────────────────┬───────────────────────────────┘
                              ▼
   ┌──────────────────────────────────────────────────────────┐
   │ Step 3: Application                                      │
   │  Desktop: Tauri (Rust + WebView) — small bundle, native │
   │   - Model picker + auto-download w/ resume               │
   │   - Streaming chat UI                                    │
   │   - System prompt presets ("Code Reviewer", "Tutor"…)    │
   │   - Settings: temperature, top_p, max tokens, n_ctx      │
   │   - Conversation export (JSON, Markdown)                 │
   │   - Optional: MCP-style tool hooks (browse, run code)    │
   │  Mobile: native (SwiftUI + MLX, or Compose + MediaPipe) │
   └──────────────────────────┬───────────────────────────────┘
                              ▼
   ┌──────────────────────────────────────────────────────────┐
   │ Step 4: Distribution                                     │
   │  - GitHub Releases (signed binaries, auto-update)        │
   │  - Mac: notarized .dmg                                   │
   │  - Windows: signed .msi                                  │
   │  - Linux: AppImage / .deb                                │
   │  - Mobile (stretch): TestFlight / Play Internal Testing  │
   └──────────────────────────────────────────────────────────┘

Suggested Stack

ConcernChoice
Base modelLlama-3.2-3B-Instruct, Phi-3.5-mini, Qwen2.5-3B-Instruct
Quantization (cross-platform)GGUF via llama.cpp/convert_hf_to_gguf.py, then quantize
Quantization (Apple)MLX via mlx-lm, mlx_lm.convert --quantize -q 4
Quantization (CUDA edge)AWQ via autoawq
Inference enginellama.cpp (default), mlx-lm (Mac), mediapipe-tasks-text (mobile)
Desktop UITauri (Rust + web) for small bundles; alternative: Electron, Flutter
iOSSwiftUI + mlx-swift-examples
AndroidKotlin + llama.cpp JNI bindings or MediaPipe LLM
Evallm-evaluation-harness, custom perplexity script
Benchllama-bench (built into llama.cpp), MLX's mlx_lm.benchmark

Deliverables Checklist

Quantization & Eval

  • quant/convert_gguf.sh — script that produces Q3_K_M, Q4_K_M, Q5_K_M, Q6_K, Q8_0
  • quant/convert_mlx.sh — produces MLX 4-bit and 8-bit
  • quant/convert_awq.py — AWQ INT4 with calibration set
  • eval/perplexity.py — WikiText-2 PPL across all variants
  • eval/lm_harness.sh — HellaSwag, ARC-E, MMLU on each quant
  • bench/run_bench.sh — tokens/sec on M2/M3 (Mac), x86 laptop CPU, ARM phone, GTX/RTX edge
  • BENCHMARK.md — Pareto plot (quality vs speed vs RAM); recommended default per platform

Desktop App

  • app/ — Tauri project (or Electron alternative)
  • app/src-tauri/ — Rust backend embedding llama.cpp via llama-cpp-rs crate
  • app/src/ — web UI (SvelteKit / React)
  • Model manager with download progress, integrity check (sha256), background loading
  • System-prompt presets file (presets.json) with at least 6 useful personas
  • Streaming chat with stop-token handling, regenerate, edit-and-resubmit
  • Persistent conversation storage (SQLite via rusqlite)
  • Settings UI: model select, temperature, top_p, top_k, repeat_penalty, n_ctx, threads, GPU layers
  • Export: Markdown / JSON / share-link
  • Signed releases for Mac (notarized) + Windows + Linux

Mobile (Stretch)

  • iOS app: SwiftUI + MLX-Swift; Q4 model (~1.8 GB) running natively
  • Android app: Kotlin + MediaPipe LLM Inference task

Production

  • MODEL_CARD.md per quantization (quality numbers, intended use, limitations)
  • PRIVACY.md — explicit "everything stays on device" statement; what telemetry (none, opt-in)
  • WRITEUP.md — quality cliff (where Q3 fails), platform tradeoffs, what surprised you
  • Demo video (loom)

Performance Targets

PlatformModel + QuantTarget
Apple M3 ProLlama-3.2-3B Q4_K_M≥ 35 tok/s, RAM ≤ 3 GB
Apple M3 ProLlama-3.2-3B MLX 4-bit≥ 60 tok/s, RAM ≤ 2.5 GB
Apple M3 MaxLlama-3.2-3B MLX 8-bit≥ 50 tok/s
x86 laptop CPU (8c/16t)Q4_K_M≥ 12 tok/s
RTX 4060 LaptopQ4_K_M, GPU offload≥ 80 tok/s
iPhone 15 ProMLX 4-bit≥ 15 tok/s
Quality vs FP16Q4_K_MPPL within 5%, MMLU within 1.5 pts

Resume Bullet Pattern

Shipped a fully on-device LLM desktop app (Tauri + llama.cpp + MLX) running Llama-3.2-3B at 60 tok/s on M3 Pro with <2.5 GB RAM. Benchmarked 5 quantization variants (Q3..Q8 GGUF + MLX-4/8 + AWQ) for the Pareto frontier; published model card + signed cross-platform releases. [downloads + benchmarks]


Interview Talking Points

  • GGUF format: file layout (header, kv-metadata, tensor data), why it succeeded ggml; advantages over safetensors for inference (single-file, mmap-friendly, embedded vocab).
  • K-quants (Q4_K_M et al.): block-wise quantization with per-block scale + min, mixed bit-widths within a tensor; why K-quants beat the old Q4_0 by ~2% PPL at the same bit budget.
  • AWQ vs GPTQ vs RTN: AWQ identifies salient channels via activations, scales them up before INT4 (recoverable). GPTQ uses Hessian-aware second-order. RTN is the naive baseline.
  • MLX vs llama.cpp on Apple Silicon: MLX uses unified memory more aggressively, faster for batch-1 decode; llama.cpp's Metal backend is more battle-tested and supports all GGUF quants.
  • The Pareto frontier: bits-per-weight vs perplexity is roughly linear above 3 bits and falls off a cliff below; Q4_K_M is the universal sweet spot.
  • Memory bandwidth bound: edge inference is ~always memory-bound (low arithmetic intensity at batch=1); halving model size doubles tokens/sec almost exactly.
  • Apple Intelligence model: ~3B-param model with rank-2 LoRA adapters per task, 4-bit weights, runs on Neural Engine. The architecture you're cloning.
  • Privacy story: zero-network operation, no telemetry, sandbox guarantees. The actual product differentiator vs cloud chatbots.
  • Battery and thermal: token-rate target needs to match thermal envelope; sustained vs burst tokens/sec.
  • Tool-use / MCP on-device: small models struggle with agentic loops; mitigations (constrained decoding, JSON-mode, retrieve-then-answer pattern).

Getting Started

  1. Pick the model. Llama-3.2-3B is the safest default (license, quality, ecosystem).
  2. Convert to GGUF with llama.cpp/convert_hf_to_gguf.py. Then quantize to Q4_K_M, Q5_K_M, Q8_0.
  3. Smoke test with llama-cli -m model.gguf -p "Hello". Verify coherent output.
  4. Run perplexity with llama-perplexity on WikiText-2 for each quant. Build the table.
  5. Run llama-bench on every device you can access (yours, friends', cloud Mac instance).
  6. For Mac users: convert with mlx_lm.convert --hf-path ... -q --q-bits 4. Compare MLX speed vs llama.cpp Metal on the same hardware.
  7. Build the desktop app. Start with Tauri scaffold + llama-cpp-rs. Wire streaming first; UI second.
  8. Add the model manager: download with progress, sha256 verify, mmap-load, swap models without restart.
  9. Polish UX: presets, regenerate, settings, conversation history. Spend at least a week here — UX is the product on edge.
  10. Sign and release cross-platform binaries on GitHub. Notarize the Mac build. Demo video. Submit to Hacker News / r/LocalLLaMA — community feedback is interview gold.

Stretch Goals

  • iOS app in MLX-Swift. Real "ChatGPT in your pocket" demo.
  • MCP (Model Context Protocol) integration: connect to local file system, browser, calendar via MCP servers — fully offline agent.
  • LoRA hot-swap: ship base model + 4–6 task adapters (coder, writer, summarizer); switch without reload.
  • Speculative decoding with a 0.5B draft (Qwen2.5-0.5B) for the 3B target. Surprisingly effective on M-series.
  • RAG built in: drag a PDF into the app → local embeddings → retrieve while chatting. All offline.
  • Voice mode: Whisper.cpp for STT + Coqui/Piper for TTS, 100% on-device.
  • Web demo via WebGPU: wllama or web-llm port — runs in the browser, zero install.
  • Auto-update with delta patches.

What This Capstone Proves About You

You can take a research artifact and turn it into a product normal humans can install and use. You understand the full stack from quantization formats to UI polish, and the platform-specific trade-offs (MLX vs llama.cpp, x86 vs ARM, mobile vs desktop). You can quote tokens/sec and RAM numbers across hardware tiers. You shipped a signed binary that other people use.

This is the bar for On-Device AI Engineer / Edge ML Engineer / AI Product Engineer roles at Apple (Intelligence team), Google (Gemini Nano / MediaPipe), Meta (on-device Llama), Microsoft (Phi on Surface / Windows), Qualcomm (AI Engine), Hugging Face (local-first tooling), Ollama, LM Studio, and any startup building privacy-first AI products. Few candidates have actually shipped a working installable AI app — having one is differentiating signal.