GGUF → Ship)

Phase: 11 — Capstone | Difficulty: ⭐⭐⭐⭐☆ | Time: 2–3 weeks

Real-world parallel: Apple Intelligence (on-device 3B model), Ollama, LM Studio, GPT4All, Pocket Pal, Microsoft Phi-3.5 on edge, Gemma Nano on Android, llama.cpp ecosystem. The capstone for edge AI / on-device inference roles.

Goals

Take a capable open-source LLM, squeeze it into a laptop or phone, and ship it as a real product. End-to-end:

Pick a target: Llama-3.2-3B, Phi-3.5-mini-3.8B, or Qwen2.5-3B.
Quantize to multiple formats: GGUF Q4_K_M (CPU), GGUF Q5_K_M (quality), MLX 4-bit (Apple Silicon), AWQ INT4 (CUDA edge).
Benchmark each for tokens/sec, RAM, perplexity, eval scores. Pick a Pareto-optimal default.
Ship a real desktop app (Electron + Tauri / Swift / Flutter) with native streaming, model auto-download, and offline operation.
Mobile bonus: iOS app via MLX-Swift or Android via MediaPipe / llama.cpp JNI.
Production niceties: model manager, conversation history, system prompt presets, MCP / tool-use hook.

Architecture

   ┌──────────────────────────────────────────────────────────┐
   │ Step 1: Quantization Lab                                 │
   │  HF model → GPTQ / AWQ / GGUF / MLX                      │
   │   - PPL on WikiText-2                                    │
   │   - HellaSwag / ARC / MMLU                              │
   │   - tokens/sec on M-series, x86, ARM, CUDA edge          │
   │   - RAM usage at peak                                    │
   └──────────────────────────┬───────────────────────────────┘
                              ▼
   ┌──────────────────────────────────────────────────────────┐
   │ Step 2: Inference Backend                                │
   │  - llama.cpp (Metal / CUDA / Vulkan / CPU)               │
   │  - MLX (Apple Silicon native)                            │
   │  - MediaPipe LLM (Android / iOS / Web)                   │
   │  - ONNX Runtime mobile (cross-platform fallback)         │
   └──────────────────────────┬───────────────────────────────┘
                              ▼
   ┌──────────────────────────────────────────────────────────┐
   │ Step 3: Application                                      │
   │  Desktop: Tauri (Rust + WebView) — small bundle, native │
   │   - Model picker + auto-download w/ resume               │
   │   - Streaming chat UI                                    │
   │   - System prompt presets ("Code Reviewer", "Tutor"…)    │
   │   - Settings: temperature, top_p, max tokens, n_ctx      │
   │   - Conversation export (JSON, Markdown)                 │
   │   - Optional: MCP-style tool hooks (browse, run code)    │
   │  Mobile: native (SwiftUI + MLX, or Compose + MediaPipe) │
   └──────────────────────────┬───────────────────────────────┘
                              ▼
   ┌──────────────────────────────────────────────────────────┐
   │ Step 4: Distribution                                     │
   │  - GitHub Releases (signed binaries, auto-update)        │
   │  - Mac: notarized .dmg                                   │
   │  - Windows: signed .msi                                  │
   │  - Linux: AppImage / .deb                                │
   │  - Mobile (stretch): TestFlight / Play Internal Testing  │
   └──────────────────────────────────────────────────────────┘

Suggested Stack

Concern	Choice
Base model	Llama-3.2-3B-Instruct, Phi-3.5-mini, Qwen2.5-3B-Instruct
Quantization (cross-platform)	GGUF via `llama.cpp/convert_hf_to_gguf.py`, then `quantize`
Quantization (Apple)	MLX via `mlx-lm`, `mlx_lm.convert --quantize -q 4`
Quantization (CUDA edge)	AWQ via `autoawq`
Inference engine	`llama.cpp` (default), `mlx-lm` (Mac), `mediapipe-tasks-text` (mobile)
Desktop UI	Tauri (Rust + web) for small bundles; alternative: Electron, Flutter
iOS	SwiftUI + `mlx-swift-examples`
Android	Kotlin + `llama.cpp` JNI bindings or MediaPipe LLM
Eval	`lm-evaluation-harness`, custom perplexity script
Bench	`llama-bench` (built into llama.cpp), MLX's `mlx_lm.benchmark`

Deliverables Checklist

Quantization & Eval

quant/convert_gguf.sh — script that produces Q3_K_M, Q4_K_M, Q5_K_M, Q6_K, Q8_0
quant/convert_mlx.sh — produces MLX 4-bit and 8-bit
quant/convert_awq.py — AWQ INT4 with calibration set
eval/perplexity.py — WikiText-2 PPL across all variants
eval/lm_harness.sh — HellaSwag, ARC-E, MMLU on each quant
bench/run_bench.sh — tokens/sec on M2/M3 (Mac), x86 laptop CPU, ARM phone, GTX/RTX edge
BENCHMARK.md — Pareto plot (quality vs speed vs RAM); recommended default per platform

Desktop App

app/ — Tauri project (or Electron alternative)
app/src-tauri/ — Rust backend embedding llama.cpp via llama-cpp-rs crate
app/src/ — web UI (SvelteKit / React)
Model manager with download progress, integrity check (sha256), background loading
System-prompt presets file (presets.json) with at least 6 useful personas
Streaming chat with stop-token handling, regenerate, edit-and-resubmit
Persistent conversation storage (SQLite via rusqlite)
Settings UI: model select, temperature, top_p, top_k, repeat_penalty, n_ctx, threads, GPU layers
Export: Markdown / JSON / share-link
Signed releases for Mac (notarized) + Windows + Linux

Mobile (Stretch)

iOS app: SwiftUI + MLX-Swift; Q4 model (~1.8 GB) running natively
Android app: Kotlin + MediaPipe LLM Inference task

Production

MODEL_CARD.md per quantization (quality numbers, intended use, limitations)
PRIVACY.md — explicit "everything stays on device" statement; what telemetry (none, opt-in)
WRITEUP.md — quality cliff (where Q3 fails), platform tradeoffs, what surprised you
Demo video (loom)

Performance Targets

Platform	Model + Quant	Target
Apple M3 Pro	Llama-3.2-3B Q4_K_M	≥ 35 tok/s, RAM ≤ 3 GB
Apple M3 Pro	Llama-3.2-3B MLX 4-bit	≥ 60 tok/s, RAM ≤ 2.5 GB
Apple M3 Max	Llama-3.2-3B MLX 8-bit	≥ 50 tok/s
x86 laptop CPU (8c/16t)	Q4_K_M	≥ 12 tok/s
RTX 4060 Laptop	Q4_K_M, GPU offload	≥ 80 tok/s
iPhone 15 Pro	MLX 4-bit	≥ 15 tok/s
Quality vs FP16	Q4_K_M	PPL within 5%, MMLU within 1.5 pts

Resume Bullet Pattern

Shipped a fully on-device LLM desktop app (Tauri + llama.cpp + MLX) running Llama-3.2-3B at 60 tok/s on M3 Pro with <2.5 GB RAM. Benchmarked 5 quantization variants (Q3..Q8 GGUF + MLX-4/8 + AWQ) for the Pareto frontier; published model card + signed cross-platform releases. [downloads + benchmarks]

Interview Talking Points

GGUF format: file layout (header, kv-metadata, tensor data), why it succeeded ggml; advantages over safetensors for inference (single-file, mmap-friendly, embedded vocab).
K-quants (Q4_K_M et al.): block-wise quantization with per-block scale + min, mixed bit-widths within a tensor; why K-quants beat the old Q4_0 by ~2% PPL at the same bit budget.
AWQ vs GPTQ vs RTN: AWQ identifies salient channels via activations, scales them up before INT4 (recoverable). GPTQ uses Hessian-aware second-order. RTN is the naive baseline.
MLX vs llama.cpp on Apple Silicon: MLX uses unified memory more aggressively, faster for batch-1 decode; llama.cpp's Metal backend is more battle-tested and supports all GGUF quants.
The Pareto frontier: bits-per-weight vs perplexity is roughly linear above 3 bits and falls off a cliff below; Q4_K_M is the universal sweet spot.
Memory bandwidth bound: edge inference is ~always memory-bound (low arithmetic intensity at batch=1); halving model size doubles tokens/sec almost exactly.
Apple Intelligence model: ~3B-param model with rank-2 LoRA adapters per task, 4-bit weights, runs on Neural Engine. The architecture you're cloning.
Privacy story: zero-network operation, no telemetry, sandbox guarantees. The actual product differentiator vs cloud chatbots.
Battery and thermal: token-rate target needs to match thermal envelope; sustained vs burst tokens/sec.
Tool-use / MCP on-device: small models struggle with agentic loops; mitigations (constrained decoding, JSON-mode, retrieve-then-answer pattern).

Getting Started

Pick the model. Llama-3.2-3B is the safest default (license, quality, ecosystem).
Convert to GGUF with llama.cpp/convert_hf_to_gguf.py. Then quantize to Q4_K_M, Q5_K_M, Q8_0.
Smoke test with llama-cli -m model.gguf -p "Hello". Verify coherent output.
Run perplexity with llama-perplexity on WikiText-2 for each quant. Build the table.
Run llama-bench on every device you can access (yours, friends', cloud Mac instance).
For Mac users: convert with mlx_lm.convert --hf-path ... -q --q-bits 4. Compare MLX speed vs llama.cpp Metal on the same hardware.
Build the desktop app. Start with Tauri scaffold + llama-cpp-rs. Wire streaming first; UI second.
Add the model manager: download with progress, sha256 verify, mmap-load, swap models without restart.
Polish UX: presets, regenerate, settings, conversation history. Spend at least a week here — UX is the product on edge.
Sign and release cross-platform binaries on GitHub. Notarize the Mac build. Demo video. Submit to Hacker News / r/LocalLLaMA — community feedback is interview gold.

Stretch Goals

iOS app in MLX-Swift. Real "ChatGPT in your pocket" demo.
MCP (Model Context Protocol) integration: connect to local file system, browser, calendar via MCP servers — fully offline agent.
LoRA hot-swap: ship base model + 4–6 task adapters (coder, writer, summarizer); switch without reload.
Speculative decoding with a 0.5B draft (Qwen2.5-0.5B) for the 3B target. Surprisingly effective on M-series.
RAG built in: drag a PDF into the app → local embeddings → retrieve while chatting. All offline.
Voice mode: Whisper.cpp for STT + Coqui/Piper for TTS, 100% on-device.
Web demo via WebGPU: wllama or web-llm port — runs in the browser, zero install.
Auto-update with delta patches.

What This Capstone Proves About You

You can take a research artifact and turn it into a product normal humans can install and use. You understand the full stack from quantization formats to UI polish, and the platform-specific trade-offs (MLX vs llama.cpp, x86 vs ARM, mobile vs desktop). You can quote tokens/sec and RAM numbers across hardware tiers. You shipped a signed binary that other people use.

This is the bar for On-Device AI Engineer / Edge ML Engineer / AI Product Engineer roles at Apple (Intelligence team), Google (Gemini Nano / MediaPipe), Meta (on-device Llama), Microsoft (Phi on Surface / Windows), Qualcomm (AI Engine), Hugging Face (local-first tooling), Ollama, LM Studio, and any startup building privacy-first AI products. Few candidates have actually shipped a working installable AI app — having one is differentiating signal.

LLM Inference Engineer