Capstone 06 — Multimodal Vision Assistant (LLaVA-style)
Phase: 11 — Capstone | Difficulty: ⭐⭐⭐⭐⭐ | Time: 2–4 weeks
Real-world parallel: GPT-4V / GPT-4o vision, Claude 3.5 Sonnet vision, Gemini 1.5, LLaVA, Qwen2-VL, Idefics. The capstone for multimodal foundation model roles.
Goals
Build a vision-language assistant that can answer questions about images, do OCR, describe scenes, and reason multi-step over visual content. Two phases:
- Build LLaVA-style architecture from scratch: SigLIP/CLIP vision encoder + projection MLP + Llama-3-8B language model.
- Two-stage training:
- Stage 1 (alignment): train only the projection MLP on image-caption pairs (LAION/CC3M sample). Vision and LM stay frozen.
- Stage 2 (instruction tuning): unfreeze the LM (LoRA), fine-tune on visual instruction data (LLaVA-1.5 mix or your own).
- Ship it: vLLM-compatible serving, Streamlit UI, OpenAI-compatible API with
image_urlsupport, image upload, evals.
Architecture
Image (any size)
│
▼
┌───────────────────────────┐
│ SigLIP-SO400M-patch14-384 │ (frozen)
│ → 729 patch embeddings │
│ each 1152-dim │
└─────────────┬─────────────┘
│
▼
┌───────────────────────────┐
│ Projection MLP (trained) │
│ Linear(1152 → 4096) │
│ GELU │
│ Linear(4096 → 4096) │
│ → 729 visual tokens │
│ in LM embedding space │
└─────────────┬─────────────┘
│
▼
┌────────────────────────────────────────────────────────────┐
│ Llama-3-8B (LM) │
│ Input sequence: │
│ [<system>] [<image_tokens × 729>] [<text query>] │
│ Output: streamed text response │
│ Stage 1: LM frozen | Stage 2: LM via LoRA r=16 │
└────────────────────────────────────────────────────────────┘
Suggested Stack
| Component | Choice |
|---|---|
| Vision encoder | google/siglip-so400m-patch14-384 (best quality) or openai/clip-vit-large-patch14-336 |
| LM | meta-llama/Meta-Llama-3-8B-Instruct or Qwen/Qwen2-7B |
| Stage-1 data | LLaVA-Pretrain (558k image-caption pairs) |
| Stage-2 data | LLaVA-1.5-Instruct (665k visual instructions) |
| Training | transformers + accelerate + peft (LoRA) |
| Serving | vLLM (multi-modal support) or custom (your Capstone-05) |
| API | FastAPI + OpenAI-compatible vision schema |
| UI | Streamlit (drag-drop image upload) |
| Eval | MMMU, MM-Vet, ScienceQA, TextVQA |
Deliverables Checklist
-
model/vision_encoder.py— SigLIP loader with image preprocessing -
model/projector.py— 2-layer MLP, configurable hidden dim -
model/multimodal_llama.py— composes vision + projector + LM, handles<image>token expansion -
data/preprocess.py— image resize/pad to 384×384, tokenization with<image>placeholder -
train/stage1_align.py— train projector only on captioning loss -
train/stage2_instruct.py— LoRA on LM + projector on instruction data -
serve/api.py— OpenAI-compatible/v1/chat/completionsaccepting{"type":"image_url"}content parts -
serve/ui.py— Streamlit drag-drop demo -
eval/mmmu.py— multi-discipline multimodal eval -
eval/mm_vet.py— open-ended VQA judged by GPT-4o -
EVAL_REPORT.md— table vs LLaVA-1.5-7B baseline -
MODEL_CARD.md— limitations (hallucination on unseen domains, OCR weakness, etc.) -
Dockerfile+ compose - Demo video / loom
Resume Bullet Pattern
Built and trained a vision-language assistant from scratch (SigLIP + 2-layer projector + Llama-3-8B with LoRA) using two-stage LLaVA-style training; achieved 38% on MMMU and 51% on MM-Vet (vs LLaVA-1.5-7B at 35.4 / 30.5). Shipped vLLM-served OpenAI-compatible API with Streamlit demo. [demo + repo]
Interview Talking Points
- Why a projector, not cross-attention? LLaVA showed simple MLP projection beats Q-Former on most benchmarks at much lower complexity. Cross-attention (Flamingo) is more parameter-efficient but harder to train.
- Why two stages? Stage 1 aligns the visual features to the LM's token-embedding manifold without disturbing the LM. Stage 2 teaches instruction-following with visual context without losing language ability.
- Why SigLIP over CLIP? Sigmoid loss is more stable at scale and SigLIP-SO400M is the current open SOTA for image features.
- Image token count tradeoff: 729 tokens (SigLIP-384/14) vs 576 (CLIP-336/14) vs higher-res with tiling (LLaVA-NeXT). More tokens → better detail, more KV cache, slower.
- High-resolution strategies: AnyRes (LLaVA-NeXT) tiles the image into multiple 384×384 crops + a global thumbnail; Qwen2-VL uses dynamic resolution with 2D RoPE for vision.
- Hallucination: vision-LMs hallucinate objects that aren't in the image. Mitigations: POPE-style eval, contrastive decoding (VCD), DPO with hallucinated negatives.
- Serving complexity: image preprocessing latency (often dominates TTFT), batching variable-token-count inputs, KV cache implications of 729 prefix tokens.
- OCR limitations: native VLMs are weak at dense text; production systems often pipeline a separate OCR (PaddleOCR / Azure DI) and pass extracted text alongside.
Getting Started
- Verify infra: load SigLIP and Llama-3-8B separately. Confirm forward passes work and you understand the shapes.
- Implement the projector + token splicing. Single hardest engineering bit: replace each
<image>placeholder token in the input with the 729 projected vision tokens, recompute attention masks accordingly. - Smoke-test with random vision features → confirm the LM still generates coherently (it shouldn't suddenly break).
- Stage 1 (small): train projector only on 50k LLaVA-Pretrain samples. Should converge in a few hours on 1× A100. Loss target: ~2.0.
- Sanity check: ask the model to caption an image. Should produce vaguely related text.
- Stage 2: add LoRA to LM (r=16, all linears), train on LLaVA-Instruct sample (50k for first run).
- Eval qualitatively on 20 hand-picked images. Iterate before scaling.
- Scale stage 1 to full 558k, stage 2 to full 665k. ~24 GPU-hours total on 4× A100.
- Run MMMU + MM-Vet. Document gap to LLaVA-1.5-7B (you should be within ±5%).
- Ship: serve via vLLM with
--limit-mm-per-prompt image=1. Build the Streamlit demo. Record video.
Stretch Goals
- AnyRes tiling for high-res inputs (LLaVA-NeXT approach): supports 672×672 and beyond.
- Video understanding: extend to multi-frame inputs (sample 8 frames, pool features). Foundation for VideoLLaVA.
- Function calling with vision: model can call OCR / object-detection tools when needed.
- Multimodal RAG: index image+caption pairs; retrieve relevant images for a text query and feed back into the model.
- DPO on hallucination pairs: generate (faithful, hallucinated) pairs; DPO to suppress hallucination — measurable POPE improvement.
- Quantize and ship to MLX / llama.cpp for on-device (combine with Capstone-09).
What This Capstone Proves About You
You understand multimodal architectures end-to-end — not just "use a VLM API". You can train a non-trivial multi-component model (frozen + adapted modules), debug cross-modal alignment, evaluate against published benchmarks, and ship the result through a production-grade serving stack.
This is the bar for Multimodal Researcher / Engineer roles at Anthropic, OpenAI, Google DeepMind, Meta FAIR, xAI, Adept, Reka, and any startup building visual agents (robotics, autonomy, screen-understanding, design tools).