Capstone 06 — Multimodal Vision Assistant (LLaVA-style)

Phase: 11 — Capstone | Difficulty: ⭐⭐⭐⭐⭐ | Time: 2–4 weeks

Real-world parallel: GPT-4V / GPT-4o vision, Claude 3.5 Sonnet vision, Gemini 1.5, LLaVA, Qwen2-VL, Idefics. The capstone for multimodal foundation model roles.


Goals

Build a vision-language assistant that can answer questions about images, do OCR, describe scenes, and reason multi-step over visual content. Two phases:

  1. Build LLaVA-style architecture from scratch: SigLIP/CLIP vision encoder + projection MLP + Llama-3-8B language model.
  2. Two-stage training:
    • Stage 1 (alignment): train only the projection MLP on image-caption pairs (LAION/CC3M sample). Vision and LM stay frozen.
    • Stage 2 (instruction tuning): unfreeze the LM (LoRA), fine-tune on visual instruction data (LLaVA-1.5 mix or your own).
  3. Ship it: vLLM-compatible serving, Streamlit UI, OpenAI-compatible API with image_url support, image upload, evals.

Architecture

   Image (any size)
        │
        ▼
   ┌───────────────────────────┐
   │ SigLIP-SO400M-patch14-384 │   (frozen)
   │  → 729 patch embeddings   │
   │    each 1152-dim          │
   └─────────────┬─────────────┘
                 │
                 ▼
   ┌───────────────────────────┐
   │ Projection MLP (trained)  │
   │  Linear(1152 → 4096)      │
   │  GELU                     │
   │  Linear(4096 → 4096)      │
   │  → 729 visual tokens      │
   │    in LM embedding space  │
   └─────────────┬─────────────┘
                 │
                 ▼
   ┌────────────────────────────────────────────────────────────┐
   │ Llama-3-8B (LM)                                            │
   │  Input sequence:                                           │
   │    [<system>]  [<image_tokens × 729>]  [<text query>]      │
   │  Output: streamed text response                            │
   │  Stage 1: LM frozen | Stage 2: LM via LoRA r=16            │
   └────────────────────────────────────────────────────────────┘

Suggested Stack

ComponentChoice
Vision encodergoogle/siglip-so400m-patch14-384 (best quality) or openai/clip-vit-large-patch14-336
LMmeta-llama/Meta-Llama-3-8B-Instruct or Qwen/Qwen2-7B
Stage-1 dataLLaVA-Pretrain (558k image-caption pairs)
Stage-2 dataLLaVA-1.5-Instruct (665k visual instructions)
Trainingtransformers + accelerate + peft (LoRA)
ServingvLLM (multi-modal support) or custom (your Capstone-05)
APIFastAPI + OpenAI-compatible vision schema
UIStreamlit (drag-drop image upload)
EvalMMMU, MM-Vet, ScienceQA, TextVQA

Deliverables Checklist

  • model/vision_encoder.py — SigLIP loader with image preprocessing
  • model/projector.py — 2-layer MLP, configurable hidden dim
  • model/multimodal_llama.py — composes vision + projector + LM, handles <image> token expansion
  • data/preprocess.py — image resize/pad to 384×384, tokenization with <image> placeholder
  • train/stage1_align.py — train projector only on captioning loss
  • train/stage2_instruct.py — LoRA on LM + projector on instruction data
  • serve/api.py — OpenAI-compatible /v1/chat/completions accepting {"type":"image_url"} content parts
  • serve/ui.py — Streamlit drag-drop demo
  • eval/mmmu.py — multi-discipline multimodal eval
  • eval/mm_vet.py — open-ended VQA judged by GPT-4o
  • EVAL_REPORT.md — table vs LLaVA-1.5-7B baseline
  • MODEL_CARD.md — limitations (hallucination on unseen domains, OCR weakness, etc.)
  • Dockerfile + compose
  • Demo video / loom

Resume Bullet Pattern

Built and trained a vision-language assistant from scratch (SigLIP + 2-layer projector + Llama-3-8B with LoRA) using two-stage LLaVA-style training; achieved 38% on MMMU and 51% on MM-Vet (vs LLaVA-1.5-7B at 35.4 / 30.5). Shipped vLLM-served OpenAI-compatible API with Streamlit demo. [demo + repo]


Interview Talking Points

  • Why a projector, not cross-attention? LLaVA showed simple MLP projection beats Q-Former on most benchmarks at much lower complexity. Cross-attention (Flamingo) is more parameter-efficient but harder to train.
  • Why two stages? Stage 1 aligns the visual features to the LM's token-embedding manifold without disturbing the LM. Stage 2 teaches instruction-following with visual context without losing language ability.
  • Why SigLIP over CLIP? Sigmoid loss is more stable at scale and SigLIP-SO400M is the current open SOTA for image features.
  • Image token count tradeoff: 729 tokens (SigLIP-384/14) vs 576 (CLIP-336/14) vs higher-res with tiling (LLaVA-NeXT). More tokens → better detail, more KV cache, slower.
  • High-resolution strategies: AnyRes (LLaVA-NeXT) tiles the image into multiple 384×384 crops + a global thumbnail; Qwen2-VL uses dynamic resolution with 2D RoPE for vision.
  • Hallucination: vision-LMs hallucinate objects that aren't in the image. Mitigations: POPE-style eval, contrastive decoding (VCD), DPO with hallucinated negatives.
  • Serving complexity: image preprocessing latency (often dominates TTFT), batching variable-token-count inputs, KV cache implications of 729 prefix tokens.
  • OCR limitations: native VLMs are weak at dense text; production systems often pipeline a separate OCR (PaddleOCR / Azure DI) and pass extracted text alongside.

Getting Started

  1. Verify infra: load SigLIP and Llama-3-8B separately. Confirm forward passes work and you understand the shapes.
  2. Implement the projector + token splicing. Single hardest engineering bit: replace each <image> placeholder token in the input with the 729 projected vision tokens, recompute attention masks accordingly.
  3. Smoke-test with random vision features → confirm the LM still generates coherently (it shouldn't suddenly break).
  4. Stage 1 (small): train projector only on 50k LLaVA-Pretrain samples. Should converge in a few hours on 1× A100. Loss target: ~2.0.
  5. Sanity check: ask the model to caption an image. Should produce vaguely related text.
  6. Stage 2: add LoRA to LM (r=16, all linears), train on LLaVA-Instruct sample (50k for first run).
  7. Eval qualitatively on 20 hand-picked images. Iterate before scaling.
  8. Scale stage 1 to full 558k, stage 2 to full 665k. ~24 GPU-hours total on 4× A100.
  9. Run MMMU + MM-Vet. Document gap to LLaVA-1.5-7B (you should be within ±5%).
  10. Ship: serve via vLLM with --limit-mm-per-prompt image=1. Build the Streamlit demo. Record video.

Stretch Goals

  • AnyRes tiling for high-res inputs (LLaVA-NeXT approach): supports 672×672 and beyond.
  • Video understanding: extend to multi-frame inputs (sample 8 frames, pool features). Foundation for VideoLLaVA.
  • Function calling with vision: model can call OCR / object-detection tools when needed.
  • Multimodal RAG: index image+caption pairs; retrieve relevant images for a text query and feed back into the model.
  • DPO on hallucination pairs: generate (faithful, hallucinated) pairs; DPO to suppress hallucination — measurable POPE improvement.
  • Quantize and ship to MLX / llama.cpp for on-device (combine with Capstone-09).

What This Capstone Proves About You

You understand multimodal architectures end-to-end — not just "use a VLM API". You can train a non-trivial multi-component model (frozen + adapted modules), debug cross-modal alignment, evaluate against published benchmarks, and ship the result through a production-grade serving stack.

This is the bar for Multimodal Researcher / Engineer roles at Anthropic, OpenAI, Google DeepMind, Meta FAIR, xAI, Adept, Reka, and any startup building visual agents (robotics, autonomy, screen-understanding, design tools).