Capstone 06 — Multimodal Vision Assistant (LLaVA-style)

Phase: 11 — Capstone | Difficulty: ⭐⭐⭐⭐⭐ | Time: 2–4 weeks

Real-world parallel: GPT-4V / GPT-4o vision, Claude 3.5 Sonnet vision, Gemini 1.5, LLaVA, Qwen2-VL, Idefics. The capstone for multimodal foundation model roles.

Goals

Build a vision-language assistant that can answer questions about images, do OCR, describe scenes, and reason multi-step over visual content. Two phases:

Build LLaVA-style architecture from scratch: SigLIP/CLIP vision encoder + projection MLP + Llama-3-8B language model.
Two-stage training:
- Stage 1 (alignment): train only the projection MLP on image-caption pairs (LAION/CC3M sample). Vision and LM stay frozen.
- Stage 2 (instruction tuning): unfreeze the LM (LoRA), fine-tune on visual instruction data (LLaVA-1.5 mix or your own).
Ship it: vLLM-compatible serving, Streamlit UI, OpenAI-compatible API with image_url support, image upload, evals.

Architecture

   Image (any size)
        │
        ▼
   ┌───────────────────────────┐
   │ SigLIP-SO400M-patch14-384 │   (frozen)
   │  → 729 patch embeddings   │
   │    each 1152-dim          │
   └─────────────┬─────────────┘
                 │
                 ▼
   ┌───────────────────────────┐
   │ Projection MLP (trained)  │
   │  Linear(1152 → 4096)      │
   │  GELU                     │
   │  Linear(4096 → 4096)      │
   │  → 729 visual tokens      │
   │    in LM embedding space  │
   └─────────────┬─────────────┘
                 │
                 ▼
   ┌────────────────────────────────────────────────────────────┐
   │ Llama-3-8B (LM)                                            │
   │  Input sequence:                                           │
   │    [<system>]  [<image_tokens × 729>]  [<text query>]      │
   │  Output: streamed text response                            │
   │  Stage 1: LM frozen | Stage 2: LM via LoRA r=16            │
   └────────────────────────────────────────────────────────────┘

Suggested Stack

Component	Choice
Vision encoder	`google/siglip-so400m-patch14-384` (best quality) or `openai/clip-vit-large-patch14-336`
LM	`meta-llama/Meta-Llama-3-8B-Instruct` or `Qwen/Qwen2-7B`
Stage-1 data	LLaVA-Pretrain (558k image-caption pairs)
Stage-2 data	LLaVA-1.5-Instruct (665k visual instructions)
Training	`transformers` + `accelerate` + `peft` (LoRA)
Serving	vLLM (multi-modal support) or custom (your Capstone-05)
API	FastAPI + OpenAI-compatible vision schema
UI	Streamlit (drag-drop image upload)
Eval	MMMU, MM-Vet, ScienceQA, TextVQA

Deliverables Checklist

model/vision_encoder.py — SigLIP loader with image preprocessing
model/projector.py — 2-layer MLP, configurable hidden dim
model/multimodal_llama.py — composes vision + projector + LM, handles <image> token expansion
data/preprocess.py — image resize/pad to 384×384, tokenization with <image> placeholder
train/stage1_align.py — train projector only on captioning loss
train/stage2_instruct.py — LoRA on LM + projector on instruction data
serve/api.py — OpenAI-compatible /v1/chat/completions accepting {"type":"image_url"} content parts
serve/ui.py — Streamlit drag-drop demo
eval/mmmu.py — multi-discipline multimodal eval
eval/mm_vet.py — open-ended VQA judged by GPT-4o
EVAL_REPORT.md — table vs LLaVA-1.5-7B baseline
MODEL_CARD.md — limitations (hallucination on unseen domains, OCR weakness, etc.)
Dockerfile + compose
Demo video / loom

Resume Bullet Pattern

Built and trained a vision-language assistant from scratch (SigLIP + 2-layer projector + Llama-3-8B with LoRA) using two-stage LLaVA-style training; achieved 38% on MMMU and 51% on MM-Vet (vs LLaVA-1.5-7B at 35.4 / 30.5). Shipped vLLM-served OpenAI-compatible API with Streamlit demo. [demo + repo]

Interview Talking Points

Why a projector, not cross-attention? LLaVA showed simple MLP projection beats Q-Former on most benchmarks at much lower complexity. Cross-attention (Flamingo) is more parameter-efficient but harder to train.
Why two stages? Stage 1 aligns the visual features to the LM's token-embedding manifold without disturbing the LM. Stage 2 teaches instruction-following with visual context without losing language ability.
Why SigLIP over CLIP? Sigmoid loss is more stable at scale and SigLIP-SO400M is the current open SOTA for image features.
Image token count tradeoff: 729 tokens (SigLIP-384/14) vs 576 (CLIP-336/14) vs higher-res with tiling (LLaVA-NeXT). More tokens → better detail, more KV cache, slower.
High-resolution strategies: AnyRes (LLaVA-NeXT) tiles the image into multiple 384×384 crops + a global thumbnail; Qwen2-VL uses dynamic resolution with 2D RoPE for vision.
Hallucination: vision-LMs hallucinate objects that aren't in the image. Mitigations: POPE-style eval, contrastive decoding (VCD), DPO with hallucinated negatives.
Serving complexity: image preprocessing latency (often dominates TTFT), batching variable-token-count inputs, KV cache implications of 729 prefix tokens.
OCR limitations: native VLMs are weak at dense text; production systems often pipeline a separate OCR (PaddleOCR / Azure DI) and pass extracted text alongside.

Getting Started

Verify infra: load SigLIP and Llama-3-8B separately. Confirm forward passes work and you understand the shapes.
Implement the projector + token splicing. Single hardest engineering bit: replace each <image> placeholder token in the input with the 729 projected vision tokens, recompute attention masks accordingly.
Smoke-test with random vision features → confirm the LM still generates coherently (it shouldn't suddenly break).
Stage 1 (small): train projector only on 50k LLaVA-Pretrain samples. Should converge in a few hours on 1× A100. Loss target: ~2.0.
Sanity check: ask the model to caption an image. Should produce vaguely related text.
Stage 2: add LoRA to LM (r=16, all linears), train on LLaVA-Instruct sample (50k for first run).
Eval qualitatively on 20 hand-picked images. Iterate before scaling.
Scale stage 1 to full 558k, stage 2 to full 665k. ~24 GPU-hours total on 4× A100.
Run MMMU + MM-Vet. Document gap to LLaVA-1.5-7B (you should be within ±5%).
Ship: serve via vLLM with --limit-mm-per-prompt image=1. Build the Streamlit demo. Record video.

Stretch Goals

AnyRes tiling for high-res inputs (LLaVA-NeXT approach): supports 672×672 and beyond.
Video understanding: extend to multi-frame inputs (sample 8 frames, pool features). Foundation for VideoLLaVA.
Function calling with vision: model can call OCR / object-detection tools when needed.
Multimodal RAG: index image+caption pairs; retrieve relevant images for a text query and feed back into the model.
DPO on hallucination pairs: generate (faithful, hallucinated) pairs; DPO to suppress hallucination — measurable POPE improvement.
Quantize and ship to MLX / llama.cpp for on-device (combine with Capstone-09).

What This Capstone Proves About You

You understand multimodal architectures end-to-end — not just "use a VLM API". You can train a non-trivial multi-component model (frozen + adapted modules), debug cross-modal alignment, evaluate against published benchmarks, and ship the result through a production-grade serving stack.

This is the bar for Multimodal Researcher / Engineer roles at Anthropic, OpenAI, Google DeepMind, Meta FAIR, xAI, Adept, Reka, and any startup building visual agents (robotics, autonomy, screen-understanding, design tools).

LLM Inference Engineer