04 — Fine-Tuning Platform

Roles: Post-training Engineer · ML Platform · Foundation Model Engineer

1. Requirements

  • Self-serve fine-tuning for internal users + customers (BYO data)
  • Support: SFT, LoRA, QLoRA, DPO, ORPO; pluggable
  • Job sizes: 1 GPU (LoRA on 7B) → 32 GPUs (full fine-tune of 70B)
  • Eval after every job; gated promotion to serving

2. Architecture

[UI / SDK] → [Control Plane API]
                    │
            ┌───────┼───────────┐
            ▼       ▼           ▼
       [Data svc] [Job svc]  [Model registry]
                    │
                    ▼
            [Scheduler (k8s + Volcano)]
                    │
                    ▼
        [Training pods (FSDP / DeepSpeed)]
                    │
                    ▼
            [Eval pipeline → Registry → Serving]

3. Deep Dives

3.1 Data Validation

  • Schema check; PII scrub; toxicity filter (optional, configurable)
  • Train/val split (or accept user-provided)
  • Token-count estimate → cost estimate before launch

3.2 Job Templates

  • Versioned recipes (yaml + git-pinned image)
  • Each template = (base model, method, hyperparams, hardware spec)
  • Reproducibility: lockfile of every dep + commit hash

3.3 Resource Scheduling

  • Volcano queues per priority (interactive < batch < production)
  • Bin-packing on GPU memory + interconnect
  • Spot fallback with auto-checkpoint/resume

3.4 Eval Gate

  • Run a fixed eval suite (instruction-following, safety, capability)
  • Compare against base model + last accepted checkpoint
  • Auto-block promotion on regression > X%

3.5 Adapter Management

  • LoRA adapters versioned in registry (S3 + metadata)
  • Hot-swap into vLLM at serving time (no model reload)
  • A/B routing in inference gateway

4. Observability

  • Per-step: loss, grad_norm, lr, throughput
  • Per-job: eval scores (before/after), peak memory, total $$
  • Per-tenant: jobs/month, GPU-hours, success rate

5. Failure Modes

  • OOM mid-train → reduce batch_size, retry with gradient_accumulation auto-bumped
  • Diverging loss → early stop, alert
  • Eval regression → quarantine, don't promote

6. Tradeoffs

ChoiceAltWhen
Volcano + k8sSlurmVolcano for cloud-native + multi-tenant; Slurm for HPC purity
LoRA-by-defaultFull fine-tuneLoRA covers 80% of cases at 1% the cost
Sync eval gateAsync monitorSync gate when serving SLO depends on it