04 — Fine-Tuning Platform
Roles: Post-training Engineer · ML Platform · Foundation Model Engineer
1. Requirements
- Self-serve fine-tuning for internal users + customers (BYO data)
- Support: SFT, LoRA, QLoRA, DPO, ORPO; pluggable
- Job sizes: 1 GPU (LoRA on 7B) → 32 GPUs (full fine-tune of 70B)
- Eval after every job; gated promotion to serving
2. Architecture
[UI / SDK] → [Control Plane API]
│
┌───────┼───────────┐
▼ ▼ ▼
[Data svc] [Job svc] [Model registry]
│
▼
[Scheduler (k8s + Volcano)]
│
▼
[Training pods (FSDP / DeepSpeed)]
│
▼
[Eval pipeline → Registry → Serving]
3. Deep Dives
3.1 Data Validation
- Schema check; PII scrub; toxicity filter (optional, configurable)
- Train/val split (or accept user-provided)
- Token-count estimate → cost estimate before launch
3.2 Job Templates
- Versioned recipes (yaml + git-pinned image)
- Each template = (base model, method, hyperparams, hardware spec)
- Reproducibility: lockfile of every dep + commit hash
3.3 Resource Scheduling
- Volcano queues per priority (interactive < batch < production)
- Bin-packing on GPU memory + interconnect
- Spot fallback with auto-checkpoint/resume
3.4 Eval Gate
- Run a fixed eval suite (instruction-following, safety, capability)
- Compare against base model + last accepted checkpoint
- Auto-block promotion on regression > X%
3.5 Adapter Management
- LoRA adapters versioned in registry (S3 + metadata)
- Hot-swap into vLLM at serving time (no model reload)
- A/B routing in inference gateway
4. Observability
- Per-step: loss, grad_norm, lr, throughput
- Per-job: eval scores (before/after), peak memory, total $$
- Per-tenant: jobs/month, GPU-hours, success rate
5. Failure Modes
- OOM mid-train → reduce batch_size, retry with gradient_accumulation auto-bumped
- Diverging loss → early stop, alert
- Eval regression → quarantine, don't promote
6. Tradeoffs
| Choice | Alt | When |
|---|---|---|
| Volcano + k8s | Slurm | Volcano for cloud-native + multi-tenant; Slurm for HPC purity |
| LoRA-by-default | Full fine-tune | LoRA covers 80% of cases at 1% the cost |
| Sync eval gate | Async monitor | Sync gate when serving SLO depends on it |