04 — Fine-Tuning Platform

Roles: Post-training Engineer · ML Platform · Foundation Model Engineer

1. Requirements

Self-serve fine-tuning for internal users + customers (BYO data)
Support: SFT, LoRA, QLoRA, DPO, ORPO; pluggable
Job sizes: 1 GPU (LoRA on 7B) → 32 GPUs (full fine-tune of 70B)
Eval after every job; gated promotion to serving

2. Architecture

[UI / SDK] → [Control Plane API]
                    │
            ┌───────┼───────────┐
            ▼       ▼           ▼
       [Data svc] [Job svc]  [Model registry]
                    │
                    ▼
            [Scheduler (k8s + Volcano)]
                    │
                    ▼
        [Training pods (FSDP / DeepSpeed)]
                    │
                    ▼
            [Eval pipeline → Registry → Serving]

3. Deep Dives

3.1 Data Validation

Schema check; PII scrub; toxicity filter (optional, configurable)
Train/val split (or accept user-provided)
Token-count estimate → cost estimate before launch

3.2 Job Templates

Versioned recipes (yaml + git-pinned image)
Each template = (base model, method, hyperparams, hardware spec)
Reproducibility: lockfile of every dep + commit hash

3.3 Resource Scheduling

Volcano queues per priority (interactive < batch < production)
Bin-packing on GPU memory + interconnect
Spot fallback with auto-checkpoint/resume

3.4 Eval Gate

Run a fixed eval suite (instruction-following, safety, capability)
Compare against base model + last accepted checkpoint
Auto-block promotion on regression > X%

3.5 Adapter Management

LoRA adapters versioned in registry (S3 + metadata)
Hot-swap into vLLM at serving time (no model reload)
A/B routing in inference gateway

4. Observability

Per-step: loss, grad_norm, lr, throughput
Per-job: eval scores (before/after), peak memory, total $$
Per-tenant: jobs/month, GPU-hours, success rate

5. Failure Modes

OOM mid-train → reduce batch_size, retry with gradient_accumulation auto-bumped
Diverging loss → early stop, alert
Eval regression → quarantine, don't promote

6. Tradeoffs

Choice	Alt	When
Volcano + k8s	Slurm	Volcano for cloud-native + multi-tenant; Slurm for HPC purity
LoRA-by-default	Full fine-tune	LoRA covers 80% of cases at 1% the cost
Sync eval gate	Async monitor	Sync gate when serving SLO depends on it

LLM Inference Engineer