Lab 7-01: ONNX Export & Model Optimization

Learning Goals

  • Export PyTorch models to ONNX with proper dynamic axes
  • Inspect and validate ONNX graphs
  • Profile FP32 vs FP16 inference latency
  • Understand quantization (INT8) trade-offs
  • Apply TorchScript as a lightweight alternative

Core Concepts

Why ONNX?

ONNX (Open Neural Network Exchange) is an open format that makes models portable across frameworks and runtimes:

  • Train in PyTorch → deploy in C++, TensorFlow, CoreML, or TensorRT
  • ONNX Runtime (ORT) is often faster than vanilla PyTorch on CPU
  • TensorRT converts ONNX to GPU-optimized engines

Export Mechanics

import torch

model = MyModel()
model.eval()

dummy_input = torch.randn(1, 3, 224, 224)

torch.onnx.export(
    model,
    dummy_input,
    "model.onnx",
    input_names=["image"],
    output_names=["logits"],
    dynamic_axes={
        "image": {0: "batch_size"},     # batch dim is dynamic
        "logits": {0: "batch_size"},
    },
    opset_version=17,
    do_constant_folding=True,           # fuse constant ops
)

Graph Validation

import onnx
model_proto = onnx.load("model.onnx")
onnx.checker.check_model(model_proto)   # raises exception if invalid
print(onnx.helper.printable_graph(model_proto.graph))

Inference with ONNX Runtime

import onnxruntime as ort
import numpy as np

sess = ort.InferenceSession("model.onnx",
    providers=["CUDAExecutionProvider", "CPUExecutionProvider"])

# CUDAExecutionProvider uses GPU if available, falls back to CPU
inp = np.random.randn(1, 3, 224, 224).astype(np.float32)
outputs = sess.run(None, {"image": inp})
print(outputs[0].shape)

FP16 Quantization with ONNX Runtime

from onnxruntime.transformers import optimizer
from onnxruntime.quantization import quantize_dynamic, QuantType

# Dynamic INT8 quantization (CPU-only)
quantize_dynamic("model.onnx", "model_int8.onnx",
    weight_type=QuantType.QInt8)

Latency Math

$$\text{Throughput} = \frac{1}{\text{latency_per_batch}} \times \text{batch_size}$$

For a model with 10ms latency at batch=1:

  • FP32: 10ms → 100 fps
  • FP16: ~5ms → 200 fps
  • INT8: ~3ms → 333 fps
  • TensorRT FP16: ~2ms → 500 fps

Interview Questions

Q: What does do_constant_folding=True do?
A: Pre-computes operations whose inputs are known at export time (e.g., batch norm statistics after folding), removing them from the inference graph.

Q: What's a dynamic axis? When do you need one?
A: A tensor dimension that isn't fixed at export time. Batch size is almost always dynamic. If your model handles variable-length sequences or variable image sizes, those dims must also be dynamic.

Q: What's the difference between FP16 and INT8?
A: FP16 uses 16-bit floating point (range: ~6×10⁻⁵ to 65504). INT8 uses 8-bit integer (range: -128 to 127). INT8 requires calibration data to compute activation ranges; FP16 is lossless for most models. INT8 is ~2× faster than FP16 but risks > 1% accuracy loss without QAT.

Q: When would you use TorchScript instead of ONNX?
A: TorchScript is better when deploying within NVIDIA's ecosystem, when you need Python-free C++ deployment, or when your model has Python control flow (if/else, loops) that ONNX can't represent well.