Lab 7-01: ONNX Export & Model Optimization
Learning Goals
- Export PyTorch models to ONNX with proper dynamic axes
- Inspect and validate ONNX graphs
- Profile FP32 vs FP16 inference latency
- Understand quantization (INT8) trade-offs
- Apply TorchScript as a lightweight alternative
Core Concepts
Why ONNX?
ONNX (Open Neural Network Exchange) is an open format that makes models portable across frameworks and runtimes:
- Train in PyTorch → deploy in C++, TensorFlow, CoreML, or TensorRT
- ONNX Runtime (ORT) is often faster than vanilla PyTorch on CPU
- TensorRT converts ONNX to GPU-optimized engines
Export Mechanics
import torch
model = MyModel()
model.eval()
dummy_input = torch.randn(1, 3, 224, 224)
torch.onnx.export(
model,
dummy_input,
"model.onnx",
input_names=["image"],
output_names=["logits"],
dynamic_axes={
"image": {0: "batch_size"}, # batch dim is dynamic
"logits": {0: "batch_size"},
},
opset_version=17,
do_constant_folding=True, # fuse constant ops
)
Graph Validation
import onnx
model_proto = onnx.load("model.onnx")
onnx.checker.check_model(model_proto) # raises exception if invalid
print(onnx.helper.printable_graph(model_proto.graph))
Inference with ONNX Runtime
import onnxruntime as ort
import numpy as np
sess = ort.InferenceSession("model.onnx",
providers=["CUDAExecutionProvider", "CPUExecutionProvider"])
# CUDAExecutionProvider uses GPU if available, falls back to CPU
inp = np.random.randn(1, 3, 224, 224).astype(np.float32)
outputs = sess.run(None, {"image": inp})
print(outputs[0].shape)
FP16 Quantization with ONNX Runtime
from onnxruntime.transformers import optimizer
from onnxruntime.quantization import quantize_dynamic, QuantType
# Dynamic INT8 quantization (CPU-only)
quantize_dynamic("model.onnx", "model_int8.onnx",
weight_type=QuantType.QInt8)
Latency Math
$$\text{Throughput} = \frac{1}{\text{latency_per_batch}} \times \text{batch_size}$$
For a model with 10ms latency at batch=1:
- FP32: 10ms → 100 fps
- FP16: ~5ms → 200 fps
- INT8: ~3ms → 333 fps
- TensorRT FP16: ~2ms → 500 fps
Interview Questions
Q: What does do_constant_folding=True do?
A: Pre-computes operations whose inputs are known at export time (e.g., batch norm statistics after folding), removing them from the inference graph.
Q: What's a dynamic axis? When do you need one?
A: A tensor dimension that isn't fixed at export time. Batch size is almost always dynamic. If your model handles variable-length sequences or variable image sizes, those dims must also be dynamic.
Q: What's the difference between FP16 and INT8?
A: FP16 uses 16-bit floating point (range: ~6×10⁻⁵ to 65504). INT8 uses 8-bit integer (range: -128 to 127). INT8 requires calibration data to compute activation ranges; FP16 is lossless for most models. INT8 is ~2× faster than FP16 but risks > 1% accuracy loss without QAT.
Q: When would you use TorchScript instead of ONNX?
A: TorchScript is better when deploying within NVIDIA's ecosystem, when you need Python-free C++ deployment, or when your model has Python control flow (if/else, loops) that ONNX can't represent well.