Lab 4-03: TFLite Conversion & Edge Deployment
Learning Objectives
- Convert a Keras model to TFLite FlatBuffer format
- Apply INT8 post-training quantization with a representative dataset
- Benchmark FP32 vs FP16 vs INT8 TFLite latency
- Understand what happens during quantization (weight + activation quantization)
- Compare TFLite vs ONNX Runtime for mobile deployment
- Save benchmark results as pandas CSV
TFLite Conversion Pipeline
Keras Model (.keras)
│
▼ tf.lite.TFLiteConverter.from_keras_model()
TFLiteConverter
│
├── FP32 (no quantization) → baseline accuracy, full size
├── FP16 (float16) → ~2x smaller, ~5% faster on GPU/DSP
└── INT8 (with representative → ~4x smaller, ~2-4x faster, minimal
dataset calibration) accuracy drop if calibrated well
│
▼ converter.convert()
.tflite FlatBuffer
│
▼ tf.lite.Interpreter
On-Device Inference
Quantization Types
| Type | Weights | Activations | Calibration Needed | Size Reduction |
|---|---|---|---|---|
| FP32 | float32 | float32 | No | 1x (baseline) |
| FP16 | float16 | float32 | No | ~2x |
| Dynamic INT8 | int8 | float32 | No | ~4x |
| Full INT8 | int8 | int8 | YES (rep. dataset) | ~4x + faster |
Representative Dataset (required for full INT8)
def representative_dataset():
for batch in val_dataset.take(100):
imgs = batch[0]
yield [imgs.numpy().astype(np.float32)]
converter.representative_dataset = representative_dataset
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.uint8
converter.inference_output_type = tf.uint8
Interview Questions
Q: What is the difference between dynamic range quantization and full INT8 quantization?
A: Dynamic range: only weights are quantized to INT8 at conversion time; activations are dynamically quantized at runtime (still FP32 math). Full INT8: both weights AND activations are quantized to INT8, requiring a calibration (representative) dataset to compute activation ranges. Full INT8 is faster on hardware INT8 accelerators (Edge TPU, DSP) but requires calibration.
Q: What is the FlatBuffer format and why does TFLite use it?
A: FlatBuffers is a zero-copy serialization format (no deserialization needed). TFLite uses it because on edge devices, you can memory-map the model file directly and start inference without loading it into RAM — critical for low-memory devices.
Q: When would you choose TFLite over ONNX Runtime for deployment?
A: TFLite: Android/iOS apps, Coral Edge TPU, Raspberry Pi. Tighter TF ecosystem integration. ONNX Runtime: Windows/Linux servers, diverse model sources (PyTorch, sklearn), more execution providers (CUDA, TensorRT, DirectML). For mobile = TFLite. For server = ORT.