Lab 4-03: TFLite Conversion & Edge Deployment

Learning Objectives

  • Convert a Keras model to TFLite FlatBuffer format
  • Apply INT8 post-training quantization with a representative dataset
  • Benchmark FP32 vs FP16 vs INT8 TFLite latency
  • Understand what happens during quantization (weight + activation quantization)
  • Compare TFLite vs ONNX Runtime for mobile deployment
  • Save benchmark results as pandas CSV

TFLite Conversion Pipeline

Keras Model (.keras)
       │
       ▼  tf.lite.TFLiteConverter.from_keras_model()
  TFLiteConverter
       │
       ├── FP32 (no quantization)     → baseline accuracy, full size
       ├── FP16 (float16)             → ~2x smaller, ~5% faster on GPU/DSP
       └── INT8 (with representative  → ~4x smaller, ~2-4x faster, minimal
             dataset calibration)       accuracy drop if calibrated well
       │
       ▼  converter.convert()
  .tflite FlatBuffer
       │
       ▼  tf.lite.Interpreter
  On-Device Inference

Quantization Types

TypeWeightsActivationsCalibration NeededSize Reduction
FP32float32float32No1x (baseline)
FP16float16float32No~2x
Dynamic INT8int8float32No~4x
Full INT8int8int8YES (rep. dataset)~4x + faster

Representative Dataset (required for full INT8)

def representative_dataset():
    for batch in val_dataset.take(100):
        imgs = batch[0]
        yield [imgs.numpy().astype(np.float32)]

converter.representative_dataset = representative_dataset
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type  = tf.uint8
converter.inference_output_type = tf.uint8

Interview Questions

Q: What is the difference between dynamic range quantization and full INT8 quantization?
A: Dynamic range: only weights are quantized to INT8 at conversion time; activations are dynamically quantized at runtime (still FP32 math). Full INT8: both weights AND activations are quantized to INT8, requiring a calibration (representative) dataset to compute activation ranges. Full INT8 is faster on hardware INT8 accelerators (Edge TPU, DSP) but requires calibration.

Q: What is the FlatBuffer format and why does TFLite use it?
A: FlatBuffers is a zero-copy serialization format (no deserialization needed). TFLite uses it because on edge devices, you can memory-map the model file directly and start inference without loading it into RAM — critical for low-memory devices.

Q: When would you choose TFLite over ONNX Runtime for deployment?
A: TFLite: Android/iOS apps, Coral Edge TPU, Raspberry Pi. Tighter TF ecosystem integration. ONNX Runtime: Windows/Linux servers, diverse model sources (PyTorch, sklearn), more execution providers (CUDA, TensorRT, DirectML). For mobile = TFLite. For server = ORT.