Lab 4-02: tf.data Input Pipelines

Learning Objectives

  • Build high-performance tf.data.Dataset pipelines
  • Use .map(), .cache(), .shuffle(), .batch(), .prefetch()
  • Apply image augmentation within tf.data (TensorFlow native ops)
  • Profile pipeline bottlenecks with tf.data.experimental.AUTOTUNE
  • Understand why tf.data pipelines are faster than Python DataLoaders for TPU

Pipeline Building Blocks

Raw files / numpy arrays
    │
    ▼ tf.data.Dataset.from_tensor_slices() / .list_files()
    │
    ▼ .map(parse_fn, num_parallel_calls=AUTOTUNE)   ← decode, resize, normalize
    │
    ▼ .cache()   ← cache after expensive decode (if fits in RAM)
    │
    ▼ .shuffle(buffer_size)   ← randomize order
    │
    ▼ .batch(batch_size, drop_remainder=True)
    │
    ▼ .map(augment_fn)   ← augmentation AFTER batch for efficiency
    │
    ▼ .prefetch(AUTOTUNE)   ← overlap CPU preprocessing with GPU training

Key Rules

RuleWhy
.cache() before .shuffle()Shuffle runs on already-decoded data
.prefetch(AUTOTUNE) lastAlways — overlaps CPU/GPU work
num_parallel_calls=AUTOTUNE in .map()Parallelizes decoding automatically
Augmentation after .batch()GPU can vectorize batched operations
drop_remainder=TrueFixed batch sizes needed for TPU XLA compilation

AUTOTUNE

AUTOTUNE = tf.data.AUTOTUNE  # let TF choose parallelism based on hardware

dataset = (
    tf.data.Dataset.from_tensor_slices((images, labels))
      .map(preprocess, num_parallel_calls=AUTOTUNE)
      .cache()
      .shuffle(1000)
      .batch(32)
      .map(augment, num_parallel_calls=AUTOTUNE)
      .prefetch(AUTOTUNE)
)

Interview Questions

Q: What is the difference between .cache() and .prefetch()?
A: .cache() stores dataset elements in memory (or disk) after the first epoch — eliminates re-decoding/re-preprocessing in subsequent epochs. .prefetch() runs the data pipeline in the background while training — eliminates pipeline stalls between batches. Use both: cache first, prefetch last.

Q: Why must .shuffle() come after .cache() but before .batch()?
A: After .cache(): shuffle operates on already-decoded examples (fast). Before .batch(): ensures batches contain mixed examples. If you shuffle after batch, you shuffle batches not individual examples (much weaker randomization).

Q: How large should the shuffle buffer be?
A: Buffer size controls randomness quality: buffer_size=N maintains a pool of N examples and samples uniformly from it. For perfect shuffle, buffer_size = dataset_size. In practice, 1000-10000 is a good tradeoff. Too small → correlated batches. Too large → high memory usage, slow first epoch.