Lab 4-02: tf.data Input Pipelines

Learning Objectives

Build high-performance tf.data.Dataset pipelines
Use .map(), .cache(), .shuffle(), .batch(), .prefetch()
Apply image augmentation within tf.data (TensorFlow native ops)
Profile pipeline bottlenecks with tf.data.experimental.AUTOTUNE
Understand why tf.data pipelines are faster than Python DataLoaders for TPU

Pipeline Building Blocks

Raw files / numpy arrays
    │
    ▼ tf.data.Dataset.from_tensor_slices() / .list_files()
    │
    ▼ .map(parse_fn, num_parallel_calls=AUTOTUNE)   ← decode, resize, normalize
    │
    ▼ .cache()   ← cache after expensive decode (if fits in RAM)
    │
    ▼ .shuffle(buffer_size)   ← randomize order
    │
    ▼ .batch(batch_size, drop_remainder=True)
    │
    ▼ .map(augment_fn)   ← augmentation AFTER batch for efficiency
    │
    ▼ .prefetch(AUTOTUNE)   ← overlap CPU preprocessing with GPU training

Key Rules

Rule	Why
`.cache()` before `.shuffle()`	Shuffle runs on already-decoded data
`.prefetch(AUTOTUNE)` last	Always — overlaps CPU/GPU work
`num_parallel_calls=AUTOTUNE` in `.map()`	Parallelizes decoding automatically
Augmentation after `.batch()`	GPU can vectorize batched operations
`drop_remainder=True`	Fixed batch sizes needed for TPU XLA compilation

AUTOTUNE

AUTOTUNE = tf.data.AUTOTUNE  # let TF choose parallelism based on hardware

dataset = (
    tf.data.Dataset.from_tensor_slices((images, labels))
      .map(preprocess, num_parallel_calls=AUTOTUNE)
      .cache()
      .shuffle(1000)
      .batch(32)
      .map(augment, num_parallel_calls=AUTOTUNE)
      .prefetch(AUTOTUNE)
)

Q: What is the difference between .cache() and .prefetch()?
A: .cache() stores dataset elements in memory (or disk) after the first epoch — eliminates re-decoding/re-preprocessing in subsequent epochs. .prefetch() runs the data pipeline in the background while training — eliminates pipeline stalls between batches. Use both: cache first, prefetch last.

Q: Why must .shuffle() come after .cache() but before .batch()?
A: After .cache(): shuffle operates on already-decoded examples (fast). Before .batch(): ensures batches contain mixed examples. If you shuffle after batch, you shuffle batches not individual examples (much weaker randomization).

Q: How large should the shuffle buffer be?
A: Buffer size controls randomness quality: buffer_size=N maintains a pool of N examples and samples uniformly from it. For perfect shuffle, buffer_size = dataset_size. In practice, 1000-10000 is a good tradeoff. Too small → correlated batches. Too large → high memory usage, slow first epoch.

AI Engineer — Role-Based Learning Hub

Lab 4-02: tf.data Input Pipelines

Learning Objectives

Pipeline Building Blocks

Key Rules

AUTOTUNE

Interview Questions