Lab 4-02: tf.data Input Pipelines
Learning Objectives
- Build high-performance
tf.data.Datasetpipelines - Use
.map(),.cache(),.shuffle(),.batch(),.prefetch() - Apply image augmentation within
tf.data(TensorFlow native ops) - Profile pipeline bottlenecks with
tf.data.experimental.AUTOTUNE - Understand why
tf.datapipelines are faster than Python DataLoaders for TPU
Pipeline Building Blocks
Raw files / numpy arrays
│
▼ tf.data.Dataset.from_tensor_slices() / .list_files()
│
▼ .map(parse_fn, num_parallel_calls=AUTOTUNE) ← decode, resize, normalize
│
▼ .cache() ← cache after expensive decode (if fits in RAM)
│
▼ .shuffle(buffer_size) ← randomize order
│
▼ .batch(batch_size, drop_remainder=True)
│
▼ .map(augment_fn) ← augmentation AFTER batch for efficiency
│
▼ .prefetch(AUTOTUNE) ← overlap CPU preprocessing with GPU training
Key Rules
| Rule | Why |
|---|---|
.cache() before .shuffle() | Shuffle runs on already-decoded data |
.prefetch(AUTOTUNE) last | Always — overlaps CPU/GPU work |
num_parallel_calls=AUTOTUNE in .map() | Parallelizes decoding automatically |
Augmentation after .batch() | GPU can vectorize batched operations |
drop_remainder=True | Fixed batch sizes needed for TPU XLA compilation |
AUTOTUNE
AUTOTUNE = tf.data.AUTOTUNE # let TF choose parallelism based on hardware
dataset = (
tf.data.Dataset.from_tensor_slices((images, labels))
.map(preprocess, num_parallel_calls=AUTOTUNE)
.cache()
.shuffle(1000)
.batch(32)
.map(augment, num_parallel_calls=AUTOTUNE)
.prefetch(AUTOTUNE)
)
Interview Questions
Q: What is the difference between .cache() and .prefetch()?
A: .cache() stores dataset elements in memory (or disk) after the first epoch — eliminates re-decoding/re-preprocessing in subsequent epochs. .prefetch() runs the data pipeline in the background while training — eliminates pipeline stalls between batches. Use both: cache first, prefetch last.
Q: Why must .shuffle() come after .cache() but before .batch()?
A: After .cache(): shuffle operates on already-decoded examples (fast). Before .batch(): ensures batches contain mixed examples. If you shuffle after batch, you shuffle batches not individual examples (much weaker randomization).
Q: How large should the shuffle buffer be?
A: Buffer size controls randomness quality: buffer_size=N maintains a pool of N examples and samples uniformly from it. For perfect shuffle, buffer_size = dataset_size. In practice, 1000-10000 is a good tradeoff. Too small → correlated batches. Too large → high memory usage, slow first epoch.