Lab 01 — Python Advanced Patterns
Phase: 0 — Foundations | Difficulty: ⭐⭐☆☆☆
Concept Overview
1. Generators & Iterators
Why it matters for CV/ML: Large image datasets don't fit in RAM. Generators let you stream data one batch at a time — this is exactly what DataLoader does internally.
A generator is a function that yields values lazily. When called, it returns a generator object — an iterator that computes each value on demand.
def stream_images(folder):
for path in Path(folder).rglob("*.jpg"):
yield cv2.imread(str(path)) # loads ONE image, then pauses
Under the hood, Python suspends the function's stack frame at each yield and resumes it on the next next() call. Memory usage stays constant regardless of dataset size.
Iterator protocol: any object implementing __iter__ and __next__. Generator functions implement this automatically.
class InfiniteCounter:
def __init__(self): self.n = 0
def __iter__(self): return self
def __next__(self): self.n += 1; return self.n
Generator expressions (memory-efficient alternative to list comprehensions):
total_pixels = sum(img.size for img in stream_images("data/"))
2. Decorators
Why it matters: Decorators appear everywhere in ML code — @torch.no_grad(), @tf.function, @app.route, @lru_cache. Understanding them lets you write and debug production ML systems.
A decorator is a higher-order function — it takes a function and returns a new function:
def timer(func):
import time
def wrapper(*args, **kwargs):
t0 = time.perf_counter()
result = func(*args, **kwargs)
print(f"{func.__name__} took {time.perf_counter()-t0:.4f}s")
return result
return wrapper
@timer
def run_inference(model, image): ...
@timer is syntactic sugar for run_inference = timer(run_inference).
Decorators with arguments require an extra layer of nesting:
def retry(max_attempts=3):
def decorator(func):
def wrapper(*args, **kwargs):
for attempt in range(max_attempts):
try: return func(*args, **kwargs)
except Exception as e:
if attempt == max_attempts - 1: raise
return wrapper
return decorator
@retry(max_attempts=5)
def fetch_batch_from_s3(bucket, key): ...
functools.wraps preserves the original function's __name__, __doc__, etc. Always use it.
3. Dataclasses
Why it matters: Model configs, training hyperparameters, and dataset metadata need structured containers. dataclass beats plain dicts (no typos, IDE autocompletion, type hints).
from dataclasses import dataclass, field
from typing import Optional, Tuple
@dataclass
class TrainingConfig:
model_name: str = "resnet50"
num_classes: int = 80
learning_rate: float = 1e-4
batch_size: int = 32
image_size: Tuple[int, int] = (640, 640)
augmentations: list = field(default_factory=list)
pretrained_weights: Optional[str] = None
def __post_init__(self):
assert self.learning_rate > 0, "LR must be positive"
Generated automatically: __init__, __repr__, __eq__. Optional: __hash__, ordering, frozen (immutable) instances.
4. __slots__
Why it matters: CV pipelines process millions of objects (bounding boxes, detections, keypoints). __slots__ eliminates the per-instance __dict__, saving ~40-60 bytes per object — critical at scale.
class BoundingBox:
__slots__ = ('x1', 'y1', 'x2', 'y2', 'confidence', 'class_id')
def __init__(self, x1, y1, x2, y2, conf, cls):
self.x1, self.y1, self.x2, self.y2 = x1, y1, x2, y2
self.confidence, self.class_id = conf, cls
# With __slots__: ~56 bytes/instance
# Without: ~152 bytes/instance
5. Context Managers
Why it matters: GPU memory, file handles, model inference sessions, database connections — all need deterministic cleanup. Context managers are the Pythonic way.
from contextlib import contextmanager
import torch
@contextmanager
def inference_mode(model):
"""Switch model to eval mode and disable grad tracking."""
model.eval()
try:
with torch.no_grad():
yield model
finally:
model.train()
with inference_mode(my_model) as model:
output = model(input_tensor)
# model is back in training mode here
The __enter__ / __exit__ protocol (class-based) vs @contextmanager (generator-based) — know both.
6. Type Hints & Protocols
Why it matters: Type hints are now standard in ML libraries (PyTorch 2.x, TF2, sklearn). They catch bugs at static analysis time and serve as self-documentation.
from typing import Union, List, Dict, Callable
from pathlib import Path
ImageArray = "np.ndarray" # shape (H, W, C), dtype uint8
def preprocess(
image: ImageArray,
size: tuple[int, int] = (224, 224),
normalize: bool = True,
) -> "torch.Tensor": # shape (3, H, W), dtype float32
...
Protocols (structural subtyping — "duck typing with types"):
from typing import Protocol
class Backbone(Protocol):
def forward(self, x: "torch.Tensor") -> "torch.Tensor": ...
def freeze(self) -> None: ...
7. functools, itertools, collections
These stdlib modules are heavily used in data pipeline code:
from functools import lru_cache, partial, reduce
from itertools import islice, chain, product
from collections import defaultdict, Counter, deque
# Cached feature extractor
@lru_cache(maxsize=1024)
def get_class_weights(dataset_name: str) -> dict: ...
# Sliding window over frame stream
def sliding_window(iterable, n):
d = deque(maxlen=n)
for item in iterable:
d.append(item)
if len(d) == n:
yield tuple(d)
Interview Questions
Q: What is the difference between a generator and a list comprehension? When would you use each?
A: Both produce sequences, but a list comprehension materializes all elements into memory immediately (O(n) space), while a generator yields one element at a time (O(1) space). Use generators when the dataset is large (streaming image batches from disk), when you only need one element at a time, or when the sequence is infinite. Use list comprehensions when you need random access or multiple passes over the data.
Q: Explain how @torch.no_grad() works as a decorator AND as a context manager.
A: torch.no_grad is a class that implements both __call__ (making it a decorator) and __enter__/__exit__ (making it a context manager). When used as a decorator, it wraps the function with gradient tracking disabled for the duration of the call. As a context manager, it disables/re-enables gradient tracking for the with block. Internally, it pushes/pops a "no gradient" flag onto PyTorch's autograd context stack.
Q: How would you implement a thread-safe LRU cache for model predictions?
A: Use functools.lru_cache for single-threaded code. For multi-threaded inference servers (FastAPI with asyncio), use a dictionary protected by asyncio.Lock or threading.Lock, or use cachetools.TTLCache with a lock. Cache keys should be derived from a hash of the input (e.g., MD5 of image bytes) to handle array inputs which aren't hashable by default.
Common Python Pitfalls in ML Code
# WRONG: mutable default argument — shared across all calls!
def augment(image, transforms=[]):
transforms.append(Resize(224)) # appends on every call after first!
...
# CORRECT: use None sentinel
def augment(image, transforms=None):
transforms = transforms or []
...
# WRONG: late binding in closures
fns = [lambda x: x * i for i in range(5)]
fns[0](1) # returns 4, not 0! 'i' is captured by reference
# CORRECT: use default argument to capture value
fns = [lambda x, i=i: x * i for i in range(5)]