Lab 3-01: PyTorch Tensors & Autograd
Learning Goals
- Master tensor operations and their CUDA equivalents
- Understand PyTorch's dynamic computation graph
- Use
autogradto compute gradients manually - Avoid common pitfalls: in-place ops, detach, no_grad
- Profile GPU memory usage with
torch.cuda.memory_summary
Core Concepts
Tensors
import torch
# Creation
x = torch.tensor([1.0, 2.0, 3.0]) # from Python list
x = torch.zeros(3, 4) # zeros
x = torch.randn(2, 3, requires_grad=True) # Gaussian random, tracks gradients
x = torch.arange(12).reshape(3, 4).float()
# Device placement
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
x = x.to(device)
# or
x = x.cuda() # GPU
x = x.cpu() # back to CPU
Computation Graph
PyTorch builds a directed acyclic graph (DAG) dynamically as ops execute.
Each tensor with requires_grad=True records its creation operation.
x = torch.tensor(2.0, requires_grad=True)
y = x ** 2 + 3 * x + 1 # y = x²+ 3x + 1
y.backward() # dy/dx = 2x + 3
print(x.grad) # tensor(7.) = 2*2 + 3
Gradient Tape (manual backward)
x = torch.randn(3, requires_grad=True)
W = torch.randn(4, 3, requires_grad=True)
b = torch.zeros(4, requires_grad=True)
# Forward pass
z = W @ x + b
loss = z.pow(2).sum()
# Backward pass — PyTorch computes all gradients
loss.backward()
print(W.grad) # dL/dW, shape (4, 3)
print(x.grad) # dL/dx, shape (3,)
torch.no_grad() vs detach()
# no_grad: disable gradient tracking for inference (saves memory, faster)
with torch.no_grad():
pred = model(x) # no gradient computation
# detach: break the graph — use when you want the value without gradient
y = x.detach().numpy() # convert to numpy
# grad_fn shows you what op created the tensor
x = torch.randn(3, requires_grad=True)
y = x.sin()
print(y.grad_fn) # <SinBackward0 object>
In-Place Operations — Common Pitfall
x = torch.randn(3, requires_grad=True)
# BAD: in-place modifies the tensor autograd needs for backward
x += 1 # RuntimeError: a leaf Variable that requires grad has been used in an in-place operation
# GOOD: create a new tensor
x = x + 1
CUDA Memory Management
# Check memory
print(torch.cuda.memory_allocated() / 1e6, "MB allocated")
print(torch.cuda.max_memory_allocated() / 1e6, "MB peak")
# Free cache
torch.cuda.empty_cache()
# Memory profiling context
with torch.profiler.profile(
activities=[torch.profiler.ProfilerActivity.CUDA],
profile_memory=True,
) as prof:
y = model(x)
print(prof.key_averages().table(sort_by="cuda_memory_usage", row_limit=10))
Interview Questions
Q: What is the computation graph in PyTorch? How does it differ from TensorFlow 1.x?
A: PyTorch uses a dynamic (define-by-run) computation graph built at runtime during the forward pass. TF1 used a static graph defined before execution. Dynamic graphs enable Python control flow (if/else, loops) in model forward passes and easier debugging.
Q: What does .detach() do?
A: It returns a new tensor with the same data but without gradient tracking. Use it to: (1) convert to numpy; (2) prevent gradients flowing into part of the graph (e.g., frozen encoder); (3) implement stop-gradient operations.
Q: Why does zero_grad() need to be called before backward()?
A: PyTorch accumulates gradients by default. Without zero_grad(), each backward pass adds to existing gradients. This is useful for gradient accumulation (simulating larger batch sizes), but must be reset at the start of each update step.
Q: What's the difference between model.eval() and torch.no_grad()?
A: model.eval() changes the behavior of layers like BatchNorm (use running stats instead of batch stats) and Dropout (disable it). torch.no_grad() disables gradient computation to save memory and time. For inference, you typically want both.