Lab 3-01: PyTorch Tensors & Autograd

Learning Goals

  • Master tensor operations and their CUDA equivalents
  • Understand PyTorch's dynamic computation graph
  • Use autograd to compute gradients manually
  • Avoid common pitfalls: in-place ops, detach, no_grad
  • Profile GPU memory usage with torch.cuda.memory_summary

Core Concepts

Tensors

import torch

# Creation
x = torch.tensor([1.0, 2.0, 3.0])        # from Python list
x = torch.zeros(3, 4)                     # zeros
x = torch.randn(2, 3, requires_grad=True) # Gaussian random, tracks gradients
x = torch.arange(12).reshape(3, 4).float()

# Device placement
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
x = x.to(device)
# or
x = x.cuda()   # GPU
x = x.cpu()    # back to CPU

Computation Graph

PyTorch builds a directed acyclic graph (DAG) dynamically as ops execute. Each tensor with requires_grad=True records its creation operation.

x = torch.tensor(2.0, requires_grad=True)
y = x ** 2 + 3 * x + 1   # y = x²+ 3x + 1
y.backward()               # dy/dx = 2x + 3
print(x.grad)              # tensor(7.) = 2*2 + 3

Gradient Tape (manual backward)

x = torch.randn(3, requires_grad=True)
W = torch.randn(4, 3, requires_grad=True)
b = torch.zeros(4, requires_grad=True)

# Forward pass
z = W @ x + b
loss = z.pow(2).sum()

# Backward pass — PyTorch computes all gradients
loss.backward()
print(W.grad)   # dL/dW, shape (4, 3)
print(x.grad)   # dL/dx, shape (3,)

torch.no_grad() vs detach()

# no_grad: disable gradient tracking for inference (saves memory, faster)
with torch.no_grad():
    pred = model(x)   # no gradient computation

# detach: break the graph — use when you want the value without gradient
y = x.detach().numpy()  # convert to numpy

# grad_fn shows you what op created the tensor
x = torch.randn(3, requires_grad=True)
y = x.sin()
print(y.grad_fn)  # <SinBackward0 object>

In-Place Operations — Common Pitfall

x = torch.randn(3, requires_grad=True)
# BAD: in-place modifies the tensor autograd needs for backward
x += 1  # RuntimeError: a leaf Variable that requires grad has been used in an in-place operation

# GOOD: create a new tensor
x = x + 1

CUDA Memory Management

# Check memory
print(torch.cuda.memory_allocated() / 1e6, "MB allocated")
print(torch.cuda.max_memory_allocated() / 1e6, "MB peak")

# Free cache
torch.cuda.empty_cache()

# Memory profiling context
with torch.profiler.profile(
    activities=[torch.profiler.ProfilerActivity.CUDA],
    profile_memory=True,
) as prof:
    y = model(x)

print(prof.key_averages().table(sort_by="cuda_memory_usage", row_limit=10))

Interview Questions

Q: What is the computation graph in PyTorch? How does it differ from TensorFlow 1.x?
A: PyTorch uses a dynamic (define-by-run) computation graph built at runtime during the forward pass. TF1 used a static graph defined before execution. Dynamic graphs enable Python control flow (if/else, loops) in model forward passes and easier debugging.

Q: What does .detach() do?
A: It returns a new tensor with the same data but without gradient tracking. Use it to: (1) convert to numpy; (2) prevent gradients flowing into part of the graph (e.g., frozen encoder); (3) implement stop-gradient operations.

Q: Why does zero_grad() need to be called before backward()?
A: PyTorch accumulates gradients by default. Without zero_grad(), each backward pass adds to existing gradients. This is useful for gradient accumulation (simulating larger batch sizes), but must be reset at the start of each update step.

Q: What's the difference between model.eval() and torch.no_grad()?
A: model.eval() changes the behavior of layers like BatchNorm (use running stats instead of batch stats) and Dropout (disable it). torch.no_grad() disables gradient computation to save memory and time. For inference, you typically want both.