Chapter 7 of 11
Chapter 5 - Transformers & LLMs: Attention Changed Everything
The Crux
For years, sequence modeling meant RNNs: process one word at a time, remember the past. It worked, but it was slow and forgot long-range dependencies. Then transformers arrived: process everything in parallel, use attention to find what matters. This architecture unlocked LLMs, changed NLP, and is spreading to images, video, and more.
Why Attention Beats Recurrence
The RNN Problem
RNNs process sequences step-by-step:
h₁ = f(x₁, h₀)
h₂ = f(x₂, h₁)
h₃ = f(x₃, h₂)
...
Hidden state h carries information forward. To access word 1 when at word 100, information must survive 99 steps of computation. It doesn't.
Problems:
- Sequential processing: Can't parallelize. Slow.
- Vanishing gradients: Long-range dependencies get lost.
- Fixed-size bottleneck:
hmust encode everything.
The Attention Solution
Instead of forcing information through a sequential bottleneck, let every position attend to every other position directly.
Processing word 100? Look back at all 99 previous words, figure out which are relevant, and pull information from them.
Key Idea: Attention is a learned, differentiable lookup table.
- Query: "What am I looking for?"
- Keys: "What does each position offer?"
- Values: "What information does each position have?"
Compute similarity between query and all keys, use that to weight values.
Attention(Q, K, V) = softmax(QKᵀ / √d) V
Intuition:
- Q·Kᵀ measures "how relevant is each position?"
- Softmax converts to probabilities
- Multiply by V to get weighted sum of relevant info
Why It Wins
Parallelization: All attention operations are matrix multiplies. GPUs love this. Training is 10x-100x faster than RNNs.
Long-range dependencies: Word 100 can directly attend to word 1. No vanishing gradients through 99 steps.
Flexibility: Attention weights are learned. The model decides what's important.