Explain how attention mechanisms work in transformers
The core idea behind self-attention is that every token in a sequence gets to "look at" every other token and decide how much to attend to it. Each token is projected into three vectors: a query, a key, and a value. The attention mechanism then computes a weighted sum of all values, where the weights come from the dot product of queries and keys.
# Scaled dot-product attention
import torch
import torch.nn.functional as F
def attention(Q, K, V, d_k):
scores = torch.matmul(Q, K.transpose(-2, -1)) / d_k ** 0.5
weights = F.softmax(scores, dim=-1)
return torch.matmul(weights, V)
This mechanism has several key advantages:
- Parallelism — unlike RNNs, all positions are processed simultaneously
- Long-range dependencies — any token can directly attend to any other
- Interpretability — attention weights show what the model focuses on
In practice, transformers use multi-head attention, which runs several attention operations in parallel so the model can capture different types of relationships simultaneously.