asdPrompt — Select, copy, and act on any AI response without touching the mouse.

claude.ai

You

Explain how attention mechanisms work in transformers

Claude

The core idea behind self-attention is that every token in a sequence gets to "look at" every other token and decide how much to attend to it. Each token is projected into three vectors: a query, a key, and a value. The attention mechanism then computes a weighted sum of all values, where the weights come from the dot product of queries and keys.

# Scaled dot-product attention
import torch
import torch.nn.functional as F

def attention(Q, K, V, d_k):
    scores = torch.matmul(Q, K.transpose(-2, -1)) / d_k ** 0.5
    weights = F.softmax(scores, dim=-1)
    return torch.matmul(weights, V)

This mechanism has several key advantages:

Parallelism — unlike RNNs, all positions are processed simultaneously
Long-range dependencies — any token can directly attend to any other
Interpretability — attention weights show what the model focuses on

In practice, transformers use multi-head attention, which runs several attention operations in parallel so the model can capture different types of relationships simultaneously.

Reply to Claude…