asdPrompt

Select, copy, and act on any AI response without touching the mouse.

Try the full workflow right here — no extension needed.

This interactive tutorial requires a keyboard.

Visit on a desktop browser to try it out, or go straight to the Chrome Web Store.

claude.ai
You

Explain how attention mechanisms work in transformers

Claude

The core idea behind self-attention is that every token in a sequence gets to "look at" every other token and decide how much to attend to it. Each token is projected into three vectors: a query, a key, and a value. The attention mechanism then computes a weighted sum of all values, where the weights come from the dot product of queries and keys.

# Scaled dot-product attention
import torch
import torch.nn.functional as F

def attention(Q, K, V, d_k):
    scores = torch.matmul(Q, K.transpose(-2, -1)) / d_k ** 0.5
    weights = F.softmax(scores, dim=-1)
    return torch.matmul(weights, V)

This mechanism has several key advantages:

  • Parallelism unlike RNNs, all positions are processed simultaneously
  • Long-range dependencies any token can directly attend to any other
  • Interpretability attention weights show what the model focuses on

In practice, transformers use multi-head attention, which runs several attention operations in parallel so the model can capture different types of relationships simultaneously.

Reply to Claude…
Available in the Chrome Web Store