How Transformers Work
The architecture behind GPT, Claude, and every modern language model. We'll trace a sentence through the full pipeline — from raw text to a probability distribution over the next token — with one key idea at the center: attention.
Why a Transformer?
Before transformers, sequence models like RNNs read text one word at a time, carrying a hidden state. Two big problems: they were slow (sequential by construction) and they forgot distant context. The transformer's answer in "Attention Is All You Need" (2017): drop recurrence entirely, and let every token look directly at every other token in parallel.
RNN — sequential, lossy memory
Token 100 only sees token 1 through a long chain of hidden states. Information leaks at every hop. Training cannot parallelize across the sequence.
Transformer — direct, parallel
Every token computes a weighted average over all other tokens in one matrix multiplication. Long-range context is just one operation away.
1 Tokenization
Models don't see text — they see token IDs (integers). A tokenizer breaks text into pieces (often sub-words) and maps each to an ID in a vocabulary of 30k–200k entries. Common words become single tokens; rare words split into pieces.
Try it
Type a short sentence. We'll split it into a toy vocabulary so you can see the IDs.
"unbelievable" into ["un", "believ", "able"].
Our toy tokenizer here just splits on spaces — enough to illustrate the idea.
2 Token & Position Embeddings
Each token ID is looked up in a giant table to produce a learned embedding vector
of dimension d_model (e.g. 768 for GPT-2, 12,288 for GPT-3). Similar tokens end up
near each other in this space — that's how the model represents meaning numerically.
But attention is permutation-invariant — it has no built-in notion of order. So we add a positional embedding \(P[i]\) that depends only on the position. The sum is what enters the first transformer block.
Visualize embeddings
Each row below is a 12-dim "embedding" for one token (random fixed values for the demo). Red = positive, blue = negative.
king − man + woman ≈ queen.
3 Self-Attention: The Core Idea
Self-attention answers the question: "For each token, which other tokens should I pay attention to, and how should I combine their information?" Every token gets to look up information from every other token, with learned weights.
Each token's embedding \(x_i\) is projected three different ways:
- Query \(q_i = x_i W_Q\) — "what am I looking for?"
- Key \(k_i = x_i W_K\) — "what do I represent / advertise?"
- Value \(v_i = x_i W_V\) — "what content do I contribute if attended to?"
Attention scores between positions \(i\) and \(j\) come from the dot-product \(q_i \cdot k_j\) — high when the query "matches" the key. We scale by \(\sqrt{d_k}\) (so gradients don't explode) and softmax across each row so the weights for token \(i\) sum to 1:
The output for token \(i\) is a weighted average of all values, where the weights come from the query–key compatibility. Tokens that "matter" to \(i\) contribute more.
Live attention matrix
Pick a sentence. The matrix shows attention weights: each row is a query token, each column a key token. Bright cells = strong attention. We use a causal mask (upper triangle = 0) so each token only attends to itself and earlier tokens — that's how decoder-style LMs are trained.
3b Multi-Head Attention
One attention computation can only learn one kind of relationship. Real transformers run many heads in parallel — each with its own \(W_Q, W_K, W_V\) — and concatenate the outputs. Different heads end up specializing: one might track syntax, another coreference, another positional offsets.
Head 1 — "previous token"
Strong attention to \(i-1\). Useful for local syntactic patterns.
Head 2 — "subject of verb"
"sat" attends back to "cat". Useful for resolving who-did-what.
4 The Transformer Block
One layer of a transformer wraps multi-head attention plus a per-token MLP, with two pieces of glue that turn out to be essential: residual connections and layer norm.
- Residuals (the
x +part) let gradients flow through dozens of layers without vanishing. - LayerNorm stabilizes the scale of activations across the hidden dimension.
- FFN is a 2-layer MLP applied independently to each position: \( \text{FFN}(h) = \sigma(hW_1)W_2 \).
Typically expands to
4 · d_modelin the hidden layer.
A real model — GPT-2 small (12 layers), Llama-3-8B (32 layers), GPT-4 (rumored ~120) — stacks this same block over and over. Each block refines the per-token representation a bit more, mixing in more context.
Where do the parameters live?
For a typical layer with \(d_\text{model}=768\):
Despite all the focus on attention, the bulk of a transformer's weights are in those boring feed-forward MLPs.
5 Generating the Next Token
After the final block, we have a context-aware vector \(h_n\) at the last position. Project it onto the vocabulary with the (often shared) embedding matrix to get logits, then softmax to get a probability over every token in the vocabulary:
Sample one token, append it to the input, and run the whole forward pass again. That's it — that's how ChatGPT writes a paragraph. One token at a time, each one conditioned on everything before.
Toy next-token distribution
For the prompt "The cat sat on the ___", here's what a (toy) model might predict. Adjust the temperature to see how the distribution sharpens or flattens.
Putting It All Together
A modern decoder-only LLM (GPT, Claude, Llama) is just this loop:
- Tokenize the prompt → integer IDs.
- Embed: ID → vector, plus a positional signal.
- Run \(L\) transformer blocks. Each block lets every token look at every prior token (attention) then transforms each token independently (FFN).
- Project the final-position vector onto the vocabulary → next-token probabilities.
- Sample, append, repeat.
Next steps: RLHF and GRPO show how a raw next-token predictor becomes a helpful assistant.