Architectures Behind LLMs

How Transformers Work

The architecture behind GPT, Claude, and every modern language model. We'll trace a sentence through the full pipeline — from raw text to a probability distribution over the next token — with one key idea at the center: attention.

Step 1
Tokenize
Text → IDs
Step 2
Embed
IDs → vectors
Step 3
Attention
Mix tokens
Step 4
FFN
Per-token MLP
Step 5
Predict
Next token

Why a Transformer?

Before transformers, sequence models like RNNs read text one word at a time, carrying a hidden state. Two big problems: they were slow (sequential by construction) and they forgot distant context. The transformer's answer in "Attention Is All You Need" (2017): drop recurrence entirely, and let every token look directly at every other token in parallel.

RNN — sequential, lossy memory

Token 100 only sees token 1 through a long chain of hidden states. Information leaks at every hop. Training cannot parallelize across the sequence.

Transformer — direct, parallel

Every token computes a weighted average over all other tokens in one matrix multiplication. Long-range context is just one operation away.

What a transformer actually does: given a sequence of tokens, it produces — for each position — a vector that summarizes "what this token means in this context." Stack many of these layers and you get rich, context-aware representations that you can decode into a next-token prediction.

1 Tokenization

Models don't see text — they see token IDs (integers). A tokenizer breaks text into pieces (often sub-words) and maps each to an ID in a vocabulary of 30k–200k entries. Common words become single tokens; rare words split into pieces.

Try it

Type a short sentence. We'll split it into a toy vocabulary so you can see the IDs.

Real tokenizers (BPE, WordPiece, SentencePiece) are learned from data. For example, GPT-style BPE might split "unbelievable" into ["un", "believ", "able"]. Our toy tokenizer here just splits on spaces — enough to illustrate the idea.

2 Token & Position Embeddings

Each token ID is looked up in a giant table to produce a learned embedding vector of dimension d_model (e.g. 768 for GPT-2, 12,288 for GPT-3). Similar tokens end up near each other in this space — that's how the model represents meaning numerically.

\( x_i \;=\; E[\text{token}_i] \;+\; P[i] \)

But attention is permutation-invariant — it has no built-in notion of order. So we add a positional embedding \(P[i]\) that depends only on the position. The sum is what enters the first transformer block.

Visualize embeddings

Each row below is a 12-dim "embedding" for one token (random fixed values for the demo). Red = positive, blue = negative.

Why "learned"? The embedding table starts random and is updated by gradient descent along with everything else. After training, the geometry encodes semantic relationships: king − man + woman ≈ queen.

3 Self-Attention: The Core Idea

Self-attention answers the question: "For each token, which other tokens should I pay attention to, and how should I combine their information?" Every token gets to look up information from every other token, with learned weights.

Each token's embedding \(x_i\) is projected three different ways:

x Q = xWQ K = xWK V = xWV attention(Q,K,V)

Attention scores between positions \(i\) and \(j\) come from the dot-product \(q_i \cdot k_j\) — high when the query "matches" the key. We scale by \(\sqrt{d_k}\) (so gradients don't explode) and softmax across each row so the weights for token \(i\) sum to 1:

\( \text{Attention}(Q,K,V) \;=\; \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right) V \)

The output for token \(i\) is a weighted average of all values, where the weights come from the query–key compatibility. Tokens that "matter" to \(i\) contribute more.

Live attention matrix

Pick a sentence. The matrix shows attention weights: each row is a query token, each column a key token. Bright cells = strong attention. We use a causal mask (upper triangle = 0) so each token only attends to itself and earlier tokens — that's how decoder-style LMs are trained.

Attention weight: 0.00 1.00
Causal mask: when training a model to predict the next token, position \(i\) must not peek at positions \(>i\) — that would be cheating. Setting those scores to \(-\infty\) before softmax forces the upper triangle to zero.

3b Multi-Head Attention

One attention computation can only learn one kind of relationship. Real transformers run many heads in parallel — each with its own \(W_Q, W_K, W_V\) — and concatenate the outputs. Different heads end up specializing: one might track syntax, another coreference, another positional offsets.

\( \text{MHA}(x) \;=\; \text{Concat}(\text{head}_1, \dots, \text{head}_h)\, W_O \)

Head 1 — "previous token"

Strong attention to \(i-1\). Useful for local syntactic patterns.

Head 2 — "subject of verb"

"sat" attends back to "cat". Useful for resolving who-did-what.

Cost. If \(d_\text{model}=768\) and \(h=12\), each head works in a 64-dim subspace. Total parameters per layer for QKV projections: \(3 \cdot d_\text{model}^2\). Attention itself is \(O(n^2 \cdot d)\) in sequence length — the reason long context is expensive.

4 The Transformer Block

One layer of a transformer wraps multi-head attention plus a per-token MLP, with two pieces of glue that turn out to be essential: residual connections and layer norm.

\( z = x + \text{MHA}(\text{LN}(x)) \qquad y = z + \text{FFN}(\text{LN}(z)) \)
x LN Multi-Head Attn + x LN FFN + y

A real model — GPT-2 small (12 layers), Llama-3-8B (32 layers), GPT-4 (rumored ~120) — stacks this same block over and over. Each block refines the per-token representation a bit more, mixing in more context.

Where do the parameters live?

For a typical layer with \(d_\text{model}=768\):

QKV projections
3 × 768 × 768 ≈ 1.77M params
Output projection
768 × 768 ≈ 0.59M params
FFN (768 → 3072 → 768)
≈ 4.72M params (~70% of the layer!)
LayerNorms
~3K params — negligible

Despite all the focus on attention, the bulk of a transformer's weights are in those boring feed-forward MLPs.

5 Generating the Next Token

After the final block, we have a context-aware vector \(h_n\) at the last position. Project it onto the vocabulary with the (often shared) embedding matrix to get logits, then softmax to get a probability over every token in the vocabulary:

\( P(\text{next} \mid \text{context}) \;=\; \text{softmax}(h_n \, E^\top) \)

Sample one token, append it to the input, and run the whole forward pass again. That's it — that's how ChatGPT writes a paragraph. One token at a time, each one conditioned on everything before.

Toy next-token distribution

For the prompt "The cat sat on the ___", here's what a (toy) model might predict. Adjust the temperature to see how the distribution sharpens or flattens.

1.00
Temperature divides logits before softmax. \(T \to 0\) is greedy (always pick the max); \(T \to \infty\) is uniformly random; \(T = 1\) leaves probabilities unchanged.

Putting It All Together

A modern decoder-only LLM (GPT, Claude, Llama) is just this loop:

  1. Tokenize the prompt → integer IDs.
  2. Embed: ID → vector, plus a positional signal.
  3. Run \(L\) transformer blocks. Each block lets every token look at every prior token (attention) then transforms each token independently (FFN).
  4. Project the final-position vector onto the vocabulary → next-token probabilities.
  5. Sample, append, repeat.
That's all there is. Scaling this exact recipe to billions of parameters and trillions of tokens is what produced GPT-4 and Claude. The architecture itself is shockingly simple — most of the magic is in the data, scale, and post-training (SFT / RLHF / GRPO — see those demos).

Next steps: RLHF and GRPO show how a raw next-token predictor becomes a helpful assistant.