Tokens to Vectors: How LLMs Process Language
A visual guide to the mathematics behind attention and summarisation
Stage 1 — Tokenisation to Embedding
Words don't enter the model as text. Each token is immediately converted into a vector — a list of thousands of numbers representing its position in high-dimensional space. These numbers are learned during training and encode relationships between concepts.
Stage 2 — Semantic Vector Space
Similar concepts cluster together in vector space. "Cat" and "dog" are close; "cat" and "democracy" are far apart. Ambiguous words like "bank" sit between their possible meanings — until context pulls them toward one cluster or another.
Stage 3 — Attention Mechanism
When processing "bank," the attention mechanism computes similarity between bank's Query vector and every other token's Key vector. High similarity = high attention weight. "River" gets 0.66 attention; "the" gets almost none. These weights determine how much each token's Value vector contributes to bank's updated representation.
Stage 4 — Contextual Vector Shift
After attention, "bank" isn't in the same place in vector space. The weighted contribution of "river" (and "fish," "swam") has pulled the vector toward the nature cluster. It's not that information was added to a packet — the vector itself was warped by gravitational influence from contextually relevant tokens.