Track 0 · Foundations · Lesson 4

From text to numbers: tokens & embeddings

After this lesson you can explain how a string of text becomes the numbers a neural network processes: tokenization into subwords, integer IDs, and learned embedding vectors — and what the context window limits.

Level: beginner Read time: ~9 min Prerequisites: Neural networks in one page

A neural network multiplies and adds numbers. Text is a sequence of characters. Bridging that gap takes two steps that every language model performs before it can do anything: tokenize the text into pieces, then embed each piece as a vector of numbers.

Step 1: tokenization

A tokenizer chops text into tokens. Tokens are usually subwords, not whole words and not single characters. For example, a tokenizer might split "tokenization" into ["token", "ization"], and "BrewSLM" into ["Brew", "SL", "M"]. Common words are often a single token; rare or novel words break into smaller, reusable pieces.

Why subwords rather than words? Two reasons. A whole-word vocabulary would be enormous and would still fail on any word it had never seen. Single characters, at the other extreme, make sequences painfully long. Subwords are the sweet spot: a fixed-size vocabulary that can spell out any string by combining pieces. The standard algorithm for building this vocabulary is Byte-Pair Encoding (BPE), which greedily merges the most frequent character pairs until it has the desired number of tokens.

Step 2: tokens become IDs

The tokenizer holds a fixed vocabulary — a list of every token it knows, perhaps 32,000 or 150,000 entries — and each token has an integer token ID (its index in that list). So tokenizing produces a list of integers:

"the cat sat" → ["the", " cat", " sat"] → [578, 8415, 7937]

(The leading spaces are part of the tokens — most tokenizers attach the space before a word to the word.) Those integers are what actually enters the model. When the model later outputs an ID, the tokenizer maps it back to text. Two practical consequences: the model's "view" of your text is its token count, not its character count; and a model can only ever produce tokens that exist in its vocabulary.

Step 3: IDs become embeddings

An integer ID still isn't useful for arithmetic — ID 8415 isn't "bigger" than ID 578 in any meaningful sense. So the model's first layer is an embedding table: a big lookup that maps each token ID to a learned vector of numbers (say 768 of them). That vector is the token's embedding — the model's numeric representation of it.

Crucially, the embedding table is made of parameters, so it is learned during training. Tokens that behave similarly end up with similar vectors. After training, the embeddings for "cat" and "dog" sit close together; "cat" and "thermodynamics" sit far apart. The model discovers this geometry on its own from the predict-next-token objective.

"the cat sat" raw text the · cat · sat tokens 578, 8415, 7937 token IDs [0.12, -0.4, …] [-0.9, 0.3, …] [0.05, 0.7, …] embedding vectors (learned) → into the network
Three stages every prompt passes through before the first matrix multiply. The embedding table is parameters, so this mapping is learned.

Special tokens and the context window

Vocabularies include a few special tokens that aren't ordinary text: markers for the start/end of a sequence, a padding token to make batches rectangular, and — important for fine-tuning — tokens that delimit roles in a conversation (system / user / assistant). Getting these exactly right is a correctness issue we'll dwell on in Track 1's lesson on chat templates.

Finally, a model can only attend to a bounded number of tokens at once: its context window (e.g., 2,048, 8,192, or far more). Everything — your instructions, any retrieved documents, the conversation so far, and the room left to generate — must fit inside it. When people say "the model ran out of context," they mean the token budget was exceeded.

Practical note

Cost and speed are usually measured per token, and limits are in tokens. A rough rule for English: one token ≈ 0.75 words, so 1,000 tokens ≈ 750 words. Always think in tokens, not characters.

Where this leaves us

Text is now a sequence of embedding vectors — exactly the numeric input a neural network wants. The remaining question is what network architecture turns that sequence into a good next-token prediction. The answer, and the reason modern language models work at all, is attention and the Transformer. That's next.

Key terms

Tokenization
Splitting text into tokens (usually subwords) the model can process.
Subword / BPE
Tokens smaller than words; Byte-Pair Encoding builds the vocabulary by merging frequent pairs.
Vocabulary
The fixed set of tokens a tokenizer knows; each has an integer ID.
Token ID
The integer index of a token in the vocabulary — what actually enters the model.
Embedding
A learned vector representing a token; similar tokens get similar vectors.
Special tokens
Non-text markers (start/end, padding, chat roles) in the vocabulary.
Context window
The maximum number of tokens the model can process at once.

Check yourself

Four questions. Answers are saved to this browser.

Progress is stored locally in your browser.