From n-grams to Transformers: a brief history
After this lesson you can name the four major language-model generations (n-gram, neural, recurrent, Transformer) and explain what each one fixed about the previous one — and why the Transformer's structure is what makes modern fine-tuning even worth doing.
Language models did not start with the Transformer. They are the result of a fifty-year lineage in which each new architecture solved one specific thing the previous one could not, then collided with its own bottleneck. Knowing the lineage matters because it tells you what assumptions you are standing on when you fine-tune — and which problems modern models still inherit from their ancestors.
n-gram models: count what you have seen
The earliest practical language models were n-gram models:
you assume the next token depends only on the previous n − 1
tokens (the Markov assumption), then estimate the probability
by counting in a large corpus.
P(wt | wt-2, wt-1) = count(wt-2, wt-1, wt) / count(wt-2, wt-1)
A 3-gram model conditions on the previous two words. Build a giant table of triple-counts, and you have a working speech recogniser, autocomplete, or machine translator — pre-2010, every major one used n-grams under the hood. They are fast, interpretable, and they work.
But they hit two walls hard. Sparsity: raise n
to use more context and the table size explodes; most longer sequences are
never seen and get probability zero. No similarity: the
model has no notion that "the cat sat" and
"the dog sat" share structure — the second is just as
unfamiliar as random gibberish if you never saw it. Smoothing tricks
(Kneser-Ney, back-off) help; they don't fix the underlying problem.
Not gone — still useful
n-grams are still in production: speech recognition decoders
(kenLM), keyword-based retrieval (BM25 is n-gram-flavoured),
and on-device autocomplete. We rejected n-grams as the
backbone for general-purpose language; we did not abandon them
as tools.
Neural language models: tokens that share strength
The 2003 breakthrough (Bengio et al.) was simple in hindsight: instead of storing a count for every n-gram, learn a vector for every token and a small neural network that maps a sequence of vectors to a probability distribution over the next token. Train end-to-end with gradient descent.
Two things changed. First, the embedding for "cat" ends up
near the embedding for "dog" because they appear in similar
contexts — so what the model learns about one transfers to the other.
That is exactly the statistical strength sharing n-grams
could not do. Second, the network can in principle take arbitrarily many
previous tokens, not just n − 1.
Word2Vec and GloVe were the famous side-effects of this line: standalone embedding tables trained on enormous corpora that anyone could plug into downstream tasks. They are the ancestors of every embedding model in use today.
RNNs and LSTMs: recurrence over long sequences
Neural LMs in their early form still used a fixed-size window. To handle arbitrary context, recurrent neural networks (RNNs) process tokens one at a time and carry a hidden state forward — at each step, the state summarises everything seen so far. The Long Short-Term Memory (LSTM) variant, popularised after 2010, added gating mechanisms that mitigated the vanishing gradient problem (gradients tend to shrink to nothing over many recurrent steps, so the network forgets early tokens).
LSTMs were the state of the art for nearly a decade. They drove machine translation, summarisation, the first wave of practical chatbots. But they had two of their own bottlenecks. Long-range memory was still hard: compressing arbitrary history into one fixed-size vector lost detail. And — fatally — recurrence is sequential: you cannot process token 5 until you have processed token 4. That makes training painfully slow on long sequences, and it makes scaling to bigger datasets a brick wall.
2017: Attention is all you need
The Transformer (Vaswani et al., 2017) discarded recurrence entirely. Every token attends directly to every other token through the attention mechanism (Lesson 0.5). Two consequences mattered enormously:
- Long-range dependencies are first-class. Any token can look at any other in one attention step. The model does not have to squeeze information through a single hidden state.
- It parallelises. All positions can be processed at once. Matrix multiplications on a GPU stay saturated, so the more compute and data you throw at it, the bigger the model can get — and the better it gets.
Once Transformers could be scaled, the field discovered the scaling laws: predictable improvements with more parameters, more data, and more compute. That is the chain reaction that gave us GPT, Llama, Qwen, SmolLM2 — every model this Academy actually uses.
Key idea
Each generation removed one specific bottleneck. n-grams couldn't generalise between similar tokens; neural LMs fixed that with embeddings. Fixed-window neural LMs couldn't see arbitrary context; RNNs/LSTMs fixed that with recurrence. Recurrence couldn't parallelise and couldn't preserve long-range detail; the Transformer fixed that with attention. Each step earned its reward by absorbing dramatically more data and compute than the one before.
Why this matters for fine-tuning
A modern small language model is a 100-million-parameter Transformer pretrained on trillions of tokens of text. By the time you touch it, it already encodes vast statistical structure about language — grammar, common phrases, world facts, formats. SFT works precisely because you are not teaching language. You are nudging an existing, well-trained next-token machine to lean toward the specific outputs you want on the specific inputs you provide. The lineage above is what put all that structure inside the model before you arrived.
The next lesson catalogues the three Transformer-family flavours — encoder,
encoder-decoder, decoder-only — so that when you see a Hugging Face model
card that says T5ForConditionalGeneration or
BertForSequenceClassification, you know which lineage you are
looking at and what it is good for.
Key terms
- n-gram model
- A language model assuming the next token depends only on the previous n − 1; trained by counting frequencies in a corpus.
- Markov assumption
- The simplifying assumption that the future depends only on a fixed-size recent history, not the whole past.
- Sparsity (in n-grams)
- The problem that most longer n-gram sequences are never observed in any finite corpus, so their probabilities collapse to zero without smoothing.
- Neural language model
- A model that maps tokens to learned embeddings and predicts the next token through a neural network — letting similar tokens share statistical strength.
- RNN / LSTM
- Recurrent architectures that pass a hidden state from one time step to the next; LSTM added gating to mitigate vanishing gradients over long sequences.
- Vanishing gradient
- Gradients shrinking to near-zero as they propagate back through many time steps, preventing the network from learning long-range dependencies.
- Transformer (2017)
- The architecture that replaced recurrence with attention — parallel across positions, easy to scale, and the basis for every modern LLM.
- Scaling laws
- Empirical relationships predicting model quality from parameters, data, and compute — the reason Transformers kept improving past every other architecture.
Check yourself
Four questions. Answers are saved to this browser.