Continued pretraining: when SFT isn't enough
After this lesson you can tell whether your problem is "the model doesn't speak my domain" or "the model speaks my domain but fails my task," pick continued pretraining only when it's the first one, prepare raw-text data and the right recipe (smaller learning rate, longer sequences, conservative vocab decisions), and evaluate honestly by measuring downstream-task SFT performance, not CPT loss.
SFT bends a model toward a task. Continued pretraining (CPT) does something earlier in the pipeline: it continues the original pretraining objective — next-token prediction over raw text — on a new corpus, so the model learns the language of your domain before any task fine-tune is even attempted. It's the right answer to a narrow question: does the base model know my domain's words at all? It's the wrong answer to almost every other question, and it gets picked for the wrong reasons often enough that this lesson spends as much time on when not to use it as on how to use it.
What CPT actually is
The base model was pretrained by reading enormous quantities of generic text and predicting the next token. CPT is exactly the same recipe — same loss, same data shape — applied to your corpus. No prompts, no completions, no chat template, no loss mask. Documents in, next-token loss over every position, gradients back. The output is a new checkpoint with the same architecture and tokenizer (usually) as the base, but with weights nudged toward your domain's distribution. That checkpoint becomes the starting point for SFT in stage two.
When CPT actually helps
The honest signal that CPT is the right tool is that the model can't even read your domain fluently. Symptoms:
- Per-token loss on a held-out slice of your raw text is much higher than on generic text — the model is genuinely surprised by your tokens.
- Domain-specific vocabulary is being split into long token streams (medical entity names, legal citation formats, ICD codes, ticker symbols, protein sequences, source code in a niche language).
- SFT on the base produces fluent-looking but factually warped outputs because the model is filling in details from its generic prior, not your corpus.
- You have a lot of unlabeled raw text — millions of tokens minimum, ideally tens of millions — that nobody has labelled into instructions yet.
If those signals are present, CPT is doing real work: it's giving the model exposure to your distribution under the same objective that built its general competence in the first place.
When CPT is a misdiagnosis
Most of the time, "my model is bad at my task" is not a CPT problem. Patterns that look like they need CPT but don't:
- The model speaks the language, it just doesn't do the task. If asking the base model to explain a domain term gets a passable answer, the domain isn't unknown — your SFT data, format, or eval is the bottleneck. Fix that.
- You have 500 instruction examples and CPT seems like "more training." CPT is not "SFT with no labels." It optimises a different objective on different data; throwing 500 instructions in raw will just teach the model to autocomplete prompt formats.
- You want the model to follow a new format or refuse a class of prompts. Those are SFT and preference-tuning problems (Lesson 1.2). CPT doesn't teach format compliance — it teaches token distributions.
- You have a small corpus and a big task gap. CPT on 50k tokens is mostly noise. If you don't have the data scale, skip CPT and put the effort into SFT data quality.
Honest beat — CPT is the most over-prescribed step in SLM workflows
Teams reach for continued pretraining because it sounds more serious than "make better SFT data." It almost never beats the alternative of putting that compute into curating, deduping, and gold-setting your task data. If you can't say what specific token-level fluency CPT is supposed to fix, you don't need CPT.
The data shape
CPT data is raw text. Paragraphs, documents, transcripts, source files — whatever your domain produces, deduplicated and lightly cleaned, with documents concatenated and packed into long sequences (typically 2k–8k tokens per sample, depending on the base model's context length). Things CPT data is not:
- JSONL with prompt/completion fields. (That's SFT.)
- Chat-template-formatted messages with role markers. (That's SFT.)
- Instructions and answers. (That's SFT.)
- One paragraph per row, each row short. (Wastes compute on padding tokens; pack into long sequences instead — Lesson 1.6.)
# raw-text packing for CPT — no loss mask, no roles
from datasets import load_dataset
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM2-135M")
raw = load_dataset("path/to/your/domain/corpus", split="train") # one doc per row
def to_ids(batch):
return {"ids": tok(batch["text"], add_special_tokens=False)["input_ids"]}
ids = raw.map(to_ids, batched=True, remove_columns=raw.column_names)
# pack: concat all docs, split into 2048-token chunks
SEQ = 2048
flat = sum(ids["ids"], [])
chunks = [flat[i:i+SEQ] for i in range(0, len(flat) - SEQ + 1, SEQ)]
# each `chunks[i]` is a CPT training example; labels = inputs (no mask)
The recipe (and why it differs from SFT)
CPT is more delicate than SFT because you're touching weights the model uses for everything, not just adding a task-specific behaviour. Conservative defaults:
- Learning rate — smaller than SFT, much smaller than the original pretraining. A typical CPT learning rate for a small base is 1e-5 to 5e-5. Too high and you wreck the broader behaviours faster than catastrophic forgetting in SFT does (Lesson 1.21). Use a short warmup and cosine decay.
- Sequence length — longer than SFT, because pretraining-style learning benefits from cross-sentence context. Pack to the model's full context (2k for SmolLM2-135M, 8k+ for larger bases) if memory allows.
- Epochs — usually one pass over the corpus, sometimes two. Repeating raw text aggressively is a well-known way to make pretraining-style runs worse, not better.
- Mix-back of generic data — exactly the same mitigation as Lesson 1.21: include 5–10% of generic pretraining-style text in the mix so the model doesn't drift wholesale into your domain and lose general competence.
- LoRA vs full — LoRA for CPT is a recent, viable option (LoRA-CPT) and dramatically cheaper. Full CPT moves more weight but risks more forgetting; if you don't have a strong reason for full, start with LoRA.
Vocab extension: a separate decision
If your domain has tokens that the base tokenizer splits into long ugly streams (e.g. oxytocin → oxy + tocin across multiple pieces, or domain-specific code identifiers), you may consider extending the tokenizer's vocabulary by adding new tokens before CPT begins. The trade-off is sharp:
- Benefit: fewer tokens per document → cheaper context, faster inference, sometimes faster learning of those terms.
- Cost: the new embeddings start random. Until CPT trains them, they're noise. If you extend vocab and then do only a short CPT run, those embeddings stay nearly random and the model gets worse at those tokens. Vocab extension only pays off if you have enough CPT data to train the new embeddings into useful representations.
- Distillation foot-gun: if you're planning to distil from a teacher later, vocab extension breaks the same-tokenizer assumption. Pick one path.
The conservative default: don't extend the vocab on a first CPT run. Run CPT without it, look at the gate, and only reach for vocab extension if the token-stream cost is a measured bottleneck.
Evaluating CPT: don't trust the CPT loss alone
The most common honest-metrics failure in CPT projects: the team watches CPT loss drop on the domain corpus and declares victory. CPT loss always drops on in-distribution data — that's just memorisation pressure. It tells you nothing about whether the model is more useful for your downstream task.
The honest evaluation cadence is downstream-task SFT performance, frozen across runs:
- Pin a gold set and an SFT recipe (data, hyperparameters, epochs) — call this the downstream probe.
- Run the probe from the original base. Record the gate score. This is your baseline.
- Run CPT to produce a candidate checkpoint.
- Run the same probe from the CPT'd checkpoint. Record the gate score.
- Compare. CPT earned its keep only if the gate moved in your favour by a meaningful, repeatable margin. If it's a tie or noise, CPT didn't help — and you spent training compute to learn that.
The two-stage CPT → SFT pipeline
The shape of a CPT-enabled project is two stages:
Stage 1 — Continued pretraining
base model ──► CPT on domain raw text ──► cpt-checkpoint
(objective: next-token; data: raw text; no labels)
Stage 2 — Supervised fine-tuning
cpt-checkpoint ──► SFT on task instructions ──► task-model
(objective: next-token over completion; data: prompt/completion; loss mask on)
Evaluate task-model on the same gold set you would have used without stage 1.
Two practical notes about this pipeline:
- Stage 2's recipe is unchanged. CPT replaces the starting point, not the SFT method. Same SFT data, same chat template, same LR range, same gold set.
- Catastrophic forgetting (next lesson) is a stage-2 concern that CPT can amplify. Mix-back during SFT becomes more important, not less, because a CPT'd model has already been pulled toward your domain — narrow SFT pulls it further. Track 1.21's mitigations apply.
Key idea
Continued pretraining is the right tool for one specific problem — the base model doesn't speak your domain at the token level — and the wrong tool for almost everything else. The decision rule: can the base model read your text without surprise? If yes, CPT is wasted compute; improve your SFT data instead. If no, run a small CPT pass with a conservative LR, evaluate by downstream-task SFT gate, and only then commit to the two-stage pipeline.
Catastrophic forgetting (Lesson 1.22) gets sharper after CPT: you've pulled the base toward your domain, and SFT will pull it further. The mitigations there are not optional in a two-stage pipeline.
Key terms
- Continued pretraining (CPT)
- Continuing the next-token objective on raw text from a new domain, before any task SFT. Same loss as pretraining, different data, smaller learning rate.
- Domain-adaptive pretraining (DAPT)
- The research name for CPT applied to a specific domain (medical, legal, code, finance). Same mechanic; the "domain-adaptive" label emphasises the purpose.
- Vocab extension
- Adding new tokens to the tokenizer for domain-specific vocabulary before CPT, so those terms become single tokens. Sharp trade-off: cheaper at inference, but the new embeddings start random and need enough CPT to learn.
- Catastrophic forgetting (CPT view)
- The same forgetting risk as in SFT (Lesson 1.22), but a CPT'd model is already biased toward the domain — narrow SFT on top can compound the drift unless mix-back and broad eval are kept in.
- Two-stage CPT → SFT pipeline
- The standard shape: stage 1 is CPT on raw domain text, stage 2 is SFT on task instructions from the CPT'd checkpoint. Evaluated end-to-end on the downstream task gold set.
- Downstream probe
- A pinned SFT recipe + gold set used to compare a base checkpoint and a CPT'd checkpoint on the metric that actually matters — task performance, not CPT loss.
Check yourself
Answers are saved to this browser.