Track 1 · SFT fundamentals · Lesson 21

Continued pretraining: when SFT isn't enough

After this lesson you can tell whether your problem is "the model doesn't speak my domain" or "the model speaks my domain but fails my task," pick continued pretraining only when it's the first one, prepare raw-text data and the right recipe (smaller learning rate, longer sequences, conservative vocab decisions), and evaluate honestly by measuring downstream-task SFT performance, not CPT loss.

Level: intermediate Read time: ~10 min Prerequisites: Dataset formats in the wild

SFT bends a model toward a task. Continued pretraining (CPT) does something earlier in the pipeline: it continues the original pretraining objective — next-token prediction over raw text — on a new corpus, so the model learns the language of your domain before any task fine-tune is even attempted. It's the right answer to a narrow question: does the base model know my domain's words at all? It's the wrong answer to almost every other question, and it gets picked for the wrong reasons often enough that this lesson spends as much time on when not to use it as on how to use it.

What CPT actually is

The base model was pretrained by reading enormous quantities of generic text and predicting the next token. CPT is exactly the same recipe — same loss, same data shape — applied to your corpus. No prompts, no completions, no chat template, no loss mask. Documents in, next-token loss over every position, gradients back. The output is a new checkpoint with the same architecture and tokenizer (usually) as the base, but with weights nudged toward your domain's distribution. That checkpoint becomes the starting point for SFT in stage two.

When CPT actually helps

The honest signal that CPT is the right tool is that the model can't even read your domain fluently. Symptoms:

If those signals are present, CPT is doing real work: it's giving the model exposure to your distribution under the same objective that built its general competence in the first place.

When CPT is a misdiagnosis

Most of the time, "my model is bad at my task" is not a CPT problem. Patterns that look like they need CPT but don't:

Honest beat — CPT is the most over-prescribed step in SLM workflows

Teams reach for continued pretraining because it sounds more serious than "make better SFT data." It almost never beats the alternative of putting that compute into curating, deduping, and gold-setting your task data. If you can't say what specific token-level fluency CPT is supposed to fix, you don't need CPT.

The data shape

CPT data is raw text. Paragraphs, documents, transcripts, source files — whatever your domain produces, deduplicated and lightly cleaned, with documents concatenated and packed into long sequences (typically 2k–8k tokens per sample, depending on the base model's context length). Things CPT data is not:

# raw-text packing for CPT — no loss mask, no roles
from datasets import load_dataset
from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM2-135M")
raw = load_dataset("path/to/your/domain/corpus", split="train")   # one doc per row

def to_ids(batch):
    return {"ids": tok(batch["text"], add_special_tokens=False)["input_ids"]}

ids = raw.map(to_ids, batched=True, remove_columns=raw.column_names)

# pack: concat all docs, split into 2048-token chunks
SEQ = 2048
flat = sum(ids["ids"], [])
chunks = [flat[i:i+SEQ] for i in range(0, len(flat) - SEQ + 1, SEQ)]
# each `chunks[i]` is a CPT training example; labels = inputs (no mask)

The recipe (and why it differs from SFT)

CPT is more delicate than SFT because you're touching weights the model uses for everything, not just adding a task-specific behaviour. Conservative defaults:

Vocab extension: a separate decision

If your domain has tokens that the base tokenizer splits into long ugly streams (e.g. oxytocinoxy + tocin across multiple pieces, or domain-specific code identifiers), you may consider extending the tokenizer's vocabulary by adding new tokens before CPT begins. The trade-off is sharp:

The conservative default: don't extend the vocab on a first CPT run. Run CPT without it, look at the gate, and only reach for vocab extension if the token-stream cost is a measured bottleneck.

Evaluating CPT: don't trust the CPT loss alone

The most common honest-metrics failure in CPT projects: the team watches CPT loss drop on the domain corpus and declares victory. CPT loss always drops on in-distribution data — that's just memorisation pressure. It tells you nothing about whether the model is more useful for your downstream task.

The honest evaluation cadence is downstream-task SFT performance, frozen across runs:

  1. Pin a gold set and an SFT recipe (data, hyperparameters, epochs) — call this the downstream probe.
  2. Run the probe from the original base. Record the gate score. This is your baseline.
  3. Run CPT to produce a candidate checkpoint.
  4. Run the same probe from the CPT'd checkpoint. Record the gate score.
  5. Compare. CPT earned its keep only if the gate moved in your favour by a meaningful, repeatable margin. If it's a tie or noise, CPT didn't help — and you spent training compute to learn that.

The two-stage CPT → SFT pipeline

The shape of a CPT-enabled project is two stages:

Stage 1 — Continued pretraining
  base model  ──►  CPT on domain raw text  ──►  cpt-checkpoint
  (objective: next-token; data: raw text; no labels)

Stage 2 — Supervised fine-tuning
  cpt-checkpoint  ──►  SFT on task instructions  ──►  task-model
  (objective: next-token over completion; data: prompt/completion; loss mask on)

Evaluate task-model on the same gold set you would have used without stage 1.

Two practical notes about this pipeline:

Key idea

Continued pretraining is the right tool for one specific problem — the base model doesn't speak your domain at the token level — and the wrong tool for almost everything else. The decision rule: can the base model read your text without surprise? If yes, CPT is wasted compute; improve your SFT data instead. If no, run a small CPT pass with a conservative LR, evaluate by downstream-task SFT gate, and only then commit to the two-stage pipeline.

Catastrophic forgetting (Lesson 1.22) gets sharper after CPT: you've pulled the base toward your domain, and SFT will pull it further. The mitigations there are not optional in a two-stage pipeline.

Key terms

Continued pretraining (CPT)
Continuing the next-token objective on raw text from a new domain, before any task SFT. Same loss as pretraining, different data, smaller learning rate.
Domain-adaptive pretraining (DAPT)
The research name for CPT applied to a specific domain (medical, legal, code, finance). Same mechanic; the "domain-adaptive" label emphasises the purpose.
Vocab extension
Adding new tokens to the tokenizer for domain-specific vocabulary before CPT, so those terms become single tokens. Sharp trade-off: cheaper at inference, but the new embeddings start random and need enough CPT to learn.
Catastrophic forgetting (CPT view)
The same forgetting risk as in SFT (Lesson 1.22), but a CPT'd model is already biased toward the domain — narrow SFT on top can compound the drift unless mix-back and broad eval are kept in.
Two-stage CPT → SFT pipeline
The standard shape: stage 1 is CPT on raw domain text, stage 2 is SFT on task instructions from the CPT'd checkpoint. Evaluated end-to-end on the downstream task gold set.
Downstream probe
A pinned SFT recipe + gold set used to compare a base checkpoint and a CPT'd checkpoint on the metric that actually matters — task performance, not CPT loss.

Check yourself

Answers are saved to this browser.

Progress is stored locally in your browser.