BrewSLM Academy · reference

Glossary

Plain-language definitions of the terms used across the Academy. New terms are added as tracks publish; today's set covers the Foundations track.

A–E

Activation function
The nonlinearity (e.g. ReLU) applied after a neuron's weighted sum; what lets stacked layers model non-linear patterns. See Neural networks.
Attention
A learned, content-based mechanism for a token to pull information from other relevant tokens. See Attention & the Transformer.
Autoregressive generation
Producing text by repeatedly predicting the next token and appending it.
Backpropagation
The algorithm that computes the gradient for every parameter in one backward sweep.
BPE (Byte-Pair Encoding)
A method for building a subword vocabulary by merging the most frequent character pairs.
Context window
The maximum number of tokens a model can process at once (prompt + generation).
Continued pretraining
Training a base model further on a large domain corpus to build broad ability; expensive.
Cross-entropy
The training loss for next-token prediction; small when the true next token was given high probability.
Data-centric iteration
Improving results mainly by fixing data, guided by failure analysis, rather than tweaking the model.
Decoding
How the next token is chosen from the probability distribution (greedy, sampling, temperature, top-p).
Deep learning
Neural networks with several hidden layers.
Drift
When live inputs stop resembling training data, degrading quality over time.
Embedding
A learned vector representing a token; similar tokens get similar vectors. See Tokens & embeddings.
Epoch
One complete pass through the training dataset.

F–N

Fine-tuning
Continuing training on your examples to change a model's parameters; see Supervised Fine-Tuning.
Forward pass
Computing outputs from inputs, layer by layer.
Gold set
Curated, known-correct examples a model never trains on; the source of truth for quality. See The SLM project lifecycle.
Gradient
The direction over all parameters in which the loss increases fastest; training steps the opposite way.
Gradient descent
Repeatedly nudging parameters against the gradient to reduce loss. See How models learn.
Inference
Running a trained, frozen model on new inputs to get outputs.
In-context learning
Showing examples inside the prompt so the model imitates them for that call.
Latency
Time to produce output; rises with model size since each token runs the whole network.
Layer / hidden layer
Many neurons in parallel; hidden layers sit between input and output.
Learning rate
The step-size multiplier in gradient descent; too high diverges, too low crawls.
LLM
Large language model — tens to hundreds of billions of parameters.
LM head
The final layer producing a logit for every vocabulary token.
Logits
Raw, unnormalized scores for each possible next token, before softmax.
LoRA
Low-Rank Adaptation: fine-tuning by training small added matrices instead of all parameters, cutting memory and cost. Covered in depth in Track 1.
Loss function
A single number measuring how wrong predictions are; smaller is better.
Minibatch / SGD
Estimating the gradient on a small random batch each step (stochastic gradient descent).
Model
A function with adjustable parameters that maps inputs to outputs. See What is a model?.
Multi-head attention
Several attention computations in parallel, each specializing in a different relationship.
Neuron
A weighted sum of inputs plus a bias, passed through a nonlinear activation.

P–Z

Parameter (weight)
A number inside a model, set by training rather than by hand.
Prompting
Steering a fixed model by writing a better input; no training.
RAG (retrieval-augmented generation)
Retrieving relevant documents at inference and adding them to the prompt; supplies knowledge, not behavior. See The four levers.
Residual connection
Adding a sub-layer's input to its output so information and gradients flow through a deep stack.
Self-attention
Attention where tokens attend to other tokens in the same sequence.
Supervised Fine-Tuning (SFT)
Continuing training of a base model on input→output examples to change its parameters for a task.
SLM
Small language model — relatively few parameters (millions to a few billion); cheaper and faster. See LLMs vs SLMs.
Softmax
Turns logits into a probability distribution that sums to 1.
Special tokens
Non-text markers in the vocabulary (start/end, padding, chat roles).
Temperature / top-p
Decoding knobs that flatten/sharpen the distribution and trim the unlikely tail.
Token
A chunk of text (usually a subword) the model processes; each maps to an integer ID.
Tokenization
Splitting text into tokens the model can process.
Training
Adjusting parameters from example data to reduce a loss.
Transformer block
Self-attention + feed-forward network + residual connections + layer norm; stacked many times.
Train/validation/test split
Partitioning data so you train on one part and measure on data the model hasn't seen.
Vocabulary
The fixed set of tokens a tokenizer knows; each has an integer ID.
VRAM
GPU memory; weights need ≈ params × bytes-per-param, training needs several times more.

Track 0 — Foundations (extended)

Base model
A model straight from pretraining: a next-token continuer over raw text. No chat template, no roles, no refusals. See Base vs instruct models.
Instruct model
A base model further trained (SFT + often preference tuning) to follow instructions inside a chat template. Same architecture, different behaviour.
Alignment training
The post-pretraining training (SFT on instruction data plus optional DPO/RLHF) that turns a base into an instruct model.
Alignment tax
The cost of the instruct model's baked-in choices (refusals, hedges, style) when they don't match your task.
Model license
The licence attached to a base model that travels with every derivative — including your fine-tune. Check before training, not at deploy. See Picking a base model.
Tokenizer family
A shared vocabulary and special-token set across a model family; a hard constraint for offline distillation (teacher and student must share it).
n-gram model
A pre-neural language model that estimates the next token from raw counts over the previous n−1 tokens. Fast, interpretable, and sparse beyond trigrams. Still used in BM25, kenLM, and on-device autocomplete. See From n-grams to Transformers.
Markov assumption
The simplifying assumption that the next token depends only on the previous n−1 tokens, not the full history. Necessary to make n-gram models tractable; broken (deliberately) by RNNs and Transformers.
Sparsity (n-grams)
The problem that as n grows, most n-grams never appear in training data, so the count is zero and the model assigns zero probability. The wall n-gram models hit beyond trigrams.
Neural language model
A language model that predicts the next token from a learned dense embedding of the previous tokens (Bengio 2003). Solves n-gram sparsity by letting similar tokens share strength through their embeddings.
RNN / LSTM
Recurrent neural networks that consume tokens one at a time and carry a hidden state forward. LSTMs add gating to mitigate the vanishing-gradient problem. The dominant LM family from ~2014 until 2017.
Vanishing gradient
The training failure where gradients shrink exponentially as they back-propagate through many time steps, so the network can't learn long-range dependencies. LSTMs partially solve it; attention sidesteps it entirely.
Transformer
The 2017 architecture ("Attention is all you need") that replaced recurrence with self-attention and parallel position embeddings. Every modern LM — encoder, encoder-decoder, decoder-only — is a Transformer variant. See Attention and self-attention.
Scaling laws
Empirical findings (Kaplan 2020, Chinchilla 2022) that pretraining loss falls predictably with more data, more parameters, and more compute — together, not one alone. Why pretraining budgets are spent the way they are.
Encoder-only model
A Transformer that sees the whole input at once (bidirectional attention) and emits one vector per token. Pretrained with masked-language-modelling. Used for classification and embeddings (BERT, BGE, E5). See Architecture taxonomy.
Encoder-decoder model
A Transformer with a bidirectional encoder and a causal decoder, pretrained on span corruption. Built for input-to-output transformation (T5, BART, FLAN-T5).
Decoder-only model
A Transformer with causal (left-to-right) attention, pretrained on next-token prediction. The architecture behind GPT, Llama, Qwen, SmolLM2 — and the one this Academy assumes when it says "SLM".
Masked language modelling (MLM)
The encoder-only pretraining objective: randomly mask ~15% of input tokens and ask the model to predict them from the surrounding context. Drives bidirectional representations.
Span corruption / denoising
The encoder-decoder pretraining objective: drop random contiguous spans from the input, ask the decoder to regenerate them in order. T5's objective.
Causal (autoregressive) attention
An attention mask that prevents each position from looking at later positions. Makes a Transformer next-token predictable, which is what decoder-only chat models do.
In-context learning
The decoder-only property where you can elicit new behaviour by putting examples in the prompt instead of fine-tuning. Why chat models are general-purpose without retraining.

Track 1 — SFT fundamentals

AdamW (optimizer)
The standard optimizer; adapts the step per-parameter for stable training, at the cost of ~8 bytes/param of state.
Batch size
How many examples a step processes at once; larger = smoother gradient but more memory.
Chat template
The tokenizer rule that renders a messages list into the exact formatted string (with role/special tokens) a model expects. See Chat templates.
Checkpoint
Saved parameters (or LoRA adapter) on disk, for resuming and keeping the best version by validation.
Completion
The target output part of an SFT example (the part the loss is computed on).
DPO
Direct Preference Optimization: train on (chosen, rejected) pairs to prefer better responses, no reward model. See Choosing the objective.
Early stopping
Keeping the checkpoint at the validation-loss minimum rather than the final step.
Effective batch size
per_device_batch × gradient_accumulation × devices — the number that governs training dynamics.
F1 (precision, recall)
Precision = right / flagged; recall = caught / should-have-found; F1 is their harmonic mean.
Generalization gap
The distance between training and validation loss; a widening gap signals overfitting.
Gradient accumulation
Summing gradients over several micro-batches before one optimizer step, to simulate a larger batch on a small GPU.
Gradient checkpointing
Re-computing activations during backprop instead of storing them — trades compute for large memory savings.
Loss mask
Computing the loss only on completion tokens so the model learns to produce the answer, not echo the prompt. See Anatomy of an SFT example.
LoRA alpha
Scales the adapter's contribution (~alpha/r); often set to 2×r.
LoRA rank (r)
The adapter's inner dimension / capacity; higher = more room (and more overfit risk). Common: 16.
max_seq_length
The per-example token cap; set it from the data's length distribution so completions are never clipped.
Mixed precision (bf16)
Computing/storing in 16-bit to roughly halve memory with stable training.
ORPO
An objective that combines SFT and preference learning in a single stage.
Overfitting / underfitting
Training loss falls while validation rises (memorizing) / both stay high (hasn't learned enough). See Reading a loss curve.
Perplexity
The exponential of cross-entropy; roughly how many tokens the model is choosing among (1 = certain).
Prompt
The input part of an SFT example (instruction, optionally with context); masked from the loss.
QLoRA
LoRA on a 4-bit quantized frozen base, to fit fine-tuning into far less memory.
Quality gate
A pass threshold set in advance that a model must clear on the gold set to ship.
Quantization
Storing weights in fewer bits (e.g. 4) to save memory.
RLHF
Reinforcement Learning from Human Feedback: train a reward model, then optimize with RL; powerful but complex (DPO is the simpler modern alternative).
Scheduler / warmup
Warmup ramps the LR up over the first steps; the scheduler (e.g. cosine) then decays it toward zero. See Learning rate & schedules.
Sequence packing
Concatenating short examples into one full-length sequence to avoid wasting compute on padding.
Target modules
Which weight matrices get LoRA adapters (commonly the attention query/value projections).
Greedy decoding (T1)
Always take the most-likely next token. Deterministic; the default for classification and evaluation. See Decoding controls.
Sampling
Drawing the next token from the model's distribution (do_sample=True); adds variety, the default for chat/writing tasks.
Top-p (nucleus sampling)
Keep the smallest set of tokens whose cumulative probability ≥ p, then sample. p ≈ 0.9 is a sane default.
Top-k
Keep only the k most likely tokens and sample from that subset.
max_new_tokens
Hard cap on tokens emitted per generation call. Always set it; a missing stop token with no cap can fill the context window with hallucinated conversation.
Stop tokens
Tokens signalling end-of-generation. Instruct models bake one into the chat template; without it the model rambles.
Repetition penalty
A scalar reducing probability of tokens already in the output. Small values (1.05–1.15) break greedy loops without quality loss.
Time to first token (TTFT)
Latency from request send to first generated character; dominates perceived chat UX.
JSONL
One JSON object per line; the standard streamable container for SFT datasets. See Dataset formats in the wild.
Completion format
{prompt, completion}. Simplest SFT shape; what Track 2's by-hand SFT uses.
Chat messages format
A list of {role, content} dicts that the tokenizer's apply_chat_template renders. The standard for chat models and multi-turn data.
Alpaca format
{instruction, input, output}. From Stanford's original instruction-tuning set; many older HF datasets use it.
ShareGPT format
{conversations: [{from, value}, ...]}. Multi-turn, quirky field names — rename fromrole, valuecontent to convert.
Catastrophic forgetting
Degradation of broader skills when a model is fine-tuned on narrow data — the optimizer has no signal to preserve them. See Catastrophic forgetting.
Data mixing (mix-back)
Including a small fraction (5–10%) of general instruction data alongside narrow task data, so the optimizer sees both shapes and forgets less.
Broad eval
A small set of general prompts (instruction-following, refusals, format-switching) evaluated alongside the task eval to detect forgetting.
Hard negative
An example that resembles the other class on the surface; placed on the correct side of the decision boundary to teach the model where the line is.
Ambiguous case
An example a thoughtful human would have to pause on; labelled per a stated rule so the model learns the rule.
Refusal data
Examples the model should not answer the usual way (out-of-scope, harmful, needing clarification), paired with the desired refusal phrasing.
OOD (out of distribution)
Inputs outside the trained scope, labelled with the desired "I don't know" so the model learns its limits.
Continued pretraining (CPT)
Continuing the next-token objective on raw text from a new domain, before any task SFT. Same loss as pretraining, different data, smaller learning rate. The right tool when the base model doesn't speak your domain's language; the wrong tool when it does. See Continued pretraining.
Domain-adaptive pretraining (DAPT)
The research name for CPT applied to a specific domain (medical, legal, code, finance). Same mechanic — the "domain-adaptive" label emphasises the purpose.
Vocab extension
Adding new tokens to the tokenizer for domain-specific vocabulary before CPT, so those terms become single tokens. Sharp trade-off: cheaper at inference, but the new embeddings start random and need enough CPT to learn — and it breaks the same-tokenizer assumption distillation needs.
Catastrophic forgetting (CPT view)
The same risk as in SFT, sharpened: a CPT'd model is already pulled toward the domain, and narrow SFT on top compounds the drift unless mix-back and broad eval are kept in.
Two-stage CPT → SFT pipeline
The standard shape: stage 1 is CPT on raw domain text producing a CPT'd checkpoint, stage 2 is SFT on task instructions from that checkpoint. Evaluated end-to-end on the same gold set you'd use without CPT.
Downstream probe
A pinned SFT recipe + gold set used to compare a base checkpoint and a CPT'd checkpoint on the metric that actually matters — task performance, not CPT loss.

Track 2 — Hands-on

accelerate
Hugging Face library that handles device placement and distributed details under the Trainer.
attention_mask
A per-token 1/0 vector marking real tokens vs padding so the model ignores padded positions.
AutoModelForCausalLM
The Transformers class that loads a decoder-only language model by id. See Load a base model.
AutoTokenizer
Loads the matching tokenizer and chat template for a model id.
DataCollator
Pads a list of examples into a rectangular batch; DataCollatorForSeq2Seq pads labels with -100 so padding is ignored by the loss. See Tokenize & collate.
datasets.Dataset
A fast, memory-mapped table with .map()/.filter()/split that the Trainer consumes.
dtype (bf16)
The weight/compute precision passed to from_pretrained (older API: torch_dtype); bf16 roughly halves memory.
from_pretrained / save_pretrained
Download/instantiate a model or tokenizer, and write a self-contained copy back to disk.
generate()
Runs autoregressive decoding to produce output tokens; do_sample=False gives deterministic greedy decoding for evaluation.
get_peft_model
Wraps a base model with LoRA adapters and freezes the base so only the adapter trains.
GGUF
The llama.cpp weight format for efficient CPU/edge inference; a separate conversion step from your merged model. See Merge & infer.
Greedy decoding
Always taking the highest-probability next token (do_sample=False); deterministic, used for reproducible evaluation.
Ignore index (-100)
The label value PyTorch's cross-entropy skips; how the loss mask and padding are excluded from the loss in code.
input_ids / labels
The token IDs the model reads, and the targets it's scored against (a copy of input_ids with masked positions set to -100).
merge_and_unload
Folds a LoRA adapter into the base weights, returning a standalone model with no adapter overhead.
PeftModel.from_pretrained
Attaches a trained LoRA adapter onto a fresh base model for inference. See Evaluate by hand.
Reports how few parameters LoRA actually trains (typically well under 1%).
safetensors
The safe, fast default weight format save_pretrained writes.
Trainer
The Transformers class that runs the training loop (forward, loss, backward, optimizer step) for you. See A minimal LoRA fine-tune.
TrainingArguments
Holds the run's hyperparameters: learning rate, epochs, batch size, precision, logging and saving cadence.
venv
An isolated Python environment so a project's packages don't collide with the system. See Set up the environment.
TRL
Hugging Face's SFT- and preference-training library. Provides SFTTrainer, DPOTrainer, ORPOTrainer. See SFT with TRL's SFTTrainer.
SFTTrainer
TRL wrapper around HF Trainer that handles chat-template, loss mask, padding, and PEFT for SFT. The 20-line version of Lesson 2.5.
SFTConfig
SFTTrainer's args object; superset of TrainingArguments with SFT-specific fields like max_seq_length, completion_only_loss, packing.
completion_only_loss
SFTConfig field; when True, builds the loss mask so cross-entropy ignores everything except assistant tokens.
peft_config
SFTTrainer argument; pass a LoraConfig and SFTTrainer attaches LoRA internally — no get_peft_model call needed.
packing (SFT)
SFTConfig field; concatenates short examples to max_seq_length to save compute on padding.
bitsandbytes
Library providing 4-bit/8-bit quantization kernels that HF Transformers loads via BitsAndBytesConfig. See QLoRA hands-on.
BitsAndBytesConfig
HF config object that tells from_pretrained to load weights in 4-bit (or 8-bit), with NF4 / double-quant / compute-dtype settings.
NF4 (4-bit Normal Float)
The QLoRA-default 4-bit quantization format, tuned for trained-Transformer weight distributions.
Double quantization
Quantizing the quantization constants themselves to save additional memory in QLoRA.
prepare_model_for_kbit_training
peft helper that enables gradient checkpointing and stable-dtype casts on a quantized base before LoRA is attached. Forgetting it is the most common "my QLoRA isn't training" bug.
classification_report
sklearn function returning per-class precision/recall/F1/support plus macro/weighted averages. The default classification eval.
Confusion matrix
True-vs-predicted matrix; off-diagonal cells show which classes get confused for which.
Macro F1
Mean of per-class F1s with equal weight per class; the honest default on imbalanced data.
Weighted / micro F1
F1 averaged by class support; dominated by the majority class.
HF evaluate
Hugging Face library exposing standard metrics (F1, accuracy, ROUGE, BLEU, WER, seqeval) through one interface; aligns with shared standards used in papers.
Pydantic
Python library for type-validated data models; BaseModel + type hints become a schema and a parser in one. See Structured outputs with pydantic.
Valid-JSON rate
Fraction of model outputs that both parse as JSON and match the Pydantic schema; the first half of the structured-output honest report.
Per-field accuracy
Fraction of times each schema field has the correct value, measured on the parses that succeeded; the second half of the structured-output honest report.
Multi-turn SFT
Fine-tuning on multi-turn conversations with the loss mask applied to every assistant turn, not just the last. See Multi-turn chat SFT.
Role bleed
Loss mask leaking into user turns so the model is supervised on user-voice text; causes the model to mimic the user in its replies.
seqeval
Standard span-extraction metric library (HF evaluate exposes it); produces entity-level precision/recall/F1.
Faithfulness
Whether a generated answer is grounded in supplied context (e.g. RAG passages) rather than hallucinated.
LLM-as-a-judge
Using a strong model as a rubric-driven judge for free-form outputs. The right tool when there's no exact-match metric; not a substitute for one when there is. See LLM-as-a-judge.
Rubric
The written criteria a judge scores against — dimensions (correctness, tone, concision) and a scale. The clearer the rubric, the more consistent the judge.
Pairwise judging (A/B)
Showing a judge two outputs and asking which is better, with mandatory order-swap and "both must agree" win counting. More reliable than absolute scoring.
Judge bias
The three classic LLM-as-judge biases — position (favours first option), length (favours longer answer), style/family (favours own model family). Mitigated by order-swap, concision in the rubric, and judge selection.
Judge calibration
Comparing judge scores against a small human-rated set (50–200 pairs) and reporting Cohen's κ or Spearman ρ. Required to interpret any aggregate judge metric — inter-judge agreement is not a substitute.
lm-evaluation-harness
EleutherAI's open-source benchmark runner — the standard tool used to produce the MMLU / HellaSwag / ARC / GSM8K numbers in model cards. See Public benchmarks & lm-eval-harness.
MMLU
Massive Multitask Language Understanding: ~14k multiple-choice questions across 57 subjects. Usually reported 5-shot. The closest thing to a general "what does this model know?" substrate score.
Benchmark contamination
The benchmark's questions or answers appeared in the model's pretraining; the model memorised rather than reasoned. Detected by a canonical-vs-paraphrased gap. Why your private gold set matters more than any public number.
MLflow
Open-source experiment-tracking platform; self-hosted; tracks runs, parameters, metrics, artefacts, and a model registry. The BSD-licensed standard. See Experiment tracking with MLflow & W&B.
Weights & Biases (W&B)
Hosted experiment-tracking product with the most polished UI, sweeps for hyperparameter search, and team report-sharing. Free tier for individuals.
Run metadata
The non-metric record of a training run: full config, git SHA, dataset hash, library versions, system info. The piece that makes runs joinable later — and the piece teams most often skip and most often regret.
Reproducible vs replicable
Reproducible = same inputs → bit-for-bit identical outputs. Replicable = same recipe → equivalent outcome within noise. Experiment tracking targets replicability; chasing bit-equivalence in ML usually isn't worth the cost.

Track 3 — With BrewSLM

Auto-RAG
Builds a BM25 index at training completion and prepends top-K retrieved passages to the prompt at inference. See Auto-RAG & reroute.
BM25
Keyword (lexical) retrieval — fast, dependency-light; the retrieval baseline auto-RAG uses.
Coach Mode
The surface that emits stage suggestions (data / cleaning / gold_set / training / eval) and actions (run_playbook, navigate, augment_from_cluster) across the lifecycle.
Decision engine (post-eval)
Reads eval results + failure clusters and recommends the next move — including retrieval over more fine-tuning when failures are knowledge-bound.
DeploymentVersion
A versioned deployment with promote / reject / rollback / drift-check actions.
Drift check
A scheduled re-run of the gold set against the live endpoint to catch production regressions. See Export, deploy & Coach Mode.
Eval pack
The declared set of metrics and promotion gates evaluated against the held-out eval set; the gates decide promotability. See Eval packs & failure clusters.
Failure cluster
A bucket of evaluation misses that share a pattern, surfaced instead of a flat list so the data gap is obvious.
Introspect
Sampling ~20 rows to propose a task mapping (ranked ShapeHypothesis + a ProposedMapping); never auto-picks below 0.80 confidence.
Lifecycle (11 stages)
Ingest → Introspect → Map (dry-run / commit) → Clean → Prepare → Preflight → Train → Evaluate → Export → Deploy; each has a contract and emits a RunEvent.
Manifest
prepared/manifest.json — the source of truth for downstream stages: counts, schema, task_profile, scoring_mode, paths, hashes. See Clean & prepare.
Job / Notification bell
Long-running work runs as a persisted background Job; the top-bar bell polls /api/jobs/active for progress and outcome. See Training jobs.
Preflight
Pre-run dependency, memory-fit, capability, and gate-policy checks returning pass/fail + blockers + a train plan; surfaced as the trainability forecast.
Promotability gate
A pass threshold in the eval pack a model must clear to be shippable (the platform form of the quality gate).
Recipe
Declarative training config — base model, method, LoRA knobs, optimizer/schedule, batch, precision, checkpoint cadence (your TrainingArguments + LoraConfig as reusable config). See Recipes & handlers.
Reroute-to-RAG
Clones a project into a retrieval-first sibling (rag_first=True) that uses the base model plus retrieval — no LoRA adapter.
RejectedRow / reason code
An input row that failed mapping, tagged with a stable reason code so rejections are grouped and selectable, never silently dropped.
Remediation plan
The recommended fix for a failure cluster — often augment_from_cluster: add similar rows, review, re-prepare, re-train.
RunEvent
An audit row a stage emits via emit_event() (reason code + severity) that feeds the observability timeline, failure clusters, and audit explorer. See The lifecycle.
Source locator
A string naming where rows live for ingest: hf:id:split, jsonl:/path, kaggle:....
Task handler / dispatcher
A general, task-shape-named component (Classification, QA, StructuredExtraction, RAG, Alignment, Seq2Seq, …) that owns tokenization, masking, and score(); the dispatcher routes the manifest's task_profile to it.
brewslm.yaml manifest
The project-as-code schema. api_version: brewslm/v1, kind: Project, strict-extra-forbid. See Training config reference.
Plan profile
training_plan.plan_profile on the manifest: safe / balanced (default) / max_quality. Resolves to a concrete TrainingConfig at apply time.
Training mode
training_plan.training_mode: sft for supervised fine-tuning, kd for knowledge distillation.
Manifest apply
The service that diffs a parsed brewslm.yaml against project state and emits a ManifestApplyPlan with explicit create/update/noop/delete actions.
RunEvent stage
One of nine canonical strings: ingestion, cleaning, adapter, training, eval, export, deployment, autopilot, system. See RunEvent & Coach catalogue.
RunEvent severity
info / warning / error / critical. Error and critical must carry a reason code.
Reason code
Lint-gated string from the canonical taxonomy that names a failure mode (e.g. training_oom, deployment_drift_detected); unknown codes are rejected at emit time.
Coach stage
One of five workflow surfaces Coach Mode speaks at: data, cleaning, gold_set, training, eval.
Coach action kind
One of three: navigate (route to a panel), run_playbook (trigger generation), augment_from_cluster (open a failure cluster as a generation source).
Evaluation Contract v2
slm.evaluation-pack/v2 — eval-pack schema with per-task specs, each carrying its own gates and metric schema. See Eval pack reference.
Gate (eval pack)
A scalar comparison: {gate_id, metric_id, operator: gte|lte, threshold, required, source?, weight?}. required: true gates block promotion.
FailureCluster row
Row in failure_clusters uniquely keyed on (project_id, stage, reason_code, signature); carries failure_count, timestamps, and capped exemplar lists.
Cluster signature
Hash of the canonical failure shape (model id + batch + seq length, for OOM; etc.); same signature = same cluster.

Track 4 — Advanced

alpha (KD)
The [0,1] weight in the KD loss trading the gold-label cross-entropy term against the teacher KL term.
AWQ
Activation-aware Weight Quantization; protects activation-critical weights, GPU-oriented. See Quantization & compression.
Continuous batching
Swapping finished requests out and new ones in each generation step to keep the GPU saturated.
Curriculum learning
Ordering training data (often easy→hard) to improve convergence and final quality. See Multi-task & curriculum.
Dark knowledge
The relational signal in a soft distribution (which wrong answers are plausible) that one-hot labels discard.
Distillation (KD)
Training a small student to reproduce a larger teacher's outputs, to shrink at near-equal quality. See Distillation I.
DPO
Direct Preference Optimization — learns from (chosen, rejected) pairs directly, no separate reward model. See Preference tuning.
Drift / drift detection
Production quality decay as inputs change over time; detected by re-running the gold set against the live endpoint on a schedule. See Observability & drift.
GPTQ
One-shot, layer-wise post-training quantization that minimizes introduced error; GPU-oriented.
KD loss
alpha·CE + (1−alpha)·T²·KL — hard-label cross-entropy blended with a temperature-softened KL to the teacher. See The KD loss.
KL divergence
A measure of how far the student's distribution is from the teacher's; minimized to transfer knowledge.
KV cache
Stored attention keys/values for prior tokens so each generation step only computes the new token's. See Serving & inference.
Latency vs throughput
Per-request speed vs total tokens/second; larger batches favor throughput over latency.
Multi-task training
Training one model on several tasks at once; positive transfer if they share structure, interference if not.
ORPO
Odds Ratio Preference Optimization — adds a preference term to the SFT loss for single-stage alignment, no reference model.
Post-training quantization
Quantizing a finished model by conversion, with no retraining.
Q4_K_M
A 4-bit mixed-precision k-quant (llama.cpp / GGUF); the common CPU/edge sweet spot.
Quality retained
student_score / teacher_score on the same eval — the fraction of the teacher's quality a distilled student kept. See Quality retained.
Reference model
A frozen copy of the starting model that DPO anchors to, so the policy doesn't drift arbitrarily.
Soft targets
A teacher's full probability distribution over tokens — a richer training signal than a hard label.
Task interference
Negative transfer — unrelated or imbalanced tasks degrading each other in multi-task training.
Teacher / student
The large source model and the small model trained to mimic it in distillation.
Temperature (T)
A divisor on logits before softmax; higher T softens the distribution to expose secondary probabilities for KD.
Top-k logprobs
The k most likely tokens and their log-probabilities per position; a compact offline capture of the teacher's distribution.
vLLM
A high-throughput inference server using paged attention to manage the KV cache.
Production logging
Recording request/response pairs at the inference endpoint into an append-only log (commonly JSONL). See Production feedback loop.
PII redaction (logs)
Replacing personal identifiers in logged text with tokens like [EMAIL] before disk; hashing user ids.
Sampling budget
The rule deciding which inference rows get logged; full logging is too expensive at scale, so keep all interesting rows (low-confidence, flagged, drift-window) and sample the rest.
Log tag
Opaque string on a log row marking it as candidate training data (drift_window, user_negative, operator_review).
Operator correction
The corrected completion an operator (or stronger model) wrote for a tagged bad row; source of the SFT example in the feedback loop.
Retrain cadence
Trigger for retraining: on drift signal, on schedule, or on a failure-cluster count threshold.
Tool call
A structured JSON output naming a tool (function) from a fixed catalogue with validated arguments. See Tool-use fine-tuning.
Function calling
Synonym for tool-use; the vendor term (OpenAI etc.) for structured-JSON output mode.
Tool schema
A description (name + argument schema) of a callable function the model can target; Pydantic models declare them cleanly.
No-tool path
Explicit assistant output ({"tool":"no_tool","reason":"..."}) for when no available tool fits; trained on negative examples.
Valid-tool-call rate
Fraction of outputs that parse as JSON, name a real tool, and have arguments matching that tool's schema.
Argument-match accuracy
On valid parses, the fraction with arguments matching the gold for the same request.
False-tool-call rate
Fraction of inputs where the gold is no_tool but the model called a real tool; the most damaging routing failure mode.
Structured pruning
Physically removing whole attention heads, transformer layers, or MLP channels from a trained model so matrix shapes shrink and wall-clock latency drops. See Structured pruning.
Head / layer / channel pruning
The three granularities of structured pruning: drop a whole attention head, drop a whole transformer block, or drop output channels from an MLP. Head and layer pruning are the simplest; channel pruning gives finer control.
Unstructured pruning
Zeroing out individual weights without shrinking matrix shape. Saves theoretical FLOPs but rarely real wall-clock on commodity GPUs because sparse-matmul kernels aren't standard. Contrast with structured pruning.
Recovery training
The short SFT or distillation pass run after pruning to recover the quality the prune cost. Not optional — aggressive pruning without recovery training typically loses 10–30 points on substrate benchmarks.
Importance scoring
How structured pruning decides what to cut. Three families: magnitude (cheap, misleading), gradient × weight / SNIP-style (practical sweet spot), Hessian-based / OBD / WoodFisher (accurate but expensive).
Speculative decoding
Inference-time speed-up where a small draft model proposes K tokens and the target verifies them in a single forward pass; accepted tokens commit. Wall-clock win depends on acceptance rate and draft-vs-target cost ratio. See Speculative decoding.
Draft model
The smaller model in speculative decoding that proposes candidate tokens. Must share a tokenizer with the target and ideally come from the same model family to maximise the acceptance rate.
Acceptance rate (α)
Probability the target accepts a draft-model token in speculative decoding. The single biggest lever on the achievable speed-up; α near 1 with K=4 yields ~3× throughput, α near 0.5 yields ~1.4× at best.
EAGLE / Medusa
Variants of speculative decoding that raise α without a separate draft model. EAGLE trains a tiny auxiliary head that predicts target embeddings; Medusa adds parallel decoding heads to the target itself and verifies via tree attention. Both ship in vLLM.
Reasoning training
Umbrella term for techniques that produce multi-step traces a verifier can score, not just final answers. Covers chain-of-thought SFT, process supervision via a step-level reward model, and outcome-supervised RL. See Reasoning training.
Chain-of-thought SFT
Supervised fine-tuning where each example is prompt → reasoning trace → final answer, with the loss covering both the trace and the answer. The standard warm-start before any reasoning-flavoured RL step.
Process supervision
Scoring each step of a reasoning trace with a separately-trained reward model (PRM). Higher fidelity than outcome reward; expensive because the labels are step-level human judgments.
Outcome supervision
Scoring only the final answer with a verifier (math executor, code runner, string-match). Cheap data, harder optimisation; the recipe behind R1-style training.
PRM (Process Reward Model)
A reward model trained on step-level labels. Scores each step of a candidate reasoning trace at RL training time.
ORM (Outcome Reward Model)
A reward signal derived from a final-answer verifier — often a deterministic checker rather than a trained model. Reward = 1 if the verifier passes, 0 otherwise.
GRPO (Group Relative Policy Optimisation)
RL fine-tuning algorithm used by DeepSeek-R1 and similar recipes. Drops PPO's value (critic) network; computes advantage group-relative across G samples per prompt — roughly half the memory and no critic to keep stable. The practical RL recipe at SLM scale.

← Back to the Academy