BrewSLM Academy · reference

Glossary

Plain-language definitions of the terms used across the Academy. New terms are added as tracks publish; today's set covers the Foundations track.

A–E

Activation function: The nonlinearity (e.g. ReLU) applied after a neuron's weighted sum; what lets stacked layers model non-linear patterns. See Neural networks.
Attention: A learned, content-based mechanism for a token to pull information from other relevant tokens. See Attention & the Transformer.
Autoregressive generation: Producing text by repeatedly predicting the next token and appending it.
Backpropagation: The algorithm that computes the gradient for every parameter in one backward sweep.
BPE (Byte-Pair Encoding): A method for building a subword vocabulary by merging the most frequent character pairs.
Context window: The maximum number of tokens a model can process at once (prompt + generation).
Continued pretraining: Training a base model further on a large domain corpus to build broad ability; expensive.
Cross-entropy: The training loss for next-token prediction; small when the true next token was given high probability.
Data-centric iteration: Improving results mainly by fixing data, guided by failure analysis, rather than tweaking the model.
Decoding: How the next token is chosen from the probability distribution (greedy, sampling, temperature, top-p).
Deep learning: Neural networks with several hidden layers.
Drift: When live inputs stop resembling training data, degrading quality over time.
Embedding: A learned vector representing a token; similar tokens get similar vectors. See Tokens & embeddings.
Epoch: One complete pass through the training dataset.

F–N

Fine-tuning: Continuing training on your examples to change a model's parameters; see Supervised Fine-Tuning.
Forward pass: Computing outputs from inputs, layer by layer.
Gold set: Curated, known-correct examples a model never trains on; the source of truth for quality. See The SLM project lifecycle.
Gradient: The direction over all parameters in which the loss increases fastest; training steps the opposite way.
Gradient descent: Repeatedly nudging parameters against the gradient to reduce loss. See How models learn.
Inference: Running a trained, frozen model on new inputs to get outputs.
In-context learning: Showing examples inside the prompt so the model imitates them for that call.
Latency: Time to produce output; rises with model size since each token runs the whole network.
Layer / hidden layer: Many neurons in parallel; hidden layers sit between input and output.
Learning rate: The step-size multiplier in gradient descent; too high diverges, too low crawls.
LLM: Large language model — tens to hundreds of billions of parameters.
LM head: The final layer producing a logit for every vocabulary token.
Logits: Raw, unnormalized scores for each possible next token, before softmax.
LoRA: Low-Rank Adaptation: fine-tuning by training small added matrices instead of all parameters, cutting memory and cost. Covered in depth in Track 1.
Loss function: A single number measuring how wrong predictions are; smaller is better.
Minibatch / SGD: Estimating the gradient on a small random batch each step (stochastic gradient descent).
Model: A function with adjustable parameters that maps inputs to outputs. See What is a model?.
Multi-head attention: Several attention computations in parallel, each specializing in a different relationship.
Neuron: A weighted sum of inputs plus a bias, passed through a nonlinear activation.

P–Z

Parameter (weight): A number inside a model, set by training rather than by hand.
Prompting: Steering a fixed model by writing a better input; no training.
RAG (retrieval-augmented generation): Retrieving relevant documents at inference and adding them to the prompt; supplies knowledge, not behavior. See The four levers.
Residual connection: Adding a sub-layer's input to its output so information and gradients flow through a deep stack.
Self-attention: Attention where tokens attend to other tokens in the same sequence.
Supervised Fine-Tuning (SFT): Continuing training of a base model on input→output examples to change its parameters for a task.
SLM: Small language model — relatively few parameters (millions to a few billion); cheaper and faster. See LLMs vs SLMs.
Softmax: Turns logits into a probability distribution that sums to 1.
Special tokens: Non-text markers in the vocabulary (start/end, padding, chat roles).
Temperature / top-p: Decoding knobs that flatten/sharpen the distribution and trim the unlikely tail.
Token: A chunk of text (usually a subword) the model processes; each maps to an integer ID.
Tokenization: Splitting text into tokens the model can process.
Training: Adjusting parameters from example data to reduce a loss.
Transformer block: Self-attention + feed-forward network + residual connections + layer norm; stacked many times.
Train/validation/test split: Partitioning data so you train on one part and measure on data the model hasn't seen.
Vocabulary: The fixed set of tokens a tokenizer knows; each has an integer ID.
VRAM: GPU memory; weights need ≈ params × bytes-per-param, training needs several times more.

Track 0 — Foundations (extended)

Base model: A model straight from pretraining: a next-token continuer over raw text. No chat template, no roles, no refusals. See Base vs instruct models.
Instruct model: A base model further trained (SFT + often preference tuning) to follow instructions inside a chat template. Same architecture, different behaviour.
Alignment training: The post-pretraining training (SFT on instruction data plus optional DPO/RLHF) that turns a base into an instruct model.
Alignment tax: The cost of the instruct model's baked-in choices (refusals, hedges, style) when they don't match your task.
Model license: The licence attached to a base model that travels with every derivative — including your fine-tune. Check before training, not at deploy. See Picking a base model.
Tokenizer family: A shared vocabulary and special-token set across a model family; a hard constraint for offline distillation (teacher and student must share it).
n-gram model: A pre-neural language model that estimates the next token from raw counts over the previous n−1 tokens. Fast, interpretable, and sparse beyond trigrams. Still used in BM25, kenLM, and on-device autocomplete. See From n-grams to Transformers.
Markov assumption: The simplifying assumption that the next token depends only on the previous n−1 tokens, not the full history. Necessary to make n-gram models tractable; broken (deliberately) by RNNs and Transformers.
Sparsity (n-grams): The problem that as n grows, most n-grams never appear in training data, so the count is zero and the model assigns zero probability. The wall n-gram models hit beyond trigrams.
Neural language model: A language model that predicts the next token from a learned dense embedding of the previous tokens (Bengio 2003). Solves n-gram sparsity by letting similar tokens share strength through their embeddings.
RNN / LSTM: Recurrent neural networks that consume tokens one at a time and carry a hidden state forward. LSTMs add gating to mitigate the vanishing-gradient problem. The dominant LM family from ~2014 until 2017.
Vanishing gradient: The training failure where gradients shrink exponentially as they back-propagate through many time steps, so the network can't learn long-range dependencies. LSTMs partially solve it; attention sidesteps it entirely.
Transformer: The 2017 architecture ("Attention is all you need") that replaced recurrence with self-attention and parallel position embeddings. Every modern LM — encoder, encoder-decoder, decoder-only — is a Transformer variant. See Attention and self-attention.
Scaling laws: Empirical findings (Kaplan 2020, Chinchilla 2022) that pretraining loss falls predictably with more data, more parameters, and more compute — together, not one alone. Why pretraining budgets are spent the way they are.
Encoder-only model: A Transformer that sees the whole input at once (bidirectional attention) and emits one vector per token. Pretrained with masked-language-modelling. Used for classification and embeddings (BERT, BGE, E5). See Architecture taxonomy.
Encoder-decoder model: A Transformer with a bidirectional encoder and a causal decoder, pretrained on span corruption. Built for input-to-output transformation (T5, BART, FLAN-T5).
Decoder-only model: A Transformer with causal (left-to-right) attention, pretrained on next-token prediction. The architecture behind GPT, Llama, Qwen, SmolLM2 — and the one this Academy assumes when it says "SLM".
Masked language modelling (MLM): The encoder-only pretraining objective: randomly mask ~15% of input tokens and ask the model to predict them from the surrounding context. Drives bidirectional representations.
Span corruption / denoising: The encoder-decoder pretraining objective: drop random contiguous spans from the input, ask the decoder to regenerate them in order. T5's objective.
Causal (autoregressive) attention: An attention mask that prevents each position from looking at later positions. Makes a Transformer next-token predictable, which is what decoder-only chat models do.
In-context learning: The decoder-only property where you can elicit new behaviour by putting examples in the prompt instead of fine-tuning. Why chat models are general-purpose without retraining.

Track 1 — SFT fundamentals

AdamW (optimizer): The standard optimizer; adapts the step per-parameter for stable training, at the cost of ~8 bytes/param of state.
Batch size: How many examples a step processes at once; larger = smoother gradient but more memory.
Chat template: The tokenizer rule that renders a messages list into the exact formatted string (with role/special tokens) a model expects. See Chat templates.
Checkpoint: Saved parameters (or LoRA adapter) on disk, for resuming and keeping the best version by validation.
Completion: The target output part of an SFT example (the part the loss is computed on).
DPO: Direct Preference Optimization: train on (chosen, rejected) pairs to prefer better responses, no reward model. See Choosing the objective.
Early stopping: Keeping the checkpoint at the validation-loss minimum rather than the final step.
Effective batch size: per_device_batch × gradient_accumulation × devices — the number that governs training dynamics.
F1 (precision, recall): Precision = right / flagged; recall = caught / should-have-found; F1 is their harmonic mean.
Generalization gap: The distance between training and validation loss; a widening gap signals overfitting.
Gradient accumulation: Summing gradients over several micro-batches before one optimizer step, to simulate a larger batch on a small GPU.
Gradient checkpointing: Re-computing activations during backprop instead of storing them — trades compute for large memory savings.
Loss mask: Computing the loss only on completion tokens so the model learns to produce the answer, not echo the prompt. See Anatomy of an SFT example.
LoRA alpha: Scales the adapter's contribution (~alpha/r); often set to 2×r.
LoRA rank (r): The adapter's inner dimension / capacity; higher = more room (and more overfit risk). Common: 16.
max_seq_length: The per-example token cap; set it from the data's length distribution so completions are never clipped.
Mixed precision (bf16): Computing/storing in 16-bit to roughly halve memory with stable training.
ORPO: An objective that combines SFT and preference learning in a single stage.
Overfitting / underfitting: Training loss falls while validation rises (memorizing) / both stay high (hasn't learned enough). See Reading a loss curve.
Perplexity: The exponential of cross-entropy; roughly how many tokens the model is choosing among (1 = certain).
Prompt: The input part of an SFT example (instruction, optionally with context); masked from the loss.
QLoRA: LoRA on a 4-bit quantized frozen base, to fit fine-tuning into far less memory.
Quality gate: A pass threshold set in advance that a model must clear on the gold set to ship.
Quantization: Storing weights in fewer bits (e.g. 4) to save memory.
RLHF: Reinforcement Learning from Human Feedback: train a reward model, then optimize with RL; powerful but complex (DPO is the simpler modern alternative).
Scheduler / warmup: Warmup ramps the LR up over the first steps; the scheduler (e.g. cosine) then decays it toward zero. See Learning rate & schedules.
Sequence packing: Concatenating short examples into one full-length sequence to avoid wasting compute on padding.
Target modules: Which weight matrices get LoRA adapters (commonly the attention query/value projections).
Greedy decoding (T1): Always take the most-likely next token. Deterministic; the default for classification and evaluation. See Decoding controls.
Sampling: Drawing the next token from the model's distribution (do_sample=True); adds variety, the default for chat/writing tasks.
Top-p (nucleus sampling): Keep the smallest set of tokens whose cumulative probability ≥ p, then sample. p ≈ 0.9 is a sane default.
Top-k: Keep only the k most likely tokens and sample from that subset.
max_new_tokens: Hard cap on tokens emitted per generation call. Always set it; a missing stop token with no cap can fill the context window with hallucinated conversation.
Stop tokens: Tokens signalling end-of-generation. Instruct models bake one into the chat template; without it the model rambles.
Repetition penalty: A scalar reducing probability of tokens already in the output. Small values (1.05–1.15) break greedy loops without quality loss.
Time to first token (TTFT): Latency from request send to first generated character; dominates perceived chat UX.
JSONL: One JSON object per line; the standard streamable container for SFT datasets. See Dataset formats in the wild.
Completion format: {prompt, completion}. Simplest SFT shape; what Track 2's by-hand SFT uses.
Chat messages format: A list of {role, content} dicts that the tokenizer's apply_chat_template renders. The standard for chat models and multi-turn data.
Alpaca format: {instruction, input, output}. From Stanford's original instruction-tuning set; many older HF datasets use it.
ShareGPT format: {conversations: [{from, value}, ...]}. Multi-turn, quirky field names — rename from→role, value→content to convert.
Catastrophic forgetting: Degradation of broader skills when a model is fine-tuned on narrow data — the optimizer has no signal to preserve them. See Catastrophic forgetting.
Data mixing (mix-back): Including a small fraction (5–10%) of general instruction data alongside narrow task data, so the optimizer sees both shapes and forgets less.
Broad eval: A small set of general prompts (instruction-following, refusals, format-switching) evaluated alongside the task eval to detect forgetting.
Hard negative: An example that resembles the other class on the surface; placed on the correct side of the decision boundary to teach the model where the line is.
Ambiguous case: An example a thoughtful human would have to pause on; labelled per a stated rule so the model learns the rule.
Refusal data: Examples the model should not answer the usual way (out-of-scope, harmful, needing clarification), paired with the desired refusal phrasing.
OOD (out of distribution): Inputs outside the trained scope, labelled with the desired "I don't know" so the model learns its limits.
Continued pretraining (CPT): Continuing the next-token objective on raw text from a new domain, before any task SFT. Same loss as pretraining, different data, smaller learning rate. The right tool when the base model doesn't speak your domain's language; the wrong tool when it does. See Continued pretraining.
Domain-adaptive pretraining (DAPT): The research name for CPT applied to a specific domain (medical, legal, code, finance). Same mechanic — the "domain-adaptive" label emphasises the purpose.
Vocab extension: Adding new tokens to the tokenizer for domain-specific vocabulary before CPT, so those terms become single tokens. Sharp trade-off: cheaper at inference, but the new embeddings start random and need enough CPT to learn — and it breaks the same-tokenizer assumption distillation needs.
Catastrophic forgetting (CPT view): The same risk as in SFT, sharpened: a CPT'd model is already pulled toward the domain, and narrow SFT on top compounds the drift unless mix-back and broad eval are kept in.
Two-stage CPT → SFT pipeline: The standard shape: stage 1 is CPT on raw domain text producing a CPT'd checkpoint, stage 2 is SFT on task instructions from that checkpoint. Evaluated end-to-end on the same gold set you'd use without CPT.
Downstream probe: A pinned SFT recipe + gold set used to compare a base checkpoint and a CPT'd checkpoint on the metric that actually matters — task performance, not CPT loss.

Track 2 — Hands-on

accelerate: Hugging Face library that handles device placement and distributed details under the Trainer.
attention_mask: A per-token 1/0 vector marking real tokens vs padding so the model ignores padded positions.
AutoModelForCausalLM: The Transformers class that loads a decoder-only language model by id. See Load a base model.
AutoTokenizer: Loads the matching tokenizer and chat template for a model id.
DataCollator: Pads a list of examples into a rectangular batch; DataCollatorForSeq2Seq pads labels with -100 so padding is ignored by the loss. See Tokenize & collate.
datasets.Dataset: A fast, memory-mapped table with .map()/.filter()/split that the Trainer consumes.
dtype (bf16): The weight/compute precision passed to from_pretrained (older API: torch_dtype); bf16 roughly halves memory.
from_pretrained / save_pretrained: Download/instantiate a model or tokenizer, and write a self-contained copy back to disk.
generate(): Runs autoregressive decoding to produce output tokens; do_sample=False gives deterministic greedy decoding for evaluation.
get_peft_model: Wraps a base model with LoRA adapters and freezes the base so only the adapter trains.
GGUF: The llama.cpp weight format for efficient CPU/edge inference; a separate conversion step from your merged model. See Merge & infer.
Greedy decoding: Always taking the highest-probability next token (do_sample=False); deterministic, used for reproducible evaluation.
Ignore index (-100): The label value PyTorch's cross-entropy skips; how the loss mask and padding are excluded from the loss in code.
input_ids / labels: The token IDs the model reads, and the targets it's scored against (a copy of input_ids with masked positions set to -100).
merge_and_unload: Folds a LoRA adapter into the base weights, returning a standalone model with no adapter overhead.
PeftModel.from_pretrained: Attaches a trained LoRA adapter onto a fresh base model for inference. See Evaluate by hand.
print_trainable_parameters: Reports how few parameters LoRA actually trains (typically well under 1%).
safetensors: The safe, fast default weight format save_pretrained writes.
Trainer: The Transformers class that runs the training loop (forward, loss, backward, optimizer step) for you. See A minimal LoRA fine-tune.
TrainingArguments: Holds the run's hyperparameters: learning rate, epochs, batch size, precision, logging and saving cadence.
venv: An isolated Python environment so a project's packages don't collide with the system. See Set up the environment.
TRL: Hugging Face's SFT- and preference-training library. Provides SFTTrainer, DPOTrainer, ORPOTrainer. See SFT with TRL's SFTTrainer.
SFTTrainer: TRL wrapper around HF Trainer that handles chat-template, loss mask, padding, and PEFT for SFT. The 20-line version of Lesson 2.5.
SFTConfig: SFTTrainer's args object; superset of TrainingArguments with SFT-specific fields like max_seq_length, completion_only_loss, packing.
completion_only_loss: SFTConfig field; when True, builds the loss mask so cross-entropy ignores everything except assistant tokens.
peft_config: SFTTrainer argument; pass a LoraConfig and SFTTrainer attaches LoRA internally — no get_peft_model call needed.
packing (SFT): SFTConfig field; concatenates short examples to max_seq_length to save compute on padding.
bitsandbytes: Library providing 4-bit/8-bit quantization kernels that HF Transformers loads via BitsAndBytesConfig. See QLoRA hands-on.
BitsAndBytesConfig: HF config object that tells from_pretrained to load weights in 4-bit (or 8-bit), with NF4 / double-quant / compute-dtype settings.
NF4 (4-bit Normal Float): The QLoRA-default 4-bit quantization format, tuned for trained-Transformer weight distributions.
Double quantization: Quantizing the quantization constants themselves to save additional memory in QLoRA.
prepare_model_for_kbit_training: peft helper that enables gradient checkpointing and stable-dtype casts on a quantized base before LoRA is attached. Forgetting it is the most common "my QLoRA isn't training" bug.
classification_report: sklearn function returning per-class precision/recall/F1/support plus macro/weighted averages. The default classification eval.
Confusion matrix: True-vs-predicted matrix; off-diagonal cells show which classes get confused for which.
Macro F1: Mean of per-class F1s with equal weight per class; the honest default on imbalanced data.
Weighted / micro F1: F1 averaged by class support; dominated by the majority class.
HF evaluate: Hugging Face library exposing standard metrics (F1, accuracy, ROUGE, BLEU, WER, seqeval) through one interface; aligns with shared standards used in papers.
Pydantic: Python library for type-validated data models; BaseModel + type hints become a schema and a parser in one. See Structured outputs with pydantic.
Valid-JSON rate: Fraction of model outputs that both parse as JSON and match the Pydantic schema; the first half of the structured-output honest report.
Per-field accuracy: Fraction of times each schema field has the correct value, measured on the parses that succeeded; the second half of the structured-output honest report.
Multi-turn SFT: Fine-tuning on multi-turn conversations with the loss mask applied to every assistant turn, not just the last. See Multi-turn chat SFT.
Role bleed: Loss mask leaking into user turns so the model is supervised on user-voice text; causes the model to mimic the user in its replies.
seqeval: Standard span-extraction metric library (HF evaluate exposes it); produces entity-level precision/recall/F1.
Faithfulness: Whether a generated answer is grounded in supplied context (e.g. RAG passages) rather than hallucinated.
LLM-as-a-judge: Using a strong model as a rubric-driven judge for free-form outputs. The right tool when there's no exact-match metric; not a substitute for one when there is. See LLM-as-a-judge.
Rubric: The written criteria a judge scores against — dimensions (correctness, tone, concision) and a scale. The clearer the rubric, the more consistent the judge.
Pairwise judging (A/B): Showing a judge two outputs and asking which is better, with mandatory order-swap and "both must agree" win counting. More reliable than absolute scoring.
Judge bias: The three classic LLM-as-judge biases — position (favours first option), length (favours longer answer), style/family (favours own model family). Mitigated by order-swap, concision in the rubric, and judge selection.
Judge calibration: Comparing judge scores against a small human-rated set (50–200 pairs) and reporting Cohen's κ or Spearman ρ. Required to interpret any aggregate judge metric — inter-judge agreement is not a substitute.
lm-evaluation-harness: EleutherAI's open-source benchmark runner — the standard tool used to produce the MMLU / HellaSwag / ARC / GSM8K numbers in model cards. See Public benchmarks & lm-eval-harness.
MMLU: Massive Multitask Language Understanding: ~14k multiple-choice questions across 57 subjects. Usually reported 5-shot. The closest thing to a general "what does this model know?" substrate score.
Benchmark contamination: The benchmark's questions or answers appeared in the model's pretraining; the model memorised rather than reasoned. Detected by a canonical-vs-paraphrased gap. Why your private gold set matters more than any public number.
MLflow: Open-source experiment-tracking platform; self-hosted; tracks runs, parameters, metrics, artefacts, and a model registry. The BSD-licensed standard. See Experiment tracking with MLflow & W&B.
Weights & Biases (W&B): Hosted experiment-tracking product with the most polished UI, sweeps for hyperparameter search, and team report-sharing. Free tier for individuals.
Run metadata: The non-metric record of a training run: full config, git SHA, dataset hash, library versions, system info. The piece that makes runs joinable later — and the piece teams most often skip and most often regret.
Reproducible vs replicable: Reproducible = same inputs → bit-for-bit identical outputs. Replicable = same recipe → equivalent outcome within noise. Experiment tracking targets replicability; chasing bit-equivalence in ML usually isn't worth the cost.

Track 3 — With BrewSLM

Auto-RAG: Builds a BM25 index at training completion and prepends top-K retrieved passages to the prompt at inference. See Auto-RAG & reroute.
BM25: Keyword (lexical) retrieval — fast, dependency-light; the retrieval baseline auto-RAG uses.
Coach Mode: The surface that emits stage suggestions (data / cleaning / gold_set / training / eval) and actions (run_playbook, navigate, augment_from_cluster) across the lifecycle.
Decision engine (post-eval): Reads eval results + failure clusters and recommends the next move — including retrieval over more fine-tuning when failures are knowledge-bound.
DeploymentVersion: A versioned deployment with promote / reject / rollback / drift-check actions.
Drift check: A scheduled re-run of the gold set against the live endpoint to catch production regressions. See Export, deploy & Coach Mode.
Eval pack: The declared set of metrics and promotion gates evaluated against the held-out eval set; the gates decide promotability. See Eval packs & failure clusters.
Failure cluster: A bucket of evaluation misses that share a pattern, surfaced instead of a flat list so the data gap is obvious.
Introspect: Sampling ~20 rows to propose a task mapping (ranked ShapeHypothesis + a ProposedMapping); never auto-picks below 0.80 confidence.
Lifecycle (11 stages): Ingest → Introspect → Map (dry-run / commit) → Clean → Prepare → Preflight → Train → Evaluate → Export → Deploy; each has a contract and emits a RunEvent.
Manifest: prepared/manifest.json — the source of truth for downstream stages: counts, schema, task_profile, scoring_mode, paths, hashes. See Clean & prepare.
Job / Notification bell: Long-running work runs as a persisted background Job; the top-bar bell polls /api/jobs/active for progress and outcome. See Training jobs.
Preflight: Pre-run dependency, memory-fit, capability, and gate-policy checks returning pass/fail + blockers + a train plan; surfaced as the trainability forecast.
Promotability gate: A pass threshold in the eval pack a model must clear to be shippable (the platform form of the quality gate).
Recipe: Declarative training config — base model, method, LoRA knobs, optimizer/schedule, batch, precision, checkpoint cadence (your TrainingArguments + LoraConfig as reusable config). See Recipes & handlers.
Reroute-to-RAG: Clones a project into a retrieval-first sibling (rag_first=True) that uses the base model plus retrieval — no LoRA adapter.
RejectedRow / reason code: An input row that failed mapping, tagged with a stable reason code so rejections are grouped and selectable, never silently dropped.
Remediation plan: The recommended fix for a failure cluster — often augment_from_cluster: add similar rows, review, re-prepare, re-train.
RunEvent: An audit row a stage emits via emit_event() (reason code + severity) that feeds the observability timeline, failure clusters, and audit explorer. See The lifecycle.
Source locator: A string naming where rows live for ingest: hf:id:split, jsonl:/path, kaggle:....
Task handler / dispatcher: A general, task-shape-named component (Classification, QA, StructuredExtraction, RAG, Alignment, Seq2Seq, …) that owns tokenization, masking, and score(); the dispatcher routes the manifest's task_profile to it.
brewslm.yaml manifest: The project-as-code schema. api_version: brewslm/v1, kind: Project, strict-extra-forbid. See Training config reference.
Plan profile: training_plan.plan_profile on the manifest: safe / balanced (default) / max_quality. Resolves to a concrete TrainingConfig at apply time.
Training mode: training_plan.training_mode: sft for supervised fine-tuning, kd for knowledge distillation.
Manifest apply: The service that diffs a parsed brewslm.yaml against project state and emits a ManifestApplyPlan with explicit create/update/noop/delete actions.
RunEvent stage: One of nine canonical strings: ingestion, cleaning, adapter, training, eval, export, deployment, autopilot, system. See RunEvent & Coach catalogue.
RunEvent severity: info / warning / error / critical. Error and critical must carry a reason code.
Reason code: Lint-gated string from the canonical taxonomy that names a failure mode (e.g. training_oom, deployment_drift_detected); unknown codes are rejected at emit time.
Coach stage: One of five workflow surfaces Coach Mode speaks at: data, cleaning, gold_set, training, eval.
Coach action kind: One of three: navigate (route to a panel), run_playbook (trigger generation), augment_from_cluster (open a failure cluster as a generation source).
Evaluation Contract v2: slm.evaluation-pack/v2 — eval-pack schema with per-task specs, each carrying its own gates and metric schema. See Eval pack reference.
Gate (eval pack): A scalar comparison: {gate_id, metric_id, operator: gte|lte, threshold, required, source?, weight?}. required: true gates block promotion.
FailureCluster row: Row in failure_clusters uniquely keyed on (project_id, stage, reason_code, signature); carries failure_count, timestamps, and capped exemplar lists.
Cluster signature: Hash of the canonical failure shape (model id + batch + seq length, for OOM; etc.); same signature = same cluster.

Track 4 — Advanced

alpha (KD): The [0,1] weight in the KD loss trading the gold-label cross-entropy term against the teacher KL term.
AWQ: Activation-aware Weight Quantization; protects activation-critical weights, GPU-oriented. See Quantization & compression.
Continuous batching: Swapping finished requests out and new ones in each generation step to keep the GPU saturated.
Curriculum learning: Ordering training data (often easy→hard) to improve convergence and final quality. See Multi-task & curriculum.
Dark knowledge: The relational signal in a soft distribution (which wrong answers are plausible) that one-hot labels discard.
Distillation (KD): Training a small student to reproduce a larger teacher's outputs, to shrink at near-equal quality. See Distillation I.
DPO: Direct Preference Optimization — learns from (chosen, rejected) pairs directly, no separate reward model. See Preference tuning.
Drift / drift detection: Production quality decay as inputs change over time; detected by re-running the gold set against the live endpoint on a schedule. See Observability & drift.
GPTQ: One-shot, layer-wise post-training quantization that minimizes introduced error; GPU-oriented.
KD loss: alpha·CE + (1−alpha)·T²·KL — hard-label cross-entropy blended with a temperature-softened KL to the teacher. See The KD loss.
KL divergence: A measure of how far the student's distribution is from the teacher's; minimized to transfer knowledge.
KV cache: Stored attention keys/values for prior tokens so each generation step only computes the new token's. See Serving & inference.
Latency vs throughput: Per-request speed vs total tokens/second; larger batches favor throughput over latency.
Multi-task training: Training one model on several tasks at once; positive transfer if they share structure, interference if not.
ORPO: Odds Ratio Preference Optimization — adds a preference term to the SFT loss for single-stage alignment, no reference model.
Post-training quantization: Quantizing a finished model by conversion, with no retraining.
Q4_K_M: A 4-bit mixed-precision k-quant (llama.cpp / GGUF); the common CPU/edge sweet spot.
Quality retained: student_score / teacher_score on the same eval — the fraction of the teacher's quality a distilled student kept. See Quality retained.
Reference model: A frozen copy of the starting model that DPO anchors to, so the policy doesn't drift arbitrarily.
Soft targets: A teacher's full probability distribution over tokens — a richer training signal than a hard label.
Task interference: Negative transfer — unrelated or imbalanced tasks degrading each other in multi-task training.
Teacher / student: The large source model and the small model trained to mimic it in distillation.
Temperature (T): A divisor on logits before softmax; higher T softens the distribution to expose secondary probabilities for KD.
Top-k logprobs: The k most likely tokens and their log-probabilities per position; a compact offline capture of the teacher's distribution.
vLLM: A high-throughput inference server using paged attention to manage the KV cache.
Production logging: Recording request/response pairs at the inference endpoint into an append-only log (commonly JSONL). See Production feedback loop.
PII redaction (logs): Replacing personal identifiers in logged text with tokens like [EMAIL] before disk; hashing user ids.
Sampling budget: The rule deciding which inference rows get logged; full logging is too expensive at scale, so keep all interesting rows (low-confidence, flagged, drift-window) and sample the rest.
Log tag: Opaque string on a log row marking it as candidate training data (drift_window, user_negative, operator_review).
Operator correction: The corrected completion an operator (or stronger model) wrote for a tagged bad row; source of the SFT example in the feedback loop.
Retrain cadence: Trigger for retraining: on drift signal, on schedule, or on a failure-cluster count threshold.
Tool call: A structured JSON output naming a tool (function) from a fixed catalogue with validated arguments. See Tool-use fine-tuning.
Function calling: Synonym for tool-use; the vendor term (OpenAI etc.) for structured-JSON output mode.
Tool schema: A description (name + argument schema) of a callable function the model can target; Pydantic models declare them cleanly.
No-tool path: Explicit assistant output ({"tool":"no_tool","reason":"..."}) for when no available tool fits; trained on negative examples.
Valid-tool-call rate: Fraction of outputs that parse as JSON, name a real tool, and have arguments matching that tool's schema.
Argument-match accuracy: On valid parses, the fraction with arguments matching the gold for the same request.
False-tool-call rate: Fraction of inputs where the gold is no_tool but the model called a real tool; the most damaging routing failure mode.
Structured pruning: Physically removing whole attention heads, transformer layers, or MLP channels from a trained model so matrix shapes shrink and wall-clock latency drops. See Structured pruning.
Head / layer / channel pruning: The three granularities of structured pruning: drop a whole attention head, drop a whole transformer block, or drop output channels from an MLP. Head and layer pruning are the simplest; channel pruning gives finer control.
Unstructured pruning: Zeroing out individual weights without shrinking matrix shape. Saves theoretical FLOPs but rarely real wall-clock on commodity GPUs because sparse-matmul kernels aren't standard. Contrast with structured pruning.
Recovery training: The short SFT or distillation pass run after pruning to recover the quality the prune cost. Not optional — aggressive pruning without recovery training typically loses 10–30 points on substrate benchmarks.
Importance scoring: How structured pruning decides what to cut. Three families: magnitude (cheap, misleading), gradient × weight / SNIP-style (practical sweet spot), Hessian-based / OBD / WoodFisher (accurate but expensive).
Speculative decoding: Inference-time speed-up where a small draft model proposes K tokens and the target verifies them in a single forward pass; accepted tokens commit. Wall-clock win depends on acceptance rate and draft-vs-target cost ratio. See Speculative decoding.
Draft model: The smaller model in speculative decoding that proposes candidate tokens. Must share a tokenizer with the target and ideally come from the same model family to maximise the acceptance rate.
Acceptance rate (α): Probability the target accepts a draft-model token in speculative decoding. The single biggest lever on the achievable speed-up; α near 1 with K=4 yields ~3× throughput, α near 0.5 yields ~1.4× at best.
EAGLE / Medusa: Variants of speculative decoding that raise α without a separate draft model. EAGLE trains a tiny auxiliary head that predicts target embeddings; Medusa adds parallel decoding heads to the target itself and verifies via tree attention. Both ship in vLLM.
Reasoning training: Umbrella term for techniques that produce multi-step traces a verifier can score, not just final answers. Covers chain-of-thought SFT, process supervision via a step-level reward model, and outcome-supervised RL. See Reasoning training.
Chain-of-thought SFT: Supervised fine-tuning where each example is prompt → reasoning trace → final answer, with the loss covering both the trace and the answer. The standard warm-start before any reasoning-flavoured RL step.
Process supervision: Scoring each step of a reasoning trace with a separately-trained reward model (PRM). Higher fidelity than outcome reward; expensive because the labels are step-level human judgments.
Outcome supervision: Scoring only the final answer with a verifier (math executor, code runner, string-match). Cheap data, harder optimisation; the recipe behind R1-style training.
PRM (Process Reward Model): A reward model trained on step-level labels. Scores each step of a candidate reasoning trace at RL training time.
ORM (Outcome Reward Model): A reward signal derived from a final-answer verifier — often a deterministic checker rather than a trained model. Reward = 1 if the verifier passes, 0 otherwise.
GRPO (Group Relative Policy Optimisation): RL fine-tuning algorithm used by DeepSeek-R1 and similar recipes. Drops PPO's value (critic) network; computes advantage group-relative across G samples per prompt — roughly half the memory and no critic to keep stable. The practical RL recipe at SLM scale.

← Back to the Academy