BrewSLM Academy · reference
Glossary
Plain-language definitions of the terms used across the Academy. New terms are added as tracks publish; today's set covers the Foundations track.
A–E
- Activation function
- The nonlinearity (e.g. ReLU) applied after a neuron's weighted sum; what lets stacked layers model non-linear patterns. See Neural networks.
- Attention
- A learned, content-based mechanism for a token to pull information from other relevant tokens. See Attention & the Transformer.
- Autoregressive generation
- Producing text by repeatedly predicting the next token and appending it.
- Backpropagation
- The algorithm that computes the gradient for every parameter in one backward sweep.
- BPE (Byte-Pair Encoding)
- A method for building a subword vocabulary by merging the most frequent character pairs.
- Context window
- The maximum number of tokens a model can process at once (prompt + generation).
- Continued pretraining
- Training a base model further on a large domain corpus to build broad ability; expensive.
- Cross-entropy
- The training loss for next-token prediction; small when the true next token was given high probability.
- Data-centric iteration
- Improving results mainly by fixing data, guided by failure analysis, rather than tweaking the model.
- Decoding
- How the next token is chosen from the probability distribution (greedy, sampling, temperature, top-p).
- Deep learning
- Neural networks with several hidden layers.
- Drift
- When live inputs stop resembling training data, degrading quality over time.
- Embedding
- A learned vector representing a token; similar tokens get similar vectors. See Tokens & embeddings.
- Epoch
- One complete pass through the training dataset.
F–N
- Fine-tuning
- Continuing training on your examples to change a model's parameters; see Supervised Fine-Tuning.
- Forward pass
- Computing outputs from inputs, layer by layer.
- Gold set
- Curated, known-correct examples a model never trains on; the source of truth for quality. See The SLM project lifecycle.
- Gradient
- The direction over all parameters in which the loss increases fastest; training steps the opposite way.
- Gradient descent
- Repeatedly nudging parameters against the gradient to reduce loss. See How models learn.
- Inference
- Running a trained, frozen model on new inputs to get outputs.
- In-context learning
- Showing examples inside the prompt so the model imitates them for that call.
- Latency
- Time to produce output; rises with model size since each token runs the whole network.
- Layer / hidden layer
- Many neurons in parallel; hidden layers sit between input and output.
- Learning rate
- The step-size multiplier in gradient descent; too high diverges, too low crawls.
- LLM
- Large language model — tens to hundreds of billions of parameters.
- LM head
- The final layer producing a logit for every vocabulary token.
- Logits
- Raw, unnormalized scores for each possible next token, before softmax.
- LoRA
- Low-Rank Adaptation: fine-tuning by training small added matrices instead of all parameters, cutting memory and cost. Covered in depth in Track 1.
- Loss function
- A single number measuring how wrong predictions are; smaller is better.
- Minibatch / SGD
- Estimating the gradient on a small random batch each step (stochastic gradient descent).
- Model
- A function with adjustable parameters that maps inputs to outputs. See What is a model?.
- Multi-head attention
- Several attention computations in parallel, each specializing in a different relationship.
- Neuron
- A weighted sum of inputs plus a bias, passed through a nonlinear activation.
P–Z
- Parameter (weight)
- A number inside a model, set by training rather than by hand.
- Prompting
- Steering a fixed model by writing a better input; no training.
- RAG (retrieval-augmented generation)
- Retrieving relevant documents at inference and adding them to the prompt; supplies knowledge, not behavior. See The four levers.
- Residual connection
- Adding a sub-layer's input to its output so information and gradients flow through a deep stack.
- Self-attention
- Attention where tokens attend to other tokens in the same sequence.
- Supervised Fine-Tuning (SFT)
- Continuing training of a base model on input→output examples to change its parameters for a task.
- SLM
- Small language model — relatively few parameters (millions to a few billion); cheaper and faster. See LLMs vs SLMs.
- Softmax
- Turns logits into a probability distribution that sums to 1.
- Special tokens
- Non-text markers in the vocabulary (start/end, padding, chat roles).
- Temperature / top-p
- Decoding knobs that flatten/sharpen the distribution and trim the unlikely tail.
- Token
- A chunk of text (usually a subword) the model processes; each maps to an integer ID.
- Tokenization
- Splitting text into tokens the model can process.
- Training
- Adjusting parameters from example data to reduce a loss.
- Transformer block
- Self-attention + feed-forward network + residual connections + layer norm; stacked many times.
- Train/validation/test split
- Partitioning data so you train on one part and measure on data the model hasn't seen.
- Vocabulary
- The fixed set of tokens a tokenizer knows; each has an integer ID.
- VRAM
- GPU memory; weights need ≈ params × bytes-per-param, training needs several times more.
Track 0 — Foundations (extended)
- Base model
- A model straight from pretraining: a next-token continuer over raw text. No chat template, no roles, no refusals. See Base vs instruct models.
- Instruct model
- A base model further trained (SFT + often preference tuning) to follow instructions inside a chat template. Same architecture, different behaviour.
- Alignment training
- The post-pretraining training (SFT on instruction data plus optional DPO/RLHF) that turns a base into an instruct model.
- Alignment tax
- The cost of the instruct model's baked-in choices (refusals, hedges, style) when they don't match your task.
- Model license
- The licence attached to a base model that travels with every derivative — including your fine-tune. Check before training, not at deploy. See Picking a base model.
- Tokenizer family
- A shared vocabulary and special-token set across a model family; a hard constraint for offline distillation (teacher and student must share it).
- n-gram model
- A pre-neural language model that estimates the next token from raw counts over the previous n−1 tokens. Fast, interpretable, and sparse beyond trigrams. Still used in BM25, kenLM, and on-device autocomplete. See From n-grams to Transformers.
- Markov assumption
- The simplifying assumption that the next token depends only on the previous n−1 tokens, not the full history. Necessary to make n-gram models tractable; broken (deliberately) by RNNs and Transformers.
- Sparsity (n-grams)
- The problem that as n grows, most n-grams never appear in training data, so the count is zero and the model assigns zero probability. The wall n-gram models hit beyond trigrams.
- Neural language model
- A language model that predicts the next token from a learned dense embedding of the previous tokens (Bengio 2003). Solves n-gram sparsity by letting similar tokens share strength through their embeddings.
- RNN / LSTM
- Recurrent neural networks that consume tokens one at a time and carry a hidden state forward. LSTMs add gating to mitigate the vanishing-gradient problem. The dominant LM family from ~2014 until 2017.
- Vanishing gradient
- The training failure where gradients shrink exponentially as they back-propagate through many time steps, so the network can't learn long-range dependencies. LSTMs partially solve it; attention sidesteps it entirely.
- Transformer
- The 2017 architecture ("Attention is all you need") that replaced recurrence with self-attention and parallel position embeddings. Every modern LM — encoder, encoder-decoder, decoder-only — is a Transformer variant. See Attention and self-attention.
- Scaling laws
- Empirical findings (Kaplan 2020, Chinchilla 2022) that pretraining loss falls predictably with more data, more parameters, and more compute — together, not one alone. Why pretraining budgets are spent the way they are.
- Encoder-only model
- A Transformer that sees the whole input at once (bidirectional attention) and emits one vector per token. Pretrained with masked-language-modelling. Used for classification and embeddings (BERT, BGE, E5). See Architecture taxonomy.
- Encoder-decoder model
- A Transformer with a bidirectional encoder and a causal decoder, pretrained on span corruption. Built for input-to-output transformation (T5, BART, FLAN-T5).
- Decoder-only model
- A Transformer with causal (left-to-right) attention, pretrained on next-token prediction. The architecture behind GPT, Llama, Qwen, SmolLM2 — and the one this Academy assumes when it says "SLM".
- Masked language modelling (MLM)
- The encoder-only pretraining objective: randomly mask ~15% of input tokens and ask the model to predict them from the surrounding context. Drives bidirectional representations.
- Span corruption / denoising
- The encoder-decoder pretraining objective: drop random contiguous spans from the input, ask the decoder to regenerate them in order. T5's objective.
- Causal (autoregressive) attention
- An attention mask that prevents each position from looking at later positions. Makes a Transformer next-token predictable, which is what decoder-only chat models do.
- In-context learning
- The decoder-only property where you can elicit new behaviour by putting examples in the prompt instead of fine-tuning. Why chat models are general-purpose without retraining.
Track 1 — SFT fundamentals
- AdamW (optimizer)
- The standard optimizer; adapts the step per-parameter for stable training, at the cost of ~8 bytes/param of state.
- Batch size
- How many examples a step processes at once; larger = smoother gradient but more memory.
- Chat template
- The tokenizer rule that renders a messages list into the exact formatted string (with role/special tokens) a model expects. See Chat templates.
- Checkpoint
- Saved parameters (or LoRA adapter) on disk, for resuming and keeping the best version by validation.
- Completion
- The target output part of an SFT example (the part the loss is computed on).
- DPO
- Direct Preference Optimization: train on (chosen, rejected) pairs to prefer better responses, no reward model. See Choosing the objective.
- Early stopping
- Keeping the checkpoint at the validation-loss minimum rather than the final step.
- Effective batch size
- per_device_batch × gradient_accumulation × devices — the number that governs training dynamics.
- F1 (precision, recall)
- Precision = right / flagged; recall = caught / should-have-found; F1 is their harmonic mean.
- Generalization gap
- The distance between training and validation loss; a widening gap signals overfitting.
- Gradient accumulation
- Summing gradients over several micro-batches before one optimizer step, to simulate a larger batch on a small GPU.
- Gradient checkpointing
- Re-computing activations during backprop instead of storing them — trades compute for large memory savings.
- Loss mask
- Computing the loss only on completion tokens so the model learns to produce the answer, not echo the prompt. See Anatomy of an SFT example.
- LoRA alpha
- Scales the adapter's contribution (~alpha/r); often set to 2×r.
- LoRA rank (r)
- The adapter's inner dimension / capacity; higher = more room (and more overfit risk). Common: 16.
- max_seq_length
- The per-example token cap; set it from the data's length distribution so completions are never clipped.
- Mixed precision (bf16)
- Computing/storing in 16-bit to roughly halve memory with stable training.
- ORPO
- An objective that combines SFT and preference learning in a single stage.
- Overfitting / underfitting
- Training loss falls while validation rises (memorizing) / both stay high (hasn't learned enough). See Reading a loss curve.
- Perplexity
- The exponential of cross-entropy; roughly how many tokens the model is choosing among (1 = certain).
- Prompt
- The input part of an SFT example (instruction, optionally with context); masked from the loss.
- QLoRA
- LoRA on a 4-bit quantized frozen base, to fit fine-tuning into far less memory.
- Quality gate
- A pass threshold set in advance that a model must clear on the gold set to ship.
- Quantization
- Storing weights in fewer bits (e.g. 4) to save memory.
- RLHF
- Reinforcement Learning from Human Feedback: train a reward model, then optimize with RL; powerful but complex (DPO is the simpler modern alternative).
- Scheduler / warmup
- Warmup ramps the LR up over the first steps; the scheduler (e.g. cosine) then decays it toward zero. See Learning rate & schedules.
- Sequence packing
- Concatenating short examples into one full-length sequence to avoid wasting compute on padding.
- Target modules
- Which weight matrices get LoRA adapters (commonly the attention query/value projections).
- Greedy decoding (T1)
- Always take the most-likely next token. Deterministic; the default for classification and evaluation. See Decoding controls.
- Sampling
- Drawing the next token from the model's distribution (
do_sample=True); adds variety, the default for chat/writing tasks.
- Top-p (nucleus sampling)
- Keep the smallest set of tokens whose cumulative probability ≥
p, then sample. p ≈ 0.9 is a sane default.
- Top-k
- Keep only the
k most likely tokens and sample from that subset.
- max_new_tokens
- Hard cap on tokens emitted per generation call. Always set it; a missing stop token with no cap can fill the context window with hallucinated conversation.
- Stop tokens
- Tokens signalling end-of-generation. Instruct models bake one into the chat template; without it the model rambles.
- Repetition penalty
- A scalar reducing probability of tokens already in the output. Small values (1.05–1.15) break greedy loops without quality loss.
- Time to first token (TTFT)
- Latency from request send to first generated character; dominates perceived chat UX.
- JSONL
- One JSON object per line; the standard streamable container for SFT datasets. See Dataset formats in the wild.
- Completion format
{prompt, completion}. Simplest SFT shape; what Track 2's by-hand SFT uses.
- Chat messages format
- A list of
{role, content} dicts that the tokenizer's apply_chat_template renders. The standard for chat models and multi-turn data.
- Alpaca format
{instruction, input, output}. From Stanford's original instruction-tuning set; many older HF datasets use it.
- ShareGPT format
{conversations: [{from, value}, ...]}. Multi-turn, quirky field names — rename from→role, value→content to convert.
- Catastrophic forgetting
- Degradation of broader skills when a model is fine-tuned on narrow data — the optimizer has no signal to preserve them. See Catastrophic forgetting.
- Data mixing (mix-back)
- Including a small fraction (5–10%) of general instruction data alongside narrow task data, so the optimizer sees both shapes and forgets less.
- Broad eval
- A small set of general prompts (instruction-following, refusals, format-switching) evaluated alongside the task eval to detect forgetting.
- Hard negative
- An example that resembles the other class on the surface; placed on the correct side of the decision boundary to teach the model where the line is.
- Ambiguous case
- An example a thoughtful human would have to pause on; labelled per a stated rule so the model learns the rule.
- Refusal data
- Examples the model should not answer the usual way (out-of-scope, harmful, needing clarification), paired with the desired refusal phrasing.
- OOD (out of distribution)
- Inputs outside the trained scope, labelled with the desired "I don't know" so the model learns its limits.
- Continued pretraining (CPT)
- Continuing the next-token objective on raw text from a new domain, before any task SFT. Same loss as pretraining, different data, smaller learning rate. The right tool when the base model doesn't speak your domain's language; the wrong tool when it does. See Continued pretraining.
- Domain-adaptive pretraining (DAPT)
- The research name for CPT applied to a specific domain (medical, legal, code, finance). Same mechanic — the "domain-adaptive" label emphasises the purpose.
- Vocab extension
- Adding new tokens to the tokenizer for domain-specific vocabulary before CPT, so those terms become single tokens. Sharp trade-off: cheaper at inference, but the new embeddings start random and need enough CPT to learn — and it breaks the same-tokenizer assumption distillation needs.
- Catastrophic forgetting (CPT view)
- The same risk as in SFT, sharpened: a CPT'd model is already pulled toward the domain, and narrow SFT on top compounds the drift unless mix-back and broad eval are kept in.
- Two-stage CPT → SFT pipeline
- The standard shape: stage 1 is CPT on raw domain text producing a CPT'd checkpoint, stage 2 is SFT on task instructions from that checkpoint. Evaluated end-to-end on the same gold set you'd use without CPT.
- Downstream probe
- A pinned SFT recipe + gold set used to compare a base checkpoint and a CPT'd checkpoint on the metric that actually matters — task performance, not CPT loss.
Track 2 — Hands-on
- accelerate
- Hugging Face library that handles device placement and distributed details under the Trainer.
- attention_mask
- A per-token 1/0 vector marking real tokens vs padding so the model ignores padded positions.
- AutoModelForCausalLM
- The Transformers class that loads a decoder-only language model by id. See Load a base model.
- AutoTokenizer
- Loads the matching tokenizer and chat template for a model id.
- DataCollator
- Pads a list of examples into a rectangular batch;
DataCollatorForSeq2Seq pads labels with -100 so padding is ignored by the loss. See Tokenize & collate.
- datasets.Dataset
- A fast, memory-mapped table with
.map()/.filter()/split that the Trainer consumes.
- dtype (bf16)
- The weight/compute precision passed to
from_pretrained (older API: torch_dtype); bf16 roughly halves memory.
- from_pretrained / save_pretrained
- Download/instantiate a model or tokenizer, and write a self-contained copy back to disk.
- generate()
- Runs autoregressive decoding to produce output tokens;
do_sample=False gives deterministic greedy decoding for evaluation.
- get_peft_model
- Wraps a base model with LoRA adapters and freezes the base so only the adapter trains.
- GGUF
- The llama.cpp weight format for efficient CPU/edge inference; a separate conversion step from your merged model. See Merge & infer.
- Greedy decoding
- Always taking the highest-probability next token (
do_sample=False); deterministic, used for reproducible evaluation.
- Ignore index (-100)
- The label value PyTorch's cross-entropy skips; how the loss mask and padding are excluded from the loss in code.
- input_ids / labels
- The token IDs the model reads, and the targets it's scored against (a copy of input_ids with masked positions set to -100).
- merge_and_unload
- Folds a LoRA adapter into the base weights, returning a standalone model with no adapter overhead.
- PeftModel.from_pretrained
- Attaches a trained LoRA adapter onto a fresh base model for inference. See Evaluate by hand.
- print_trainable_parameters
- Reports how few parameters LoRA actually trains (typically well under 1%).
- safetensors
- The safe, fast default weight format
save_pretrained writes.
- Trainer
- The Transformers class that runs the training loop (forward, loss, backward, optimizer step) for you. See A minimal LoRA fine-tune.
- TrainingArguments
- Holds the run's hyperparameters: learning rate, epochs, batch size, precision, logging and saving cadence.
- venv
- An isolated Python environment so a project's packages don't collide with the system. See Set up the environment.
- TRL
- Hugging Face's SFT- and preference-training library. Provides
SFTTrainer, DPOTrainer, ORPOTrainer. See SFT with TRL's SFTTrainer.
- SFTTrainer
- TRL wrapper around HF
Trainer that handles chat-template, loss mask, padding, and PEFT for SFT. The 20-line version of Lesson 2.5.
- SFTConfig
- SFTTrainer's args object; superset of
TrainingArguments with SFT-specific fields like max_seq_length, completion_only_loss, packing.
- completion_only_loss
- SFTConfig field; when
True, builds the loss mask so cross-entropy ignores everything except assistant tokens.
- peft_config
- SFTTrainer argument; pass a
LoraConfig and SFTTrainer attaches LoRA internally — no get_peft_model call needed.
- packing (SFT)
- SFTConfig field; concatenates short examples to
max_seq_length to save compute on padding.
- bitsandbytes
- Library providing 4-bit/8-bit quantization kernels that HF Transformers loads via
BitsAndBytesConfig. See QLoRA hands-on.
- BitsAndBytesConfig
- HF config object that tells
from_pretrained to load weights in 4-bit (or 8-bit), with NF4 / double-quant / compute-dtype settings.
- NF4 (4-bit Normal Float)
- The QLoRA-default 4-bit quantization format, tuned for trained-Transformer weight distributions.
- Double quantization
- Quantizing the quantization constants themselves to save additional memory in QLoRA.
- prepare_model_for_kbit_training
- peft helper that enables gradient checkpointing and stable-dtype casts on a quantized base before LoRA is attached. Forgetting it is the most common "my QLoRA isn't training" bug.
- classification_report
- sklearn function returning per-class precision/recall/F1/support plus macro/weighted averages. The default classification eval.
- Confusion matrix
- True-vs-predicted matrix; off-diagonal cells show which classes get confused for which.
- Macro F1
- Mean of per-class F1s with equal weight per class; the honest default on imbalanced data.
- Weighted / micro F1
- F1 averaged by class support; dominated by the majority class.
- HF evaluate
- Hugging Face library exposing standard metrics (F1, accuracy, ROUGE, BLEU, WER, seqeval) through one interface; aligns with shared standards used in papers.
- Pydantic
- Python library for type-validated data models;
BaseModel + type hints become a schema and a parser in one. See Structured outputs with pydantic.
- Valid-JSON rate
- Fraction of model outputs that both parse as JSON and match the Pydantic schema; the first half of the structured-output honest report.
- Per-field accuracy
- Fraction of times each schema field has the correct value, measured on the parses that succeeded; the second half of the structured-output honest report.
- Multi-turn SFT
- Fine-tuning on multi-turn conversations with the loss mask applied to every assistant turn, not just the last. See Multi-turn chat SFT.
- Role bleed
- Loss mask leaking into user turns so the model is supervised on user-voice text; causes the model to mimic the user in its replies.
- seqeval
- Standard span-extraction metric library (HF
evaluate exposes it); produces entity-level precision/recall/F1.
- Faithfulness
- Whether a generated answer is grounded in supplied context (e.g. RAG passages) rather than hallucinated.
- LLM-as-a-judge
- Using a strong model as a rubric-driven judge for free-form outputs. The right tool when there's no exact-match metric; not a substitute for one when there is. See LLM-as-a-judge.
- Rubric
- The written criteria a judge scores against — dimensions (correctness, tone, concision) and a scale. The clearer the rubric, the more consistent the judge.
- Pairwise judging (A/B)
- Showing a judge two outputs and asking which is better, with mandatory order-swap and "both must agree" win counting. More reliable than absolute scoring.
- Judge bias
- The three classic LLM-as-judge biases — position (favours first option), length (favours longer answer), style/family (favours own model family). Mitigated by order-swap, concision in the rubric, and judge selection.
- Judge calibration
- Comparing judge scores against a small human-rated set (50–200 pairs) and reporting Cohen's κ or Spearman ρ. Required to interpret any aggregate judge metric — inter-judge agreement is not a substitute.
- lm-evaluation-harness
- EleutherAI's open-source benchmark runner — the standard tool used to produce the MMLU / HellaSwag / ARC / GSM8K numbers in model cards. See Public benchmarks & lm-eval-harness.
- MMLU
- Massive Multitask Language Understanding: ~14k multiple-choice questions across 57 subjects. Usually reported 5-shot. The closest thing to a general "what does this model know?" substrate score.
- Benchmark contamination
- The benchmark's questions or answers appeared in the model's pretraining; the model memorised rather than reasoned. Detected by a canonical-vs-paraphrased gap. Why your private gold set matters more than any public number.
- MLflow
- Open-source experiment-tracking platform; self-hosted; tracks runs, parameters, metrics, artefacts, and a model registry. The BSD-licensed standard. See Experiment tracking with MLflow & W&B.
- Weights & Biases (W&B)
- Hosted experiment-tracking product with the most polished UI, sweeps for hyperparameter search, and team report-sharing. Free tier for individuals.
- Run metadata
- The non-metric record of a training run: full config, git SHA, dataset hash, library versions, system info. The piece that makes runs joinable later — and the piece teams most often skip and most often regret.
- Reproducible vs replicable
- Reproducible = same inputs → bit-for-bit identical outputs. Replicable = same recipe → equivalent outcome within noise. Experiment tracking targets replicability; chasing bit-equivalence in ML usually isn't worth the cost.
Track 3 — With BrewSLM
- Auto-RAG
- Builds a BM25 index at training completion and prepends top-K retrieved passages to the prompt at inference. See Auto-RAG & reroute.
- BM25
- Keyword (lexical) retrieval — fast, dependency-light; the retrieval baseline auto-RAG uses.
- Coach Mode
- The surface that emits stage suggestions (data / cleaning / gold_set / training / eval) and actions (run_playbook, navigate, augment_from_cluster) across the lifecycle.
- Decision engine (post-eval)
- Reads eval results + failure clusters and recommends the next move — including retrieval over more fine-tuning when failures are knowledge-bound.
- DeploymentVersion
- A versioned deployment with promote / reject / rollback / drift-check actions.
- Drift check
- A scheduled re-run of the gold set against the live endpoint to catch production regressions. See Export, deploy & Coach Mode.
- Eval pack
- The declared set of metrics and promotion gates evaluated against the held-out eval set; the gates decide promotability. See Eval packs & failure clusters.
- Failure cluster
- A bucket of evaluation misses that share a pattern, surfaced instead of a flat list so the data gap is obvious.
- Introspect
- Sampling ~20 rows to propose a task mapping (ranked ShapeHypothesis + a ProposedMapping); never auto-picks below 0.80 confidence.
- Lifecycle (11 stages)
- Ingest → Introspect → Map (dry-run / commit) → Clean → Prepare → Preflight → Train → Evaluate → Export → Deploy; each has a contract and emits a RunEvent.
- Manifest
prepared/manifest.json — the source of truth for downstream stages: counts, schema, task_profile, scoring_mode, paths, hashes. See Clean & prepare.
- Job / Notification bell
- Long-running work runs as a persisted background Job; the top-bar bell polls
/api/jobs/active for progress and outcome. See Training jobs.
- Preflight
- Pre-run dependency, memory-fit, capability, and gate-policy checks returning pass/fail + blockers + a train plan; surfaced as the trainability forecast.
- Promotability gate
- A pass threshold in the eval pack a model must clear to be shippable (the platform form of the quality gate).
- Recipe
- Declarative training config — base model, method, LoRA knobs, optimizer/schedule, batch, precision, checkpoint cadence (your TrainingArguments + LoraConfig as reusable config). See Recipes & handlers.
- Reroute-to-RAG
- Clones a project into a retrieval-first sibling (
rag_first=True) that uses the base model plus retrieval — no LoRA adapter.
- RejectedRow / reason code
- An input row that failed mapping, tagged with a stable reason code so rejections are grouped and selectable, never silently dropped.
- Remediation plan
- The recommended fix for a failure cluster — often
augment_from_cluster: add similar rows, review, re-prepare, re-train.
- RunEvent
- An audit row a stage emits via
emit_event() (reason code + severity) that feeds the observability timeline, failure clusters, and audit explorer. See The lifecycle.
- Source locator
- A string naming where rows live for ingest:
hf:id:split, jsonl:/path, kaggle:....
- Task handler / dispatcher
- A general, task-shape-named component (Classification, QA, StructuredExtraction, RAG, Alignment, Seq2Seq, …) that owns tokenization, masking, and
score(); the dispatcher routes the manifest's task_profile to it.
- brewslm.yaml manifest
- The project-as-code schema.
api_version: brewslm/v1, kind: Project, strict-extra-forbid. See Training config reference.
- Plan profile
training_plan.plan_profile on the manifest: safe / balanced (default) / max_quality. Resolves to a concrete TrainingConfig at apply time.
- Training mode
training_plan.training_mode: sft for supervised fine-tuning, kd for knowledge distillation.
- Manifest apply
- The service that diffs a parsed
brewslm.yaml against project state and emits a ManifestApplyPlan with explicit create/update/noop/delete actions.
- RunEvent stage
- One of nine canonical strings:
ingestion, cleaning, adapter, training, eval, export, deployment, autopilot, system. See RunEvent & Coach catalogue.
- RunEvent severity
info / warning / error / critical. Error and critical must carry a reason code.
- Reason code
- Lint-gated string from the canonical taxonomy that names a failure mode (e.g.
training_oom, deployment_drift_detected); unknown codes are rejected at emit time.
- Coach stage
- One of five workflow surfaces Coach Mode speaks at:
data, cleaning, gold_set, training, eval.
- Coach action kind
- One of three:
navigate (route to a panel), run_playbook (trigger generation), augment_from_cluster (open a failure cluster as a generation source).
- Evaluation Contract v2
slm.evaluation-pack/v2 — eval-pack schema with per-task specs, each carrying its own gates and metric schema. See Eval pack reference.
- Gate (eval pack)
- A scalar comparison:
{gate_id, metric_id, operator: gte|lte, threshold, required, source?, weight?}. required: true gates block promotion.
- FailureCluster row
- Row in
failure_clusters uniquely keyed on (project_id, stage, reason_code, signature); carries failure_count, timestamps, and capped exemplar lists.
- Cluster signature
- Hash of the canonical failure shape (model id + batch + seq length, for OOM; etc.); same signature = same cluster.
Track 4 — Advanced
- alpha (KD)
- The [0,1] weight in the KD loss trading the gold-label cross-entropy term against the teacher KL term.
- AWQ
- Activation-aware Weight Quantization; protects activation-critical weights, GPU-oriented. See Quantization & compression.
- Continuous batching
- Swapping finished requests out and new ones in each generation step to keep the GPU saturated.
- Curriculum learning
- Ordering training data (often easy→hard) to improve convergence and final quality. See Multi-task & curriculum.
- Dark knowledge
- The relational signal in a soft distribution (which wrong answers are plausible) that one-hot labels discard.
- Distillation (KD)
- Training a small student to reproduce a larger teacher's outputs, to shrink at near-equal quality. See Distillation I.
- DPO
- Direct Preference Optimization — learns from (chosen, rejected) pairs directly, no separate reward model. See Preference tuning.
- Drift / drift detection
- Production quality decay as inputs change over time; detected by re-running the gold set against the live endpoint on a schedule. See Observability & drift.
- GPTQ
- One-shot, layer-wise post-training quantization that minimizes introduced error; GPU-oriented.
- KD loss
alpha·CE + (1−alpha)·T²·KL — hard-label cross-entropy blended with a temperature-softened KL to the teacher. See The KD loss.
- KL divergence
- A measure of how far the student's distribution is from the teacher's; minimized to transfer knowledge.
- KV cache
- Stored attention keys/values for prior tokens so each generation step only computes the new token's. See Serving & inference.
- Latency vs throughput
- Per-request speed vs total tokens/second; larger batches favor throughput over latency.
- Multi-task training
- Training one model on several tasks at once; positive transfer if they share structure, interference if not.
- ORPO
- Odds Ratio Preference Optimization — adds a preference term to the SFT loss for single-stage alignment, no reference model.
- Post-training quantization
- Quantizing a finished model by conversion, with no retraining.
- Q4_K_M
- A 4-bit mixed-precision k-quant (llama.cpp / GGUF); the common CPU/edge sweet spot.
- Quality retained
student_score / teacher_score on the same eval — the fraction of the teacher's quality a distilled student kept. See Quality retained.
- Reference model
- A frozen copy of the starting model that DPO anchors to, so the policy doesn't drift arbitrarily.
- Soft targets
- A teacher's full probability distribution over tokens — a richer training signal than a hard label.
- Task interference
- Negative transfer — unrelated or imbalanced tasks degrading each other in multi-task training.
- Teacher / student
- The large source model and the small model trained to mimic it in distillation.
- Temperature (T)
- A divisor on logits before softmax; higher T softens the distribution to expose secondary probabilities for KD.
- Top-k logprobs
- The k most likely tokens and their log-probabilities per position; a compact offline capture of the teacher's distribution.
- vLLM
- A high-throughput inference server using paged attention to manage the KV cache.
- Production logging
- Recording request/response pairs at the inference endpoint into an append-only log (commonly JSONL). See Production feedback loop.
- PII redaction (logs)
- Replacing personal identifiers in logged text with tokens like
[EMAIL] before disk; hashing user ids.
- Sampling budget
- The rule deciding which inference rows get logged; full logging is too expensive at scale, so keep all interesting rows (low-confidence, flagged, drift-window) and sample the rest.
- Log tag
- Opaque string on a log row marking it as candidate training data (
drift_window, user_negative, operator_review).
- Operator correction
- The corrected completion an operator (or stronger model) wrote for a tagged bad row; source of the SFT example in the feedback loop.
- Retrain cadence
- Trigger for retraining: on drift signal, on schedule, or on a failure-cluster count threshold.
- Tool call
- A structured JSON output naming a tool (function) from a fixed catalogue with validated arguments. See Tool-use fine-tuning.
- Function calling
- Synonym for tool-use; the vendor term (OpenAI etc.) for structured-JSON output mode.
- Tool schema
- A description (name + argument schema) of a callable function the model can target; Pydantic models declare them cleanly.
- No-tool path
- Explicit assistant output (
{"tool":"no_tool","reason":"..."}) for when no available tool fits; trained on negative examples.
- Valid-tool-call rate
- Fraction of outputs that parse as JSON, name a real tool, and have arguments matching that tool's schema.
- Argument-match accuracy
- On valid parses, the fraction with arguments matching the gold for the same request.
- False-tool-call rate
- Fraction of inputs where the gold is no_tool but the model called a real tool; the most damaging routing failure mode.
- Structured pruning
- Physically removing whole attention heads, transformer layers, or MLP channels from a trained model so matrix shapes shrink and wall-clock latency drops. See Structured pruning.
- Head / layer / channel pruning
- The three granularities of structured pruning: drop a whole attention head, drop a whole transformer block, or drop output channels from an MLP. Head and layer pruning are the simplest; channel pruning gives finer control.
- Unstructured pruning
- Zeroing out individual weights without shrinking matrix shape. Saves theoretical FLOPs but rarely real wall-clock on commodity GPUs because sparse-matmul kernels aren't standard. Contrast with structured pruning.
- Recovery training
- The short SFT or distillation pass run after pruning to recover the quality the prune cost. Not optional — aggressive pruning without recovery training typically loses 10–30 points on substrate benchmarks.
- Importance scoring
- How structured pruning decides what to cut. Three families: magnitude (cheap, misleading), gradient × weight / SNIP-style (practical sweet spot), Hessian-based / OBD / WoodFisher (accurate but expensive).
- Speculative decoding
- Inference-time speed-up where a small draft model proposes K tokens and the target verifies them in a single forward pass; accepted tokens commit. Wall-clock win depends on acceptance rate and draft-vs-target cost ratio. See Speculative decoding.
- Draft model
- The smaller model in speculative decoding that proposes candidate tokens. Must share a tokenizer with the target and ideally come from the same model family to maximise the acceptance rate.
- Acceptance rate (α)
- Probability the target accepts a draft-model token in speculative decoding. The single biggest lever on the achievable speed-up; α near 1 with K=4 yields ~3× throughput, α near 0.5 yields ~1.4× at best.
- EAGLE / Medusa
- Variants of speculative decoding that raise α without a separate draft model. EAGLE trains a tiny auxiliary head that predicts target embeddings; Medusa adds parallel decoding heads to the target itself and verifies via tree attention. Both ship in vLLM.
- Reasoning training
- Umbrella term for techniques that produce multi-step traces a verifier can score, not just final answers. Covers chain-of-thought SFT, process supervision via a step-level reward model, and outcome-supervised RL. See Reasoning training.
- Chain-of-thought SFT
- Supervised fine-tuning where each example is
prompt → reasoning trace → final answer, with the loss covering both the trace and the answer. The standard warm-start before any reasoning-flavoured RL step.
- Process supervision
- Scoring each step of a reasoning trace with a separately-trained reward model (PRM). Higher fidelity than outcome reward; expensive because the labels are step-level human judgments.
- Outcome supervision
- Scoring only the final answer with a verifier (math executor, code runner, string-match). Cheap data, harder optimisation; the recipe behind R1-style training.
- PRM (Process Reward Model)
- A reward model trained on step-level labels. Scores each step of a candidate reasoning trace at RL training time.
- ORM (Outcome Reward Model)
- A reward signal derived from a final-answer verifier — often a deterministic checker rather than a trained model. Reward = 1 if the verifier passes, 0 otherwise.
- GRPO (Group Relative Policy Optimisation)
- RL fine-tuning algorithm used by DeepSeek-R1 and similar recipes. Drops PPO's value (critic) network; computes advantage group-relative across G samples per prompt — roughly half the memory and no critic to keep stable. The practical RL recipe at SLM scale.
← Back to the Academy