Why load the model with dtype=torch.bfloat16?

To halve memory with stable training

Why set tok.pad_token = tok.eos_token when pad_token is None?

Because batching needs a padding token

Why run a generation on the base model BEFORE fine-tuning?

To establish a baseline to compare against later

Track 2 · Hands-on · Lesson 2

Load a base model and tokenizer

After this lesson you can load a base model and tokenizer, inspect what you loaded, set the pad token, and sanity-check the model with a quick generation.

Level: beginner Read time: ~9 min Prerequisites: Set up the environment

Our running task for this track will be a simple one — sentiment classification — and our model will be HuggingFaceTB/SmolLM2-135M-Instruct: small enough to fine-tune on a single GPU in minutes, real enough to be useful. This lesson loads it and looks inside.

Load the tokenizer and model

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "HuggingFaceTB/SmolLM2-135M-Instruct"
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, dtype=torch.bfloat16)
# older transformers: from_pretrained(model_id, torch_dtype=torch.bfloat16)

device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)

from_pretrained downloads the weights (cached after the first run) and builds the model. We load in bf16 to halve memory (Track 1's mixed-precision lesson) and move it onto the GPU.

Inspect what you loaded

n_params = sum(p.numel() for p in model.parameters())
print(f"parameters: {n_params/1e6:.1f}M")

cfg = model.config
print("layers:", cfg.num_hidden_layers)
print("hidden size:", cfg.hidden_size)
print("vocab size:", cfg.vocab_size)
print("context length:", cfg.max_position_embeddings)

You should see ~135M parameters and the architectural numbers from Track 0 made concrete: a stack of Transformer layers, a hidden size (the embedding width), the vocabulary size, and the context window. This is the function whose parameters we'll nudge.

Set the pad token

Batching needs a padding token (Track 1's tokenization lesson). Many base tokenizers don't define one, so set it to the end-of-sequence token:

if tok.pad_token is None:
    tok.pad_token = tok.eos_token
    model.config.pad_token_id = tok.pad_token_id

Sanity-check with a quick generation

Before fine-tuning, confirm the base model runs and see how it handles our task untuned — this is your baseline (Track 1's evaluation lesson).

messages = [{"role": "user", "content": "Classify the sentiment as positive or negative: I loved this movie."}]
text = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tok(text, return_tensors="pt").to(device)

out = model.generate(**inputs, max_new_tokens=16, do_sample=False)
print(tok.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

Notice we use apply_chat_template with add_generation_prompt=True — exactly the chat-template discipline from Track 1. The untuned model may answer correctly, verbosely, or off-format; that's fine. The point of fine-tuning is to make it answer reliably and concisely in our format. Next, we build the data that will teach it to.

Key idea

Loading is two objects: a tokenizer (text ↔ token IDs, plus the chat template) and a model (the parameters). Always load them from the same model_id so they match, and always sanity-check the base before training so you have a baseline.

Key terms

AutoModelForCausalLM: The class that loads a decoder-only language model by id.
AutoTokenizer: Loads the matching tokenizer (and its chat template) for a model id.
from_pretrained: Downloads/caches and instantiates a model or tokenizer.
dtype=bf16: Loads weights in 16-bit to halve memory (older API: torch_dtype).
pad_token: The padding token; set it to eos_token if undefined.
generate: Runs autoregressive decoding to produce output tokens.

Check yourself

Answers are saved to this browser.

Progress is stored locally in your browser.