Product · Design

Gamifying a dev tool without making it a toy

Most gamified developer tools feel infantilizing. Confetti animations. Cartoon avatars. "Yay, you opened the app today!" We added a progression layer anyway — here are the design rules we kept, and what we explicitly refused to do.

What we wanted to fix

BrewSLM is a long-tail tool. Sessions are minutes to hours; the path from "first project" to "production model" plays out over weeks. Between meaningful milestones, the UI goes silent. There's no signal that crossing 90% F1 on a real eval is harder and more valuable than running the demo. No incentive to revisit underused capabilities (RAG, DPO, multi-base-model breadth). No recognition that shipping a deploy version means something.

The gap isn't that users don't know they did something good. It's that the tool doesn't acknowledge it.

What we refused to ship

Before designing what to build, we listed what to avoid:

No cartoon animations or confetti. The product is a lab tool, not a mobile game.
No cute names. "Master Wizard," "AI Sensei," "Llama Trainer Pro" — out.
No streaks for opening the app. Showing up isn't an achievement; producing something is.
No leaderboards, no social, no cloud sync. The product is local-first. Gamification stays local.
No friction-inducing achievements. Nothing that makes a user feel "I need this badge before I'm allowed to do X."

What we shipped: the Lab Journal

One persistent chip in the TopBar: ▣ L3 · 1,240 XP. Click it for a drawer (right-side overlay) that lists unlocked achievements with timestamps, locked achievements as next milestones, and hidden Discovery achievements as ▢ ??? until you trip them. Aesthetic: retro CRT terminal, phosphor green on dark, ASCII borders, monospace. Reads as "lab journal," not "kid game."

XP feeds off the audit stream

BrewSLM already emits a RunEvent for every meaningful workflow action (import, clean, train, eval, export, deploy). That stream is the perfect XP feedstock — every event is a real action with a structured payload, so we can both (a) award proportional XP and (b) check whether the event represents a first-time milestone. The gamification service is a single function — process_run_event(db, event) — that dispatches on stage + reason code. It's a best-effort tap; a bug in the progression layer can't break the data write path.

Levels are named after lab roles

L1 Intern → L2 Lab Tech → L3 ML Engineer → L5 Senior → L8 Staff → L10 Principal → L15 Distinguished. The titles set the tone: this is a career-ladder metaphor, not a wizarding-academy metaphor. The XP curve is floor(100 * level^1.5) — fast onboarding, steep mastery. Total to L10 is ~7,000 XP, which translates to roughly twenty meaningful runs.

Achievements track real ML skill

Three tiers:

Onboarding mirrors the existing Guide page checklist. First import, first clean, first train, first eval, first export, first deploy. The tutorial-shaped milestones.
Mastery tracks actual ML thresholds. F1 ≥ 0.80 / 0.90 / 0.95. Trained on three different base models (multi-model breadth). Imported from three different source connectors (multi-dataset breadth). First preference-pair (DPO) run. First RAG pipeline. First compressed export.
Discovery is the easter-egg tier. Hidden until unlocked. First training run between 00:00–05:00 (Night Owl). Recovered from an OOM via autopilot (OOM Survivor). Used --force on a low-confidence dataset import (With Authority — lampshades the architectural rule). Registered a custom mapper plugin (Modder). Triggered the LLM-assisted mapping mode (Phone a Friend). Cleaned a document end-to-end in under 30 seconds (Speedrunner).

Every achievement gets a one-line description and an XP value. Descriptions are dry. From the catalog: "Eval pass rate crossed 95%. Most teams plateau before this." Or: "Used --force on a low-confidence dataset import. You knew what you were doing."

Idempotent unlock, deduped toasts

The unlock path is a set-membership test before grant. The same trigger firing twice never double-pays. For repeated continuous events (dataset_import_run, training completes, eval passes), a 30-second per-(project, reason_code) suppress window means rapid-fire events drip XP silently but only emit one toast. We don't want to spam.

What we got right (we think)

Two design choices we're least likely to walk back:

Anchoring achievements to RunEvents, not pageviews. Showing up isn't an achievement. Producing a model that passes 90% F1 is. The XP feedstock is the audit stream because that's where the real work is.
Hidden Discovery tier. When a user stumbles into Night Owl after a late-night training session, the toast lands as a wink, not a tutorial. The hidden tier earns disproportionate engagement — discovery is a stronger signal of mastery than checking off a list.

What we'd still get wrong

If we shipped this for a team product instead of a single-user tool, we'd need to think harder about social comparison. The current design — local-only, no leaderboards — sidesteps the question entirely. The moment achievements become visible to coworkers, "I have to grind X" becomes a real failure mode. We didn't solve that; we deferred it by scoping local.

The other thing we'd be cautious about: making the gamification layer load-bearing. The XP can't be the reason someone trains a better model. It's a small extra signal, not the contract. If we caught ourselves designing the next phase of the actual product around an achievement we wanted to ship, that would be a sign to delete the gamification.