Product · Design

Gamifying a dev tool without making it a toy

Most gamified developer tools feel infantilizing. Confetti animations. Cartoon avatars. "Yay, you opened the app today!" We added a progression layer anyway — here are the design rules we kept, and what we explicitly refused to do.

What we wanted to fix

BrewSLM is a long-tail tool. Sessions are minutes to hours; the path from "first project" to "production model" plays out over weeks. Between meaningful milestones, the UI goes silent. There's no signal that crossing 90% F1 on a real eval is harder and more valuable than running the demo. No incentive to revisit underused capabilities (RAG, DPO, multi-base-model breadth). No recognition that shipping a deploy version means something.

The gap isn't that users don't know they did something good. It's that the tool doesn't acknowledge it.

What we refused to ship

Before designing what to build, we listed what to avoid:

What we shipped: the Lab Journal

One persistent chip in the TopBar: ▣ L3 · 1,240 XP. Click it for a drawer (right-side overlay) that lists unlocked achievements with timestamps, locked achievements as next milestones, and hidden Discovery achievements as ▢ ??? until you trip them. Aesthetic: retro CRT terminal, phosphor green on dark, ASCII borders, monospace. Reads as "lab journal," not "kid game."

XP feeds off the audit stream

BrewSLM already emits a RunEvent for every meaningful workflow action (import, clean, train, eval, export, deploy). That stream is the perfect XP feedstock — every event is a real action with a structured payload, so we can both (a) award proportional XP and (b) check whether the event represents a first-time milestone. The gamification service is a single function — process_run_event(db, event) — that dispatches on stage + reason code. It's a best-effort tap; a bug in the progression layer can't break the data write path.

Levels are named after lab roles

L1 Intern → L2 Lab Tech → L3 ML Engineer → L5 Senior → L8 Staff → L10 Principal → L15 Distinguished. The titles set the tone: this is a career-ladder metaphor, not a wizarding-academy metaphor. The XP curve is floor(100 * level^1.5) — fast onboarding, steep mastery. Total to L10 is ~7,000 XP, which translates to roughly twenty meaningful runs.

Achievements track real ML skill

Three tiers:

Every achievement gets a one-line description and an XP value. Descriptions are dry. From the catalog: "Eval pass rate crossed 95%. Most teams plateau before this." Or: "Used --force on a low-confidence dataset import. You knew what you were doing."

Idempotent unlock, deduped toasts

The unlock path is a set-membership test before grant. The same trigger firing twice never double-pays. For repeated continuous events (dataset_import_run, training completes, eval passes), a 30-second per-(project, reason_code) suppress window means rapid-fire events drip XP silently but only emit one toast. We don't want to spam.

What we got right (we think)

Two design choices we're least likely to walk back:

What we'd still get wrong

If we shipped this for a team product instead of a single-user tool, we'd need to think harder about social comparison. The current design — local-only, no leaderboards — sidesteps the question entirely. The moment achievements become visible to coworkers, "I have to grind X" becomes a real failure mode. We didn't solve that; we deferred it by scoping local.

The other thing we'd be cautious about: making the gamification layer load-bearing. The XP can't be the reason someone trains a better model. It's a small extra signal, not the contract. If we caught ourselves designing the next phase of the actual product around an achievement we wanted to ship, that would be a sign to delete the gamification.