THE SIGNAL
A. Karpathy dropped microgpt, a 200-line Python script that trains GPT-2 from scratch on names. No PyTorch, no dependencies, just raw autograd and attention. The whole thing fits on your screen.
The WebML Community ported the concept to browser with WebGPU acceleration. You drag nodes, connect layers, and watch a transformer learn in real time.
The shift: transformers stopped being a black box you read papers about. Now you can break one apart and poke at the pieces while it trains.
AI Agents Are Reading Your Docs. Are You Ready?
Last month, 48% of visitors to documentation sites across Mintlify were AI agents—not humans.
Claude Code, Cursor, and other coding agents are becoming the actual customers reading your docs. And they read everything.
This changes what good documentation means. Humans skim and forgive gaps. Agents methodically check every endpoint, read every guide, and compare you against alternatives with zero fatigue.
Your docs aren't just helping users anymore—they're your product's first interview with the machines deciding whether to recommend you.
That means:
→ Clear schema markup so agents can parse your content
→ Real benchmarks, not marketing fluff
→ Open endpoints agents can actually test
→ Honest comparisons that emphasize strengths without hype
In the agentic world, documentation becomes 10x more important. Companies that make their products machine-understandable will win distribution through AI.

What it does:
A browser-based visual tool where you build, train, and run tiny transformer models by dragging nodes and connecting them. Runs entirely client-side using WebGPU. No server, no install, nothing leaves your machine.
What it replaces:
Reading transformer papers and hoping you understood -> Actually building one yourself
Running Karpathy's Python script and staring at terminal output -> Watching loss curves, attention patterns, and logits update live
"I think I get attention" -> Dragging a query/key/value node and seeing what happens when you remove it
Cost:
Free. Runs on Hugging Face Spaces.
Use it if:
You build with LLMs but have never actually trained one. Or you've read "Attention Is All You Need" three times and still can't explain multi-head attention without checking your notes.
HOW IT WORKS

Build is where you drag transformer primitives onto a canvas: token embeddings, position embeddings, Q/K/V projections, matmul, softmax, RMSNorm, MLP layers. Wire them together however you want. The nodes mirror what Karpathy's 200-line script does under the hood, except you can see the graph instead of reading code.
Train takes a toy dataset (character-level text, like a file of 32K names) and runs forward/backward passes through your graph. You set learning rate, steps, betas. Loss starts around 3.3 (random guessing) and drops toward 2.4 as the model learns character patterns. The whole thing trains in seconds because you're working with maybe 4,000 parameters, not billions.
Run lets you sample from your trained model. Start with a beginning-of-sequence token, autoregress through softmax with temperature control, and watch the model hallucinate plausible names like "karia" or "toran." You can see the logit distribution at each step. The model has never seen these names before. It learned the statistical pattern of what English names look like.
EVERY NODE, EXPLAINED

Here's what each draggable piece actually does. Think of it as a factory line for words.
Embeddings: The front door. Takes a token ID (like 'a' = 0) and turns it into a vector of numbers. A second embedding adds position info (where the token sits in the sequence). After this step, each token "knows" what it is and where it is.
Projections (wq, wk, wv, wo): Matrix multiplies that create four different views of each token. wq: "what am I looking for?" wk: "what do I contain?" wv: "what can I offer?" wo: recombines everything after attention finishes. These are the lenses the model uses to decide what's relevant.
Transpose: Swaps rows and columns so the math lines up for dot products. Boring but necessary.
Matmul :Matrix multiply. The workhorse of the entire graph. Dots Q against K to get similarity scores. Multiplies attention weights by V to blend information. Shows up everywhere.
Scale : Divides scores by the square root of the head dimension. Without this, softmax gets pushed to extremes and gradients vanish. One line of math that keeps training stable.
Softmax :Turns raw scores into probabilities that sum to 1. Input: [2.1, 0.8, -0.3]. Output: [0.67, 0.24, 0.09]. Now you have "how much to attend to each past token" as a real probability distribution.
Weighted Sum : Multiplies those probabilities by value vectors and sums. The result is a blend of information from past tokens, weighted by how relevant each one was.
ReLU : If the number is positive, keep it. If negative, zero it out. This is the non-linearity in the MLP. Without it, stacking linear layers would collapse into a single linear layer. You need this to learn anything interesting.
RMS Norm : Normalizes vector magnitudes so they don't explode or shrink to nothing during training. Applied before attention and before the MLP. The reason deep networks can train at all.
Attention : The full assembled head. Projects Q, K, V, scores them (matmul, scale, softmax), blends values (weighted sum), projects output. This is how the current token looks backward at everything that came before it.
MLP : Two linear layers with ReLU between them. Expands the dimension (usually 4x), runs the non-linearity, shrinks back. This is where per-token "thinking" happens. Attention moves information between tokens; the MLP processes it.
Decoder Layer : The repeating block: RMS Norm, Attention, add the input back (residual connection), RMS Norm again, MLP, add the input back again. Stack these for depth. More layers, more abstraction.
LM Head : Final linear projection to vocab-size logits. If your vocab is 27 characters, you get 27 raw scores. Softmax turns those into probabilities, and you sample your next token.
THE KNOBS YOU CAN TURN

Embedding Dim (16-128) — Vector size per token. Bigger means richer representation, slower training.
Intermediate Size (4x embedding dim) — How wide the MLP expands. More room for the model to "think."
Attention Heads (4+) — Splits the embedding into parallel attention streams. Each head can learn a different pattern (one tracks vowels, another tracks position, etc).
Layers (1-6) — How many decoder blocks you stack. More depth means more abstraction, but diminishing returns on tiny data.
Vocab Size (27+) — Number of unique tokens. For character-level names, that's 26 letters plus a start token.
Block Size (16-64) — Context window. How many past tokens each position can attend to.
WHAT HAPPENS WHEN YOU HIT TRAIN
Forward pass: Tokens flow through embeddings, through each decoder layer (norm, attention, residual, norm, MLP, residual), out the LM head. Cross-entropy loss measures how wrong the next-token prediction was.
Backward pass: Starting from the loss, gradients chain backward through every operation. Each node knows its own derivative (ReLU: 1 if positive, 0 if not. Matmul: the other input transposed). The playground visualizes this flow live.
Update (Adam): Each parameter gets nudged: new weight = old weight minus learning rate times gradient, smoothed by momentum terms (betas). Repeat.
Repeat until loss drops. Random guessing starts around 3.3. A trained model gets to about 2.0. You can watch this happen in real time.
Until next week,
@speedy_devv



