- Published on
Life of
- Authors

- Name
- Sunil Tiwari
- @sunil28071987
Prologue
This article follows one token's journey through a modern language model.
The token's name is Pi.
Pi gets born in a tokenizer, voyages through 96 layers of a Transformer, washes up in a scratchpad full of intermediate reasoning, and finally — if it survives — reaches shore as part of the user-visible output. Most of what happens to Pi never makes it back to the user. That hidden middle is where the interesting stuff lives.
The title's strikethrough — π — is the visual equivalent of what reasoning models do in their scratchpads. They write things down, cross them out, try again, refine. By the time the user sees the answer, all that crossed-out work has been stripped away. The "scratch" in scratchpad is literally scratch — write, strike, revise.
Inside the model, that strike-out happens with special tokens like <think> / </think> (DeepSeek-R1) or <scratch> / </scratch> (research literature), bracketing reasoning that gets removed before the user sees it.
So: Life of π Token. Three acts. Let's go.
Act 1 — Birth: from text to tokens
Pi enters the world
A user types this into a chat interface:
What is π to 10 digits?
That string is just bytes — UTF-8 encoded characters sitting in a memory buffer. No model on earth understands "characters." Models work on integers. Before anything else can happen, the text needs to be turned into a list of integer IDs.
That's the tokenizer's job.
The tokenizer
Modern LLMs use subword tokenization — most commonly Byte-Pair Encoding (BPE) or one of its variants. Subword tokenizers strike a balance: they're more efficient than character-level (one token per character would be wasteful) and more flexible than word-level (which can't handle "supercalifragilistic" or "GPT-4o").
Here's roughly how What is π to 10 digits? becomes tokens:
| Token | ID (example) | Notes |
|---|---|---|
What | 3923 | Common word, gets its own token |
is | 374 | Leading space is part of the token |
π | 26341 | The Greek letter — single subword |
to | 311 | Common |
10 | 220 | Numbers often get their own ranges |
digits | 19016 | Common word |
? | 30 | Punctuation |
(The exact IDs depend on the tokenizer. GPT-4o, Llama-3, DeepSeek — all use BPE but with different vocabularies.)
Our protagonist is Pi — the token ID 26341, which represents the character π.
Vocabulary and rare tokens
The full vocabulary of a modern tokenizer is around 100K–200K tokens. Common words get one token each. Rare words get split into multiple subword tokens. Emoji, code, and non-Latin scripts often get many tokens per character.
Example: gpt4-turbo-2024-04-09 might tokenize as:
gpt | 4 | - | turbo | - | 2024 | - | 04 | - | 09
That's 10 tokens for what looks like one identifier.
Special tokens
Here's where things get interesting for us.
A subset of the vocabulary is reserved for special tokens — IDs that don't appear in normal text but mark structural boundaries. Examples:
| Token | Purpose |
|---|---|
<|endoftext|> | Marks the end of a document during training |
<|im_start|> | "Instruction message start" — wraps a turn in chat models |
<|im_end|> | End of a chat turn |
<|user|> / <|assistant|> | Role markers in chat templates |
<think> / </think> | DeepSeek-R1's reasoning markers |
<scratch> / </scratch> | Generic scratchpad markers from research literature |
These tokens have integer IDs just like normal tokens. The tokenizer is configured to never split them. If your text happens to contain the literal string <|endoftext|>, the tokenizer treats that as one token, not as the individual characters <, |, e, n, etc.
This is enforced by the added_tokens configuration in the tokenizer. They sit in a separate registry from the BPE merge rules, and they always win.
Chat templates wrap the message
Before Pi's input ever reaches the model, a chat template wraps it in role markers. For a model trained with ChatML format, the wrapped input looks like:
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
What is π to 10 digits?<|im_end|>
<|im_start|>assistant
That trailing newline after <|im_start|>assistant is the cue: "your turn — start generating from here."
So Pi's actual token sequence at the model's input is not just [What, is, π, to, 10, digits, ?]. It's something like:
<|im_start|> system \n You are a helpful assistant. <|im_end|>
<|im_start|> user \n What is π to 10 digits? <|im_end|>
<|im_start|> assistant \n
— around 30 tokens, of which our Pi is just one. The model has no idea who's talking unless those role markers are there. They're how the model knows "this is the user; now I respond."
End of Act 1
Pi is born. Token ID 26341. Floating in a sequence of ~30 sibling tokens, surrounded by special tokens that frame the conversation. Next, Pi gets a body.
Act 2 — The voyage: through the Transformer
Embeddings: Pi gets a body
The first thing the model does with Pi's integer ID is look it up in an embedding table:
The embedding table is a learned matrix of shape (vocab_size, d_model). For Llama-3-70B: (128256, 8192). So Pi becomes an 8192-dimensional vector — a point in a very high-dimensional space where geometric proximity encodes semantic similarity.
Pi's vector is close to other math-related tokens (3.14, circle, radius) and far from unrelated tokens (banana, Tuesday).
Position matters
A bare embedding tells the model what token Pi is, but not where it sits in the sequence. Modern models add positional information — either with sinusoidal encodings, learned position embeddings, or RoPE (rotary position embeddings), which is now the dominant choice.
Whatever the flavor, the result is the same: Pi now knows both its identity and its position. The vector entering layer 1 of the Transformer encodes both.
Through the layers
Pi enters a stack of identical Transformer blocks — typically 32 to 96 of them, depending on model size.
Each block does two things:
- Multi-head attention. Pi looks at every other token in the sequence and pulls relevant information from them.
- Feedforward (MLP). Pi processes that information through a two-layer fully-connected network, often with a hidden dimension 4× the model dimension.
We already wrote a whole article on the first part — Attention resolves through DNS. For our purposes here, the key fact is: at every layer, Pi's vector gets updated based on what other tokens contribute.
After layer 1, Pi knows about its immediate neighbors. After layer 10, it knows about the sentence structure. After layer 40, it knows about the full prompt context. After layer 80, it knows what answer it should help produce. The vector keeps getting richer.
The decision: to think or not to think
After processing the prompt, the model reaches the position right after <|im_start|>assistant\n — the cue to start generating.
For a non-reasoning model, the next token is just the first word of the answer. The model emits π or 3.14159... and keeps going until it generates <|im_end|>.
For a reasoning model (o1, R1, Claude with extended thinking), something different happens. The first token the model emits is <think> — a special token that opens the scratchpad.
This isn't a hand-coded rule. The model learned during training that opening a scratchpad before answering produces better outcomes. Through reinforcement learning (R1) or supervised fine-tuning on reasoning traces (early CoT models), the model internalized: "hard problem → start thinking → write reasoning → close thinking → answer."
Inside the scratchpad
Once <think> is emitted, the model is now generating tokens inside the scratchpad. These tokens look like any other tokens — they go through the same attention, the same MLP, the same sampling. The only difference is what the model is allowed to produce: it's encouraged to reason rather than answer directly.
A scratchpad for our query might look like:
<think>
The user wants π to 10 digits.
I know π = 3.141592653589793...
Let me count: 3.1415926535 — that's 10 digits after the decimal.
Actually wait — "to 10 digits" usually means 10 significant figures.
3.141592654 (rounded at the 10th significant digit).
Let me give both interpretations.
</think>
Notice what's happening to Pi the token inside this scratchpad. Pi appears multiple times — as π, as 3.141592653589793, as 3.141592654. Each appearance gets its own embedding, its own attention pattern, its own MLP output. Pi is being reflected on. The model is doing serial computation in a domain where attention alone is too parallel.
This is the entire reason scratchpads work. A single forward pass is a fixed-depth computation. If you need to do 10 sequential reasoning steps, you can't do them all inside one forward pass — but you can spread them across 10 tokens of scratchpad output, with each token getting its own full forward pass that conditions on everything before. The scratchpad turns parallel architecture into serial computation.
The model emits </think> and the answer
After however many tokens of reasoning, the model emits </think> to close the scratchpad. The next tokens are the actual answer:
</think>
π to 10 significant digits is 3.141592654 (rounded).
If you meant 10 digits after the decimal, that's 3.1415926536.
Then <|im_end|> signals the end of the assistant's turn.
End of Act 2
Pi has voyaged through 80 layers, been embedded as multiple vectors, been reasoned about inside a scratchpad, and now sits as part of the final answer. But the user still hasn't seen anything yet.
Act 3 — The shore: output and stripping
Post-processing
When the model finishes generating, the full token stream looks like this:
<|im_start|>assistant
<think>
The user wants π to 10 digits.
I know π = 3.141592653589793...
... (scratchpad reasoning)
</think>
π to 10 significant digits is 3.141592654 (rounded).
If you meant 10 digits after the decimal, that's 3.1415926536.
<|im_end|>
The API layer takes this raw stream and splits it into two parts:
- Everything between
<think>and</think>→ the reasoning content. - Everything else (between
</think>and<|im_end|>) → the assistant content.
What the user gets back depends on which API they're using:
| Provider | Reasoning treatment |
|---|---|
| OpenAI (o1, o3) | Hidden entirely. User sees only the answer. A short summary is shown for transparency. |
| DeepSeek (R1) | Returned as a separate reasoning_content field. Some clients show it. |
| Anthropic (Claude with extended thinking) | Returned as thinking blocks in the response, optionally encrypted to protect IP. |
Despite the policy differences, all three providers strip the reasoning from the assistant's visible content by default. The user-facing answer is the clean part.
What you see vs. what really happened
Pi the token did a lot of work that no one ever saw. The user typed a question and got back:
π to 10 significant digits is 3.141592654 (rounded). If you meant 10 digits after the decimal, that's 3.1415926536.
That's maybe 30 tokens of output. The model may have generated 500 tokens of scratchpad reasoning to produce it. You pay for all 530.
Below the surface
Try it yourself with this interactive — toggle the eye to see what was hidden:
Scratchpad in action
Pick an example. The model wraps its reasoning in <think> / </think> scratch tokens. Toggle the eye to see what really happened versus what the user sees.
Let me break this down.
17 × 23 = 17 × (20 + 3)
= 17 × 20 + 17 × 3
= 340 + 51
= 391
Let me double-check: 17 × 23.
17
× 23
----
51 (17 × 3)
340 (17 × 20)
----
391
Confirmed.The hidden-to-visible ratio matters. For OpenAI's o1, the API bills you for "reasoning tokens" even though you don't see them. For DeepSeek-R1, the visible reasoning content can easily be 10–100× the size of the final answer.
This is why a "simple" question to a reasoning model can cost 50× a normal completion. The model is doing 50× the work — it's just not showing all of it.
Why scratchpads actually work
The scratchpad trick predates reasoning models. The original "Show Your Work: Scratchpads for Intermediate Computation" paper (Nye et al., 2021) found that letting models emit intermediate steps dramatically improved performance on arithmetic, code execution, and multi-step reasoning — even with no architectural changes. Just let them write the steps down.
Why does this work? Three reasons:
1. Serial depth from parallel architecture. A Transformer's forward pass is fixed in depth — 80 layers means 80 sequential matrix operations. That's it. If a problem requires 200 sequential reasoning steps, you can't solve it in one forward pass. But you can solve it in 200 tokens of scratchpad output, where each token gets its own full forward pass conditioned on all the previous tokens. The scratchpad converts the model's parallel architecture into serial computation by using the autoregressive loop as a stand-in for recurrent computation.
2. Each token is a chance to recover. If the model makes a mistake on token 12 of its reasoning, it can correct itself at token 30. The KV cache means the model still "sees" the mistake, but later tokens can write things like "wait, that's wrong, let me redo it." Without a scratchpad, a single forward pass either gets it right or doesn't — there's no recovery path.
3. The training distribution matches. Reasoning models are trained explicitly on traces that include scratchpads. The model learned that scratch tokens are productive for producing correct answers. Without that training signal, asking a base model to "think step by step" gets you reasoning that the model isn't optimized to produce.
The famous "Let's think step by step" finding (Kojima et al., 2022) showed that even prompting a non-reasoning model to verbalize its steps boosts performance. The mechanism is the same: serial computation routed through the autoregressive loop.
The economics of scratch tokens
Scratch tokens are real tokens. They consume compute. They count against context windows. They get billed.
Rough numbers for a modern reasoning model:
| Question type | Visible output | Scratch tokens (typical) | Hidden ratio |
|---|---|---|---|
| Trivia ("Who wrote Hamlet?") | ~10 | ~10 | 1× |
| Math problem | ~50 | ~500–2000 | 10–40× |
| Code generation | ~200 | ~2000–8000 | 10–40× |
| Hard reasoning (IMO, GPQA) | ~100 | ~10,000–50,000 | 100–500× |
This is why o1 is expensive even when it gives short answers. The visible token count and the actual compute are decoupled.
For application developers, this has real consequences:
- Latency: scratch tokens take wall-clock time. A 30-second response from a reasoning model is often 90% scratch generation.
- Cost: priced per token. The bill is denominated in all tokens, not just the ones you see.
- Context limits: a long scratchpad eats into the same context window as the visible conversation.
Some APIs let you cap scratch tokens (max_reasoning_tokens or similar). Useful for budget control, dangerous if you cap too low — the model may not have enough room to actually reason through hard problems.
Hidden vs. visible reasoning
There's a debate happening right now about whether reasoning traces should be shown to users.
Arguments for hiding (OpenAI's stance):
- The raw reasoning may contain wrong intermediate steps that confuse users.
- Reasoning traces are valuable IP — competitors can fine-tune on them.
- Safety: raw reasoning may reveal dangerous content (e.g. how the model arrived at a refusal).
- Users mostly want the answer, not the work.
Arguments for showing (DeepSeek's stance, partially Anthropic's):
- Auditability — you can verify the model isn't making things up.
- Debugging — if the answer is wrong, you can see where it went wrong.
- Trust — users can judge whether the reasoning is sound.
- Research — the community can study reasoning patterns.
Anthropic's middle path: return reasoning as separate thinking blocks, optionally encrypted (you can pass them back on the next turn so the model has continuity, but you can't read them).
There's no consensus yet. The economics, the safety arguments, and the auditability arguments are all real and partially in tension.
Epilogue — Pi reaches shore
Let's trace Pi's full journey one more time:
- Text — Pi appears as
πin the user's question. - Tokenization — BPE turns it into integer ID
26341. - Chat template — wrapped in
<|im_start|>user...<|im_end|>markers. - Embedding — ID 26341 becomes an 8192-dim vector.
- Position — RoPE adds positional information.
- Forward pass — 80 layers of attention + MLP. Pi's vector evolves at each layer.
- Scratchpad opens — model emits
<think>. New tokens, including Pi reappearances, are generated. - Reasoning — model reflects on Pi's value, considers multiple interpretations.
- Scratchpad closes — model emits
</think>. - Visible answer — model emits the final response. Pi appears again as
3.141592654. - Stripping — API splits the stream. Reasoning goes to one field, answer to another.
- Display — user sees only the answer.
The user reads "3.141592654" and never knows about Pi's voyage. The 500 tokens of scratchpad reasoning, the multiple embeddings, the 80 layers of attention — all of it happened, was billed for, and was thrown away.
But it wasn't wasted. The scratchpad is why the answer is correct.
Takeaways
- Tokens are integers. Tokenizers turn text into integer IDs using BPE. The vocabulary is typically 100K–200K tokens.
- Special tokens are reserved IDs that mark structural boundaries: turn separators, document boundaries, scratchpad markers. They're enforced by the tokenizer's
added_tokensconfiguration. - Chat templates wrap user input in role markers (
<|user|>,<|assistant|>,<|im_start|>,<|im_end|>) so the model knows whose turn it is. - Scratchpads turn parallel computation into serial. They give a fixed-depth Transformer the ability to reason through problems that require more steps than its layer count.
- Scratch tokens are real tokens. They cost money, take time, and consume context. The hidden-to-visible ratio for hard problems can be 100× or more.
- Different providers handle visibility differently — OpenAI hides, DeepSeek shows, Anthropic returns thinking blocks separately. There's no consensus yet.
Pi reached shore. The scratchpad made it possible.
