Byte Pair Encoding (BPE) is the algorithm that turns raw text into tokens — the units every LLM reads, reasons in, and is billed by. This post traces how it works, why it was invented, and what understanding it changes about the way you build with language models.
- It Started with a Bill
- The Assumption That Was Wrong
- The Decision to Go Deeper
- The Problem BPE Solves
- The Byte Pair Encoding Algorithm, Step by Step
- Code: Building BPE from Scratch
- The </w> Marker — Why It’s There
- From Training to Encoding
- How Byte Pair Encoding Connects to Real Tokenizers
- Why This Matters — Especially If You’re Optimizing Costs
- Key Takeaways
- Try It Yourself
It Started with a Bill
A few months into building AI agents at work, I got a question from my manager that I wasn’t fully prepared for:
“Why is this costing so much?”
We had built a fairly capable agent system — the kind I documented in my Building AI Agents from Scratch series. It worked well. Users loved it. But the API costs were climbing faster than expected, and the culprit was always the same thing: tokens.
Every LLM API call is billed by tokens. More tokens in your prompt, more tokens in the response, bigger the bill. So we did what any practical engineering team does — we optimized. We built a tool internally (we called it TRON) to compress prompts, trim context, reduce redundancy. It worked. Costs came down.
But here’s the thing that kept nagging at me.
Tokenization Was Hiding in Plain Sight
And TRON wasn’t even the only place tokens were showing up. In Part 9 of the agents series, we built a vector search tool — documents ingested, chunked, and stored as embeddings in pgvector. For that, we used Sentence Transformers, the open-source embedding library. Free to run, no API costs. The embeddings were great.
But Sentence Transformers also tokenize your text before computing the embedding vector. Quietly, under the hood, Sentence Transformers split every sentence into tokens first — using the same kind of subword tokenization — before computing the embedding vector.
I had tokenization happening in two separate parts of my stack. Paid tokenization in the LLM API. Free tokenization in the embedding model. In both cases, I was consuming the output without understanding the process.
I had been managing tokens for months without truly understanding what they were. I knew tokens weren’t exactly words. My context window was measured in tokens, not characters. Sending a large document to an LLM would “use up” tokens. But if you had asked me to explain how a sentence becomes tokens, I would have fumbled.
And that bothered me.
The Assumption That Was Wrong
I had been assuming, without realizing it, that tokens were roughly equivalent to words.
One word, one token. Simple enough.
Then I actually checked.
import tiktoken
enc = tiktoken.get_encoding("cl100k_base")
text = "The quick brown fox jumps over the lazy dog"
tokens = enc.encode(text)
print(f"Words: {len(text.split())}") # 9
print(f"Tokens: {len(tokens)}") # 10
OK, close enough for that sentence. But then I tried something longer with punctuation, numbers, and a technical term:
text = "GPT-4 achieves 96.3% accuracy on HumanEval benchmarks."
tokens = enc.encode(text)
decoded = [enc.decode([t]) for t in tokens]
print(f"Words: {len(text.split())}") # 7
print(f"Tokens: {len(tokens)}") # 14
print(decoded)
# ['G', 'PT', '-', '4', ' achieves', ' ', '96', '.', '3', '%', ' accuracy', ' on', ' Human', 'Eval', ' benchmarks', '.']
Seven words. Fourteen tokens. GPT-4 alone became four tokens. HumanEval split into two. The percentage 96.3% shattered into fragments.
My mental model was broken. I had been thinking about LLMs all wrong.
The Decision to Go Deeper
This is where my thinking shifted.
For the past year, I had been building on top of LLMs — agents, tools, APIs, orchestration. That work is documented in the previous series. And it’s valuable work. But somewhere along the way I started feeling like I was operating a machine I didn’t fully understand. I could tune it, connect it to things, deploy it. But the inside was a black box.
So I made a decision: I was going to learn how LLMs actually work. From scratch.
Not just read a blog post or two. Actually go deep. I started with the obvious question — what do I need to understand first?
The answer kept pointing to the same stack:
- Neural networks (the foundation)
- The Transformer architecture (the breakthrough)
- The “Attention is All You Need” paper (the original design)
- And then — how to actually build one
I picked up Build a Large Language Model from Scratch by Sebastian Raschka. And the very first real concept it asks you to understand before writing a single line of model code is this:
How does text become tokens?
That question led me to Byte Pair Encoding.
I read about it, implemented it, and immediately thought: I need to write this down while it’s fresh.
So here we are.
The Problem BPE Solves
Before we get to the algorithm, let’s understand the problem it’s solving. Because when I first asked “how does an LLM decide what a token is?”, the answer wasn’t obvious.
If you’re building an LLM, one of the first decisions you have to make is: what is the vocabulary? Every word the model will ever read, every concept it will ever express, has to map to something in that vocabulary. So what should those units be?
Two naive options:
Option 1: Character-level vocabulary Vocab = [a, b, c, ..., z, A, ..., Z, 0, ..., 9, !, ?, ...]
This is tiny (~100 tokens) and handles any word — even typos and new words. But every sentence becomes hundreds of tokens. The model has to learn to compose meaning across many steps. It’s slow, and it’s hard.
Option 2: Word-level vocabulary Vocab = every word in the training corpus
This gives you rich, meaningful tokens — but your vocabulary explodes. English alone has hundreds of thousands of words, and that’s before you add names, code, multiple languages, misspellings, and compound words. You’d need a million-token vocabulary, most of which appear only once. And you can’t handle new words at all.
Byte Pair Encoding (BPE) is the middle ground. It starts with characters and iteratively merges the most common adjacent pairs into new tokens — until you reach a target vocabulary size. The result is a vocabulary where:
- Common words (
the,is,and) are single tokens - Rare words are split into meaningful subword pieces (
tokenization→token+ization) - Nothing is truly “unknown” — you can always fall back to characters
The Byte Pair Encoding Algorithm, Step by Step
When I first read the BPE algorithm description, it sounded almost too simple.
You just merge the most common pairs? That’s it?
Yes. That’s it. But simple doesn’t mean unsurprising. Let me trace through it with a small real English corpus and show you what the algorithm actually discovers.
Corpus: "eat eater eating"
Three related words. Nothing pre-labelled. The algorithm has no idea what English is.
Start: split everything into characters
'e a t </w>' → the word "eat"
'e a t e r </w>' → the word "eater"
'e a t i n g </w>' → the word "eating"
Count every adjacent pair
(e, a) → 3 ← appears in all three words
(a, t) → 3 ← appears in all three words
(t, e) → 1 ← only in "eater"
(t, i) → 1 ← only in "eating"
...
Two pairs are tied at 3. Pick the first: merge (e, a) → ea.
After merge 1:
'ea t </w>'
'ea t e r </w>'
'ea t i n g </w>'
Count again. This time, `(ea, t)` appears 3 times — and wins immediately. Merge to `eat`.
After merge 2:
'eat </w>'
'eat e r </w>'
'eat i n g </w>'
Look what just happened. The algorithm has never seen a dictionary. It has no grammar rules and no concept that “eat” is a verb. It just counted — and ‘eat‘ emerged as a token** because it’s the most frequent substring across all three words.
Everything else (er, ing) appears only once each, so the algorithm stops building here (or needs more data to go further).
That’s Byte Pair Encoding. A blind frequency counter that accidentally discovers meaningful linguistic units — because in real language, meaning and frequency are correlated.
Code: Building BPE from Scratch
Reading the algorithm is one thing. Writing it is where it actually clicks.
I implemented BPE in Python as part of learning — no libraries, no shortcuts. Here’s the full implementation:
from collections import defaultdict
def get_vocab(corpus: str) -> dict:
"""Split corpus into word frequency counts, treating each word as a sequence of chars."""
vocab = defaultdict(int)
for word in corpus.strip().split():
# Add end-of-word marker so the model learns word boundaries
vocab[" ".join(list(word)) + " </w>"] += 1
return dict(vocab)
def get_pairs(vocab: dict) -> dict:
"""Count all adjacent symbol pairs across the vocabulary."""
pairs = defaultdict(int)
for word, freq in vocab.items():
symbols = word.split()
for i in range(len(symbols) - 1):
pairs[(symbols[i], symbols[i + 1])] += freq
return dict(pairs)
def merge_pair(pair: tuple, vocab: dict) -> dict:
"""Merge a given pair of symbols into a single token across the vocabulary."""
new_vocab = {}
bigram = " ".join(pair)
replacement = "".join(pair)
for word, freq in vocab.items():
new_word = word.replace(bigram, replacement)
new_vocab[new_word] = freq
return new_vocab
def train_bpe(corpus: str, num_merges: int) -> list:
"""Train BPE on a corpus and return the list of merge rules."""
vocab = get_vocab(corpus)
merges = []
print(f"Initial vocab: {vocab}\n")
for i in range(num_merges):
pairs = get_pairs(vocab)
if not pairs:
break
# Pick the most frequent pair (ties broken by first occurrence)
best_pair = max(pairs, key=pairs.get)
vocab = merge_pair(best_pair, vocab)
merges.append(best_pair)
print(f"Merge {i + 1}: {best_pair} → {''.join(best_pair)}")
print(f" Vocab: {vocab}\n")
return merges
# --- Run it ---
corpus = "low lower lowest new newer"
merges = train_bpe(corpus, num_merges=8)
print("=" * 40)
print("Final merge rules:")
for i, merge in enumerate(merges, 1):
print(f" {i}. {merge[0]} + {merge[1]} → {''.join(merge)}")
Sample output:
Initial vocab:
'l o w </w>' : 1
'l o w e r </w>' : 1
'l o w e s t </w>': 1
'n e w </w>' : 1
'n e w e r </w>' : 1
Merge 1: ('e', 'r') → er
Merge 2: ('l', 'o') → lo
Merge 3: ('lo', 'w') → low
Merge 4: ('n', 'e') → ne
Merge 5: ('ne', 'w') → new
Merge 6: ('low', 'e') → lowe
Merge 7: ('lowe', 'r') → lower
Merge 8: ('new', 'e') → newe
A few things in this output stopped me.
er merged before lo — why? Because er appears in two different word families: lower and newer. While lo only appears in the low- family. Cross-word frequency wins. The algorithm rewards patterns that are widely shared, not just locally common.
lowe and newe are not real English words. The algorithm learned them anyway. After building low and new as tokens, it kept merging because lowe (from low+e in lower and lowest) appeared more than once. BPE has no dictionary. It doesn’t know lowe isn’t a word. It just saw a frequent pattern and merged it.
This is the algorithm’s honesty — and its limitation. It discovers real structure (low, new, er) but also captures statistical noise (lowe, newe). At scale, with billions of words, the signal overwhelms the noise and the vocabulary becomes genuinely meaningful. On five words, you see both.
The same process, on billions of real words, merged (r, r) into rr. rr appears in error, correct, arrow, mirror, carrot, arrange, strawberry — constantly. So at some point during GPT-4’s training, (r, r) earned enough frequency to become a single token. That detail has consequences we’ll come back to.
The </w> Marker — Why It’s There
You may have noticed the </w> end-of-word marker in the code above.
Without it, the algorithm couldn’t distinguish a `w` at the end of `new` (a complete word) from a `w` in the middle of `newer`. Consequently, a merge like `(n, e, w)` would bleed across word boundaries into other words. The `</w>` marker makes word endings explicit, so `new</w>` (the standalone word) and `new` (a prefix inside another word) stay as distinct tokens.
Modern tokenizers like tiktoken use a different trick — they pre-split on whitespace using a regex before applying BPE — but the goal is the same: preserve word boundary information.
From Training to Encoding
Once BPE is trained, you have a list of merge rules (ordered by when they were created). To encode new text:
- Split into characters
- Apply merge rules in the same order they were learned, from first to last
- The remaining symbols are your tokens
The ordering matters. Earlier merges (more frequent pairs) take priority. This guarantees consistent tokenization.
def encode(word: str, merges: list) -> list:
"""Encode a single word using learned BPE merge rules."""
tokens = list(word) + ["</w>"]
for pair in merges:
i = 0
new_tokens = []
while i < len(tokens):
# Try to match the current pair
if i < len(tokens) - 1 and (tokens[i], tokens[i + 1]) == pair:
new_tokens.append("".join(pair))
i += 2
else:
new_tokens.append(tokens[i])
i += 1
tokens = new_tokens
return tokens
# Encode a new word using the merge rules we learned
print(encode("lower", merges)) # → ['lower</w>']
print(encode("lowest", merges)) # → ['low', 'e', 's', 't</w>']
print(encode("lowly", merges)) # → ['low', 'l', 'y</w>']
lower is in the training data so it encodes to a single token. lowest is partially known — low is a learned token but est wasn’t frequent enough to merge. lowly uses low but the rest falls back to characters. This graceful degradation is the whole point.
How Byte Pair Encoding Connects to Real Tokenizers
Once I had the toy implementation working, the next thing I did was run the same kind of inspection on GPT-4’s actual tokenizer. I wanted to see BPE’s fingerprints in the real thing.
The GPT-4 tokenizer (cl100k_base, via the tiktoken library) has a vocabulary of 100,277 tokens, built by running BPE on a massive multilingual corpus — billions of words across dozens of languages, code repositories, books, and web pages. Here’s what it looks like in practice:
import tiktoken
enc = tiktoken.get_encoding("cl100k_base")
examples = [
"Hello, world!",
"tokenization",
"ChatGPT",
"chatgpt",
"def calculate_loss(predictions, targets):",
"الذكاء الاصطناعي", # Arabic: "artificial intelligence"
]
for text in examples:
tokens = enc.encode(text)
decoded = [enc.decode([t]) for t in tokens]
print(f"{text!r}")
print(f" Tokens ({len(tokens)}): {decoded}")
print()
A few things you’ll notice — and these directly explain behaviors I had seen but never understood:
"ChatGPT"and"chatgpt"tokenize differently — BPE is case-sensitive, so casing literally changes your token count- Python keywords and common code patterns often get their own tokens — the model was trained on a lot of code
- Arabic text fragments into far more tokens per word than English — the training corpus was English-heavy, so Arabic subwords were rarer and got merged less
"tokenization"splits into["token", "ization"]— the subword boundaries are semantically meaningful, not arbitrary
This last point hit me. BPE doesn’t know English. It has no grammar rules. It just counts. And yet it discovers meaningful subword structure because meaning and frequency are correlated in real text.
Why This Matters — Especially If You’re Optimizing Costs
This brings me back to where the story started: the bill.
When we built TRON to optimize our token usage, we were making decisions about which parts of a prompt to trim, which context to drop, how to compress system messages. We were doing that based on word counts and intuition. Now I understand the actual unit we were optimizing.
Here’s what knowing BPE concretely changes:
1. Tokens ≠ Words On average, 1 token ≈ 0.75 English words — but that’s just an average. In practice, code, URLs, numbers, and non-English text often produce many more tokens per word. If your system prompt is full of technical terminology, you’re likely spending more tokens than you realize.
2. Rare and domain-specific words are expensive A common word like the is always 1 token. By contrast, an uncommon word like xylitol might be 3-4 tokens. Medical records, legal documents, financial filings — these contain rare words that BPE never saw often enough to merge. Every unusual term is multiple tokens.
3. Tokenization shapes model behavior — the strawberry problem
This one deserves its own discussion, because it’s the most counterintuitive thing BPE does to model behaviour.
In 2024, a question went viral: “How many r’s are in the word strawberry?”
GPT-4 confidently said 2. The correct answer, however, is 3.
People laughed. Memes were made. “AI can’t count letters.” But I didn’t want to just accept that. I wanted to understand why. So I started experimenting with a Token Visualizer — a tool that shows you exactly how GPT tokenizes any input.
Experiment 1 — Type “strawberry”
strawberry → 3 tokens, 10 characters
['str'] ['aw'] ['berry']
The model doesn’t receive the word strawberry. It receives three separate chunks. The word has been sliced across syllable boundaries — str, aw, berry. That’s what goes into the model.
OK, interesting. But this alone doesn’t explain why it says 2. Let me keep digging.
Experiment 2 — Type “berry” alone
berry → 1 token, 5 characters
"berry" is a single, indivisible token in GPT’s vocabulary. The model knows "berry" as one whole unit — the way you’d recognise a face without counting its features. When the model sees "berry", it doesn’t automatically think b-e-r-r-y. It thinks berry.
Experiment 3 — Type “r”
r → 1 token, 1 character
Fine. A single r is one token. Expected.
Experiment 4 — Type “rr”
rr → 1 token, 2 characters
This stopped me.
"rr" — the double-r — is also a single token. Two characters, one token.
And once I saw that, I knew exactly why. Go back to the BPE algorithm we built earlier: it merges the most frequent adjacent character pairs, one by one, until the vocabulary is full. "rr" appears constantly in English — error, correct, arrow, mirror, carrot, strawberry, arrange — so during training, the pair (r, r) earned enough frequency to get merged into a single token. That’s not a special rule. That’s just BPE doing exactly what we described.
The algorithm we implemented by hand — the one that merged (l, o) into lo and then (lo, w) into low because they were frequent — is the same algorithm that quietly merged (r, r) into rr while training on billions of words of real text. The strawberry problem isn’t some mysterious failure. It’s the BPE merge rule, applied at scale, surfacing as unexpected model behaviour.
Putting it together
Now the model’s answer of 2 makes complete sense.
The LLM doesn’t count characters. It was never trained to. It’s a token-prediction machine — it thinks, reasons, and counts in tokens.
When asked “how many r’s in strawberry?”, the model receives ['str', 'aw', 'berry']. To answer the question, it needs to reason about r’s inside each token:
'str' → contains one r-token → count: 1
'aw' → no r → count: 0
'berry' → the model may further decompose this as [b, e, rr, y]
where 'rr' is ONE token → count: 1
──────────────────────────────────────────────────────
Total token-level r count → 2
That’s the answer the model gives. Not because it’s bad at maths. Because it counted correctly — at the token level. "rr" is one token, so it counts as one. The double-r is invisible when you’re thinking in token units, not character units.
We confirmed this ourselves:
import tiktoken
enc = tiktoken.get_encoding("cl100k_base")
for text in ["strawberry", "berry", "r", "rr"]:
ids = enc.encode(text)
print(f"{text!r:15} → {len(ids)} token(s), {len(text)} character(s) {[enc.decode([t]) for t in ids]}")
'strawberry' → 3 token(s), 10 character(s) ['str', 'aw', 'berry']
'berry' → 1 token(s), 5 character(s) ['berry']
'r' → 1 token(s), 1 character(s) ['r']
'rr' → 1 token(s), 2 character(s) ['rr']
The `”rr”` finding is what unlocks the explanation. It’s not a vague “the model got confused” story. It’s a direct consequence of how Byte Pair Encoding works: frequent character pairs get merged. `”rr”` appears constantly in English — `”error”`, `”correct”`, `”arrange”`, `”strawberry”` — so BPE merged it. Now it’s one unit. And the model, which thinks in those units, counts it as one.
The core insight: The model counts tokens, not characters. BPE decides what a token is. When BPE merges
"rr"into one token, the model genuinely has no native way to see it as two r’s — unless it has been explicitly fine-tuned to do so.
This is why the strawberry problem isn’t an isolated quirk. It’s a window into the fundamental nature of how LLMs represent text. And once you’ve seen it, you notice it everywhere:
- Models misspell rare words — BPE fused those characters into a token the model learned as a shape, not a sequence
- Arithmetic on long numbers is unreliable — BPE bundles digits into multi-digit tokens, so `”1234″` might be one token or four
- Reversing a string is hard — token order and character order aren’t the same thing
"ChatGPT"and"chatgpt"are genuinely different inputs — different tokens, different everything
Newer models answer the strawberry question correctly now — not because the tokenization changed, but because they were fine-tuned on thousands of character-level tasks and learned to compensate. The vocabulary still has "rr" as one token. The model just learned to work around it.
4. Before sending expensive prompts, check your token count. tiktoken for OpenAI models, the HuggingFace `tokenizers` library for open-source models. It takes 3 lines of code and can save real money at scale.
Key Takeaways
- Byte Pair Encoding starts from characters and iteratively merges the most frequent adjacent pairs — no linguistic knowledge, pure statistics
- The result is a vocabulary of subword tokens that gracefully handles common words, rare words, and completely new words
- Common words (
the,is,and) become single tokens; rare words split into recognizable pieces; nothing is ever truly unknown - The merge rules follow a fixed order — the sequence learned during training applies at inference time. Consistent, deterministic, reproducible.
- Tokens are not words. Once that clicked for me, a lot of confusing LLM behavior suddenly made sense
Try It Yourself
- Run the training code on a different corpus — try a paragraph from a Wikipedia article. What tokens does BPE naturally discover? Are they linguistically meaningful?
- Count tokens in your own prompts using
tiktoken. Take a system prompt you use regularly. Is the token count what you expected? Which parts are more expensive than you thought? - Break the model (intentionally): Ask GPT-4 “How many letter ‘e’s are in the word ‘telephone’?” Then tokenize
telephonewithtiktoken. Does the tokenization explain the wrong answer? - Explore the full vocabulary: Loop through token IDs and decode them —
enc.decode([i])foriin range. You’ll find single characters, common words, code fragments, and some genuinely surprising entries. - Compare languages: Tokenize the same sentence in English and then in another language you know. Count the tokens. Does the ratio match your intuition about how that language was represented in the training data?
I’m learning LLMs from scratch — reading, implementing, and writing as I go.
