Lab 01 — Tokenization From Scratch (Solution Walkthrough)
Phase: 1 — Foundations | Difficulty: ⭐⭐☆☆☆ | Time: 1–2 hours
Read
../HITCHHIKERS-GUIDE.md§Tokenization first. This document walks through the solution code line-by-line.
0. What you build and why
Three tokenizers, ranging from naïve to production:
| Tokenizer | Vocab | Round-trip safe? | Real-world use |
|---|---|---|---|
WhitespaceTokenizer | All unique whitespace-split tokens | ❌ (loses casing of OOV, no punctuation handling) | Pedagogical only |
RegexTokenizer | Tokens from GPT-2's pre-tokenization regex | ❌ for OOV | Used as the first stage of GPT-2/GPT-4 BPE |
ByteLevelTokenizer | Fixed 256 (one per byte) | ✅ for any UTF-8 input | The fallback in tiktoken/GPT-4; pure byte models |
You will see by direct measurement why naïve splitting fails on real text and why GPT-2 layers a regex on top of bytes/BPE.
Run
pip install -r requirements.txt
python lab.py # TODO scaffold — fill in
python solution.py # reference implementation
1. The GPT-2 pre-tokenization regex
GPT2_PAT = re.compile(
r"""'(?:[sdmt]|ll|ve|re)| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+"""
)
This is the single most important regex in modern NLP. Read each alternative left-to-right (the | operator):
'(?:[sdmt]|ll|ve|re)— English contractions. Captures's,'d,'m,'t,'ll,'ve,'resoDon't→Don,'t(two reusable tokens).' ?\p{L}+'— an optional leading space followed by 1+ Unicode letter characters (\p{L}). The leading space is part of the token — that's why GPT models render text by simple"".join(tokens)(no" ".join)." hello"is one token, distinct from"hello".' ?\p{N}+'— same for numbers. Splits"abc123"into["abc", "123"].' ?[^\s\p{L}\p{N}]+'— runs of punctuation/symbols.\s+(?!\S)— runs of whitespace not followed by non-whitespace (trailing whitespace).\s+— any other whitespace run (final fallback).
Why is this regex the right pre-tokenization? Because BPE merges look at adjacent characters within a pre-token; if you don't pre-split, "the dog" could merge across the space into a single token "the dog" — wasteful and brittle. By forcing " dog" as a self-contained pre-token, BPE can only learn merges within " dog", never across.
You don't run BPE in this lab (that's an extension), but you set up the pre-tokenization layer that BPE consumes.
2. WhitespaceTokenizer
class WhitespaceTokenizer:
UNK = "<unk>"
def __init__(self):
self.token_to_id: dict[str, int] = {}
self.id_to_token: dict[int, str] = {}
Two parallel dicts — cheaper than list.index() lookups.
def train(self, corpus, min_freq=1, special_tokens=None):
specials = [self.UNK] + (special_tokens or [])
counts = Counter()
for line in corpus:
counts.update(line.split())
vocab = list(specials) + [tok for tok, c in counts.most_common()
if c >= min_freq and tok not in specials]
self.token_to_id = {tok: i for i, tok in enumerate(vocab)}
self.id_to_token = {i: tok for tok, i in self.token_to_id.items()}
Key choices:
- Special tokens reserve the lowest IDs (
<unk>is always id 0). Convention; production code hard-codespad_id=0orunk_id=0everywhere. most_common()ensures stable ordering: more frequent tokens get smaller ids → embedding tables are more cache-friendly.min_freqlets you drop hapaxes (words seen once). For natural text, ~50% of unique tokens appear once but contribute <2% of total tokens.
def encode(self, text):
unk = self.token_to_id[self.UNK]
return [self.token_to_id.get(t, unk) for t in text.split()]
def decode(self, ids):
return " ".join(self.id_to_token.get(i, self.UNK) for i in ids)
encode is information-lossy on OOV → <unk>. decode reinserts spaces but cannot recover the original whitespace structure — multiple spaces, tabs, newlines all become single spaces.
3. RegexTokenizer
Only difference from WhitespaceTokenizer: replace line.split() with GPT2_PAT.findall(line). This single change fixes:
- Punctuation:
"hello!"→["hello", "!"]. - Contractions:
"don't"→["don", "'t"]. - Number boundaries:
"abc123"→["abc", "123"].
Decoding uses "".join(...) — the leading-space tokens already carry their spaces. This is the trick that makes round-trip preserve spacing for in-vocab tokens.
4. ByteLevelTokenizer
class ByteLevelTokenizer:
def encode(self, text):
return list(text.encode("utf-8"))
def decode(self, ids):
return bytes(ids).decode("utf-8", errors="replace")
Three lines. Yet this is what production LLMs fall back to.
text.encode("utf-8")producesbytesin[0, 255]. UTF-8 is variable-length: ASCII is 1 byte, accented Latin is 2, CJK is 3, emoji is 4.- The vocab is fixed at 256 — no training step.
errors="replace"handles ill-formed byte sequences (truncated multi-byte chars from streaming generation) by insertingU+FFFDinstead of crashing.
Round-trip safety: for any input, decode(encode(x)) == x. Try emoji, Arabic, code, unicode whitespace.
Trade-off: a token = a byte. English averages ~1 char ≈ 1 byte ≈ 1 token. After BPE we collapse common byte sequences into ~0.25 tokens/char. So pure byte-level inflates sequence length 4× vs BPE — slower but trivially robust.
5. The runner
sample = "Hello, world! Don't tokenize naively — GPT-2's regex is smart."
corpus = [sample] * 100
Corpus is the same sentence ×100. Enough to populate vocabularies; lab is about correctness not training a useful tokenizer.
The round_trip_ok flag will reveal:
whitespace: True only because we trained on the exact string. Add an unseen word → False.regex: True for the same reason — but spacing/punctuation is preserved correctly.byte: Always True.
The sanity check against tiktoken confirms our regex matches GPT-2's. tiktoken then runs BPE merges on top — that's why its token count is even lower than our regex count.
6. Expected output
[whitespace] n_tokens= 10 round_trip_ok=True
[regex ] n_tokens= 17 round_trip_ok=True
[byte ] n_tokens= 64 round_trip_ok=True
[tiktoken ] n_tokens= 17
Numbers may differ by ±1 depending on how the em-dash is encoded. Try a slightly modified input — change one word to something not in corpus. The whitespace tokenizer will produce <unk> and break round-trip; the others won't.
7. Common pitfalls
revsregex— only theregexpackage supports\p{L}Unicode properties. Standardrewill silently match nothing.- Decoding regex tokens with
" ".join— adds extra spaces because the leading-space tokens already carry them. Always"".join. - Forgetting
errors="replace"in byte-level decoding — production streaming generation will yield partial multi-byte chars, and bare.decode("utf-8")raises. - Reserving
<unk>mid-vocab — always put specials at IDs0..k-1. Many downstream libraries assume this. text.split()vstext.split(" ")—split()(no arg) collapses runs of whitespace;split(" ")keeps empty strings.
8. Stretch exercises
- Implement BPE training (Sennrich 2016): start with byte vocab; repeatedly find the most frequent adjacent pair and merge it into a new token; stop at target vocab size. ~100 lines.
- Implement byte-level BPE like GPT-2: pre-tokenize with
GPT2_PAT, then do BPE within each pre-token over its UTF-8 bytes. Thebytes_to_unicode()helper is the trickiest piece. - Compute compression ratio (chars/token) on 1 MB of English Wikipedia. Whitespace ~5.0, regex ~4.5, byte 1.0, tiktoken
cl100k_base~4.0. - Run on Chinese / Arabic / code and compare. tiktoken
gpt2is famously bad on non-English (5–10× more tokens per char).cl100k_base(GPT-4) added more multilingual merges. - Visualize token boundaries with color coding (Karpathy's video on tokenization shows this beautifully).
9. What this lab proves about you
You can answer tokenizer interview questions ("explain BPE", "why does GPT-4 sometimes count letters wrong in 'strawberry'", "what's the failure mode of pure-whitespace tokenization for LLM pretraining") with code-level confidence. That's the Phase-1 milestone.