Lab 01 — Tokenization From Scratch (Solution Walkthrough)

Phase: 1 — Foundations | Difficulty: ⭐⭐☆☆☆ | Time: 1–2 hours

Read ../HITCHHIKERS-GUIDE.md §Tokenization first. This document walks through the solution code line-by-line.


0. What you build and why

Three tokenizers, ranging from naïve to production:

TokenizerVocabRound-trip safe?Real-world use
WhitespaceTokenizerAll unique whitespace-split tokens❌ (loses casing of OOV, no punctuation handling)Pedagogical only
RegexTokenizerTokens from GPT-2's pre-tokenization regex❌ for OOVUsed as the first stage of GPT-2/GPT-4 BPE
ByteLevelTokenizerFixed 256 (one per byte)✅ for any UTF-8 inputThe fallback in tiktoken/GPT-4; pure byte models

You will see by direct measurement why naïve splitting fails on real text and why GPT-2 layers a regex on top of bytes/BPE.

Run

pip install -r requirements.txt
python lab.py        # TODO scaffold — fill in
python solution.py   # reference implementation

1. The GPT-2 pre-tokenization regex

GPT2_PAT = re.compile(
    r"""'(?:[sdmt]|ll|ve|re)| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+"""
)

This is the single most important regex in modern NLP. Read each alternative left-to-right (the | operator):

  1. '(?:[sdmt]|ll|ve|re) — English contractions. Captures 's, 'd, 'm, 't, 'll, 've, 're so Don'tDon, 't (two reusable tokens).
  2. ' ?\p{L}+' — an optional leading space followed by 1+ Unicode letter characters (\p{L}). The leading space is part of the token — that's why GPT models render text by simple "".join(tokens) (no " ".join). " hello" is one token, distinct from "hello".
  3. ' ?\p{N}+' — same for numbers. Splits "abc123" into ["abc", "123"].
  4. ' ?[^\s\p{L}\p{N}]+' — runs of punctuation/symbols.
  5. \s+(?!\S) — runs of whitespace not followed by non-whitespace (trailing whitespace).
  6. \s+ — any other whitespace run (final fallback).

Why is this regex the right pre-tokenization? Because BPE merges look at adjacent characters within a pre-token; if you don't pre-split, "the dog" could merge across the space into a single token "the dog" — wasteful and brittle. By forcing " dog" as a self-contained pre-token, BPE can only learn merges within " dog", never across.

You don't run BPE in this lab (that's an extension), but you set up the pre-tokenization layer that BPE consumes.


2. WhitespaceTokenizer

class WhitespaceTokenizer:
    UNK = "<unk>"

    def __init__(self):
        self.token_to_id: dict[str, int] = {}
        self.id_to_token: dict[int, str] = {}

Two parallel dicts — cheaper than list.index() lookups.

    def train(self, corpus, min_freq=1, special_tokens=None):
        specials = [self.UNK] + (special_tokens or [])
        counts = Counter()
        for line in corpus:
            counts.update(line.split())
        vocab = list(specials) + [tok for tok, c in counts.most_common()
                                  if c >= min_freq and tok not in specials]
        self.token_to_id = {tok: i for i, tok in enumerate(vocab)}
        self.id_to_token = {i: tok for tok, i in self.token_to_id.items()}

Key choices:

  • Special tokens reserve the lowest IDs (<unk> is always id 0). Convention; production code hard-codes pad_id=0 or unk_id=0 everywhere.
  • most_common() ensures stable ordering: more frequent tokens get smaller ids → embedding tables are more cache-friendly.
  • min_freq lets you drop hapaxes (words seen once). For natural text, ~50% of unique tokens appear once but contribute <2% of total tokens.
    def encode(self, text):
        unk = self.token_to_id[self.UNK]
        return [self.token_to_id.get(t, unk) for t in text.split()]

    def decode(self, ids):
        return " ".join(self.id_to_token.get(i, self.UNK) for i in ids)

encode is information-lossy on OOV → <unk>. decode reinserts spaces but cannot recover the original whitespace structure — multiple spaces, tabs, newlines all become single spaces.


3. RegexTokenizer

Only difference from WhitespaceTokenizer: replace line.split() with GPT2_PAT.findall(line). This single change fixes:

  • Punctuation: "hello!"["hello", "!"].
  • Contractions: "don't"["don", "'t"].
  • Number boundaries: "abc123"["abc", "123"].

Decoding uses "".join(...) — the leading-space tokens already carry their spaces. This is the trick that makes round-trip preserve spacing for in-vocab tokens.


4. ByteLevelTokenizer

class ByteLevelTokenizer:
    def encode(self, text):
        return list(text.encode("utf-8"))

    def decode(self, ids):
        return bytes(ids).decode("utf-8", errors="replace")

Three lines. Yet this is what production LLMs fall back to.

  • text.encode("utf-8") produces bytes in [0, 255]. UTF-8 is variable-length: ASCII is 1 byte, accented Latin is 2, CJK is 3, emoji is 4.
  • The vocab is fixed at 256 — no training step.
  • errors="replace" handles ill-formed byte sequences (truncated multi-byte chars from streaming generation) by inserting U+FFFD instead of crashing.

Round-trip safety: for any input, decode(encode(x)) == x. Try emoji, Arabic, code, unicode whitespace.

Trade-off: a token = a byte. English averages ~1 char ≈ 1 byte ≈ 1 token. After BPE we collapse common byte sequences into ~0.25 tokens/char. So pure byte-level inflates sequence length 4× vs BPE — slower but trivially robust.


5. The runner

sample = "Hello, world! Don't tokenize naively — GPT-2's regex is smart."
corpus = [sample] * 100

Corpus is the same sentence ×100. Enough to populate vocabularies; lab is about correctness not training a useful tokenizer.

The round_trip_ok flag will reveal:

  • whitespace: True only because we trained on the exact string. Add an unseen word → False.
  • regex: True for the same reason — but spacing/punctuation is preserved correctly.
  • byte: Always True.

The sanity check against tiktoken confirms our regex matches GPT-2's. tiktoken then runs BPE merges on top — that's why its token count is even lower than our regex count.


6. Expected output

[whitespace] n_tokens= 10  round_trip_ok=True
[regex     ] n_tokens= 17  round_trip_ok=True
[byte      ] n_tokens= 64  round_trip_ok=True
[tiktoken  ] n_tokens= 17

Numbers may differ by ±1 depending on how the em-dash is encoded. Try a slightly modified input — change one word to something not in corpus. The whitespace tokenizer will produce <unk> and break round-trip; the others won't.


7. Common pitfalls

  1. re vs regex — only the regex package supports \p{L} Unicode properties. Standard re will silently match nothing.
  2. Decoding regex tokens with " ".join — adds extra spaces because the leading-space tokens already carry them. Always "".join.
  3. Forgetting errors="replace" in byte-level decoding — production streaming generation will yield partial multi-byte chars, and bare .decode("utf-8") raises.
  4. Reserving <unk> mid-vocab — always put specials at IDs 0..k-1. Many downstream libraries assume this.
  5. text.split() vs text.split(" ")split() (no arg) collapses runs of whitespace; split(" ") keeps empty strings.

8. Stretch exercises

  • Implement BPE training (Sennrich 2016): start with byte vocab; repeatedly find the most frequent adjacent pair and merge it into a new token; stop at target vocab size. ~100 lines.
  • Implement byte-level BPE like GPT-2: pre-tokenize with GPT2_PAT, then do BPE within each pre-token over its UTF-8 bytes. The bytes_to_unicode() helper is the trickiest piece.
  • Compute compression ratio (chars/token) on 1 MB of English Wikipedia. Whitespace ~5.0, regex ~4.5, byte 1.0, tiktoken cl100k_base ~4.0.
  • Run on Chinese / Arabic / code and compare. tiktoken gpt2 is famously bad on non-English (5–10× more tokens per char). cl100k_base (GPT-4) added more multilingual merges.
  • Visualize token boundaries with color coding (Karpathy's video on tokenization shows this beautifully).

9. What this lab proves about you

You can answer tokenizer interview questions ("explain BPE", "why does GPT-4 sometimes count letters wrong in 'strawberry'", "what's the failure mode of pure-whitespace tokenization for LLM pretraining") with code-level confidence. That's the Phase-1 milestone.