Dismantling Transformers

>S1: Tokenization Comparison[--:--:--]

Enter text below to compare how BERT (WordPiece), GPT (BPE), and T5 (SentencePiece) tokenize it differently.

WordPiece (BERT)(6 tokens)

BPE (GPT)(6 tokens)

SentencePiece (T5)(6 tokens)

WordPiece (BERT)

Splits unknown words into subwords using "##" prefix for continuation tokens.

Algorithm: Greedy longest-match-first: tries to match the longest token in vocabulary first, then recursively tokenizes the remainder.

Example: "playing" → ["play", "##ing"]

Vocab: ~30,000 tokens

BPE (GPT)

Uses "Ġ" prefix to mark word boundaries (spaces become part of the token).

Algorithm: Byte Pair Encoding: iteratively merges the most frequent adjacent byte pairs until vocabulary size is reached.

Example: " playing" → ["Ġplay", "ing"]

Vocab: ~50,000 tokens

SentencePiece (T5)

Uses "▁" (underscore) prefix to mark word starts. Language-agnostic, works on raw text.

Algorithm: Unigram model: starts with large vocabulary, iteratively removes tokens that least impact the likelihood.

Example: "playing" → ["▁play", "ing"]

Vocab: ~32,000 tokens

>S2: How Tokenizers Learn (Iterative Merging)[--:--:--]

Watch how each tokenizer builds vocabulary by merging frequent pairs:

Example word: "tokenization"

Step 1 / 10

Vocabulary: 26

Tokens: 12

Compression: 1.0x

>S3: The Embedding Pipeline[--:--:--]

Visualize the transformation from Token ID to Final Input Vector:

Word Embedding+Positional Embedding=Final Input Vector

Word Embedding (Lookup)

Each token is mapped to a dense vector using a learned embedding matrix. This is a simple table lookup: the token ID indexes into a matrix of shape[vocab_size × d_model].

Key idea: Similar words have similar vectors, capturing semantic meaning (e.g., "king" and "queen" are close).

Positional Embedding (Calculated)

Transformers process all tokens in parallel, so they need position information. The original paper uses sinusoidal functions:sin/cos(pos / 10000^(2i/d)).

Key idea: Each position gets a unique pattern that allows the model to learn relative positioning.

Select a token from WordPiece (BERT):

Click on a token above to see its embedding pipeline

← Previous: ITV (Transformer Visualizer)Next: IPC (Path Comparator) →