Enter text below to compare how BERT (WordPiece), GPT (BPE), and T5 (SentencePiece) tokenize it differently.
Splits unknown words into subwords using "##" prefix for continuation tokens.
Algorithm: Greedy longest-match-first: tries to match the longest token in vocabulary first, then recursively tokenizes the remainder.
Example: "playing" → ["play", "##ing"]
Vocab: ~30,000 tokens
Uses "Ġ" prefix to mark word boundaries (spaces become part of the token).
Algorithm: Byte Pair Encoding: iteratively merges the most frequent adjacent byte pairs until vocabulary size is reached.
Example: " playing" → ["Ġplay", "ing"]
Vocab: ~50,000 tokens
Uses "▁" (underscore) prefix to mark word starts. Language-agnostic, works on raw text.
Algorithm: Unigram model: starts with large vocabulary, iteratively removes tokens that least impact the likelihood.
Example: "playing" → ["▁play", "ing"]
Vocab: ~32,000 tokens
Watch how each tokenizer builds vocabulary by merging frequent pairs:
Visualize the transformation from Token ID to Final Input Vector:
Word Embedding (Lookup)
Each token is mapped to a dense vector using a learned embedding matrix. This is a simple table lookup: the token ID indexes into a matrix of shape[vocab_size × d_model].
Key idea: Similar words have similar vectors, capturing semantic meaning (e.g., "king" and "queen" are close).
Positional Embedding (Calculated)
Transformers process all tokens in parallel, so they need position information. The original paper uses sinusoidal functions:sin/cos(pos / 10000^(2i/d)).
Key idea: Each position gets a unique pattern that allows the model to learn relative positioning.