> Understanding Transformer Attention
A deep dive into the "Attention Is All You Need" architecture with live TensorFlow.js computations.
Enter a sentence below to compute its attention matrix in real-time using TensorFlow.js:
Unlike RNNs, Transformers process all tokens in parallel. To inject positional information, we add positional encodings using sine and cosine functions at different frequencies.
Sine waves encode position with a smooth, continuous function. At dimension 0, the wave oscillates rapidly. At higher dimensions, the wavelength increases, creating unique patterns.
Cosine is 90° phase-shifted from sine. Using both together creates a unique "fingerprint" for each position that the model can learn to interpret as relative distances.
The combination allows the model to learn relative positions. For any fixed offset k, PE(pos+k) can be represented as a linear function of PE(pos). This means the model can easily learn "3 tokens ahead" or "2 tokens behind" relationships.
Before attention can work, each token must be converted into a dense vector representation. This is done through a learned embedding matrix.
Each token ID indexes into an embedding matrix of shape[vocab_size × d_model]. This lookup retrieves a dense vector that captures semantic meaning.
- →Learned during training - embeddings adapt to the task
- →Semantic similarity - similar words cluster together
- →Fixed dimension - all tokens become same-sized vectors
The core of the Transformer architecture is the Scaled Dot-Product Attention. This mechanism allows the model to focus on different parts of the input sequence.
What am I looking for?
What do I contain?
What do I provide?
When input enters the Transformer, each token goes through this pipeline:
The combined embedding (word + position) is projected into Query, Key, and Value vectors. The attention mechanism then computes how much each token should attend to every other token, allowing the model to capture contextual relationships regardless of distance.
The original Transformer uses an encoder stack to process the input and a decoder stack to generate the output. Each layer contains multi-head attention and feed-forward networks.
Encoder Stack
The encoder processes the entire input sequence in parallel using self-attention. Each layer refines the representation by allowing tokens to attend to all other tokens.
Layer Normalization
Feed-Forward Network
Ready to see the implementation details?
Explore LDI: Code Walkthrough →