Dismantling Transformers

> Understanding Transformer Attention

A deep dive into the "Attention Is All You Need" architecture with live TensorFlow.js computations.

>1. Tokenization (Input)[--:--:--]

Enter a sentence below to compute its attention matrix in real-time using TensorFlow.js:

Thecatsatonmat(5 tokens)

>2. Positional Encoding[--:--:--]

Unlike RNNs, Transformers process all tokens in parallel. To inject positional information, we add positional encodings using sine and cosine functions at different frequencies.

Sine (Even dimensions: 0, 2, 4...)

Sine waves encode position with a smooth, continuous function. At dimension 0, the wave oscillates rapidly. At higher dimensions, the wavelength increases, creating unique patterns.

Cosine (Odd dimensions: 1, 3, 5...)

Cosine is 90° phase-shifted from sine. Using both together creates a unique "fingerprint" for each position that the model can learn to interpret as relative distances.

> Why use both sine AND cosine?

The combination allows the model to learn relative positions. For any fixed offset k, PE(pos+k) can be represented as a linear function of PE(pos). This means the model can easily learn "3 tokens ahead" or "2 tokens behind" relationships.

Chart shows PE values across positions. Note: dim 0,1 (fast) vs dim 4,5 (slow wavelength)

>3. Word Embedding[--:--:--]

Before attention can work, each token must be converted into a dense vector representation. This is done through a learned embedding matrix.

Embedding Lookup

Each token ID indexes into an embedding matrix of shape[vocab_size × d_model]. This lookup retrieves a dense vector that captures semantic meaning.

Example:

"cat" → ID: 2847 → [0.12, -0.45, 0.78, ...]

Key Properties

→Learned during training - embeddings adapt to the task
→Semantic similarity - similar words cluster together
→Fixed dimension - all tokens become same-sized vectors

The Final Input Representation:

Word Embedding+Positional Encoding=Input to Transformer

>4. The Core Equation[--:--:--]

The core of the Transformer architecture is the Scaled Dot-Product Attention. This mechanism allows the model to focus on different parts of the input sequence.

Query

What am I looking for?

Key

What do I contain?

Value

What do I provide?

>5. Encoder-Decoder Architecture[--:--:--]

> The Embedding Flow

When input enters the Transformer, each token goes through this pipeline:

Token→Word Embed+Pos Embed→Self-Attention→FFN→Output

> How Attention Uses Embeddings

The combined embedding (word + position) is projected into Query, Key, and Value vectors. The attention mechanism then computes how much each token should attend to every other token, allowing the model to capture contextual relationships regardless of distance.

The original Transformer uses an encoder stack to process the input and a decoder stack to generate the output. Each layer contains multi-head attention and feed-forward networks.

Encoder Stack

The encoder processes the entire input sequence in parallel using self-attention. Each layer refines the representation by allowing tokens to attend to all other tokens.

Layer Components:

→Input Embedding + Positional Encoding

→Multi-Head Self-Attention

→Add & Normalize (Residual)

→Feed-Forward Network

→Add & Normalize (Residual)

Layer Normalization

Feed-Forward Network

Ready to see the implementation details?

Explore LDI: Code Walkthrough →