Dismantling Transformers

> Comparing Transformer Architectures

BERT, GPT, and T5 all use the Transformer, but with different attention patterns and pre-training objectives.

>1. The Three Transformer Variants[--:--:--]

The core Transformer can be adapted into three main architectures, each optimized for different tasks:

BERT

Encoder-Only

Bidirectional context
Best for understanding

GPT

Decoder-Only

Causal (left-to-right)
Best for generation

Encoder-Decoder

Full + Causal
Best for seq-to-seq

>2. Architecture Comparison[--:--:--]

Each variant stacks transformer blocks differently:

BERT

GPT

>3. Attention Mask Patterns[--:--:--]

The key difference lies in the attention mask M:

Test sentence:

BERT: Full Attention (Bidirectional)

In BERT, every token can attend to every other token in the sequence. This bidirectional attention allows the model to understand context from both left and right sides simultaneously.

Key Observation:

Notice how the entire matrix is filled - each row (Query token) has non-zero attention to all columns (Key tokens). This is ideal for understanding tasks like classification and question answering.

>4. Pre-training Objectives[--:--:--]

Each model learns through a different self-supervised task:

Masked Language Modeling (MLM)

BERT randomly masks 15% of input tokens and learns to predict them using bidirectional context. The model sees "[MASK]" and must predict the original word.

How it works:

1. Randomly select tokens to mask
2. Replace with [MASK] token
3. Predict original token using all surrounding context

>5. Performance Comparison[--:--:--]

Different architectures excel at different tasks:

BERT Strengths

• Text classification
• Named entity recognition
• Question answering

GPT Strengths

• Text generation
• Code completion
• Creative writing

T5 Strengths

• Translation
• Summarization
• Multi-task learning

>Summary: Choosing the Right Model[--:--:--]

Aspect	BERT	GPT	T5
Architecture	Encoder	Decoder	Enc-Dec
Attention	Bidirectional	Causal	Mixed
Pre-training	MLM	CLM	Denoise
Best For	Understanding	Generation	Seq2Seq

Want to see the code implementation?

← Back to LDI: Code Walkthrough