← Home|

IPC: Interactive Path Comparator

> Comparing Transformer Architectures

BERT, GPT, and T5 all use the Transformer, but with different attention patterns and pre-training objectives.

>1. The Three Transformer Variants[--:--:--]

The core Transformer can be adapted into three main architectures, each optimized for different tasks:

BERT
Encoder-Only
Bidirectional context
Best for understanding
GPT
Decoder-Only
Causal (left-to-right)
Best for generation
T5
Encoder-Decoder
Full + Causal
Best for seq-to-seq
>2. Architecture Comparison[--:--:--]

Each variant stacks transformer blocks differently:

BERT
GPT
T5
>3. Attention Mask Patterns[--:--:--]

The key difference lies in the attention mask M:

Test sentence:

BERT: Full Attention (Bidirectional)

In BERT, every token can attend to every other token in the sequence. This bidirectional attention allows the model to understand context from both left and right sides simultaneously.

Loading...
Key Observation:

Notice how the entire matrix is filled - each row (Query token) has non-zero attention to all columns (Key tokens). This is ideal for understanding tasks like classification and question answering.

>4. Pre-training Objectives[--:--:--]

Each model learns through a different self-supervised task:

Masked Language Modeling (MLM)

BERT randomly masks 15% of input tokens and learns to predict them using bidirectional context. The model sees "[MASK]" and must predict the original word.

How it works:
  • 1. Randomly select tokens to mask
  • 2. Replace with [MASK] token
  • 3. Predict original token using all surrounding context
>5. Performance Comparison[--:--:--]

Different architectures excel at different tasks:

BERT Strengths
  • • Text classification
  • • Named entity recognition
  • • Question answering
GPT Strengths
  • • Text generation
  • • Code completion
  • • Creative writing
T5 Strengths
  • • Translation
  • • Summarization
  • • Multi-task learning
>Summary: Choosing the Right Model[--:--:--]
AspectBERTGPTT5
ArchitectureEncoderDecoderEnc-Dec
AttentionBidirectionalCausalMixed
Pre-trainingMLMCLMDenoise
Best ForUnderstandingGenerationSeq2Seq

Want to see the code implementation?

← Back to LDI: Code Walkthrough