> Comparing Transformer Architectures
BERT, GPT, and T5 all use the Transformer, but with different attention patterns and pre-training objectives.
The core Transformer can be adapted into three main architectures, each optimized for different tasks:
Best for understanding
Best for generation
Best for seq-to-seq
Each variant stacks transformer blocks differently:
The key difference lies in the attention mask M:
BERT: Full Attention (Bidirectional)
In BERT, every token can attend to every other token in the sequence. This bidirectional attention allows the model to understand context from both left and right sides simultaneously.
Notice how the entire matrix is filled - each row (Query token) has non-zero attention to all columns (Key tokens). This is ideal for understanding tasks like classification and question answering.
Each model learns through a different self-supervised task:
Masked Language Modeling (MLM)
BERT randomly masks 15% of input tokens and learns to predict them using bidirectional context. The model sees "[MASK]" and must predict the original word.
- 1. Randomly select tokens to mask
- 2. Replace with [MASK] token
- 3. Predict original token using all surrounding context
Different architectures excel at different tasks:
- • Text classification
- • Named entity recognition
- • Question answering
- • Text generation
- • Code completion
- • Creative writing
- • Translation
- • Summarization
- • Multi-task learning
| Aspect | BERT | GPT | T5 |
|---|---|---|---|
| Architecture | Encoder | Decoder | Enc-Dec |
| Attention | Bidirectional | Causal | Mixed |
| Pre-training | MLM | CLM | Denoise |
| Best For | Understanding | Generation | Seq2Seq |
Want to see the code implementation?
← Back to LDI: Code Walkthrough