Compare BERT, GPT, and T5 — how do they differ in pre-training objectives and architecture?

Question

Rahul Pal · Accepted Answer

## Overview The Transformer architecture, based solely on attention mechanisms and dispensing with recurrence and convolutions entirely, forms the shared foundation for BERT, GPT, and T5. Vaswani et al. (2017) Building on this foundation, BERT, GPT, and T5 each adopt distinct pre-training paradigms and architectural configurations to address different NLP challenges. Devlin et al. (2019) --- ## Key Concepts - **Transformer** — A network architecture based solely on attention mechanisms, enabling parallel sequence modeling without recurrent connections, achieving state-of-the-art results on machine translation tasks. Vaswani et al. (2017) - **BERT** — Pre-trains deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context using masked language modeling (MLM) and next sentence prediction objectives. Devlin et al. (2019) - **GPT** — Demonstrates that large gains on NLP tasks can be realized by generative pre-training of a language model on a diverse corpus of unlabeled text, using a unidirectional (left-to-right) language modeling objective. Radford et al. (2018) - **T5** — Introduces a unified framework that converts every NLP problem into a text-to-text format, enabling a single encoder-decoder model to handle diverse tasks. --- ## System Architecture ``` ┌─────────────────────────────────────────────────────────────────┐ │ TRANSFORMER BACKBONE │ │ (Multi-Head Self-Attention + FFN) │ └───────────────┬─────────────────┬──────────

Feature	BERT	GPT	T5
Architecture	Encoder-only Transformer	Decoder-only Transformer	Encoder-Decoder Transformer
Pre-training Objective	Masked Language Modeling (MLM) + Next Sentence Prediction	Unidirectional (causal) language modeling	Text-to-text generation on diverse NLP tasks
Context Direction	Bidirectional (left + right)	Unidirectional (left-to-right)	Bidirectional encoder + autoregressive decoder
Key Innovation	Deep bidirectional representations via joint left-right context conditioning	Generative pre-training on unlabeled text for NLP gains	Unified text-to-text framework for all NLP tasks
Parameters	Base: ~110M, Large: ~340M	GPT-3: 175B	Up to 11B
Strengths	Strong NLU: QA, NER, classification	Few-shot learning; strong text generation	Versatile; handles generation and understanding uniformly
Weaknesses	Not natively generative; limited for open-ended generation	Unidirectional context; weaker on token-level NLU	Expensive encoder-decoder stack; higher compute cost

Compare BERT, GPT, and T5 — how do they differ in pre-training objectives and architecture?

Overview

Key Concepts

System Architecture

Technical Details or Comparison

Limitations

Key Takeaways

What To Search Next

Research smarter with AI-powered citations