Compare BERT, GPT, T5

Question

Rahul Pal · Accepted Answer

## Overview The Transformer architecture, based solely on attention mechanisms and dispensing with recurrence and convolutions entirely, forms the shared backbone for BERT, GPT, and T5. Each of these models adapts this foundation for a distinct pre-training paradigm, leading to different strengths across NLP tasks. Devlin et al. (2019) ## Key Concepts - **Transformer** — A network architecture based solely on attention mechanisms, enabling parallel sequence modeling without recurrent connections and achieving state-of-the-art results on translation tasks. - **BERT** — Pre-trains deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, using masked language modeling (MLM) and next sentence prediction objectives. Devlin et al. (2019) - **GPT** — Demonstrates that large gains on NLP tasks can be realized by generative pre-training of a language model on a diverse corpus of unlabeled text, using a unidirectional (left-to-right) language modeling objective. Radford et al. (2018) - **T5** — Introduces a unified framework that converts all text-based language problems into a text-to-text format, combining insights from systematic exploration of transfer learning techniques with the large-scale C4 dataset. Raffel et al. (2020) ## System Architecture ``` ┌─────────────────────────────────────────────────────────────┐ │ Transformer Foundation │ │ (Multi-Head Self-Attention + Feed-Forward) │ └───────────────┬───────────

Feature	BERT	GPT / GPT-3	T5
Architecture	Encoder-only (bidirectional)	Decoder-only (unidirectional, autoregressive)	Encoder-Decoder
Parameters	Not specified in evidence	175 billion (GPT-3)	Up to 11 billion (T5-11B)
Training Data	Unlabeled text (MLM + NSP)	Diverse corpus of unlabeled text	Colossal Clean Crawled Corpus (C4), 160GB+
Key Innovation	Bidirectional context via MLM and next sentence prediction	Generative pre-training; few-shot learning at scale	Unified text-to-text framework for all NLP tasks
Strengths	State-of-the-art on NLU tasks (QA, classification, NER) with minimal task-specific modifications	Task-agnostic few-shot performance without gradient updates or fine-tuning	Covers summarization, QA, classification, and more under one framework
Weaknesses	Not natively generative; requires fine-tuning datasets	Unidirectional context; weaker on token-level NLU	Expensive encoder-decoder stack; largest model requires 11B parameters

Compare BERT, GPT, T5

Overview

Key Concepts

System Architecture

Technical Details or Comparison

Limitations

Key Takeaways

What To Search Next

Research smarter with AI-powered citations