compare BERT GPT T5 transformer architecture
Below is a structured comparison drawing only from the retrieved evidence. Where the evidence does not state a detail (e.g., exact parameter counts for BERT or the original Transformer), I note that explicitly rather than filling in the gap.
Model-by-Model Overview
.
.
GPT takes the opposite directional approach. Radford et al. (2018) demonstrated "that large gains on NLP tasks can be realized by generative pre-training of a language model on a diverse corpus of unlabeled text" . The evidence for GPT-1 is limited in the retrieved chunks; more detail is available for GPT-3. Brown et al. (2020) trained GPT-3 as "an autoregressive language model with 175 billion parameters — 10x more than any previous non-sparse language model." Critically, GPT-3 is applied "without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction" , demonstrating few-shot generalization at scale.
T5 unifies these paradigms. Raffel et al. (2020) introduced "a unified framework that converts all text-based language problems into a text-to-text format," systematically comparing pre-training objectives, architectures, and datasets. Their largest model, T5-11B, has 11 billion parameters and uses span corruption as its pre-training objective, trained on the "Colossal Clean Crawled Corpus" (C4) .
Comparison Table
| Feature | Transformer | BERT | GPT / GPT-3 | T5 |
|---|---|---|---|---|
| Architecture | Encoder-decoder; attention-only, no recurrence or convolutions | Deep bidirectional encoder; conditions on left and right context in all layers | Autoregressive (unidirectional) language model; generative pre-training | Encoder-decoder; unified text-to-text framework |
| Parameters | Not stated in retrieved evidence | Not stated in retrieved evidence | GPT-3: 175 billion | T5-11B: 11 billion |
| Training Data | WMT 2014 translation corpora (reported tasks) | Unlabeled text (bidirectional context) | GPT: diverse unlabeled text corpus; GPT-3: large text corpus | C4 (Colossal Clean Crawled Corpus) |
| Key Innovation | Attention-only architecture enabling full parallelization | Bidirectional pre-training via MLM + NSP | Generative pre-training; few-shot learning without fine-tuning at scale | Unified text-to-text format; systematic study of transfer learning factors |
| Strengths | More parallelizable and significantly faster to train than prior recurrent models; strong MT performance | Fine-tunable with "just one additional output layer" for diverse NLU tasks | GPT-3 achieves strong performance "without any gradient updates or fine-tuning" across translation, QA, cloze, and reasoning tasks | State-of-the-art across summarization, QA, text classification, and more benchmarks |
| Weaknesses / Limitations | Evidence only discusses MT tasks; generalization scope not detailed in retrieved chunks | Requires task-specific fine-tuning datasets* | Fine-tuning paradigm (prior to GPT-3) requires "thousands or tens of thousands of examples" | Best results require scale (11B parameters) and a massive bespoke corpus (C4) |
* Brown et al. (2020) characterize the fine-tuning requirement as a limitation of the pre-train–fine-tune paradigm generally, which applies to BERT-style models.
Key Takeaways
- The Transformer
- BERT
and T5 both rely on fine-tuning, differing in that BERT is encoder-only/bidirectional while T5 frames everything as text-to-text with an encoder-decoder.
- GPT / GPT-3 use a unidirectional autoregressive design; GPT-3's scale enables few-shot task performance without any fine-tuning at all .
- T5 is the only model among these whose evidence explicitly reports a systematic comparison of pre-training objectives and architectures, positioning it as a meta-study as much as a model proposal .
Honesty note: Parameter counts for the original Transformer and BERT are not present in the retrieved evidence. I have not filled those cells with values from outside the provided sources, per grounding rules.
Research smarter with AI-powered citations
Researchly finds and cites academic papers for any research topic in seconds. Used by students across India.