Brown et al. (2020)1introduced GPT-3, an autoregressive language model with 175 billion parameters — described as 10x more than any previous non-sparse language model at the time1. A key finding of that paper is that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes reaching competitiveness with prior state-of-the-art fine-tuning approaches1
.
1
Language Models are Few-Shot LearnersTom B. Brown, Benjamin Mann et al.2020Advances in Neural Information Processing Systems (NeurIPS)
Crucially, GPT-3 is applied without any gradient updates or fine-tuning; tasks and few-shot demonstrations are specified purely via text interaction with the model1. This contrasts with traditional NLP fine-tuning, which typically requires task-specific datasets of thousands or tens of thousands of examples1. GPT-3 achieves strong performance across translation, question-answering, cloze tasks, and tasks requiring on-the-fly reasoning or domain adaptation1
.
Architectural Foundations
The Transformer architecture introduced by Vaswani et al. (2017)2— based solely on attention mechanisms and dispensing with recurrence and convolutions — underlies models like GPT-3, enabling parallel sequence modeling2
. Earlier generative pre-training work by Radford et al. (2018) demonstrated that large gains on NLP tasks can be realized by generative pre-training on a diverse corpus of unlabeled text, laying groundwork for GPT-3 .
2
Attention Is All You NeedAshish Vaswani, Noam Shazeer et al.2017Advances in Neural Information Processing Systems (NeurIPS)
Zhao et al. (2026) survey the field and identify in-context learning and prompt engineering as key utilization strategies that optimize real-world LLM deployment . Separately, Chung et al. (2022) found that instruction finetuning — scaling the number of tasks and model size — dramatically improves few-shot performance across prompting setups and evaluation benchmarks, with Flan-PaLM 540B outperforming PaLM 540B by +9.4% on average .
Applied Example
In a clinical NLP application, Augusto et al. (2024) compared GPT-3.5 and GPT-4 against smaller fine-tuned models, finding that GPT-4 "vastly outperformed all other models for this task at any level of in-context learning," correctly annotating 94% of hydroxychloroquine and 95% of prednisone medication signatures with 100 in-context examples .