🔍 Research any topic with AI-powered citations — Try Researchly freeStart Researching
Home/Research/how does BERT pre-training work
AI Research Answer

how does BERT pre-training work

5 cited papers · May 25, 2026 · Powered by Researchly AI

🧠
TL;DR

BERT (Bidirectional Encoder Representations from Transformers) is a pre-training approach that learns deep bidirectional representations from unlabeled text by…

BERT (Bidirectional Encoder Representations from Transformers) is a pre-training approach that learns deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers.12Devlin et al. (2019)1

BERT is built on the Transformer architecture, which relies solely on attention mechanisms, dispensing with recurrence and convolutions entirely.

1
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingJacob Devlin, Ming-Wei Chang et al.2019NAACL-HLT 2019
View
2
MASS: Masked Sequence to Sequence Pre-training for Language GenerationKaitao Song, Xu Tan et al.2019arXiv (Cornell University)
View
  • Masked Language Modeling (MLM) — BERT randomly masks a subset of input tokens and trains the model to predict them, forcing the model to learn bidirectional context from both left and right surroundings simultaneously.
12Devlin et al. (2019)1
  • Next Sentence Prediction (NSP) — A second pre-training objective where the model learns to predict whether two input sentences are consecutive in the original text, helping the model understand inter-sentence relationships.
1
  • Transformer Encoder — The underlying architecture based solely on self-attention mechanisms that processes the full input sequence in parallel, enabling rich contextual representations.
3
  • Fine-tuning — After pre-training, BERT can be adapted to downstream tasks (e.g., question answering, named entity recognition, sentence classification) by adding just one additional output layer without substantial task-specific architecture modifications.
1
1
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingJacob Devlin, Ming-Wei Chang et al.2019NAACL-HLT 2019
View
2
MPNet: Masked and Permuted Pre-training for Language UnderstandingKaitao Song, Xu Tan et al.2020arXiv (Cornell University)
View
3
Attention Is All You NeedAshish Vaswani, Noam Shazeer et al.2017Advances in Neural Information Processing Systems (NeurIPS)
View
Want to research your own topic? Try it free →
Diagram
RAW TEXT CORPUS (Unlabeled)
 │
 ▼
┌─────────────────────────────────┐
│ TOKENIZATION │
│ [CLS] tok1 [MASK] tok3 [SEP] │
│ tok5 tok6 tok7 [MASK] [SEP] │
└────────────────┬────────────────┘
 │ Token + Segment + Position Embeddings
 ▼
┌─────────────────────────────────┐
│ EMBEDDING LAYER │
│ Token Emb + Segment Emb │
│ + Position Emb │
│ Output dim: [Batch x Seq x H] │
└────────────────┬────────────────┘
 │
 ▼
┌─────────────────────────────────┐
│ TRANSFORMER ENCODER STACK │
│ ┌───────────────────────────┐ │
│ │ Layer 1: Multi-Head │ │
│ │ Self-Attention + FFN │ │
│ └────────────┬──────────────┘ │
│ │ │
│ ┌────────────▼──────────────┐ │
│ │ Layer 2: Multi-Head │ │
│ │ Self-Attention + FFN │ │
│ └────────────┬──────────────┘ │
│ │ (x N layers) │
│ ┌────────────▼──────────────┐ │
│ │ Layer N: Multi-Head │ │
│ │ Self-Attention + FFN │ │
│ └────────────┬──────────────┘ │
└───────────────┼─────────────────┘
 │
 ▼
┌─────────────────────────────────┐
│ CONTEXTUAL REPRESENTATIONS │
│ [Batch x Seq x Hidden_dim] │
└──────┬──────────────────┬───────┘
 │ │
 ▼ ▼
┌─────────────┐ ┌───────────────┐
│ MLM HEAD │ │ NSP HEAD │
│ Predict │ │ Is Next │
│ masked toks │ │ Sentence? │
│ (vocab size)│ │ (binary) │
└──────┬──────┘ └───────┬───────┘
 │ │
 ▼ ▼
┌─────────────────────────────────┐
│ COMBINED PRE-TRAINING LOSS │
│ L_MLM + L_NSP │
└────────────────┬────────────────┘
 │
 ▼
┌─────────────────────────────────┐
│ PRE-TRAINED BERT MODEL │
│ (Saved weights / checkpoint) │
└────────────────┬────────────────┘
 │ Add task-specific output layer
 ▼
┌─────────────────────────────────┐
│ FINE-TUNING │
│ QA / NER / Classification etc. │
└─────────────────────────────────┘
BERT uses two simultaneous pre-training objectives: MLM, where tokens are randomly masked and the model predicts them using bidirectional context, and NSP, where the model predicts sentence continuity — both jointly shaping the learned representations.1A key limitation of MLM noted in subsequent work is that BERT neglects the dependency among predicted tokens, since masked tokens are predicted independently of each other.23Song et al. (2020)2Furthermore, research has shown that BERT sentence embeddings without fine-tuning induce a non-smooth anisotropic semantic space, which harms performance on semantic similarity tasks.1

Li et al. (2020)

1
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingJacob Devlin, Ming-Wei Chang et al.2019NAACL-HLT 2019
View
2
MPNet: Masked and Permuted Pre-training for Language UnderstandingKaitao Song, Xu Tan et al.2020arXiv (Cornell University)
View
3
MASS: Masked Sequence to Sequence Pre-training for Language GenerationKaitao Song, Xu Tan et al.2019arXiv (Cornell University)
View
Table
AspectDetail
Pre-training objectivesMLM + NSP
Context directionBidirectional (left + right)
Architecture baseTransformer Encoder
Fine-tuning overheadOne additional output layer
Downstream tasksQA, NER, classification
Want to research your own topic? Try it free →
BERT's MLM objective neglects dependency among the predicted (masked) tokens, since each masked position is predicted independently, which limits the model's ability to capture inter-token relationships during pre-training.1Additionally, BERT sentence embeddings without fine-tuning produce a non-smooth anisotropic semantic space that poorly captures sentence-level semantic meaning.2
1
MPNet: Masked and Permuted Pre-training for Language UnderstandingKaitao Song, Xu Tan et al.2020arXiv (Cornell University)
View
2
On the Sentence Embeddings from Pre-trained Language ModelsBohan Li, Hao Zhou et al.2020OpenAlex
View
  • BERT pre-trains deep bidirectional representations by jointly conditioning on left and right context across all layers.
1
  • The two core pre-training objectives are Masked Language Modeling (MLM) and Next Sentence Prediction (NSP).
1234
  • The Transformer encoder, based solely on attention mechanisms, is the architectural backbone enabling BERT's parallel, context-rich processing.
51
  • MLM's independence assumption among masked tokens is a known limitation that later models like MPNet sought to address.
2
  • Fine-tuning BERT for downstream tasks requires only one additional output layer, making it highly adaptable.
1
1
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingJacob Devlin, Ming-Wei Chang et al.2019NAACL-HLT 2019
View
2
MPNet: Masked and Permuted Pre-training for Language UnderstandingKaitao Song, Xu Tan et al.2020arXiv (Cornell University)
View
3
On the Sentence Embeddings from Pre-trained Language ModelsBohan Li, Hao Zhou et al.2020OpenAlex
View
4
MASS: Masked Sequence to Sequence Pre-training for Language GenerationKaitao Song, Xu Tan et al.2019arXiv (Cornell University)
View
5
Attention Is All You NeedAshish Vaswani, Noam Shazeer et al.2017Advances in Neural Information Processing Systems (NeurIPS)
View
Want to research your own topic? Try it free →
  1. "BERT vs RoBERTa vs ALBERT pre-training differences and improvements"
  2. "How does XLNet permuted language modeling overcome BERT's MLM limitations"
  3. "BERT fine-tuning strategies for question answering and named entity recognition"

Research smarter with AI-powered citations

Researchly finds and cites academic papers for any research topic in seconds. Used by students across India.