AI Research Answer

What is BERT and how does it work

Generated by Researchly AI·May 25, 2026·9 sources

🧠

TL;DR

BERT (Bidirectional Encoder Representations from Transformers) is a pre-trained deep learning model designed to learn bidirectional representations from unlabel…

BERT (Bidirectional Encoder Representations from Transformers) is a pre-trained deep learning model designed to learn bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers.¹Devlin et al. (2019)¹

BERT is built upon the Transformer architecture, which relies solely on attention mechanisms, dispensing with recurrence and convolutions entirely.

Transformer Backbone — A network architecture based solely on attention mechanisms, forming the foundational building block upon which BERT is constructed.

¹²

Masked Language Modeling (MLM) — A pre-training objective where tokens in the input are randomly masked and the model is trained to predict them, enabling deep bidirectional context learning across all layers.

²³Devlin et al. (2019)²

Next Sentence Prediction (NSP) — A second pre-training objective where BERT learns to predict whether two sentences appear consecutively in a document, supporting tasks like question answering and sentence classification.

²Devlin et al. (2019)²

Fine-tuning — After pre-training, BERT can be adapted to downstream NLU tasks such as question answering, named entity recognition, and sentiment analysis by adding just one additional output layer without substantial task-specific architecture modifications.

⁴²Devlin et al. (2019)²

Want to research your own topic? Try it free →

Diagram

 ┌─────────────────────────────────────┐
 │ RAW TEXT INPUT │
 │ "The cat [MASK] on the mat" │
 └────────────────┬────────────────────┘
 │
 ▼
 ┌─────────────────────────────────────┐
 │ TOKENIZATION │
 │ WordPiece tokenizer │
 │ → [CLS] The cat [MASK] on the mat │
 │ [SEP] │
 └────────────────┬────────────────────┘
 │
 ▼
 ┌─────────────────────────────────────┐
 │ INPUT EMBEDDINGS │
 │ Token Embeddings │
 │ + │
 │ Segment Embeddings (Sent A / B) │
 │ + │
 │ Position Embeddings │
 │ → Combined vector per token │
 └────────────────┬────────────────────┘
 │
 ▼
 ┌─────────────────────────────────────┐
 │ TRANSFORMER ENCODER STACK │
 │ ┌───────────────────────────────┐ │
 │ │ Layer 1: Multi-Head Self- │ │
 │ │ Attention + Feed-Forward │ │
 │ └──────────────┬────────────────┘ │
 │ │ │
 │ ┌──────────────▼────────────────┐ │
 │ │ Layer 2: Multi-Head Self- │ │
 │ │ Attention + Feed-Forward │ │
 │ └──────────────┬────────────────┘ │
 │ │ (× N layers) │
 │ ┌──────────────▼────────────────┐ │
 │ │ Layer N: Multi-Head Self- │ │
 │ │ Attention + Feed-Forward │ │
 │ └──────────────┬────────────────┘ │
 │ BERT-Base: 12 layers, 768 hidden │
 │ BERT-Large: 24 layers, 1024 hidden │
 └────────────────┬────────────────────┘
 │
 ┌──────────────┴──────────────┐
 │ │
 ▼ ▼
 ┌────────────────────┐ ┌────────────────────────┐
 │ PRE-TRAINING │ │ FINE-TUNING │
 │ │ │ │
 │ Objective 1: MLM │ │ Add task-specific │
 │ Predict [MASK] │ │ output layer │
 │ tokens │ │ │
 │ │ │ Tasks: │
 │ Objective 2: NSP │ │ - Question Answering │
 │ Predict if Sent B │ │ - NER │
 │ follows Sent A │ │ - Sentiment Analysis │
 └────────────────────┘ │ - Text Classification │
 └────────────────────────┘

BERT's pre-training uses two complementary objectives: MLM, which randomly masks 15% of input tokens for the model to predict, and NSP, which trains the model to understand inter-sentence relationships.¹²Devlin et al. (2019)¹This bidirectional design contrasts with unidirectional approaches like GPT, which processes text in only one direction — a key distinction highlighted in comparative evaluations of GPT, BERT, and XLNet across benchmarks such as SQuAD and GLUE.³Zhou (2024)³Despite BERT's strong contextual representations, research has found that BERT sentence embeddings without fine-tuning induce a non-smooth, anisotropic semantic space that can harm performance on semantic similarity tasks.¹

Li et al. (2020)

Table

Feature	BERT
Architecture	Transformer Encoder
Pre-training Objectives	MLM + NSP
Context Direction	Bidirectional
Fine-tuning	Single output layer added
Key Tasks	QA, NER, Classification

Want to research your own topic? Try it free →

BERT pre-trains deep bidirectional representations by jointly conditioning on left and right context in all layers using MLM and NSP objectives.

¹²Devlin et al. (2019)¹

The Transformer's attention-only architecture is the foundational backbone that makes BERT's design possible.

³¹

BERT can be fine-tuned with just one additional output layer for tasks like question answering, NER, and sentence classification.

⁴⁵¹Devlin et al. (2019)¹

Without fine-tuning, BERT sentence embeddings form an anisotropic space that limits semantic similarity performance.

¹Li et al. (2020)

BERT's MLM does not model dependencies among masked tokens, a known architectural limitation.

Song et al. (2020)

Want to research your own topic? Try it free →

More research like thisResearch your own

how does Dijkstra's shortest path algorithm work time complexity8 views·18 Jun what is quantum entanglement and how does it work8 views·18 Jun how does CRISPR Cas9 gene editing work8 views·18 Jun BERT vs GPT architecture differences8 views·15 Jun how do mRNA vaccines work mechanism of action6 views·18 Jun how does reinforcement learning work reward policy6 views·18 Jun

Research smarter with AI-powered citations

Researchly finds and cites academic papers for any research topic in seconds. Used by students across India.

Remix this research Start a new research See Pricing

What is BERT and how does it work

Overview

Key Concepts

System Architecture

Technical Details

Limitations

Key Takeaways

What To Search Next

Research smarter with AI-powered citations