self-attention mechanism in transformers
1. Origins: The Transformer Architecture
.
2. Mathematical Grounding of Self-Attention
.
3. The Quadratic Scaling Problem
Standard self-attention scales quadratically with sequence length, making Transformer-based models unable to process long sequences . Beltagy et al. (2020) introduced the Longformer to address this: it combines local windowed attention with task-motivated global attention, achieving linear scaling with sequence length .
4. System Pipeline — ASCII Diagram
.
│
▼
│
│ ┌──────────────────────────────────────────────┐
└─►│ MULTI-HEAD SELF-ATTENTION │
│ (H independent attention heads in parallel) │
│ │
│ For each head h: │
│ │
│ Input │
│ │ │
│ ├──────────────────────────────┐ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────┐ ┌─────┐ ┌─────┐ │
│ │ Wq │ │ Wk │ │ Wv │ │
│ │proj.│ │proj.│ │proj.│ │
│ └──┬──┘ └──┬──┘ └──┬──┘ │
│ │ │ │ │
│ ▼ ▼ │ │
│ ┌──────┐ ┌──────┐ │ │
│ │Query │ │ Key │ │ │
│ │ Q │ │ K │ │ │
│ └──┬───┘ └───┬──┘ │ │
│ │ │ │ │
│ ▼ ▼ │ │
│ ┌──────────────────┐ │ │
│ │ Q · Kᵀ │ │ │
│ │ (Asymmetric │◄───── Directional │
│ │ dot-product │ relationships │
│ │ for directional │ [CITATION: │
│ │ relationships) │ e90v9as] │
│ └────────┬─────────┘ │ │
│ │ │ │
│ ▼ │ │
│ ┌──────────────────┐ │ │
│ │ Softmax │◄── Scales quadratically
│ │ (Attention │ with seq. length n
│ │ Weights) │ O(n²) cost │
│ └────────┬─────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌──────────────────────────────────┐ │
│ │ Attention Weights · Value (V) │ │
│ │ → Weighted contextual output │ │
│ └──────────────┬───────────────────┘ │
│ │ │
└─────────────────┼──────────────────────────┘
│
▼
┌─────────────────┐
│ Concatenate │ ← All H heads combined
│ all H heads │
└────────┬────────┘
│
▼
┌─────────────────┐
│ Output Projection│
│ Wo │
└────────┬────────┘
│
▼
CONTEXTUAL OUTPUT
REPRESENTATIONS
[o₁, o₂, ..., oₙ]
── LONGFORMER VARIANT ─────────────────────────────────────── Standard O(n²) self-attention ──► replaced with:
┌──────────────────────────────────┐
│ Local Windowed Attention │ ← O(n) scaling
│ + │
│ Task-Motivated Global Attention │
└──────────────────────────────────┘
Enables processing documents of
thousands of tokens or longer
5. Key Findings Summary
| Aspect | Finding | Source |
|---|---|---|
| Core design | Attention-only; no recurrence or convolutions | |
| Q-K-V mechanism | Asymmetric projection for directional relationships | |
| Positional encodings | Structured refinement of projection principle | |
| Multi-head attention | Structured refinement of same projection principle | |
| Scaling cost | Quadratic O(n²) in sequence length | |
| Longformer fix | Linear O(n) via local + global attention combination |
Scope note: The specific dimension values (e.g., d_model, d_k) are not stated in the retrieved evidence and are therefore omitted per grounding rules.
Research smarter with AI-powered citations
Researchly finds and cites academic papers for any research topic in seconds. Used by students across India.