📄 Research Paper

Attention Is All You Need — Vaswani et al., 2017

Attention Is All You Need

Vaswani et al. · NIPS 2017 · arXiv:1706.03762

Ashish Vaswani Noam Shazeer Niki Parmar Jakob Uszkoreit Llion Jones Aidan N. Gomez Łukasz Kaiser Illia Polosukhin
📅 Published: June 2017 (v7: Aug 2023) 🏢 Google Brain / Google Research 📍 NIPS 2017, Long Beach, CA 🔗 arxiv.org/abs/1706.03762 💻 tensor2tensor on GitHub
🌟 Why This Paper Matters
This single paper introduced the Transformer architecture — the foundation of every modern LLM including GPT-4, Claude, Gemini, and LLaMA. Understanding it is essential for any AI engineer.

Abstract

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature.

Key Numbers at a Glance

28.4
BLEU — EN→DE (SOTA)
41.8
BLEU — EN→FR (SOTA)
3.5
Days to train (8× P100)
8
Attention heads (base)
512
dmodel (base)
6
Encoder / Decoder layers
65M
Parameters (base)
3.3×10¹⁸
FLOPs to train (base)

1. Introduction

Prior to the Transformer, the dominant approaches to sequence modeling were recurrent neural networks (RNNs) — specifically LSTMs and GRUs. These models process sequences token-by-token, maintaining a hidden state that is updated at each step.

⚠ The Problem with RNNs
Sequential computation prevents parallelization. Hidden state ht depends on ht-1, so you cannot process position 10 until positions 1–9 are done. This becomes a critical bottleneck at long sequence lengths.

Attention mechanisms were already used alongside RNNs to handle long-range dependencies without regard to distance. But they were always used in conjunction with a recurrent architecture.

💡 The Key Insight
The Transformer proposes using attention mechanisms alone — dispensing with recurrence and convolutions entirely. This enables full parallelization and achieves state-of-the-art quality with far less training time.

2. Background

Earlier attempts to reduce sequential computation used convolutional neural networks (ByteNet, ConvS2S). In those models, the number of operations to relate two positions grows with their distance — linearly for ConvS2S, logarithmically for ByteNet.

Self-attention (intra-attention) relates different positions of a single sequence to compute a representation. It had been used in reading comprehension, summarization, and textual entailment tasks.

ℹ️ Key Claim
The Transformer is the first transduction model relying entirely on self-attention to compute its input/output representations — without sequence-aligned RNNs or convolutions.

3. Model Architecture

The Transformer follows the standard encoder-decoder structure. The encoder maps input (x₁, …, xₙ) to continuous representations z = (z₁, …, zₙ). The decoder generates output (y₁, …, yₘ) one element at a time, auto-regressively.

ENCODER (×6) Input Embedding Add & Norm Multi-Head Self-Attention Add & Norm Feed Forward Network Output z DECODER (×6) Output Embedding Add & Norm Masked Multi-Head Self-Attention Add & Norm Cross-Attention (Encoder-Decoder) Feed Forward K, V from encoder Linear + Softmax

Figure 1: The Transformer model architecture — encoder (left, blue) and decoder (right, purple) each composed of 6 identical layers.

Complete Transformer architecture with encoder and decoder stacks

Figure 1b: Full Transformer architecture in detail — showing encoder (left) and decoder (right) with Nx repetition of identical layers, input/output embeddings, positional encoding, and cross-attention connections.

3.1 Encoder and Decoder Stacks

🔵
Encoder
Stack of N=6 identical layers. Each has: (1) Multi-head self-attention, (2) Position-wise FFN. Both wrapped with residual connection + LayerNorm. All sub-layers output dmodel=512.
🟣
Decoder
Stack of N=6 identical layers. Each has: (1) Masked self-attention, (2) Cross-attention over encoder output, (3) Position-wise FFN. Masking prevents attending to future positions.
💡 Residual Connections
Every sub-layer computes LayerNorm(x + Sublayer(x)). The residual connection allows gradients to flow unchanged through the network, enabling training of deep stacks.

3.2 Attention

An attention function maps a query and a set of key-value pairs to an output. The output is a weighted sum of values, where weights come from the compatibility of the query with each key.

3.2.1 Scaled Dot-Product Attention

Attention(Q, K, V) = softmax( QKᵀ / √dk ) · V
Equation (1) — Scaled Dot-Product Attention

Queries and keys have dimension dk; values have dimension dv. The scaling factor 1/√dk prevents the dot products from growing large (which would push the softmax into regions with very small gradients).

⚠ Why Scale by √dk?
Assume components of q and k are i.i.d. with mean 0, variance 1. Then q·k = Σ qᵢkᵢ has mean 0 but variance dk. Scaling brings variance back to 1, keeping softmax gradients healthy.
Scaled Dot-Product Attention mechanism diagram

Figure 2: Scaled Dot-Product Attention — the core building block. Queries (Q), Keys (K), and Values (V) flow through MatMul, Scale, Mask, SoftMax, and MatMul to produce attention-weighted output.

3.2.2 Multi-Head Attention

Instead of one large attention function, project queries, keys and values h times with different learned linear projections, run attention in parallel, then concatenate.

MultiHead(Q,K,V) = Concat(head₁, …, headh) WO
where headi = Attention(QWQi, KWKi, VWVi)
Equation — Multi-Head Attention
💡 Why Multiple Heads?
Each head can attend to information from different representation subspaces at different positions. A single attention head averaging over all positions inhibits this. In the base model: h=8, dk=dv=64.
Multi-Head Attention mechanism showing concatenation of attention heads

Figure 3: Multi-Head Attention — multiple attention heads (shown as parallel operations) process the input independently, then their outputs are concatenated and passed through a final linear layer.

3.2.3 Applications of Attention in the Transformer

  • Encoder self-attention: Each encoder position attends to all positions in the previous encoder layer.
  • Decoder masked self-attention: Each decoder position attends to all positions up to and including itself (future positions masked with −∞).
  • Encoder-decoder (cross) attention: Decoder queries attend over all encoder output positions — the keys and values come from the encoder.

3.3 Position-wise Feed-Forward Networks

Each encoder/decoder layer contains a fully connected FFN applied identically to each position:

FFN(x) = max(0, xW₁ + b₁)W₂ + b₂
Equation (2) — Two linear transformations with ReLU between them

dmodel=512 on input/output; inner layer dff=2048.

3.4 Embeddings and Softmax

Learned embeddings convert tokens to dmodel-dimensional vectors. The same weight matrix is shared between: the two embedding layers and the pre-softmax linear transformation. Embedding weights are multiplied by √dmodel.

3.5 Positional Encoding

Since the Transformer has no recurrence or convolution, it has no inherent sense of token order. Positional encodings are added to input embeddings at the bottom of both encoder and decoder stacks.

PE(pos, 2i) = sin(pos / 100002i/dmodel)
PE(pos, 2i+1) = cos(pos / 100002i/dmodel)
Sinusoidal positional encoding — each dimension is a sinusoid of different frequency
ℹ️ Why Sinusoids?
For any fixed offset k, PEpos+k can be represented as a linear function of PEpos. This lets the model easily learn to attend by relative positions. Also, sinusoidal encodings can extrapolate to longer sequences than seen during training.

4. Why Self-Attention

Three motivating criteria for preferring self-attention over recurrent and convolutional layers:

Layer Type Complexity per Layer Sequential Ops Max Path Length
Self-Attention O(n² · d) O(1) O(1)
Recurrent O(n · d²) O(n) O(n)
Convolutional O(k · n · d²) O(1) O(logk(n))
Self-Attn (restricted) O(r · n · d) O(1) O(n/r)
🔑 Key Insight
Self-attention connects all positions with O(1) sequential operations. For typical sentence-level tasks where n < d, self-attention is also faster than recurrent layers per-layer. The O(n²) complexity is manageable for typical NLP sequence lengths.

5. Training

📦
Data
WMT 2014 EN-DE: ~4.5M sentence pairs (37K token BPE vocab). WMT 2014 EN-FR: 36M sentences (32K wordpiece vocab).
🖥️
Hardware
8× NVIDIA P100 GPUs. Base model: 0.4s/step, 100K steps (~12 hours). Big model: 1.0s/step, 300K steps (3.5 days).
📐
Optimizer
Adam (β₁=0.9, β₂=0.98, ε=10⁻⁹). Warmup for 4000 steps then inverse square root decay.
🛡️
Regularization
Residual dropout Pdrop=0.1. Label smoothing εls=0.1 (hurts perplexity but improves BLEU).
lrate = dmodel−0.5 · min(step_num−0.5, step_num · warmup_steps−1.5)
Equation (3) — Learning rate schedule: linear warmup, then inverse sqrt decay

6. Results

6.1 Machine Translation

28.4
BLEU on EN→DE
+2 BLEU over all prior models including ensembles
41.8
BLEU on EN→FR
New single-model SOTA at <¼ the training cost
3.3×10¹⁸
FLOPs (base model)
vs 2.3×10¹⁹ for GNMT+RL, a 7× reduction
ModelEN-DE BLEUEN-FR BLEUEN-DE FLOPsEN-FR FLOPs
GNMT + RL24.639.922.3×10¹⁹1.4×10²⁰
ConvS2S25.1640.469.6×10¹⁸1.5×10²⁰
MoE26.0340.562.0×10¹⁹1.2×10²⁰
ConvS2S Ensemble26.3641.297.7×10¹⁹1.2×10²¹
Transformer (base)27.338.13.3×10¹⁸
Transformer (big)28.441.82.3×10¹⁹

6.2 Model Variations (Ablations)

Key findings from ablation experiments (Table 3 in the paper):

  • Attention heads: Single-head attention is 0.9 BLEU worse than h=8. But too many heads also hurts.
  • Key dimension dk: Reducing dk hurts quality — dot-product compatibility is non-trivial.
  • Bigger models are better and dropout is critical for avoiding overfitting.
  • Positional encoding: Learned embeddings yield nearly identical results to sinusoidal — but sinusoidal can generalize to longer sequences.

6.3 English Constituency Parsing

To test generalization, the authors trained a 4-layer Transformer (dmodel=1024) on the WSJ Penn Treebank (~40K sentences).

✅ Result
Transformer achieves 92.7 F1 in semi-supervised setting — outperforming all prior models except the Recurrent Neural Network Grammar, with no task-specific tuning.

7. Conclusion

The Transformer is the first sequence transduction model based entirely on attention, replacing recurrent layers with multi-headed self-attention. Key advantages:

  • Significantly faster to train than RNN/CNN architectures
  • Achieves new SOTA on EN-DE and EN-FR translation
  • Generalizes to other tasks (constituency parsing) without task-specific modifications
🚀 Legacy
This architecture became the foundation for BERT (2018), GPT (2018), T5 (2019), and every modern LLM. Virtually every state-of-the-art model in NLP, vision, audio, and multimodal AI is now built on the Transformer.

Review Questions

  1. Why does standard RNN processing prevent parallelization, and how does the Transformer solve this?
  2. Derive the scaling factor 1/√dk from the variance argument. What happens without it?
  3. What is the difference between encoder self-attention, decoder self-attention, and cross-attention?
  4. Why does the paper prefer sinusoidal positional encodings over learned embeddings?
  5. In Table 1 of the paper, self-attention has O(n²·d) complexity per layer. When does this become worse than recurrent layers?
  6. What is label smoothing and why does it hurt perplexity but improve BLEU?
  7. The paper uses beam search at inference. What is beam size and how does it trade off quality vs. speed?