Table of Contents
Transformer Architecture Research Notes
Overview
The Transformer architecture, introduced in "Attention Is All You Need" (Vaswani et al., 2017), revolutionized natural language processing by eliminating recurrence and introducing self-attention.
Architecture Visualization
graph TD subgraph "Transformer Architecture" input[Input Embeddings] --> pe[Positional Encoding] subgraph "Encoder Stack (Nx)" pe --> sa1[Self Attention] sa1 --> add1[Add & Norm] add1 --> ff1[Feed Forward] ff1 --> add2[Add & Norm] end subgraph "Decoder Stack (Nx)" output[Output Embeddings] --> pe2[Positional Encoding] pe2 --> sa2[Masked Self Attention] sa2 --> add3[Add & Norm] add3 --> ca[Cross Attention] ca --> add4[Add & Norm] add4 --> ff2[Feed Forward] ff2 --> add5[Add & Norm] end add2 --> ca add5 --> linear[Linear] linear --> softmax[Softmax] end
Multi-Head Attention Detail
graph LR subgraph "Single Attention Head" Q[Query Matrix] --> QK[Q × K^T] K[Key Matrix] --> QK QK --> scale[Scale by √dk] scale --> sm[Softmax] sm --> AV[× Value Matrix] V[Value Matrix] --> AV end
Implementation
Multi-Head Attention
import torch import torch.nn as nn from typing import Optional, Tuple import math class MultiHeadAttention(nn.Module): def __init__(self, d_model: int, num_heads: int, dropout: float = 0.1): super().__init__() assert d_model % num_heads == 0, "d_model must be divisible by num_heads" self.d_model: int = d_model self.num_heads: int = num_heads self.d_k: int = d_model // num_heads # Linear projections self.W_q = nn.Linear(d_model, d_model) self.W_k = nn.Linear(d_model, d_model) self.W_v = nn.Linear(d_model, d_model) self.W_o = nn.Linear(d_model, d_model) self.dropout = nn.Dropout(dropout)
Positional Encoding
class PositionalEncoding(nn.Module): def __init__(self, d_model: int, max_seq_length: int = 5000): super().__init__() pe = torch.zeros(max_seq_length, d_model) position = torch.arange(0, max_seq_length, dtype=torch.float).unsqueeze(1) div_term = torch.exp( torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model) ) pe[:, 0::2] = torch.sin(position * div_term) pe[:, 1::2] = torch.cos(position * div_term) pe = pe.unsqueeze(0) self.register_buffer('pe', pe)
Key Parameters
| Parameter | Value | Description |
|---|---|---|
| dmodel | 512 | Model dimension |
| numheads | 8 | Number of attention heads |
| dff | 2048 | Feed-forward network dimension |
| numlayers | 6 | Number of encoder/decoder layers |
| dropout | 0.1 | Dropout rate |
References
- Vaswani, A., et al. (2017). Attention Is All You Need. In Advances in Neural Information Processing Systems.
- The Annotated Transformer - Harvard NLP
- PyTorch Documentation - nn.MultiheadAttention
TODO Implementation Tasks [0/3]
[ ]Add layer normalization implementation[ ]Implement full encoder block[ ]Add training loop with example data
Notes
- The architecture eliminates the need for recurrence and convolutions
- Attention weights provide interpretability
- Positional encoding enables sequence awareness
- Parallel processing enables efficient training