Table of Contents

Transformer Architecture Research Notes

Overview

The Transformer architecture, introduced in "Attention Is All You Need" (Vaswani et al., 2017), revolutionized natural language processing by eliminating recurrence and introducing self-attention.

Architecture Visualization

   graph TD
       subgraph "Transformer Architecture"
           input[Input Embeddings] --> pe[Positional Encoding]
           
           subgraph "Encoder Stack (Nx)"
               pe --> sa1[Self Attention]
               sa1 --> add1[Add & Norm]
               add1 --> ff1[Feed Forward]
               ff1 --> add2[Add & Norm]
           end
           
           subgraph "Decoder Stack (Nx)"
               output[Output Embeddings] --> pe2[Positional Encoding]
               pe2 --> sa2[Masked Self Attention]
               sa2 --> add3[Add & Norm]
               add3 --> ca[Cross Attention]
               ca --> add4[Add & Norm]
               add4 --> ff2[Feed Forward]
               ff2 --> add5[Add & Norm]
           end
           
           add2 --> ca
           add5 --> linear[Linear]
           linear --> softmax[Softmax]
       end

Multi-Head Attention Detail

   graph LR
       subgraph "Single Attention Head"
           Q[Query Matrix] --> QK[Q × K^T]
           K[Key Matrix] --> QK
           QK --> scale[Scale by √dk]
           scale --> sm[Softmax]
           sm --> AV[× Value Matrix]
           V[Value Matrix] --> AV
       end

Implementation

Multi-Head Attention

    import torch
    import torch.nn as nn
    from typing import Optional, Tuple
    import math

    class MultiHeadAttention(nn.Module):
        def __init__(self, d_model: int, num_heads: int, dropout: float = 0.1):
            super().__init__()
            assert d_model % num_heads == 0, "d_model must be divisible by num_heads"
            
            self.d_model: int = d_model
            self.num_heads: int = num_heads
            self.d_k: int = d_model // num_heads
            
            # Linear projections
            self.W_q = nn.Linear(d_model, d_model)
            self.W_k = nn.Linear(d_model, d_model)
            self.W_v = nn.Linear(d_model, d_model)
            self.W_o = nn.Linear(d_model, d_model)
            
            self.dropout = nn.Dropout(dropout)

Positional Encoding

    class PositionalEncoding(nn.Module):
        def __init__(self, d_model: int, max_seq_length: int = 5000):
            super().__init__()
            
            pe = torch.zeros(max_seq_length, d_model)
            position = torch.arange(0, max_seq_length, dtype=torch.float).unsqueeze(1)
            div_term = torch.exp(
                torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model)
            )
            
            pe[:, 0::2] = torch.sin(position * div_term)
            pe[:, 1::2] = torch.cos(position * div_term)
            pe = pe.unsqueeze(0)
            
            self.register_buffer('pe', pe)

Key Parameters

Parameter Value Description
dmodel 512 Model dimension
numheads 8 Number of attention heads
dff 2048 Feed-forward network dimension
numlayers 6 Number of encoder/decoder layers
dropout 0.1 Dropout rate

References

  1. Vaswani, A., et al. (2017). Attention Is All You Need. In Advances in Neural Information Processing Systems.
  2. The Annotated Transformer - Harvard NLP
  3. PyTorch Documentation - nn.MultiheadAttention

TODO Implementation Tasks [0/3]

  • [ ] Add layer normalization implementation
  • [ ] Implement full encoder block
  • [ ] Add training loop with example data

Notes

  • The architecture eliminates the need for recurrence and convolutions
  • Attention weights provide interpretability
  • Positional encoding enables sequence awareness
  • Parallel processing enables efficient training

Author: Jason Walsh

jwalsh@nexus

Last Updated: 2025-07-30 13:45:27

build: 2025-12-23 09:12 | sha: e32f33e