‎

Transformer Architecture Research Notes

Transformer Architecture Research Notes

Overview

The Transformer architecture, introduced in "Attention Is All You Need" (Vaswani et al., 2017), revolutionized natural language processing by eliminating recurrence and introducing self-attention.

Architecture Visualization

   graph TD
       subgraph "Transformer Architecture"
           input[Input Embeddings] --> pe[Positional Encoding]
           
           subgraph "Encoder Stack (Nx)"
               pe --> sa1[Self Attention]
               sa1 --> add1[Add & Norm]
               add1 --> ff1[Feed Forward]
               ff1 --> add2[Add & Norm]
           end
           
           subgraph "Decoder Stack (Nx)"
               output[Output Embeddings] --> pe2[Positional Encoding]
               pe2 --> sa2[Masked Self Attention]
               sa2 --> add3[Add & Norm]
               add3 --> ca[Cross Attention]
               ca --> add4[Add & Norm]
               add4 --> ff2[Feed Forward]
               ff2 --> add5[Add & Norm]
           end
           
           add2 --> ca
           add5 --> linear[Linear]
           linear --> softmax[Softmax]
       end

Multi-Head Attention Detail

   graph LR
       subgraph "Single Attention Head"
           Q[Query Matrix] --> QK[Q × K^T]
           K[Key Matrix] --> QK
           QK --> scale[Scale by √dk]
           scale --> sm[Softmax]
           sm --> AV[× Value Matrix]
           V[Value Matrix] --> AV
       end

Implementation

Multi-Head Attention

    import torch
    import torch.nn as nn
    from typing import Optional, Tuple
    import math

    class MultiHeadAttention(nn.Module):
        def __init__(self, d_model: int, num_heads: int, dropout: float = 0.1):
            super().__init__()
            assert d_model % num_heads == 0, "d_model must be divisible by num_heads"
            
            self.d_model: int = d_model
            self.num_heads: int = num_heads
            self.d_k: int = d_model // num_heads
            
            # Linear projections
            self.W_q = nn.Linear(d_model, d_model)
            self.W_k = nn.Linear(d_model, d_model)
            self.W_v = nn.Linear(d_model, d_model)
            self.W_o = nn.Linear(d_model, d_model)
            
            self.dropout = nn.Dropout(dropout)

Positional Encoding

    class PositionalEncoding(nn.Module):
        def __init__(self, d_model: int, max_seq_length: int = 5000):
            super().__init__()
            
            pe = torch.zeros(max_seq_length, d_model)
            position = torch.arange(0, max_seq_length, dtype=torch.float).unsqueeze(1)
            div_term = torch.exp(
                torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model)
            )
            
            pe[:, 0::2] = torch.sin(position * div_term)
            pe[:, 1::2] = torch.cos(position * div_term)
            pe = pe.unsqueeze(0)
            
            self.register_buffer('pe', pe)

Key Parameters

Parameter	Value	Description
d_model	512	Model dimension
num_heads	8	Number of attention heads
d_ff	2048	Feed-forward network dimension
num_layers	6	Number of encoder/decoder layers
dropout	0.1	Dropout rate

References

Vaswani, A., et al. (2017). Attention Is All You Need. In Advances in Neural Information Processing Systems.
The Annotated Transformer - Harvard NLP
PyTorch Documentation - nn.MultiheadAttention

TODO Implementation Tasks `[0/3]`

[ ] Add layer normalization implementation
[ ] Implement full encoder block
[ ] Add training loop with example data

Notes

The architecture eliminates the need for recurrence and convolutions
Attention weights provide interpretability
Positional encoding enables sequence awareness
Parallel processing enables efficient training

Table of Contents

Transformer Architecture Research Notes

Overview

Architecture Visualization

Multi-Head Attention Detail

Implementation

Multi-Head Attention

Positional Encoding

Key Parameters

References

TODO Implementation Tasks [0/3]

Notes

TODO Implementation Tasks `[0/3]`