Machine Learning Design Patterns

Overview
Background
Key Concepts
- Pattern Categories (Original Book)
Modern Updates (2024-2025)
Implementation
- Project Structure for Modern ML Systems
- Recommended Stack (2024-2025)
References
Notes

Overview

Machine Learning Design Patterns by Valliappa Lakshmanan, Sara Robinson, and Michael Munn (O'Reilly, 2020) provides solutions to common challenges in ML engineering. The book organizes patterns into categories addressing data representation, problem framing, model training, resilient serving, reproducibility, and responsible AI.

This document summarizes the core patterns from the book and extends them with modern developments from 2024-2025, particularly around Large Language Models (LLMs), foundation models, and contemporary MLOps practices.

Background

The book emerged from the authors' experience at Google Cloud, distilling recurring solutions to ML problems into reusable patterns. Unlike traditional software design patterns, ML design patterns must address:

Data quality and representation challenges
Training/serving skew
Model reproducibility
Continuous model improvement
Fairness and explainability requirements

Key Concepts

Pattern Categories (Original Book)

Data Representation Patterns

Hashed Feature

Transform high-cardinality categorical variables into fixed-size representations using hashing. Useful when vocabulary is incomplete or too large.

import hashlib

def hash_feature(value, num_buckets=1000):
    hash_value = int(hashlib.md5(value.encode()).hexdigest(), 16)
    return hash_value % num_buckets

Embeddings

Learn dense, low-dimensional representations of sparse, high-dimensional data. Essential for text, user IDs, and categorical features.

Feature Cross

Combine multiple features to capture non-linear interactions without adding model complexity.

Multimodal Input

Handle inputs from different modalities (text, images, structured data) by learning separate representations and combining them.

Problem Representation Patterns

Reframing

Transform a problem type to leverage different ML approaches (e.g., regression to classification, or vice versa).

Multilabel

Handle cases where instances can belong to multiple classes simultaneously.

Ensemble

Combine multiple models to improve predictions through averaging, voting, or stacking.

Cascade

Chain models where each addresses a specific subtask or confidence level.

Neutral Class

Add an explicit "unknown" or "uncertain" class to handle ambiguous cases.

Model Training Patterns

Useful Overfitting

Intentionally overfit on small datasets or specific cases (e.g., lookup tables, memorization).

Checkpoints

Save model state periodically during training for recovery and analysis.

Transfer Learning

Leverage pre-trained models and adapt them to new tasks with less data.

Distribution Strategy

Parallelize training across multiple devices or machines.

Hyperparameter Tuning

Systematically search the hyperparameter space using grid search, random search, or Bayesian optimization.

Resilient Serving Patterns

Stateless Serving Function

Package models as stateless functions for scalable, reliable serving.

Batch Serving

Process predictions in bulk for offline workloads.

Continued Model Evaluation

Monitor model performance continuously in production.

Two-Phase Predictions

Separate candidate generation from ranking for efficiency at scale.

Keyed Predictions

Track predictions through the system with unique identifiers.

Reproducibility Patterns

Transform

Encapsulate feature engineering logic for consistent application in training and serving.

Repeatable Splitting

Ensure deterministic train/validation/test splits.

Bridged Schema

Handle schema evolution between training data and serving inputs.

Windowed Inference

Process streaming data in windows for consistent feature computation.

Workflow Pipeline

Orchestrate ML workflows as directed acyclic graphs (DAGs).

Feature Store

Centralize feature computation and serving for consistency.

Model Versioning

Track model versions with their associated code, data, and hyperparameters.

Responsible AI Patterns

Heuristic Benchmark

Establish baseline performance using simple rules or heuristics.

Explainable Predictions

Provide interpretable outputs alongside predictions.

Fairness Lens

Evaluate model performance across different demographic groups.

Modern Updates (2024-2025)

LLM-Specific Patterns

Retrieval-Augmented Generation (RAG)

Combine LLMs with external knowledge retrieval to ground responses in factual, up-to-date information.

Architecture Components

Document ingestion and chunking pipeline
Embedding model for semantic encoding
Vector database for similarity search
Retrieval strategy (dense, sparse, or hybrid)
LLM for response generation with retrieved context

Implementation Pattern

from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA

# Create vector store from documents
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=OpenAIEmbeddings()
)

# Create RAG chain
qa_chain = RetrievalQA.from_chain_type(
    llm=ChatOpenAI(model="gpt-4"),
    retriever=vectorstore.as_retriever(search_kwargs={"k": 5}),
    return_source_documents=True
)

Advanced RAG Variants

HyDE (Hypothetical Document Embeddings): Generate hypothetical answers to improve retrieval
RAPTOR: Recursive abstractive processing for hierarchical retrieval
Corrective RAG: Verify and refine retrieved documents
Self-RAG: Model decides when retrieval is needed
Graph RAG: Combine knowledge graphs with vector retrieval

Prompt Engineering Patterns

Few-Shot Learning

Provide examples in the prompt to guide model behavior without fine-tuning.

Classify the sentiment of movie reviews.

Review: "The acting was superb and the plot kept me engaged."
Sentiment: Positive

Review: "Waste of two hours. Predictable and boring."
Sentiment: Negative

Review: "{{user_review}}"
Sentiment:

Chain-of-Thought (CoT)

Encourage step-by-step reasoning for complex tasks.

Q: If a train travels 120 miles in 2 hours, what is its speed?

Let me think step by step:
1. Speed = Distance / Time
2. Distance = 120 miles
3. Time = 2 hours
4. Speed = 120 / 2 = 60 miles per hour

A: 60 miles per hour

ReAct (Reasoning + Acting)

Interleave reasoning traces with actions (tool use, retrieval).

Self-Consistency

Sample multiple reasoning paths and select the most consistent answer.

Tree of Thoughts

Explore multiple reasoning branches for complex problem-solving.

Structured Output

Constrain LLM output to specific formats (JSON, XML) for reliable parsing.

from pydantic import BaseModel
from openai import OpenAI

class MovieReview(BaseModel):
    sentiment: str
    confidence: float
    key_phrases: list[str]

client = OpenAI()
response = client.beta.chat.completions.parse(
    model="gpt-4o",
    messages=[{"role": "user", "content": review_text}],
    response_format=MovieReview
)

Fine-Tuning Patterns

Parameter-Efficient Fine-Tuning (PEFT)

Adapt large models with minimal trainable parameters.

LoRA (Low-Rank Adaptation): Add low-rank decomposition matrices to attention layers
QLoRA: Combine LoRA with quantization for memory efficiency
Prefix Tuning: Prepend learnable tokens to input
Adapter Layers: Insert small trainable modules between frozen layers

from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b")

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none"
)

peft_model = get_peft_model(model, lora_config)
# Only ~0.1% of parameters are trainable

Instruction Tuning

Fine-tune on instruction-following datasets to improve task generalization.

RLHF (Reinforcement Learning from Human Feedback)

Align model outputs with human preferences using reward models.

DPO (Direct Preference Optimization)

Simpler alternative to RLHF that directly optimizes on preference data.

LLM Serving Patterns

Speculative Decoding

Use a smaller draft model to propose tokens, verified by the larger model in parallel.

Continuous Batching

Dynamically batch requests to maximize GPU utilization.

KV-Cache Optimization

Efficiently manage key-value caches for transformer attention.

Quantization for Inference

Reduce model precision (INT8, INT4) for faster inference with minimal quality loss.

MLOps Patterns (2024-2025)

Feature Stores

Centralized platforms for feature management, serving, and discovery.

Key Capabilities

Online and offline serving
Point-in-time correctness for training
Feature versioning and lineage
Feature discovery and reuse
Real-time feature computation

Popular Implementations

Feast (open source)
Tecton
Databricks Feature Store
Amazon SageMaker Feature Store
Vertex AI Feature Store

from feast import FeatureStore

store = FeatureStore(repo_path=".")

# Get training data with point-in-time join
training_df = store.get_historical_features(
    entity_df=entity_df,
    features=[
        "user_features:age",
        "user_features:purchase_count_30d",
        "item_features:price",
    ]
).to_df()

# Get online features for serving
online_features = store.get_online_features(
    features=["user_features:purchase_count_30d"],
    entity_rows=[{"user_id": 123}]
).to_dict()

Model Registries

Centralized repositories for model versioning, staging, and deployment.

Key Capabilities

Model versioning and lineage
Stage transitions (staging, production, archived)
Model metadata and documentation
Approval workflows
Integration with CI/CD

Popular Implementations

MLflow Model Registry
Weights & Biases
Neptune
Amazon SageMaker Model Registry
Vertex AI Model Registry

A/B Testing and Experimentation

Traffic Splitting Patterns

Shadow Mode: New model receives traffic but responses are not used
Canary Deployment: Gradually increase traffic to new model
Multi-Armed Bandit: Dynamically allocate traffic based on performance

Experiment Tracking

import mlflow

with mlflow.start_run():
    mlflow.log_params({"learning_rate": 0.001, "epochs": 10})
    mlflow.log_metrics({"accuracy": 0.95, "f1": 0.93})
    mlflow.sklearn.log_model(model, "model")

Model Monitoring

Data Drift Detection

Monitor input feature distributions for changes from training data.

Prediction Drift

Track changes in model output distributions.

Performance Degradation

Monitor business metrics and model accuracy over time.

from evidently import Report
from evidently.metric_preset import DataDriftPreset

report = Report(metrics=[DataDriftPreset()])
report.run(reference_data=train_df, current_data=prod_df)
report.save_html("drift_report.html")

Foundation Model Patterns

Model Selection Pattern

Choose the right model based on task requirements:

Task Type	Recommended Approach
General chat	GPT-4, Claude 3, Gemini
Code generation	Codex, StarCoder, CodeLlama
Embeddings	text-embedding-3, BGE, E5
Image generation	DALL-E 3, Stable Diffusion, Midjourney
Speech	Whisper, ElevenLabs
Multimodal	GPT-4V, Gemini Pro Vision, LLaVA

Model Routing Pattern

Route requests to different models based on complexity, cost, or latency requirements.

def route_request(query: str, complexity_score: float) -> str:
    if complexity_score > 0.8:
        return call_gpt4(query)
    elif complexity_score > 0.5:
        return call_gpt35(query)
    else:
        return call_local_model(query)

Caching Pattern

Cache LLM responses for identical or semantically similar queries.

Exact Match Caching

import hashlib
from redis import Redis

cache = Redis()

def cached_llm_call(prompt: str) -> str:
    cache_key = hashlib.sha256(prompt.encode()).hexdigest()
    cached = cache.get(cache_key)
    if cached:
        return cached.decode()
    response = llm.generate(prompt)
    cache.setex(cache_key, 3600, response)
    return response

Semantic Caching

Use embeddings to find similar past queries and return cached responses.

Guardrails Pattern

Implement safety and quality controls around LLM inputs and outputs.

from guardrails import Guard
from guardrails.validators import ToxicLanguage, PIIFilter

guard = Guard().use_many(
    ToxicLanguage(on_fail="exception"),
    PIIFilter(on_fail="fix")
)

validated_output = guard(
    llm.generate,
    prompt=user_input
)

Vector Databases and Embeddings

Embedding Patterns

Text Embeddings

Convert text to dense vectors for semantic similarity search.

from openai import OpenAI

client = OpenAI()

def get_embedding(text: str, model="text-embedding-3-small") -> list[float]:
    response = client.embeddings.create(input=text, model=model)
    return response.data[0].embedding

Multimodal Embeddings

Embed images and text into a shared vector space (CLIP, ImageBind).

Chunking Strategies

Fixed-size chunks: Simple but may break semantic units
Semantic chunking: Split on natural boundaries (paragraphs, sections)
Recursive chunking: Hierarchical splitting with overlap
Late chunking: Context-aware chunking using LLMs

Vector Database Patterns

Index Types

HNSW (Hierarchical Navigable Small World): Fast approximate search
IVF (Inverted File Index): Cluster-based search
PQ (Product Quantization): Compressed vectors for memory efficiency

Hybrid Search

Combine dense vector search with sparse keyword search (BM25).

from qdrant_client import QdrantClient
from qdrant_client.models import SearchRequest, NamedVector

client = QdrantClient("localhost", port=6333)

# Hybrid search combining dense and sparse vectors
results = client.search(
    collection_name="documents",
    query_vector=NamedVector(
        name="dense",
        vector=dense_embedding
    ),
    query_filter=None,
    limit=10
)

Popular Vector Databases

Database	Key Features
Pinecone	Managed, serverless, metadata filtering
Weaviate	GraphQL API, modules for vectorization
Qdrant	Rust-based, filtering, hybrid search
Milvus	Distributed, GPU acceleration
Chroma	Simple, Python-native, good for prototyping
pgvector	PostgreSQL extension, familiar SQL interface

Metadata and Filtering

Attach metadata to vectors for filtered search.

# Store with metadata
vectorstore.add_documents(
    documents=docs,
    metadatas=[
        {"source": "arxiv", "date": "2024-01", "topic": "llm"},
        {"source": "blog", "date": "2024-03", "topic": "mlops"}
    ]
)

# Search with filter
results = vectorstore.similarity_search(
    query="RAG implementation",
    filter={"source": "arxiv", "date": {"$gte": "2024-01"}}
)

Emerging Patterns (Late 2024-2025)

Agentic Patterns

Tool Use / Function Calling

LLMs invoke external tools and APIs to accomplish tasks.

Multi-Agent Systems

Multiple specialized agents collaborate on complex tasks.

Planning and Reflection

Agents create plans, execute steps, and reflect on results.

Long-Context Patterns

Context Window Management

Strategies for working with 100K+ token context windows.

Memory Systems

Implement short-term and long-term memory for conversations.

Evaluation Patterns

LLM-as-Judge

Use LLMs to evaluate outputs of other LLMs.

Automated Red-Teaming

Systematically test models for vulnerabilities and failures.

Implementation

Project Structure for Modern ML Systems

ml-project/
├── src/
│   ├── features/          # Feature engineering
│   ├── models/            # Model definitions
│   ├── pipelines/         # Training and inference pipelines
│   ├── serving/           # Serving infrastructure
│   └── evaluation/        # Evaluation and monitoring
├── configs/               # Hydra/YAML configurations
├── data/
│   ├── raw/
│   ├── processed/
│   └── features/
├── experiments/           # Experiment tracking
├── tests/
├── notebooks/
├── Dockerfile
├── pyproject.toml
└── dvc.yaml              # Data versioning

Recommended Stack (2024-2025)

Category	Tools
Experiment Tracking	MLflow, Weights & Biases, Neptune
Pipeline Orchestration	Kubeflow, Airflow, Prefect, Dagster
Feature Store	Feast, Tecton
Model Serving	TensorFlow Serving, Triton, BentoML, vLLM
Vector Database	Qdrant, Pinecone, Weaviate, pgvector
LLM Framework	LangChain, LlamaIndex, Haystack
Evaluation	RAGAS, DeepEval, promptfoo
Monitoring	Evidently, Arize, WhyLabs

References

Original Book

Lakshmanan, V., Robinson, S., & Munn, M. (2020). Machine Learning Design Patterns. O'Reilly Media.
GitHub repository: https://github.com/GoogleCloudPlatform/ml-design-patterns

Modern Resources

Lewis, P., et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. arXiv:2005.11401
Hu, E., et al. (2021). LoRA: Low-Rank Adaptation of Large Language Models. arXiv:2106.09685
Wei, J., et al. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv:2201.11903
Yao, S., et al. (2023). ReAct: Synergizing Reasoning and Acting in Language Models. arXiv:2210.03629
Rafailov, R., et al. (2023). Direct Preference Optimization. arXiv:2305.18290

Tools and Frameworks

Notes

The original book patterns remain highly relevant as foundational concepts
LLM patterns are rapidly evolving; validate against current best practices
Consider cost/latency/quality tradeoffs when selecting patterns
Combine traditional ML patterns with LLM patterns for hybrid systems
Evaluation and monitoring are critical for production LLM systems

Machine Learning Design Patterns

Table of Contents

Overview

Background

Key Concepts

Pattern Categories (Original Book)

Data Representation Patterns

Hashed Feature

Embeddings

Feature Cross

Multimodal Input

Problem Representation Patterns

Reframing

Multilabel

Ensemble

Cascade

Neutral Class

Model Training Patterns

Useful Overfitting

Checkpoints

Transfer Learning

Distribution Strategy

Hyperparameter Tuning

Resilient Serving Patterns

Stateless Serving Function

Batch Serving

Continued Model Evaluation

Two-Phase Predictions

Keyed Predictions

Reproducibility Patterns

Transform

Repeatable Splitting

Bridged Schema

Windowed Inference

Workflow Pipeline

Feature Store

Model Versioning

Responsible AI Patterns

Heuristic Benchmark

Explainable Predictions

Fairness Lens

Modern Updates (2024-2025)

LLM-Specific Patterns

Retrieval-Augmented Generation (RAG)

Architecture Components

Implementation Pattern

Advanced RAG Variants

Prompt Engineering Patterns

Few-Shot Learning

Chain-of-Thought (CoT)

ReAct (Reasoning + Acting)

Self-Consistency

Tree of Thoughts

Structured Output

Fine-Tuning Patterns

Parameter-Efficient Fine-Tuning (PEFT)

Instruction Tuning

RLHF (Reinforcement Learning from Human Feedback)

DPO (Direct Preference Optimization)

LLM Serving Patterns

Speculative Decoding

Continuous Batching

KV-Cache Optimization

Quantization for Inference

MLOps Patterns (2024-2025)

Feature Stores

Key Capabilities

Popular Implementations

Model Registries

Key Capabilities

Popular Implementations

A/B Testing and Experimentation

Traffic Splitting Patterns

Experiment Tracking

Model Monitoring

Data Drift Detection

Prediction Drift

Performance Degradation

Foundation Model Patterns

Model Selection Pattern