Machine Learning Design Patterns

Table of Contents

Overview

Machine Learning Design Patterns by Valliappa Lakshmanan, Sara Robinson, and Michael Munn (O'Reilly, 2020) provides solutions to common challenges in ML engineering. The book organizes patterns into categories addressing data representation, problem framing, model training, resilient serving, reproducibility, and responsible AI.

This document summarizes the core patterns from the book and extends them with modern developments from 2024-2025, particularly around Large Language Models (LLMs), foundation models, and contemporary MLOps practices.

Background

The book emerged from the authors' experience at Google Cloud, distilling recurring solutions to ML problems into reusable patterns. Unlike traditional software design patterns, ML design patterns must address:

  • Data quality and representation challenges
  • Training/serving skew
  • Model reproducibility
  • Continuous model improvement
  • Fairness and explainability requirements

Key Concepts

Pattern Categories (Original Book)

Data Representation Patterns

Hashed Feature

Transform high-cardinality categorical variables into fixed-size representations using hashing. Useful when vocabulary is incomplete or too large.

import hashlib

def hash_feature(value, num_buckets=1000):
    hash_value = int(hashlib.md5(value.encode()).hexdigest(), 16)
    return hash_value % num_buckets
Embeddings

Learn dense, low-dimensional representations of sparse, high-dimensional data. Essential for text, user IDs, and categorical features.

Feature Cross

Combine multiple features to capture non-linear interactions without adding model complexity.

Multimodal Input

Handle inputs from different modalities (text, images, structured data) by learning separate representations and combining them.

Problem Representation Patterns

Reframing

Transform a problem type to leverage different ML approaches (e.g., regression to classification, or vice versa).

Multilabel

Handle cases where instances can belong to multiple classes simultaneously.

Ensemble

Combine multiple models to improve predictions through averaging, voting, or stacking.

Cascade

Chain models where each addresses a specific subtask or confidence level.

Neutral Class

Add an explicit "unknown" or "uncertain" class to handle ambiguous cases.

Model Training Patterns

Useful Overfitting

Intentionally overfit on small datasets or specific cases (e.g., lookup tables, memorization).

Checkpoints

Save model state periodically during training for recovery and analysis.

Transfer Learning

Leverage pre-trained models and adapt them to new tasks with less data.

Distribution Strategy

Parallelize training across multiple devices or machines.

Hyperparameter Tuning

Systematically search the hyperparameter space using grid search, random search, or Bayesian optimization.

Resilient Serving Patterns

Stateless Serving Function

Package models as stateless functions for scalable, reliable serving.

Batch Serving

Process predictions in bulk for offline workloads.

Continued Model Evaluation

Monitor model performance continuously in production.

Two-Phase Predictions

Separate candidate generation from ranking for efficiency at scale.

Keyed Predictions

Track predictions through the system with unique identifiers.

Reproducibility Patterns

Transform

Encapsulate feature engineering logic for consistent application in training and serving.

Repeatable Splitting

Ensure deterministic train/validation/test splits.

Bridged Schema

Handle schema evolution between training data and serving inputs.

Windowed Inference

Process streaming data in windows for consistent feature computation.

Workflow Pipeline

Orchestrate ML workflows as directed acyclic graphs (DAGs).

Feature Store

Centralize feature computation and serving for consistency.

Model Versioning

Track model versions with their associated code, data, and hyperparameters.

Responsible AI Patterns

Heuristic Benchmark

Establish baseline performance using simple rules or heuristics.

Explainable Predictions

Provide interpretable outputs alongside predictions.

Fairness Lens

Evaluate model performance across different demographic groups.

Modern Updates (2024-2025)

LLM-Specific Patterns

Retrieval-Augmented Generation (RAG)

Combine LLMs with external knowledge retrieval to ground responses in factual, up-to-date information.

Architecture Components
  • Document ingestion and chunking pipeline
  • Embedding model for semantic encoding
  • Vector database for similarity search
  • Retrieval strategy (dense, sparse, or hybrid)
  • LLM for response generation with retrieved context
Implementation Pattern
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA

# Create vector store from documents
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=OpenAIEmbeddings()
)

# Create RAG chain
qa_chain = RetrievalQA.from_chain_type(
    llm=ChatOpenAI(model="gpt-4"),
    retriever=vectorstore.as_retriever(search_kwargs={"k": 5}),
    return_source_documents=True
)
Advanced RAG Variants
  • HyDE (Hypothetical Document Embeddings): Generate hypothetical answers to improve retrieval
  • RAPTOR: Recursive abstractive processing for hierarchical retrieval
  • Corrective RAG: Verify and refine retrieved documents
  • Self-RAG: Model decides when retrieval is needed
  • Graph RAG: Combine knowledge graphs with vector retrieval

Prompt Engineering Patterns

Few-Shot Learning

Provide examples in the prompt to guide model behavior without fine-tuning.

Classify the sentiment of movie reviews.

Review: "The acting was superb and the plot kept me engaged."
Sentiment: Positive

Review: "Waste of two hours. Predictable and boring."
Sentiment: Negative

Review: "{{user_review}}"
Sentiment:
Chain-of-Thought (CoT)

Encourage step-by-step reasoning for complex tasks.

Q: If a train travels 120 miles in 2 hours, what is its speed?

Let me think step by step:
1. Speed = Distance / Time
2. Distance = 120 miles
3. Time = 2 hours
4. Speed = 120 / 2 = 60 miles per hour

A: 60 miles per hour
ReAct (Reasoning + Acting)

Interleave reasoning traces with actions (tool use, retrieval).

Self-Consistency

Sample multiple reasoning paths and select the most consistent answer.

Tree of Thoughts

Explore multiple reasoning branches for complex problem-solving.

Structured Output

Constrain LLM output to specific formats (JSON, XML) for reliable parsing.

from pydantic import BaseModel
from openai import OpenAI

class MovieReview(BaseModel):
    sentiment: str
    confidence: float
    key_phrases: list[str]

client = OpenAI()
response = client.beta.chat.completions.parse(
    model="gpt-4o",
    messages=[{"role": "user", "content": review_text}],
    response_format=MovieReview
)

Fine-Tuning Patterns

Parameter-Efficient Fine-Tuning (PEFT)

Adapt large models with minimal trainable parameters.

  • LoRA (Low-Rank Adaptation): Add low-rank decomposition matrices to attention layers
  • QLoRA: Combine LoRA with quantization for memory efficiency
  • Prefix Tuning: Prepend learnable tokens to input
  • Adapter Layers: Insert small trainable modules between frozen layers
from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b")

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none"
)

peft_model = get_peft_model(model, lora_config)
# Only ~0.1% of parameters are trainable
Instruction Tuning

Fine-tune on instruction-following datasets to improve task generalization.

RLHF (Reinforcement Learning from Human Feedback)

Align model outputs with human preferences using reward models.

DPO (Direct Preference Optimization)

Simpler alternative to RLHF that directly optimizes on preference data.

LLM Serving Patterns

Speculative Decoding

Use a smaller draft model to propose tokens, verified by the larger model in parallel.

Continuous Batching

Dynamically batch requests to maximize GPU utilization.

KV-Cache Optimization

Efficiently manage key-value caches for transformer attention.

Quantization for Inference

Reduce model precision (INT8, INT4) for faster inference with minimal quality loss.

MLOps Patterns (2024-2025)

Feature Stores

Centralized platforms for feature management, serving, and discovery.

Key Capabilities
  • Online and offline serving
  • Point-in-time correctness for training
  • Feature versioning and lineage
  • Feature discovery and reuse
  • Real-time feature computation
Popular Implementations
  • Feast (open source)
  • Tecton
  • Databricks Feature Store
  • Amazon SageMaker Feature Store
  • Vertex AI Feature Store
from feast import FeatureStore

store = FeatureStore(repo_path=".")

# Get training data with point-in-time join
training_df = store.get_historical_features(
    entity_df=entity_df,
    features=[
        "user_features:age",
        "user_features:purchase_count_30d",
        "item_features:price",
    ]
).to_df()

# Get online features for serving
online_features = store.get_online_features(
    features=["user_features:purchase_count_30d"],
    entity_rows=[{"user_id": 123}]
).to_dict()

Model Registries

Centralized repositories for model versioning, staging, and deployment.

Key Capabilities
  • Model versioning and lineage
  • Stage transitions (staging, production, archived)
  • Model metadata and documentation
  • Approval workflows
  • Integration with CI/CD
Popular Implementations
  • MLflow Model Registry
  • Weights & Biases
  • Neptune
  • Amazon SageMaker Model Registry
  • Vertex AI Model Registry

A/B Testing and Experimentation

Traffic Splitting Patterns
  • Shadow Mode: New model receives traffic but responses are not used
  • Canary Deployment: Gradually increase traffic to new model
  • Multi-Armed Bandit: Dynamically allocate traffic based on performance
Experiment Tracking
import mlflow

with mlflow.start_run():
    mlflow.log_params({"learning_rate": 0.001, "epochs": 10})
    mlflow.log_metrics({"accuracy": 0.95, "f1": 0.93})
    mlflow.sklearn.log_model(model, "model")

Model Monitoring

Data Drift Detection

Monitor input feature distributions for changes from training data.

Prediction Drift

Track changes in model output distributions.

Performance Degradation

Monitor business metrics and model accuracy over time.

from evidently import Report
from evidently.metric_preset import DataDriftPreset

report = Report(metrics=[DataDriftPreset()])
report.run(reference_data=train_df, current_data=prod_df)
report.save_html("drift_report.html")

Foundation Model Patterns

Model Selection Pattern

Choose the right model based on task requirements:

Task Type Recommended Approach
General chat GPT-4, Claude 3, Gemini
Code generation Codex, StarCoder, CodeLlama
Embeddings text-embedding-3, BGE, E5
Image generation DALL-E 3, Stable Diffusion, Midjourney
Speech Whisper, ElevenLabs
Multimodal GPT-4V, Gemini Pro Vision, LLaVA

Model Routing Pattern

Route requests to different models based on complexity, cost, or latency requirements.

def route_request(query: str, complexity_score: float) -> str:
    if complexity_score > 0.8:
        return call_gpt4(query)
    elif complexity_score > 0.5:
        return call_gpt35(query)
    else:
        return call_local_model(query)

Caching Pattern

Cache LLM responses for identical or semantically similar queries.

Exact Match Caching
import hashlib
from redis import Redis

cache = Redis()

def cached_llm_call(prompt: str) -> str:
    cache_key = hashlib.sha256(prompt.encode()).hexdigest()
    cached = cache.get(cache_key)
    if cached:
        return cached.decode()
    response = llm.generate(prompt)
    cache.setex(cache_key, 3600, response)
    return response
Semantic Caching

Use embeddings to find similar past queries and return cached responses.

Guardrails Pattern

Implement safety and quality controls around LLM inputs and outputs.

from guardrails import Guard
from guardrails.validators import ToxicLanguage, PIIFilter

guard = Guard().use_many(
    ToxicLanguage(on_fail="exception"),
    PIIFilter(on_fail="fix")
)

validated_output = guard(
    llm.generate,
    prompt=user_input
)

Vector Databases and Embeddings

Embedding Patterns

Text Embeddings

Convert text to dense vectors for semantic similarity search.

from openai import OpenAI

client = OpenAI()

def get_embedding(text: str, model="text-embedding-3-small") -> list[float]:
    response = client.embeddings.create(input=text, model=model)
    return response.data[0].embedding
Multimodal Embeddings

Embed images and text into a shared vector space (CLIP, ImageBind).

Chunking Strategies
  • Fixed-size chunks: Simple but may break semantic units
  • Semantic chunking: Split on natural boundaries (paragraphs, sections)
  • Recursive chunking: Hierarchical splitting with overlap
  • Late chunking: Context-aware chunking using LLMs

Vector Database Patterns

Index Types
  • HNSW (Hierarchical Navigable Small World): Fast approximate search
  • IVF (Inverted File Index): Cluster-based search
  • PQ (Product Quantization): Compressed vectors for memory efficiency
Hybrid Search

Combine dense vector search with sparse keyword search (BM25).

from qdrant_client import QdrantClient
from qdrant_client.models import SearchRequest, NamedVector

client = QdrantClient("localhost", port=6333)

# Hybrid search combining dense and sparse vectors
results = client.search(
    collection_name="documents",
    query_vector=NamedVector(
        name="dense",
        vector=dense_embedding
    ),
    query_filter=None,
    limit=10
)
Popular Vector Databases
Database Key Features
Pinecone Managed, serverless, metadata filtering
Weaviate GraphQL API, modules for vectorization
Qdrant Rust-based, filtering, hybrid search
Milvus Distributed, GPU acceleration
Chroma Simple, Python-native, good for prototyping
pgvector PostgreSQL extension, familiar SQL interface

Metadata and Filtering

Attach metadata to vectors for filtered search.

# Store with metadata
vectorstore.add_documents(
    documents=docs,
    metadatas=[
        {"source": "arxiv", "date": "2024-01", "topic": "llm"},
        {"source": "blog", "date": "2024-03", "topic": "mlops"}
    ]
)

# Search with filter
results = vectorstore.similarity_search(
    query="RAG implementation",
    filter={"source": "arxiv", "date": {"$gte": "2024-01"}}
)

Emerging Patterns (Late 2024-2025)

Agentic Patterns

Tool Use / Function Calling

LLMs invoke external tools and APIs to accomplish tasks.

Multi-Agent Systems

Multiple specialized agents collaborate on complex tasks.

Planning and Reflection

Agents create plans, execute steps, and reflect on results.

Long-Context Patterns

Context Window Management

Strategies for working with 100K+ token context windows.

Memory Systems

Implement short-term and long-term memory for conversations.

Evaluation Patterns

LLM-as-Judge

Use LLMs to evaluate outputs of other LLMs.

Automated Red-Teaming

Systematically test models for vulnerabilities and failures.

Implementation

Project Structure for Modern ML Systems

ml-project/
├── src/
│   ├── features/          # Feature engineering
│   ├── models/            # Model definitions
│   ├── pipelines/         # Training and inference pipelines
│   ├── serving/           # Serving infrastructure
│   └── evaluation/        # Evaluation and monitoring
├── configs/               # Hydra/YAML configurations
├── data/
│   ├── raw/
│   ├── processed/
│   └── features/
├── experiments/           # Experiment tracking
├── tests/
├── notebooks/
├── Dockerfile
├── pyproject.toml
└── dvc.yaml              # Data versioning

Recommended Stack (2024-2025)

Category Tools
Experiment Tracking MLflow, Weights & Biases, Neptune
Pipeline Orchestration Kubeflow, Airflow, Prefect, Dagster
Feature Store Feast, Tecton
Model Serving TensorFlow Serving, Triton, BentoML, vLLM
Vector Database Qdrant, Pinecone, Weaviate, pgvector
LLM Framework LangChain, LlamaIndex, Haystack
Evaluation RAGAS, DeepEval, promptfoo
Monitoring Evidently, Arize, WhyLabs

References

Original Book

Modern Resources

  • Lewis, P., et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. arXiv:2005.11401
  • Hu, E., et al. (2021). LoRA: Low-Rank Adaptation of Large Language Models. arXiv:2106.09685
  • Wei, J., et al. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv:2201.11903
  • Yao, S., et al. (2023). ReAct: Synergizing Reasoning and Acting in Language Models. arXiv:2210.03629
  • Rafailov, R., et al. (2023). Direct Preference Optimization. arXiv:2305.18290

Tools and Frameworks

Notes

  • The original book patterns remain highly relevant as foundational concepts
  • LLM patterns are rapidly evolving; validate against current best practices
  • Consider cost/latency/quality tradeoffs when selecting patterns
  • Combine traditional ML patterns with LLM patterns for hybrid systems
  • Evaluation and monitoring are critical for production LLM systems

Author: Jason Walsh

j@wal.sh

Last Updated: 2026-01-10 17:13:42

build: 2026-01-11 18:29 | sha: 48a6da1