Machine Learning Design Patterns
Table of Contents
- Overview
- Background
- Key Concepts
- Modern Updates (2024-2025)
- Implementation
- References
- Notes
Overview
Machine Learning Design Patterns by Valliappa Lakshmanan, Sara Robinson, and Michael Munn (O'Reilly, 2020) provides solutions to common challenges in ML engineering. The book organizes patterns into categories addressing data representation, problem framing, model training, resilient serving, reproducibility, and responsible AI.
This document summarizes the core patterns from the book and extends them with modern developments from 2024-2025, particularly around Large Language Models (LLMs), foundation models, and contemporary MLOps practices.
Background
The book emerged from the authors' experience at Google Cloud, distilling recurring solutions to ML problems into reusable patterns. Unlike traditional software design patterns, ML design patterns must address:
- Data quality and representation challenges
- Training/serving skew
- Model reproducibility
- Continuous model improvement
- Fairness and explainability requirements
Key Concepts
Pattern Categories (Original Book)
Data Representation Patterns
Hashed Feature
Transform high-cardinality categorical variables into fixed-size representations using hashing. Useful when vocabulary is incomplete or too large.
import hashlib def hash_feature(value, num_buckets=1000): hash_value = int(hashlib.md5(value.encode()).hexdigest(), 16) return hash_value % num_buckets
Embeddings
Learn dense, low-dimensional representations of sparse, high-dimensional data. Essential for text, user IDs, and categorical features.
Feature Cross
Combine multiple features to capture non-linear interactions without adding model complexity.
Multimodal Input
Handle inputs from different modalities (text, images, structured data) by learning separate representations and combining them.
Problem Representation Patterns
Reframing
Transform a problem type to leverage different ML approaches (e.g., regression to classification, or vice versa).
Multilabel
Handle cases where instances can belong to multiple classes simultaneously.
Ensemble
Combine multiple models to improve predictions through averaging, voting, or stacking.
Cascade
Chain models where each addresses a specific subtask or confidence level.
Neutral Class
Add an explicit "unknown" or "uncertain" class to handle ambiguous cases.
Model Training Patterns
Useful Overfitting
Intentionally overfit on small datasets or specific cases (e.g., lookup tables, memorization).
Checkpoints
Save model state periodically during training for recovery and analysis.
Transfer Learning
Leverage pre-trained models and adapt them to new tasks with less data.
Distribution Strategy
Parallelize training across multiple devices or machines.
Hyperparameter Tuning
Systematically search the hyperparameter space using grid search, random search, or Bayesian optimization.
Resilient Serving Patterns
Stateless Serving Function
Package models as stateless functions for scalable, reliable serving.
Batch Serving
Process predictions in bulk for offline workloads.
Continued Model Evaluation
Monitor model performance continuously in production.
Two-Phase Predictions
Separate candidate generation from ranking for efficiency at scale.
Keyed Predictions
Track predictions through the system with unique identifiers.
Reproducibility Patterns
Transform
Encapsulate feature engineering logic for consistent application in training and serving.
Repeatable Splitting
Ensure deterministic train/validation/test splits.
Bridged Schema
Handle schema evolution between training data and serving inputs.
Windowed Inference
Process streaming data in windows for consistent feature computation.
Workflow Pipeline
Orchestrate ML workflows as directed acyclic graphs (DAGs).
Feature Store
Centralize feature computation and serving for consistency.
Model Versioning
Track model versions with their associated code, data, and hyperparameters.
Responsible AI Patterns
Heuristic Benchmark
Establish baseline performance using simple rules or heuristics.
Explainable Predictions
Provide interpretable outputs alongside predictions.
Fairness Lens
Evaluate model performance across different demographic groups.
Modern Updates (2024-2025)
LLM-Specific Patterns
Retrieval-Augmented Generation (RAG)
Combine LLMs with external knowledge retrieval to ground responses in factual, up-to-date information.
Architecture Components
- Document ingestion and chunking pipeline
- Embedding model for semantic encoding
- Vector database for similarity search
- Retrieval strategy (dense, sparse, or hybrid)
- LLM for response generation with retrieved context
Implementation Pattern
from langchain.vectorstores import Chroma from langchain.embeddings import OpenAIEmbeddings from langchain.chat_models import ChatOpenAI from langchain.chains import RetrievalQA # Create vector store from documents vectorstore = Chroma.from_documents( documents=chunks, embedding=OpenAIEmbeddings() ) # Create RAG chain qa_chain = RetrievalQA.from_chain_type( llm=ChatOpenAI(model="gpt-4"), retriever=vectorstore.as_retriever(search_kwargs={"k": 5}), return_source_documents=True )
Advanced RAG Variants
- HyDE (Hypothetical Document Embeddings): Generate hypothetical answers to improve retrieval
- RAPTOR: Recursive abstractive processing for hierarchical retrieval
- Corrective RAG: Verify and refine retrieved documents
- Self-RAG: Model decides when retrieval is needed
- Graph RAG: Combine knowledge graphs with vector retrieval
Prompt Engineering Patterns
Few-Shot Learning
Provide examples in the prompt to guide model behavior without fine-tuning.
Classify the sentiment of movie reviews.
Review: "The acting was superb and the plot kept me engaged."
Sentiment: Positive
Review: "Waste of two hours. Predictable and boring."
Sentiment: Negative
Review: "{{user_review}}"
Sentiment:
Chain-of-Thought (CoT)
Encourage step-by-step reasoning for complex tasks.
Q: If a train travels 120 miles in 2 hours, what is its speed? Let me think step by step: 1. Speed = Distance / Time 2. Distance = 120 miles 3. Time = 2 hours 4. Speed = 120 / 2 = 60 miles per hour A: 60 miles per hour
ReAct (Reasoning + Acting)
Interleave reasoning traces with actions (tool use, retrieval).
Self-Consistency
Sample multiple reasoning paths and select the most consistent answer.
Tree of Thoughts
Explore multiple reasoning branches for complex problem-solving.
Structured Output
Constrain LLM output to specific formats (JSON, XML) for reliable parsing.
from pydantic import BaseModel from openai import OpenAI class MovieReview(BaseModel): sentiment: str confidence: float key_phrases: list[str] client = OpenAI() response = client.beta.chat.completions.parse( model="gpt-4o", messages=[{"role": "user", "content": review_text}], response_format=MovieReview )
Fine-Tuning Patterns
Parameter-Efficient Fine-Tuning (PEFT)
Adapt large models with minimal trainable parameters.
- LoRA (Low-Rank Adaptation): Add low-rank decomposition matrices to attention layers
- QLoRA: Combine LoRA with quantization for memory efficiency
- Prefix Tuning: Prepend learnable tokens to input
- Adapter Layers: Insert small trainable modules between frozen layers
from peft import LoraConfig, get_peft_model from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b") lora_config = LoraConfig( r=16, lora_alpha=32, target_modules=["q_proj", "v_proj"], lora_dropout=0.05, bias="none" ) peft_model = get_peft_model(model, lora_config) # Only ~0.1% of parameters are trainable
Instruction Tuning
Fine-tune on instruction-following datasets to improve task generalization.
RLHF (Reinforcement Learning from Human Feedback)
Align model outputs with human preferences using reward models.
DPO (Direct Preference Optimization)
Simpler alternative to RLHF that directly optimizes on preference data.
LLM Serving Patterns
Speculative Decoding
Use a smaller draft model to propose tokens, verified by the larger model in parallel.
Continuous Batching
Dynamically batch requests to maximize GPU utilization.
KV-Cache Optimization
Efficiently manage key-value caches for transformer attention.
Quantization for Inference
Reduce model precision (INT8, INT4) for faster inference with minimal quality loss.
MLOps Patterns (2024-2025)
Feature Stores
Centralized platforms for feature management, serving, and discovery.
Key Capabilities
- Online and offline serving
- Point-in-time correctness for training
- Feature versioning and lineage
- Feature discovery and reuse
- Real-time feature computation
Popular Implementations
- Feast (open source)
- Tecton
- Databricks Feature Store
- Amazon SageMaker Feature Store
- Vertex AI Feature Store
from feast import FeatureStore store = FeatureStore(repo_path=".") # Get training data with point-in-time join training_df = store.get_historical_features( entity_df=entity_df, features=[ "user_features:age", "user_features:purchase_count_30d", "item_features:price", ] ).to_df() # Get online features for serving online_features = store.get_online_features( features=["user_features:purchase_count_30d"], entity_rows=[{"user_id": 123}] ).to_dict()
Model Registries
Centralized repositories for model versioning, staging, and deployment.
Key Capabilities
- Model versioning and lineage
- Stage transitions (staging, production, archived)
- Model metadata and documentation
- Approval workflows
- Integration with CI/CD
Popular Implementations
- MLflow Model Registry
- Weights & Biases
- Neptune
- Amazon SageMaker Model Registry
- Vertex AI Model Registry
A/B Testing and Experimentation
Traffic Splitting Patterns
- Shadow Mode: New model receives traffic but responses are not used
- Canary Deployment: Gradually increase traffic to new model
- Multi-Armed Bandit: Dynamically allocate traffic based on performance
Experiment Tracking
import mlflow with mlflow.start_run(): mlflow.log_params({"learning_rate": 0.001, "epochs": 10}) mlflow.log_metrics({"accuracy": 0.95, "f1": 0.93}) mlflow.sklearn.log_model(model, "model")
Model Monitoring
Data Drift Detection
Monitor input feature distributions for changes from training data.
Prediction Drift
Track changes in model output distributions.
Performance Degradation
Monitor business metrics and model accuracy over time.
from evidently import Report from evidently.metric_preset import DataDriftPreset report = Report(metrics=[DataDriftPreset()]) report.run(reference_data=train_df, current_data=prod_df) report.save_html("drift_report.html")
Foundation Model Patterns
Model Selection Pattern
Choose the right model based on task requirements:
| Task Type | Recommended Approach |
|---|---|
| General chat | GPT-4, Claude 3, Gemini |
| Code generation | Codex, StarCoder, CodeLlama |
| Embeddings | text-embedding-3, BGE, E5 |
| Image generation | DALL-E 3, Stable Diffusion, Midjourney |
| Speech | Whisper, ElevenLabs |
| Multimodal | GPT-4V, Gemini Pro Vision, LLaVA |
Model Routing Pattern
Route requests to different models based on complexity, cost, or latency requirements.
def route_request(query: str, complexity_score: float) -> str: if complexity_score > 0.8: return call_gpt4(query) elif complexity_score > 0.5: return call_gpt35(query) else: return call_local_model(query)
Caching Pattern
Cache LLM responses for identical or semantically similar queries.
Exact Match Caching
import hashlib from redis import Redis cache = Redis() def cached_llm_call(prompt: str) -> str: cache_key = hashlib.sha256(prompt.encode()).hexdigest() cached = cache.get(cache_key) if cached: return cached.decode() response = llm.generate(prompt) cache.setex(cache_key, 3600, response) return response
Semantic Caching
Use embeddings to find similar past queries and return cached responses.
Guardrails Pattern
Implement safety and quality controls around LLM inputs and outputs.
from guardrails import Guard from guardrails.validators import ToxicLanguage, PIIFilter guard = Guard().use_many( ToxicLanguage(on_fail="exception"), PIIFilter(on_fail="fix") ) validated_output = guard( llm.generate, prompt=user_input )
Vector Databases and Embeddings
Embedding Patterns
Text Embeddings
Convert text to dense vectors for semantic similarity search.
from openai import OpenAI client = OpenAI() def get_embedding(text: str, model="text-embedding-3-small") -> list[float]: response = client.embeddings.create(input=text, model=model) return response.data[0].embedding
Multimodal Embeddings
Embed images and text into a shared vector space (CLIP, ImageBind).
Chunking Strategies
- Fixed-size chunks: Simple but may break semantic units
- Semantic chunking: Split on natural boundaries (paragraphs, sections)
- Recursive chunking: Hierarchical splitting with overlap
- Late chunking: Context-aware chunking using LLMs
Vector Database Patterns
Index Types
- HNSW (Hierarchical Navigable Small World): Fast approximate search
- IVF (Inverted File Index): Cluster-based search
- PQ (Product Quantization): Compressed vectors for memory efficiency
Hybrid Search
Combine dense vector search with sparse keyword search (BM25).
from qdrant_client import QdrantClient from qdrant_client.models import SearchRequest, NamedVector client = QdrantClient("localhost", port=6333) # Hybrid search combining dense and sparse vectors results = client.search( collection_name="documents", query_vector=NamedVector( name="dense", vector=dense_embedding ), query_filter=None, limit=10 )
Popular Vector Databases
| Database | Key Features |
|---|---|
| Pinecone | Managed, serverless, metadata filtering |
| Weaviate | GraphQL API, modules for vectorization |
| Qdrant | Rust-based, filtering, hybrid search |
| Milvus | Distributed, GPU acceleration |
| Chroma | Simple, Python-native, good for prototyping |
| pgvector | PostgreSQL extension, familiar SQL interface |
Metadata and Filtering
Attach metadata to vectors for filtered search.
# Store with metadata vectorstore.add_documents( documents=docs, metadatas=[ {"source": "arxiv", "date": "2024-01", "topic": "llm"}, {"source": "blog", "date": "2024-03", "topic": "mlops"} ] ) # Search with filter results = vectorstore.similarity_search( query="RAG implementation", filter={"source": "arxiv", "date": {"$gte": "2024-01"}} )
Emerging Patterns (Late 2024-2025)
Agentic Patterns
Tool Use / Function Calling
LLMs invoke external tools and APIs to accomplish tasks.
Multi-Agent Systems
Multiple specialized agents collaborate on complex tasks.
Planning and Reflection
Agents create plans, execute steps, and reflect on results.
Long-Context Patterns
Context Window Management
Strategies for working with 100K+ token context windows.
Memory Systems
Implement short-term and long-term memory for conversations.
Evaluation Patterns
LLM-as-Judge
Use LLMs to evaluate outputs of other LLMs.
Automated Red-Teaming
Systematically test models for vulnerabilities and failures.
Implementation
Project Structure for Modern ML Systems
ml-project/ ├── src/ │ ├── features/ # Feature engineering │ ├── models/ # Model definitions │ ├── pipelines/ # Training and inference pipelines │ ├── serving/ # Serving infrastructure │ └── evaluation/ # Evaluation and monitoring ├── configs/ # Hydra/YAML configurations ├── data/ │ ├── raw/ │ ├── processed/ │ └── features/ ├── experiments/ # Experiment tracking ├── tests/ ├── notebooks/ ├── Dockerfile ├── pyproject.toml └── dvc.yaml # Data versioning
Recommended Stack (2024-2025)
| Category | Tools |
|---|---|
| Experiment Tracking | MLflow, Weights & Biases, Neptune |
| Pipeline Orchestration | Kubeflow, Airflow, Prefect, Dagster |
| Feature Store | Feast, Tecton |
| Model Serving | TensorFlow Serving, Triton, BentoML, vLLM |
| Vector Database | Qdrant, Pinecone, Weaviate, pgvector |
| LLM Framework | LangChain, LlamaIndex, Haystack |
| Evaluation | RAGAS, DeepEval, promptfoo |
| Monitoring | Evidently, Arize, WhyLabs |
References
Original Book
- Lakshmanan, V., Robinson, S., & Munn, M. (2020). Machine Learning Design Patterns. O'Reilly Media.
- GitHub repository: https://github.com/GoogleCloudPlatform/ml-design-patterns
Modern Resources
- Lewis, P., et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. arXiv:2005.11401
- Hu, E., et al. (2021). LoRA: Low-Rank Adaptation of Large Language Models. arXiv:2106.09685
- Wei, J., et al. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv:2201.11903
- Yao, S., et al. (2023). ReAct: Synergizing Reasoning and Acting in Language Models. arXiv:2210.03629
- Rafailov, R., et al. (2023). Direct Preference Optimization. arXiv:2305.18290
Tools and Frameworks
- LangChain: https://langchain.com
- LlamaIndex: https://llamaindex.ai
- Feast: https://feast.dev
- MLflow: https://mlflow.org
- vLLM: https://vllm.ai
Notes
- The original book patterns remain highly relevant as foundational concepts
- LLM patterns are rapidly evolving; validate against current best practices
- Consider cost/latency/quality tradeoffs when selecting patterns
- Combine traditional ML patterns with LLM patterns for hybrid systems
- Evaluation and monitoring are critical for production LLM systems