Kubeflow ML Pipelines

Overview
Background
- History and Evolution
- Design Philosophy
Key Concepts
- Core Components
- Architecture
Implementation
References
Notes

Overview

Kubeflow is an open-source machine learning platform designed to make deployments of ML workflows on Kubernetes simple, portable, and scalable. Originally developed at Google and released in 2017, Kubeflow enables data scientists and ML engineers to build, train, and deploy machine learning models using Kubernetes as the orchestration layer, abstracting away infrastructure complexity while providing enterprise-grade scalability.

The project's name combines "Kube" from Kubernetes and "Flow" from TensorFlow, reflecting its origins as a way to run TensorFlow jobs on Kubernetes. Today, Kubeflow has evolved into a comprehensive MLOps platform supporting multiple ML frameworks including PyTorch, scikit-learn, XGBoost, and more.

Background

History and Evolution

2017: Kubeflow originated as an internal Google project to simplify running TensorFlow on Kubernetes
2018: Released as open-source under the Apache 2.0 license
2019: Kubeflow 1.0 released with stable APIs and production-ready components
2020-2021: KFServing evolved into KServe as a standalone project
2022-2024: Continued maturation with improved multi-tenancy and security features

Design Philosophy

Kubeflow embraces several key principles:

Composability: Each component can be used independently or together
Portability: Deploy the same pipelines across any Kubernetes cluster (on-premises, cloud, hybrid)
Scalability: Leverage Kubernetes autoscaling for training and serving workloads
Reproducibility: Version control for data, code, and model artifacts

Key Concepts

Core Components

Kubeflow Pipelines

The heart of Kubeflow for MLOps workflows. Pipelines allow you to author, schedule, and monitor multi-step ML workflows as directed acyclic graphs (DAGs).

from kfp import dsl
from kfp import compiler

@dsl.component
def preprocess_data(input_path: str, output_path: str):
    """Preprocess raw data for training."""
    import pandas as pd
    df = pd.read_csv(input_path)
    # Preprocessing logic
    df.to_csv(output_path, index=False)

@dsl.component
def train_model(data_path: str, model_path: str, epochs: int = 10):
    """Train ML model on preprocessed data."""
    import tensorflow as tf
    # Training logic
    model.save(model_path)

@dsl.pipeline(name="ml-training-pipeline")
def training_pipeline(input_data: str, output_model: str):
    preprocess_task = preprocess_data(input_path=input_data, output_path="/tmp/processed.csv")
    train_task = train_model(data_path=preprocess_task.output, model_path=output_model)

compiler.Compiler().compile(training_pipeline, "pipeline.yaml")

Key features:

Visual pipeline editor and execution viewer
Artifact tracking and lineage
Experiment management and comparison
Recurring run scheduling

Kubeflow Notebooks

Jupyter notebook servers running natively on Kubernetes with:

Pre-configured images for TensorFlow, PyTorch, and other frameworks
GPU support for accelerated computing
Persistent volumes for data storage
Integration with Kubeflow Pipelines SDK

KServe (formerly KFServing)

Serverless inference platform for deploying ML models on Kubernetes:

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: sklearn-iris
spec:
  predictor:
    sklearn:
      storageUri: "gs://kfserving-examples/models/sklearn/iris"
      resources:
        requests:
          cpu: 100m
          memory: 256Mi
        limits:
          cpu: 500m
          memory: 512Mi

Features:

Autoscaling (including scale-to-zero)
Canary rollouts and A/B testing
Request batching for efficiency
Model explainability with Alibi
Support for TensorFlow, PyTorch, scikit-learn, XGBoost, ONNX, and custom models

Katib

Automated machine learning (AutoML) system for hyperparameter tuning and neural architecture search:

apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
  name: hyperparameter-tuning
spec:
  objective:
    type: maximize
    goal: 0.99
    objectiveMetricName: accuracy
  algorithm:
    algorithmName: bayesianoptimization
  parallelTrialCount: 3
  maxTrialCount: 12
  parameters:
    - name: learning_rate
      parameterType: double
      feasibleSpace:
        min: "0.001"
        max: "0.1"
    - name: batch_size
      parameterType: int
      feasibleSpace:
        min: "16"
        max: "128"

Supported algorithms:

Grid search, random search
Bayesian optimization
Tree-structured Parzen Estimator (TPE)
Hyperband and ASHA
Neural Architecture Search (NAS)

Training Operators

Kubernetes operators for distributed training:

TFJob: Distributed TensorFlow training
PyTorchJob: Distributed PyTorch training with DDP
MXNetJob: Apache MXNet distributed training
XGBoostJob: Distributed XGBoost training
MPIJob: MPI-based distributed training (Horovod)

apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
  name: pytorch-distributed
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      template:
        spec:
          containers:
            - name: pytorch
              image: pytorch/pytorch:latest
              command: ["python", "train.py"]
    Worker:
      replicas: 4
      template:
        spec:
          containers:
            - name: pytorch
              image: pytorch/pytorch:latest
              resources:
                limits:
                  nvidia.com/gpu: 1

Architecture

+------------------------------------------------------------------+
|                         Kubeflow Platform                         |
+------------------------------------------------------------------+
|  +-------------+  +-------------+  +-------------+  +-----------+ |
|  |  Notebooks  |  |  Pipelines  |  |   KServe    |  |   Katib   | |
|  +-------------+  +-------------+  +-------------+  +-----------+ |
|  +-------------+  +-------------+  +-------------+  +-----------+ |
|  | TF Operator |  | PT Operator |  | MPI Operator|  |  Volumes  | |
|  +-------------+  +-------------+  +-------------+  +-----------+ |
+------------------------------------------------------------------+
|                    Kubernetes Cluster                             |
|  +-------------+  +-------------+  +-------------+  +-----------+ |
|  |   Istio     |  |   Knative   |  |   Storage   |  |    GPU    | |
|  | (Networking)|  | (Serverless)|  |   (PV/PVC)  |  | Scheduler | |
|  +-------------+  +-------------+  +-------------+  +-----------+ |
+------------------------------------------------------------------+
|                     Infrastructure Layer                          |
|        (AWS EKS / GCP GKE / Azure AKS / On-Premises)             |
+------------------------------------------------------------------+

Implementation

Getting Started

Prerequisites

Kubernetes cluster (1.22+) with at least 4 CPUs and 16GB RAM
kubectl configured for your cluster
kustomize (v3.2.0+) for deployment customization

Installation Methods

1. Kubeflow Manifests (Full Platform)

# Clone the manifests repository
git clone https://github.com/kubeflow/manifests.git
cd manifests

# Deploy using kustomize
while ! kustomize build example | kubectl apply -f -; do
  echo "Retrying..."
  sleep 10
done

# Access the dashboard (default credentials: user@example.com / 12341234)
kubectl port-forward svc/istio-ingressgateway -n istio-system 8080:80

2. Standalone Kubeflow Pipelines

# Simpler deployment for pipelines-only use case
export PIPELINE_VERSION=2.0.5
kubectl apply -k "github.com/kubeflow/pipelines/manifests/kustomize/cluster-scoped-resources?ref=$PIPELINE_VERSION"
kubectl wait --for condition=established --timeout=60s crd/applications.app.k8s.io
kubectl apply -k "github.com/kubeflow/pipelines/manifests/kustomize/env/platform-agnostic-emissary?ref=$PIPELINE_VERSION"

3. Cloud Provider Distributions

AWS: Use AWS Kubeflow Distribution with Cognito and RDS integration
GCP: Use Kubeflow on GKE with Cloud IAP authentication
Azure: Use Kubeflow on AKS with Azure AD integration

Pipeline SDK Usage

Python SDK (KFP v2)

from kfp import dsl, compiler
from kfp.client import Client

# Define components
@dsl.component(base_image="python:3.10")
def load_data(dataset_name: str) -> str:
    import urllib.request
    url = f"https://data.example.com/{dataset_name}.csv"
    path = f"/tmp/{dataset_name}.csv"
    urllib.request.urlretrieve(url, path)
    return path

@dsl.component(base_image="tensorflow/tensorflow:2.13.0")
def train_tensorflow_model(
    data_path: str,
    epochs: int,
    learning_rate: float
) -> str:
    import tensorflow as tf
    # Load and prepare data
    # Build and train model
    model_path = "/tmp/model"
    # model.save(model_path)
    return model_path

@dsl.component(base_image="python:3.10")
def evaluate_model(model_path: str, test_data: str) -> float:
    # Load model and evaluate
    accuracy = 0.95  # placeholder
    return accuracy

# Define pipeline
@dsl.pipeline(
    name="TensorFlow Training Pipeline",
    description="End-to-end ML pipeline with TensorFlow"
)
def tf_training_pipeline(
    dataset: str = "mnist",
    epochs: int = 10,
    learning_rate: float = 0.001
):
    load_task = load_data(dataset_name=dataset)
    train_task = train_tensorflow_model(
        data_path=load_task.output,
        epochs=epochs,
        learning_rate=learning_rate
    )
    eval_task = evaluate_model(
        model_path=train_task.output,
        test_data=load_task.output
    )

# Compile and submit
compiler.Compiler().compile(tf_training_pipeline, "pipeline.yaml")

# Submit to Kubeflow
client = Client(host="http://localhost:8080")
run = client.create_run_from_pipeline_package(
    "pipeline.yaml",
    arguments={"dataset": "cifar10", "epochs": 20}
)

Integration with ML Frameworks

MLflow Integration

Use MLflow for experiment tracking alongside Kubeflow Pipelines:

@dsl.component(
    base_image="python:3.10",
    packages_to_install=["mlflow", "scikit-learn"]
)
def train_with_mlflow(
    data_path: str,
    mlflow_tracking_uri: str,
    experiment_name: str
) -> str:
    import mlflow
    from sklearn.ensemble import RandomForestClassifier

    mlflow.set_tracking_uri(mlflow_tracking_uri)
    mlflow.set_experiment(experiment_name)

    with mlflow.start_run():
        # Train model
        model = RandomForestClassifier(n_estimators=100)
        # model.fit(X_train, y_train)

        # Log parameters and metrics
        mlflow.log_param("n_estimators", 100)
        mlflow.log_metric("accuracy", 0.95)

        # Log model
        mlflow.sklearn.log_model(model, "model")

        return mlflow.active_run().info.run_id

PyTorch Distributed Training

@dsl.component(base_image="pytorch/pytorch:2.0.1-cuda11.7-cudnn8-runtime")
def distributed_pytorch_training(
    data_path: str,
    model_output: str,
    world_size: int = 4
):
    import torch
    import torch.distributed as dist
    from torch.nn.parallel import DistributedDataParallel as DDP

    # Initialize distributed training
    dist.init_process_group(backend="nccl")
    local_rank = int(os.environ["LOCAL_RANK"])
    torch.cuda.set_device(local_rank)

    # Create model and wrap with DDP
    model = MyModel().cuda(local_rank)
    model = DDP(model, device_ids=[local_rank])

    # Training loop
    # ...

    # Save model on rank 0
    if dist.get_rank() == 0:
        torch.save(model.module.state_dict(), model_output)

Production Deployment

Model Serving with KServe

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: production-model
  annotations:
    sidecar.istio.io/inject: "true"
spec:
  predictor:
    minReplicas: 2
    maxReplicas: 10
    scaleTarget: 80
    scaleMetric: concurrency
    tensorflow:
      storageUri: "s3://models/production/v1"
      runtimeVersion: "2.13.0"
      resources:
        requests:
          cpu: "1"
          memory: "2Gi"
          nvidia.com/gpu: "1"
        limits:
          cpu: "2"
          memory: "4Gi"
          nvidia.com/gpu: "1"
  transformer:
    containers:
      - name: feature-transformer
        image: myregistry/feature-transformer:v1
        resources:
          requests:
            cpu: "500m"
            memory: "512Mi"

Canary Deployment

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: model-canary
spec:
  predictor:
    canaryTrafficPercent: 20
    tensorflow:
      storageUri: "s3://models/production/v2"  # New version
  # Previous version gets remaining 80% traffic

References

Official Documentation

Community Resources

Related Technologies

MLflow - ML experiment tracking and model registry
TensorFlow Extended (TFX) - End-to-end ML platform
Seldon Core - Alternative model serving platform
BentoML - ML model serving framework

Books and Courses

"Kubeflow for Machine Learning" by Holden Karau et al. (O'Reilly)
"Building Machine Learning Pipelines" by Hannes Hapke (O'Reilly)
MLOps Specialization on Coursera

Notes

Comparison with Other MLOps Platforms

Feature	Kubeflow	MLflow	SageMaker
Infrastructure	Kubernetes	Agnostic	AWS-native
Experiment Tracking	Via MLflow/custom	Native	Native
Pipeline Authoring	Python SDK	Limited	Python SDK
Model Serving	KServe	MLflow Serving	SageMaker Endpoints
AutoML	Katib	Third-party	Autopilot
Cost	Infrastructure	Open-source	Pay-per-use
Learning Curve	Steep	Moderate	Moderate

Best Practices

Start Small: Begin with standalone Pipelines before full Kubeflow deployment
Use Caching: Enable pipeline caching to avoid redundant computation
Version Everything: Use artifact versioning for reproducibility
Resource Limits: Always set resource requests and limits for components
Multi-tenancy: Use Kubeflow profiles for team isolation
Monitoring: Integrate Prometheus and Grafana for observability

Common Challenges

Complexity: Full Kubeflow deployment requires expertise in Kubernetes, Istio, and multiple components
Resource Intensive: Minimum viable deployment requires significant cluster resources
Upgrade Path: Major version upgrades can be challenging due to CRD changes
Debugging: Distributed pipeline failures require understanding of Kubernetes internals

Future Directions

Improved integration with LLM workflows and fine-tuning pipelines
Enhanced support for federated learning
Better multi-cloud and hybrid deployment options
Simplified installation and management