Comprehensive Workflow and Version Control in Machine Learning Using Various Cloud Platforms

Table of Contents

Workflow

scikit-learn

  • Load
  • Data analysis (numeric vs. categorical)
  • Preprocess Pipeline (features union, impute, fill, scale)
  • Features, Target
  • Estimator (fit)
  • Pickle
  • Version and Deploy
  • Input
  • Preprocess Pipeline
  • Predict

AWS (Machine Learning)

https://docs.aws.amazon.com/machine-learning/latest/dg/the-machine-learning-process.html

  • Analyze your data
  • Split data into training and evaluation datasources
  • Shuffle your training data
  • Process features
  • Train the model
  • Select model parameters
  • Evaluate the model performance
  • Feature selection
  • Set a score threshold for prediction accuracy
  • Use the model
  • Prediction input to S3
  • Boto3 batch prediction
  • Waiter poll
  • Prediction response (S3)
  • Clean data source, model, and batch prediction resources

AWS (SageMaker)

https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-notebooks-instances.html

  • Explore and Preprocess Data
  • Model Training
  • Model Deployment
  • Validating Models
  • Programming Model

Google (ML Engine)

https://github.com/GoogleCloudPlatform/cloudml-samples

  • Model
  • Version
  • Framework (e.g., scikit-learn)
  • Notebook: joblib: model.pkl

Azure (Cognitive Services)

https://github.com/Azure/MachineLearningNotebooks

  • Experiments
  • Pipelines
  • Compute
  • Models
  • Images
  • Deployments
  • Activities

Models

  • Azure notebooks
  • Load data
  • Cleanse data
  • Convert types and filter
  • Split and rename columns
  • Transform data
  • Clean up resources
  • Train the automatic regression model
  • Test the best model accuracy

IBM

Oracle

Project Pipeline

Task Features v2 Target v2 Clustering Model 2 Model 3
Data Collection          
Data Integration          
Data Cleaning          
Analysis Tools          
Data Analysis          
Feature Engineering          
Pipeline Management          
Model Training          
Tuning          
Model Evaluation          
Configuration          
Deployment          
A/B Testing          
Resource Management          
Feature Extraction          
Target Management          
Model Deprecation          

Training and Prediction Input Pipelines

Versioning

Validation

  • MSE
  • Training Error
  • Resubstitution
  • Hold-out
  • K-fold cross-validation
  • LOOCV
  • Random subsampling
  • Bootstrapping
  • Over-Fit
  • Confidence

Exploration

Deployment

Algorithms

AWS Machine Learning

Four our purposes we are simply using linear regression (squared loss function and SGD)

Models

Frameworks

Training / Conferences

Author: Jason Walsh

j@wal.sh

Last Updated: 2024-08-14 06:08:49