home > research > 2019-machine-learning (2024-10-30)

Comprehensive Workflow and Version Control in Machine Learning Using Various Cloud Platforms

Table of Contents

Workflow
Project Pipeline
Training and Prediction Input Pipelines
Versioning
Validation
Exploration
Deployment
Algorithms
- AWS Machine Learning
Models
Frameworks
Training / Conferences

Workflow

scikit-learn

Load
Data analysis (numeric vs. categorical)
Preprocess Pipeline (features union, impute, fill, scale)
Features, Target
Estimator (fit)
Pickle
Version and Deploy
Input
Preprocess Pipeline
Predict

AWS (Machine Learning)

https://docs.aws.amazon.com/machine-learning/latest/dg/the-machine-learning-process.html

Analyze your data
Split data into training and evaluation datasources
Shuffle your training data
Process features
Train the model
Select model parameters
Evaluate the model performance
Feature selection
Set a score threshold for prediction accuracy
Use the model
Prediction input to S3
Boto3 batch prediction
Waiter poll
Prediction response (S3)
Clean data source, model, and batch prediction resources

AWS (SageMaker)

https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-notebooks-instances.html

Explore and Preprocess Data
Model Training
Model Deployment
Validating Models
Programming Model

Google (ML Engine)

https://github.com/GoogleCloudPlatform/cloudml-samples

Model
Version
Framework (e.g., scikit-learn)
Notebook: joblib: model.pkl

Azure (Cognitive Services)

https://github.com/Azure/MachineLearningNotebooks

Experiments
Pipelines
Compute
Models
Images
Deployments
Activities

Models

Azure notebooks
Load data
Cleanse data
Convert types and filter
Split and rename columns
Transform data
Clean up resources
Train the automatic regression model
Test the best model accuracy

IBM

Oracle

Project Pipeline

Task	Features v2	Target v2	Clustering	Model 2	Model 3
Data Collection
Data Integration
Data Cleaning
Analysis Tools
Data Analysis
Feature Engineering
Pipeline Management
Model Training
Tuning
Model Evaluation
Configuration
Deployment
A/B Testing
Resource Management
Feature Extraction
Target Management
Model Deprecation

Training and Prediction Input Pipelines

Versioning

https://blog.algorithmia.com/how-to-version-control-your-production-machine-learning-models/

Validation

MSE
Training Error
Resubstitution
Hold-out
K-fold cross-validation
LOOCV
Random subsampling
Bootstrapping
Over-Fit
Confidence

Exploration

https://pair-code.github.io/what-if-tool/

Deployment

https://github.com/kubeflow/kubeflow

Algorithms

AWS Machine Learning

Four our purposes we are simply using linear regression (squared loss function and SGD)

Models

Frameworks

https://github.com/apache/incubator-mxnet

Training / Conferences

https://www.eventbrite.com/e/odsc-east-2019-open-data-science-conference-save-50-for-limited-time-tickets-50666130761?aff=ebdssbdestsearch

STAY15

Author: Jason Walsh

j@wal.sh

Last Updated: 2024-10-30 16:43:54