Comprehensive Workflow and Version Control in Machine Learning Using Various Cloud Platforms
Table of Contents
Workflow
scikit-learn
- Load
- Data analysis (numeric vs. categorical)
- Preprocess Pipeline (features union, impute, fill, scale)
- Features, Target
- Estimator (fit)
- Pickle
- Version and Deploy
- Input
- Preprocess Pipeline
- Predict
AWS (Machine Learning)
https://docs.aws.amazon.com/machine-learning/latest/dg/the-machine-learning-process.html
- Analyze your data
- Split data into training and evaluation datasources
- Shuffle your training data
- Process features
- Train the model
- Select model parameters
- Evaluate the model performance
- Feature selection
- Set a score threshold for prediction accuracy
- Use the model
- Prediction input to S3
- Boto3 batch prediction
- Waiter poll
- Prediction response (S3)
- Clean data source, model, and batch prediction resources
AWS (SageMaker)
https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-notebooks-instances.html
- Explore and Preprocess Data
- Model Training
- Model Deployment
- Validating Models
- Programming Model
Google (ML Engine)
https://github.com/GoogleCloudPlatform/cloudml-samples
- Model
- Version
- Framework (e.g., scikit-learn)
- Notebook: joblib: model.pkl
Azure (Cognitive Services)
https://github.com/Azure/MachineLearningNotebooks
- Experiments
- Pipelines
- Compute
- Models
- Images
- Deployments
- Activities
Models
- Azure notebooks
- Load data
- Cleanse data
- Convert types and filter
- Split and rename columns
- Transform data
- Clean up resources
- Train the automatic regression model
- Test the best model accuracy
IBM
Oracle
Project Pipeline
Task | Features v2 | Target v2 | Clustering | Model 2 | Model 3 |
Data Collection | |||||
Data Integration | |||||
Data Cleaning | |||||
Analysis Tools | |||||
Data Analysis | |||||
Feature Engineering | |||||
Pipeline Management | |||||
Model Training | |||||
Tuning | |||||
Model Evaluation | |||||
Configuration | |||||
Deployment | |||||
A/B Testing | |||||
Resource Management | |||||
Feature Extraction | |||||
Target Management | |||||
Model Deprecation |
Training and Prediction Input Pipelines
Versioning
Validation
- MSE
- Training Error
- Resubstitution
- Hold-out
- K-fold cross-validation
- LOOCV
- Random subsampling
- Bootstrapping
- Over-Fit
- Confidence
Exploration
Deployment
Algorithms
AWS Machine Learning
Four our purposes we are simply using linear regression (squared loss function and SGD)