Table of Contents
1. Workflow
1.1. scikit-learn
- Load
- Data analysis (numeric vs. categorical)
- Preprocess Pipeline (features union, impute, fill, scale)
- Features, Target
- Estimator (fit)
- Pickle
- Version and Deploy
- Input
- Preprocess Pipeline
- Predict
1.2. AWS (Machine Learning)
https://docs.aws.amazon.com/machine-learning/latest/dg/the-machine-learning-process.html
- Analyze your data
- Split data into training and evaluation datasources
- Shuffle your training data
- Process features
- Train the model
- Select model parameters
- Evaluate the model performance
- Feature selection
- Set a score threshold for prediction accuracy
- Use the model
- Prediction input to S3
- Boto3 batch prediction
- Waiter poll
- Prediction response (S3)
- Clean data source, model, and batch prediction resources
1.3. AWS (SageMaker)
https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-notebooks-instances.html
- Explore and Preprocess Data
- Model Training
- Model Deployment
- Validating Models
- Programming Model
1.4. Google (ML Engine)
https://github.com/GoogleCloudPlatform/cloudml-samples
- Model
- Version
- Framework (e.g., scikit-learn)
- Notebook: joblib: model.pkl
1.5. Azure (Cognitive Services)
https://github.com/Azure/MachineLearningNotebooks
- Experiments
- Pipelines
- Compute
- Models
- Images
- Deployments
- Activities
1.5.1. Models
- Azure notebooks
- Load data
- Cleanse data
- Convert types and filter
- Split and rename columns
- Transform data
- Clean up resources
- Train the automatic regression model
- Test the best model accuracy
1.6. IBM
1.7. Oracle
2. Project Pipeline
Task | Features v2 | Target v2 | Clustering | Model 2 | Model 3 |
Data Collection | |||||
Data Integration | |||||
Data Cleaning | |||||
Analysis Tools | |||||
Data Analysis | |||||
Feature Engineering | |||||
Pipeline Management | |||||
Model Training | |||||
Tuning | |||||
Model Evaluation | |||||
Configuration | |||||
Deployment | |||||
A/B Testing | |||||
Resource Management | |||||
Feature Extraction | |||||
Target Management | |||||
Model Deprecation |
3. Training and Prediction Input Pipelines
4. Versioning
5. Validation
- MSE
- Training Error
- Resubstitution
- Hold-out
- K-fold cross-validation
- LOOCV
- Random subsampling
- Bootstrapping
- Over-Fit
- Confidence
6. Exploration
7. Deployment
8. Algorithms
8.1. AWS Machine Learning
Four our purposes we are simply using linear regression (squared loss function and SGD)