Electricity Company - MLOps Platform
Background
The Company maintains a electricity distribution network and supplies eletricity to millions of homes across Australia.
Problem
The Company has built machine learning models to perform batch demand forecasting on eletricity demand. However, there are several problems with the current solution:
- The models are ran on a batch basis directly on a virtual machine using Windows Task Scheduler, resulting in difficulty with monitoring execution logs.
- The models are not versioned.
- The models are manually deployed on to the virtual machine, resulting in error prone deployment.
- The model’s local python environment is not captured resulting in difficulty in replicating the environment in production.
- The model cannot perform inference in real-time.
- Models were generally written using a very long python script which lacked in modularity and ability to perform unit and integration tests.
- Several other models have been developed with inconsistent approaches, code patterns and deployment patterns.
Solution
MLOps platform
- Designed an MLOps platform that would scale across the Company’s business units (> 5 business units).
- Wrote Infrastructure as Code using ARM templates and PowerShell Core to automate entire environment deployment with simple YAML config files.
- The MLOps platform environment includes (each environment is fully configurable with a YAML file):
- Azure ML Workspace for tracking experiments and model versions
- Azure Compute for cloud training and experimentation with Python ML environment loaded
- Azure Databricks for cloud training and experimentation with PySpark ML environment loaded
- Kubernetes Cluster for real-time model inference
- Azure VNet for network security
- Policies and RBAC for governance and access control
- Log Analytics Workspace for application logging
- Azure Container Registry for storing docker images of training environments
- Azure DevOps CI/CD YAML pipelines to deploy end-to-end infrastructure.
- Designed and documented new development and deployment patterns for Data Scientists and ML engineers. New patterns include:
- Unit and integration testing to improve code quality
- Training pipelines to automate re-training if/when model drift is detected
- Model versioning with support for version rollback if needed
- Docker to encapsulate local training environment for easy deployment later
- Trained Data Scientists and ML engineers with new patterns.
Outcome
- The MLOps platform received a health check assessment score of 92% from Microsoft (Perth).
- Feedback from Data Analytics Manager:
“Jonathan’s commitment and hard work throughout the project was a motivation for everyone else in the team. Many of us learned a lot from you and the best practices you adhere to when designing and building flexible and reliable data solutions. The advice you provided, the expertise in data & AI and the positive attitude you consistently showed during your time in both projects were a major factor for the achievement of my team’s goals.”
- Feedback from Solution Architect:
“The solution went beyond what we original set out to do which was great”