Electricity Company - MLOps Platform

Background

The Company maintains a electricity distribution network and supplies eletricity to millions of homes across Australia.

Problem

The Company has built machine learning models to perform batch demand forecasting on eletricity demand. However, there are several problems with the current solution:

The models are ran on a batch basis directly on a virtual machine using Windows Task Scheduler, resulting in difficulty with monitoring execution logs.
The models are not versioned.
The models are manually deployed on to the virtual machine, resulting in error prone deployment.
The model’s local python environment is not captured resulting in difficulty in replicating the environment in production.
The model cannot perform inference in real-time.
Models were generally written using a very long python script which lacked in modularity and ability to perform unit and integration tests.
Several other models have been developed with inconsistent approaches, code patterns and deployment patterns.

Solution

MLOps platform

platform

Designed an MLOps platform that would scale across the Company’s business units (> 5 business units).
Wrote Infrastructure as Code using ARM templates and PowerShell Core to automate entire environment deployment with simple YAML config files.
The MLOps platform environment includes (each environment is fully configurable with a YAML file):
- Azure ML Workspace for tracking experiments and model versions
- Azure Compute for cloud training and experimentation with Python ML environment loaded
- Azure Databricks for cloud training and experimentation with PySpark ML environment loaded
- Kubernetes Cluster for real-time model inference
- Azure VNet for network security
- Policies and RBAC for governance and access control
- Log Analytics Workspace for application logging
- Azure Container Registry for storing docker images of training environments
Azure DevOps CI/CD YAML pipelines to deploy end-to-end infrastructure.
Designed and documented new development and deployment patterns for Data Scientists and ML engineers. New patterns include:
- Unit and integration testing to improve code quality
- Training pipelines to automate re-training if/when model drift is detected
- Model versioning with support for version rollback if needed
- Docker to encapsulate local training environment for easy deployment later
Trained Data Scientists and ML engineers with new patterns.

Outcome

The MLOps platform received a health check assessment score of 92% from Microsoft (Perth).
Feedback from Data Analytics Manager:

“Jonathan’s commitment and hard work throughout the project was a motivation for everyone else in the team. Many of us learned a lot from you and the best practices you adhere to when designing and building flexible and reliable data solutions. The advice you provided, the expertise in data & AI and the positive attitude you consistently showed during your time in both projects were a major factor for the achievement of my team’s goals.”

Feedback from Solution Architect:

“The solution went beyond what we original set out to do which was great”