Zack's Blog

zack-machine-learning October 02, 2024

MLOps - Explore ML tools

In the last post MLOPS - Lab Setup, I was able to set the local ML lab environment, and run validation in Jupyter Notebook to test the CODA device and performance on my local PC.

Although Jupyter Notebooks can be user-friendly tools for ML practice, offering easy interaction and immediate feedback, which simplifies testing and debugging, it has limitations such as reproducibility issues, challenges in collaboration and version control, scalability concerns for larger projects, and a lack of automation for tasks like retraining.

In this post, I will try an ML project with tools like DVC, MLflow, Docker, Apache Airflow, and CI/CD frameworks to strengthen machine learning workflows. This way can ensure reproducibility by tracking data and code versions, while MLflow logs metrics for effective experiment tracking. Although their initial setup can be complex and resource-intensive, these tools automate processes, streamline workflows, and enhance collaboration and scalability, which could be excessive for smaller ML projects.

ML Tools explained

Data Versioning (DVC): DVC allows teams to manage and version datasets just like code. This ensures that data changes are tracked, making it easier to revert to previous versions if necessary.
Experiment Tracking (MLflow): MLflow tracks experiments, capturing metrics, parameters, and model versions in one centralized location. This makes it easier to compare different runs and select the best-performing model.
Containerization (Docker): Docker creates isolated environments, ensuring that code runs consistently across different platforms without dependency issues. This helps avoid the "it works on my machine" problem.
Workflow Orchestration (Apache Airflow): Airflow schedules and manages complex workflows, allowing for the automation of tasks such as data retrieval, preprocessing, model training, and evaluation.
CI/CD (Jenkins): I have a local Jenkins image to facilitate automatic testing and deployment of models and code changes. This ensures that new features or updates are quickly integrated without disrupting the existing workflow.

Combining these tools to achieve a holistic pipeline enables reproducibility, scalability, and consistency in machine learning workflows.

Project Structure

Create a new project directory with the following structure:

(jupyter_env) root@zackz:/mnt/mlops-project# tree
mlops-project/
├── data/             # Data directory (for DVC)
├── models/           # Trained models
├── src/              # Source code for the ML model
├── notebooks/        # Jupyter notebooks for experimentation
├── Dockerfile        # Docker config for packaging
├── dvc.yaml          # DVC pipeline config
├── airflow_dags/     # Airflow DAG for automation
└── mlflow/           # MLflow tracking directory

Project Implementation

Step 1: Data Versioning with DVC

Initialize Git & DVC

pip install dvc
git init
dvc init -f

Add the Iris Dataset:

mkdir data
curl -o data/iris.csv https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data
dvc add data/iris.csv

Step 2: Train the Model (Using MLflow)

Install MLflow

pip install mlflow

Create a Training Script (src/train.py)

vim  src/train.py

import mlflow
import mlflow.sklearn
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Load the dataset
data = pd.read_csv('../data/iris.csv', header=None)
X = data.iloc[:, :-1]
y = data.iloc[:, -1]

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Track experiment with MLflow
with mlflow.start_run():
 # Train model
 model = RandomForestClassifier(n_estimators=100)
 model.fit(X_train, y_train)

 # Make predictions
 predictions = model.predict(X_test)
 accuracy = accuracy_score(y_test, predictions)

 # Log model and metrics to MLflow
 mlflow.log_metric("accuracy", accuracy)
 mlflow.sklearn.log_model(model, "model")
 print(f"Model accuracy: {accuracy}")

Run the Training Script

python src/train.py

(jupyter_env) root@zackz:/mnt/f/1/mlops-project# python src/train.py
2024/10/05 13:33:39 WARNING mlflow.models.model: Model logged without a signature and input example. Please set `input_example` parameter when logging the model to auto infer the model signature.
Model accuracy: 1.0

Launch the MLflow UI

mlflow ui

Navigate to http://127.0.0.1:5000 to view the experiment

Step 3: Dockerize the Model for Deployment

Create Dockerfile:

vim Dockerfile

FROM python:3.8-slim

WORKDIR /app

# Install dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy the source code
COPY . .

# Run the model training script
CMD ["python", "src/train.py"]

Create a requirements.txt file and build the mlops-local-model Docker image:

vim requirements.txt

mlflow
scikit-learn
pandas
dvc

docker build -t mlops-local-model .
docker run mlops-local-model

(jupyter_env) root@zackz:~# docker run mlops-local-model
2024/10/05 02:50:58 WARNING mlflow.utils.git_utils: Failed to import Git (the Git executable is probably not on your PATH), so Git SHA is not available. Error: Failed to initialize: Bad git executable.
The git executable must be specified in one of the following ways:
 - be included in your $PATH
 - be set via $GIT_PYTHON_GIT_EXECUTABLE
 - explicitly set via git.refresh()

All git commands will error until this is rectified.

This initial message can be silenced or aggravated in the future by setting the
$GIT_PYTHON_REFRESH environment variable. Use one of the following values:
 - quiet|q|silence|s|silent|none|n|0: for no message or exception
 - warn|w|warning|log|l|1: for a warning message (logging level CRITICAL, displayed by default)
 - error|e|exception|raise|r|2: for a raised exception

Example:
 export GIT_PYTHON_REFRESH=quiet

2024/10/05 02:51:00 WARNING mlflow.models.model: Model logged without a signature and input example. Please set `input_example` parameter when logging the model to auto infer the model signature.
Model accuracy: 1.0

Step 4: Automate with Apache Airflow

Install Apache Airflow:

pip install apache-airflow

Create an Airflow DAG (airflow_dags/ml_pipeline.py)

vim  airflow_dags/ml_pipeline.py

from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from datetime import datetime
import os

# Define the DAG
default_args = {
 'owner': 'airflow',
 'start_date': datetime(2023, 1, 1),
 'retries': 1,
}

dag = DAG('mlops_pipeline', default_args=default_args, schedule_interval='@daily')

# Define the task to retrain the model
def retrain_model():
 os.system('python src/train.py')

retrain_task = PythonOperator(
 task_id='retrain_model',
 python_callable=retrain_model,
 dag=dag
)

retrain_task

Run Airflow:

airflow db init
airflow webserver --port 8080
airflow scheduler

Create Airflow web UI Admin user

airflow users create \
 --username admin \
 --firstname Admin \
 --lastname User \
 --role Admin \
 --email [email protected] \
 --password the_password

Navigate to http://127.0.0.1:8080 to view the Airflow

Step 5: CICD with Jenkins

Create Jenkins pipeline for continuous model training with the following stages:

Conclusion

By integrating DVC, MLflow, Docker, Airflow, and CI/CD into a cohesive ML project environment, we can achieve enhanced efficiency, greater automation, and improved collaboration. This synergy not only streamlines the development process but also ensures that machine learning models are robust, reproducible, and ready for production deployment.

In summary, a production-level ML workflow integrates new data, automates model training and deployment, and continuously monitors model performance. By utilizing CI/CD pipelines, Docker for containerization, and tools for versioning and tracking, we can create a robust and efficient machine learning system that can adapt to changing data and business requirements.

In the next post, I will refactor the local tools into AWS ML services, to move the ML pipeline and deployment to the cloud.

MLOps - Explore ML tools

Welcome to Zack's Blog