
MLOps - Deploy ML workload to K8S
Machine Learning workload deployed in K8S and minikube!!
So far the local ML practice is just the beginning, in a real world production environment, ML projects typically follow a structured lifecycle and often deployed in scalable, cloud-based environments. Cloud Providers like AWS with Managed Kubernetes Services (EKS) provide orchestration, scaling, and fault tolerance to handle ML workloads.
Path for deploying MLOps workload in K8S
- Transition from local ML practice to a K8S-based deployment (This post)
- Start with MLOps tools like Kubeflow.
- Shift from local to cloud platforms (AWS) to deploy ML workload on EKS.
- Practice deploying models using REST APIs (local) and APT Gateway (AWS).
- Try local data engineering (ETL pipelines, data lakes, etc.), then move to AWS data services and solutions for ML workload.
Setting up Minikube with GPU on WSL Ubuntu
First, let's create a Minikube cluster with GPU support on WSL Ubuntu. This will be the local K8S environment for testing and deploying ML workloads.
# Install Minikube sudo apt-get update curl -LO https://storage.googleapis.com/minikube/releases/latest/minikube-linux-amd64 sudo install minikube-linux-amd64 /usr/local/bin/minikube # Start Minikube with the Docker driver and GPU support minikube start --driver docker --container-runtime docker --gpus all --force --cpus=8 --memory=16g --addons=nvidia-gpu-device-plugin # Verify Minikube addon with Nvidia GPU root@zackz:/mnt/f/ml-local/local-minikube/complex# minikube addons list | grep NVIDIA | nvidia-device-plugin | minikube | enabled ✅ | 3rd party (NVIDIA) | | nvidia-driver-installer | minikube | disabled | 3rd party (NVIDIA) | | nvidia-gpu-device-plugin | minikube | disabled | 3rd party (NVIDIA) | # Verify Minikube node root@zackz:/mnt/f/ml-local/local-minikube# kubectl get node NAME STATUS ROLES AGE VERSION minikube Ready control-plane 34m v1.32.0 # Verify Minikube node GPU capacity root@zackz:/mnt/f/ml-local/local-minikube# kubectl describe node $(kubectl get nodes -o name | cut -d'/' -f2) | grep -A 10 "Capacity" Capacity: cpu: 20 ephemeral-storage: 1055762868Ki hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 49238360Ki nvidia.com/gpu: 1 pods: 110 Allocatable: cpu: 20 ephemeral-storage: 1055762868Ki
Next step, once the K8S is ready, let's run a GPU pod to test if a K8S pod can access GPU.
# vim gpu-stes.yaml apiVersion: v1 kind: Pod metadata: name: gpu-pod spec: containers: - name: cuda-container image: nvidia/cuda:12.6.0-base-ubuntu22.04 resources: limits: nvidia.com/gpu: 1 # Request 1 GPU command: ["nvidia-smi"] root@zackz:/mnt/f/ml-local/local-minikube# kubectl apply -f gpu-test.yaml pod/gpu-pod created root@zackz:/mnt/f/ml-local/local-minikube# kubectl get po -A NAMESPACE NAME READY STATUS RESTARTS AGE default gpu-pod 0/1 Completed 1 (2s ago) 3s kube-system coredns-668d6bf9bc-2nwph 1/1 Running 0 3m1s kube-system etcd-minikube 1/1 Running 0 3m7s kube-system kube-apiserver-minikube 1/1 Running 0 3m7s kube-system kube-controller-manager-minikube 1/1 Running 0 3m6s kube-system kube-proxy-vblkm 1/1 Running 0 3m1s kube-system kube-scheduler-minikube 1/1 Running 0 3m6s kube-system nvidia-device-plugin-daemonset-72jwz 1/1 Running 0 3m1s kube-system storage-provisioner 1/1 Running 1 (2m39s ago) 3m5s root@zackz:/mnt/f/ml-local/local-minikube# kubectl logs gpu-pod Fri Jan 24 22:55:29 2025 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 560.35.02 Driver Version: 560.94 CUDA Version: 12.6 | |-----------------------------------------+------------------------+----------------------| | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 3070 Ti On | 00000000:01:00.0 On | N/A | | 0% 53C P8 17W / 186W | 1736MiB / 8192MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------| +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 27 G /Xwayland N/A | | 0 N/A N/A 37 G /Xwayland N/A | +-----------------------------------------------------------------------------------------+
Now let's create a more complex GPU workload that trains a simple CNN on the MNIST dataset, with a ResourceQuota and LimitRange to allocate the GPU in minikube, we will use the tensorflow/tensorflow:2.14.0-gpu to better support CUDA and NVIDIA drivers to complete the training job.
# MNIST-Training.yaml apiVersion: apps/v1 kind: Deployment metadata: name: mnist-gpu-training namespace: gpu-workloads spec: replicas: 1 selector: matchLabels: app: mnist-gpu-training template: metadata: labels: app: mnist-gpu-training spec: containers: - name: tensorflow image: tensorflow/tensorflow:2.14.0-gpu resources: limits: nvidia.com/gpu: 1 memory: "8Gi" requests: memory: "4Gi" command: ["python3"] args: - "-c" - | import tensorflow as tf import time # Load and preprocess MNIST data (x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data() x_train = x_train.reshape(-1, 28, 28, 1).astype('float32') / 255.0 x_test = x_test.reshape(-1, 28, 28, 1).astype('float32') / 255.0 # Build CNN model model = tf.keras.Sequential([ tf.keras.layers.Conv2D(32, 3, activation='relu', input_shape=(28, 28, 1)), tf.keras.layers.MaxPooling2D(), tf.keras.layers.Conv2D(64, 3, activation='relu'), tf.keras.layers.MaxPooling2D(), tf.keras.layers.Flatten(), tf.keras.layers.Dense(128, activation='relu'), tf.keras.layers.Dense(10, activation='softmax') ]) # Compile model model.compile( optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'] ) print("Starting training...") start_time = time.time() # Train model history = model.fit( x_train, y_train, epochs=5, validation_data=(x_test, y_test), batch_size=128 ) end_time = time.time() print(f"\nTraining completed in {end_time - start_time:.2f} seconds") # Evaluate model test_loss, test_accuracy = model.evaluate(x_test, y_test) print(f"\nTest accuracy: {test_accuracy:.4f}") # gpu-resources.yaml apiVersion: v1 kind: Namespace metadata: name: gpu-workloads --- apiVersion: v1 kind: ResourceQuota metadata: name: gpu-quota namespace: gpu-workloads spec: hard: requests.nvidia.com/gpu: "1" limits.nvidia.com/gpu: "1" --- apiVersion: v1 kind: LimitRange metadata: name: gpu-limits namespace: gpu-workloads spec: limits: - type: Container defaultRequest: nvidia.com/gpu: "1" default: nvidia.com/gpu: "1" max: nvidia.com/gpu: "1"
Deploy the MNIST Training and resource quota into minikube.
root@zackz:/mnt/f/ml-local/local-minikube/complex# kubectl apply -f gpu-quota.yaml namespace/gpu-workloads created resourcequota/gpu-quota created limitrange/gpu-limits created root@zackz:/mnt/f/ml-local/local-minikube/complex# kubectl apply -f MNIST-Training.yaml deployment.apps/mnist-gpu-training created root@zackz:/mnt/f/ml-local/local-minikube# kubectl get po -A -w NAMESPACE NAME READY STATUS RESTARTS AGE default tensorflow-gpu-test-d445455dc-slsg4 0/1 ContainerCreating 0 2m59s kube-system coredns-668d6bf9bc-2nwph 1/1 Running 0 25m kube-system etcd-minikube 1/1 Running 0 25m kube-system kube-apiserver-minikube 1/1 Running 0 25m kube-system kube-controller-manager-minikube 1/1 Running 0 25m kube-system kube-proxy-vblkm 1/1 Running 0 25m kube-system kube-scheduler-minikube 1/1 Running 0 25m kube-system nvidia-device-plugin-daemonset-72jwz 1/1 Running 0 25m kube-system storage-provisioner 1/1 Running 1 (25m ago) 25m root@zackz:/mnt/f/ml-local/local-minikube/complex# kubectl logs -n gpu-workloads mnist-gpu-training-778fb7bcf7-zsscr 2025-01-24 23:18:15.642781: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz 11490434/11490434 [==============================] - 2s 0us/step 2025-01-24 23:18:19.429788: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1886] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 5558 MB memory: -> device: 0, name: NVIDIA GeForce RTX 3070 Ti, pci bus id: 0000:01:00.0, compute capability: 8.6 Starting training... Epoch 1/5 2025-01-24 23:18:20.175141: I tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:442] Loaded cuDNN version 8600 2025-01-24 23:18:20.399678: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7f25deb91fb0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices: 2025-01-24 23:18:20.399719: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): NVIDIA GeForce RTX 3070 Ti, Compute Capability 8.6 2025-01-24 23:18:20.402577: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:269] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable. 2025-01-24 23:18:20.461905: I ./tensorflow/compiler/jit/device_compiler.h:186] Compiled cluster using XLA! This line is logged at most once for the lifetime of the process. 469/469 [==============================] - 3s 4ms/step - loss: 0.2068 - accuracy: 0.9409 - val_loss: 0.0664 - val_accuracy: 0.9778 Epoch 2/5 469/469 [==============================] - 2s 3ms/step - loss: 0.0555 - accuracy: 0.9828 - val_loss: 0.0476 - val_accuracy: 0.9849 Epoch 3/5 469/469 [==============================] - 2s 3ms/step - loss: 0.0407 - accuracy: 0.9875 - val_loss: 0.0329 - val_accuracy: 0.9900 Epoch 4/5 469/469 [==============================] - 2s 4ms/step - loss: 0.0304 - accuracy: 0.9905 - val_loss: 0.0385 - val_accuracy: 0.9883 Epoch 5/5 469/469 [==============================] - 2s 4ms/step - loss: 0.0229 - accuracy: 0.9929 - val_loss: 0.0296 - val_accuracy: 0.9899 Training completed in 9.94 seconds 313/313 [==============================] - 1s 2ms/step - loss: 0.0296 - accuracy: 0.9899 Test accuracy: 0.9899
What we've achieved
So here we are able to set up a Minikube with local GPU support, integrated NVIDIA GPU (RTX 3070 Ti), by validating CUDA with TensorFlow GPU integration, deploy a CNN (Convolutional Neural Network) training on MNIST with final test accuracy: 98.99%.
In the next post I will explore how to manage Kubeflow for more complex machine learning scenarios and enable Prometheus and Grafana for ML workload monitoring.