MLOps - Get Started with KubeRay
Ray is an open-source framework designed for scalable and distributed ML workloads, including training, tuning, and inference. It provides a simple API for scaling Python applications across multiple nodes.
Distributed Training: Easily scale PyTorch, TensorFlow, and other ML jobs across multiple GPUs/instances.Hyperparameter Tuning: Integrates with Optuna and Ray Tune for distributed hyperparameter optimization.Parallel Inference: Supports inference pipelines that scale out dynamically based on demand.Fault Tolerance: If a node fails, Ray can reschedule tasks on other available nodes.
Together with EKS, Karpenter, and Ray, a modern ML team can achieve dynamic Auto-Scaling and GPU resource allocation from local deployment to Running Distributed ML Jobs in Cloud Production environment:
- Dynamic Auto-Scaling: Karpenter scales worker nodes based on Ray’s demand (CPU, GPU, memory). Ray autoscaler scales Ray worker pods dynamically. No need for pre-provisioned expensive GPU nodes.
- Multi-Tenant Resource Sharing and Seamless Transition from Local to Cloud: ML teams can submit workloads without managing Kubernetes pods directly. Ray manages job execution and ensures efficient resource utilization. ML Engineers can run jobs locally with
Ray (ray.init())and later scale seamlessly to AWS EKS by switching toray.init(address=”ray://…”). - Cost Optimization with Spot & On-Demand Nodes: Karpenter provisions spot instances for non-critical ML training. On-demand nodes handle critical, low-latency inference.
Deployed a Ray cluster on Minikube with NVIDIA GPU support
Here I will start a local Minikube Kubernetes Ray deployment to get started with the Ray cluster.
# start minikube with GPU support root@zackz:~# minikube start --driver docker --container-runtime docker --gpus all --force --cpus=12 --memory=36g root@zackz:~# minikube addons enable nvidia-gpu-device-plugin # install ray and ray cluster helm chart on minikube root@zackz:~# helm repo add kuberay https://ray-project.github.io/kuberay-helm/ "kuberay" has been added to your repositories root@zackz:~# helm repo update Hang tight while we grab the latest from your chart repositories... ...Successfully got an update from the "aws-ebs-csi-driver" chart repository ...Successfully got an update from the "kuberay" chart repository ...Successfully got an update from the "karpenter" chart repository ...Successfully got an update from the "eks-charts" chart repository ...Successfully got an update from the "grafana" chart repository ...Successfully got an update from the "external-secrets" chart repository ...Successfully got an update from the "prometheus-community" chart repository Update Complete. ⎈Happy Helming!⎈ root@zackz:~# helm install kuberay-operator kuberay/kuberay-operator NAME: kuberay-operator LAST DEPLOYED: Fri Feb 14 10:26:51 2025 NAMESPACE: default STATUS: deployed REVISION: 1 TEST SUITE: None root@zackz:~# helm install my-ray-cluster kuberay/ray-cluster NAME: my-ray-cluster LAST DEPLOYED: Fri Feb 14 10:27:25 2025 NAMESPACE: default STATUS: deployed REVISION: 1 TEST SUITE: None # check ray pods root@zackz:~# kubectl get pods NAMESPACE NAME READY STATUS RESTARTS AGE default kuberay-operator-975995b7d-xzjd4 1/1 Running 0 2m19s default my-ray-cluster-kuberay-head-q6gcz 1/1 Running 0 105s default my-ray-cluster-kuberay-workergroup-worker-mf5pr 1/1 Running 0 105s root@zackz:~# kubectl get svc NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE kuberay-operator ClusterIP 10.98.253.1578080/TCP 2m39s kubernetes ClusterIP 10.96.0.1 443/TCP 20d my-ray-cluster-kuberay-head-svc ClusterIP None 10001/TCP,8265/TCP,8080/TCP,6379/TCP,8000/TCP 2m5s # forward ray dashboard port root@zackz:~# kubectl port-forward svc/my-ray-cluster-kuberay-head-svc 8265:8265 Forwarding from 127.0.0.1:8265 -> 8265 Forwarding from [::1]:8265 -> 8265 Handling connection for 8265
Access Ray dashboard via http://localhost:8265/ to verify Ray cluster head and worker nodes are running properly.
Resource Allocation and GPU training test
Here I exec into Ray head pod to verify the resource and GPU support, and found that the Ray cluster is not configured to use GPU by default helm chart.
root@zackz:~# kubectl exec -it my-ray-cluster-kuberay-head-q6gcz -- bash
(base) ray@my-ray-cluster-kuberay-head-q6gcz:~$ python -c "import ray; ray.init(); print(ray.cluster_resources())"
2025-02-13 15:31:21,699 INFO worker.py:1405 -- Using address 127.0.0.1:6379 set in the environment variable RAY_ADDRESS
2025-02-13 15:31:21,699 INFO worker.py:1540 -- Connecting to existing Ray cluster at address: 10.244.0.16:6379...
2025-02-13 15:31:21,706 INFO worker.py:1715 -- Connected to Ray cluster. View the dashboard at http://10.244.0.16:8265
{'node:__internal_head__': 1.0, 'node:10.244.0.16': 1.0, 'CPU': 2.0, 'memory': 3000000000.0, 'object_store_memory': 540331621.0, 'node:10.244.0.17': 1.0}
(base) ray@my-ray-cluster-kuberay-head-q6gcz:~$ nvidia-smi
bash: nvidia-smi: command not found
Hence I need to create a custom helm chart value file to add GPU resource request for Ray worker node and then update the Ray cluster.
root@zackz:~# helm list
NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION
kuberay-operator default 1 2025-02-14 10:26:51.403488615 +1100 AEDT deployed kuberay-operator-1.2.2
my-ray-cluster default 1 2025-02-14 10:27:25.186940262 +1100 AEDT deployed ray-cluster-1.2.2
root@zackz:~# helm get values my-ray-cluster
USER-SUPPLIED VALUES:
null
vim ray-values.yaml
# Ray version
rayVersion: '2.9.0'
# Image configuration
image:
repository: rayproject/ray
tag: "2.9.0"
pullPolicy: IfNotPresent
head:
rayStartParams:
dashboard-host: "0.0.0.0"
num-cpus: "2"
resources:
limits:
cpu: "2"
memory: "4Gi"
requests:
cpu: "1"
memory: "2Gi"
volumeMounts:
- mountPath: /dev/shm
name: dshm
volumes:
- name: dshm
emptyDir:
medium: Memory
worker:
replicas: 1
rayStartParams:
num-cpus: "2"
resources:
limits:
cpu: "4"
memory: "16Gi"
nvidia.com/gpu: 1
requests:
cpu: "2"
memory: "8Gi"
volumeMounts:
- mountPath: /dev/shm
name: dshm
volumes:
- name: dshm
emptyDir:
medium: Memory
root@zackz:/mnt/f/ml-local/local-minikube/ray# helm upgrade my-ray-cluster kuberay/ray-cluster -f ray-values.yaml
Release "my-ray-cluster" has been upgraded. Happy Helming!
NAME: my-ray-cluster
LAST DEPLOYED: Fri Feb 14 11:40:24 2025
NAMESPACE: default
STATUS: deployed
REVISION: 3
TEST SUITE: None
Now I was able to run a Ray task-based workload with memory/custom resource constraints, Adjusted Ray's memory thresholds to avoid OOM kills.
root@zackz:/mnt/f/ml-local/local-minikube/ray# kubectl exec -it my-ray-cluster-kuberay-workergroup-worker-6lcx5 -- bash
(base) ray@my-ray-cluster-kuberay-workergroup-worker-6lcx5:~$ nvidia-smi
Thu Feb 13 19:16:44 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.02 Driver Version: 560.94 CUDA Version: 12.6 |
|-----------------------------------------+------------------------+----------------------|
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 3070 Ti On | 00000000:01:00.0 On | N/A |
| 47% 45C P8 19W / 232W | 2955MiB / 8192MiB | 28% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------|
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 27 G /Xwayland N/A |
| 0 N/A N/A 37 G /Xwayland N/A |
+-----------------------------------------------------------------------------------------+
(base) ray@my-ray-cluster-kuberay-workergroup-worker-6lcx5:~$ exit
exit
root@zackz:/mnt/f/ml-local/local-minikube/ray# \
python -c "
import ray
from time import sleep
# Initialize with memory threshold adjustment
ray.init(runtime_env={
'env_vars': {
'RAY_memory_monitor_refresh_ms': '0', # Disable OOM killing
'RAY_memory_usage_threshold': '0.95'
}
})
# Specify resource requirements for the task
@ray.remote(
num_cpus=0.5, # Use less CPU to allow multiple tasks
memory=500 * 1024 * 1024, # Request 500MB memory per task
resources={'worker': 1} # Ensure it runs on worker nodes
)
def train_model_simulation(model_id):
sleep(2)
return f'Model {model_id} trained'
# Run fewer parallel tasks initially
futures = [train_model_simulation.remote(i) for i in range(2)]
results = ray.get(futures)
print(results)
Check the job event and history in Ray dashboard.
Conclusion
Here is what we achieved:
- Ray cluster setup on Local Minikube with GPU support (KubeRay)
- Configuring Ray’s resource limits (memory, CPU allocation per task)
- Running remote tasks efficiently using Ray's distributed execution
- Observing Ray cluster resource usage in real-time
Next step: I will see how to run Ray cluster together with Karpenter on EKS once I have the GPU EC2 instance quota request approved by AWS.