Zack's Blog

zack-kubernetes September 29, 2024

EKS - Debug Prometheus Metrics

AWS Managed Prometheus & Grafana is the "plug-and-play" choice for production workloads requiring minimal management, while on the other hand installing helm kube-prometheus-stack offers maximum control but requires more effort to maintain and scale effectively.

Hence for cost control and full customization, I decided to install kube-prometheus-stack on my local lab cluster.

Understand Prometheus Pull-based Monitoring Flow

Expose Metrics: Applications expose metrics in Prometheus format.
Discover Targets:
- Kubernetes-native targets: Discovered via the Kubernetes API.
- Non-cloud-native targets: Defined statically or exposed through exporters.
Scrape Metrics: Prometheus scrapes metrics periodically from /metrics endpoints.
Store Metrics: Metrics are stored in Prometheus's time-series database.
Visualize Metrics: Grafana (in kube-prometheus-stack) is often used to query and visualize metrics.

Prometheus metrics issue

Then I found Prometheus was unable to scrape metrics from several Kubernetes components (etcd, kube-controller-manager, kube-scheduler, and kube-proxy). These targets were marked as DOWN in the Prometheus UI with errors such as: Connection refused.

Troubleshooting Steps

As per Prometheus monitoring flow, let's start with troubleshooting.

Step 1: Check Prometheus service, pod and ServiceMonitor

root@asb:/home/ubuntu# kubectl get svc -n kube-system
NAME                                                 TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)                        AGE
kube-dns                                             ClusterIP   10.96.0.10             53/UDP,53/TCP,9153/TCP         63d
metrics-server                                       ClusterIP   10.110.18.85           443/TCP                        61d
prometheus-stack-kube-prom-coredns                   ClusterIP   None                   9153/TCP                       60d
prometheus-stack-kube-prom-kube-controller-manager   ClusterIP   None                   10257/TCP                      60d
prometheus-stack-kube-prom-kube-etcd                 ClusterIP   None                   2381/TCP                       60d
prometheus-stack-kube-prom-kube-proxy                ClusterIP   None                   10249/TCP                      60d
prometheus-stack-kube-prom-kube-scheduler            ClusterIP   None                   10259/TCP                      60d
prometheus-stack-kube-prom-kubelet                   ClusterIP   None                   10250/TCP,10255/TCP,4194/TCP   60d
root@asb:/home/ubuntu# kubectl get po -n kube-system
NAME                                       READY   STATUS    RESTARTS        AGE
calico-kube-controllers-84b7b7fdbb-klzpf   1/1     Running   10 (42m ago)    39d
calico-node-bzf6d                          1/1     Running   20 (113m ago)   63d
calico-node-ggw7r                          1/1     Running   21 (112m ago)   63d
calico-node-t8jfn                          1/1     Running   20 (112m ago)   63d
coredns-5d5dd8cb46-pwsvt                   1/1     Running   1 (113m ago)    25h
coredns-5d5dd8cb46-vthsr                   1/1     Running   9 (112m ago)    39d
etcd-asb-mst                               1/1     Running   0               41m
kube-apiserver-asb-mst                     1/1     Running   19 (41m ago)    39d
kube-controller-manager-asb-mst            1/1     Running   0               41m
kube-proxy-956zq                           1/1     Running   0               28m
kube-proxy-9fcd4                           1/1     Running   0               28m
kube-proxy-tcnql                           1/1     Running   0               28m
kube-scheduler-asb-mst                     1/1     Running   0               41m
metrics-server-7766f59c77-xbxxr            1/1     Running   1 (112m ago)    25h
root@asb:/home/ubuntu# kubectl get servicemonitors.monitoring.coreos.com -n monitoring 
NAME                                                 AGE
prometheus-stack-grafana                             60d
prometheus-stack-kube-prom-alertmanager              60d
prometheus-stack-kube-prom-apiserver                 60d
prometheus-stack-kube-prom-coredns                   60d
prometheus-stack-kube-prom-kube-controller-manager   60d
prometheus-stack-kube-prom-kube-etcd                 60d
prometheus-stack-kube-prom-kube-proxy                60d
prometheus-stack-kube-prom-kube-scheduler            60d
prometheus-stack-kube-prom-kubelet                   60d
prometheus-stack-kube-prom-operator                  60d
prometheus-stack-kube-prom-prometheus                60d
prometheus-stack-kube-state-metrics                  60d
prometheus-stack-prometheus-node-exporter            60d

Step 2: Checked Prometheus ServiceMonitor configurations to ensure they matched the service labels, ports, and namespaces. Check Service Labels and Selectors vs Pod Labels vs ServiceMonitor Selector.

Step 3: Verify Metrics Exposure

kubectl port-forward -n kube-system svc/prometheus-stack-kube-prom-kube-etcd 2381:2381
curl http://localhost:2381/metrics

See that API /metrics was accessible with a list of retrieves, then I need to create a debug pod to test the metrics endpoint

kubectl run -it --rm debug-pod --image=busybox --restart=Never -- sh

/ # wget http://11.0.1.231:2381/metrics
Connecting to 11.0.1.231:2381 (11.0.1.231:2381)
wget: server returned error: HTTP/1.1 503 Service Unavailable

The 503 Service Unavailable error indicates that the service is reachable, but it’s not properly routing requests to the etcd pod. This suggests a potential issue with the service configuration or the etcd pod itself.

Step 4: Inspect ETCD configuration via /etc/kubernetes/manifests/etcd.yaml
The etcd pod must be configured to expose metrics on port 2381. Check the etcd deployment or static pod configuration file (often in /etc/kubernetes/manifests/ for static pods on control plane nodes).

The --listen-metrics-urls flag should include the :2381 endpoint: --listen-metrics-urls=http://127.0.0.1:2381

Step 5: Implement Fixes

# vim /etc/kubernetes/manifests/etcd.yaml

--listen-metrics-urls=http://0.0.0.0:2381

# Updated ConfigMap and restart kube-proxy:
kubectl edit cm kube-proxy -n kube-system
kubectl rollout restart ds kube-proxy -n kube-system

Confirmed that Prometheus targets were UP after the fixes.

Conclusion

Common Prometheus Metrics debug steps:

Check Prometheus pod, service, and servicemonitor status
Check labels (servicemonitor label vs service label vs pod label)
Check and test metrics API endpoint within Prometheus pod (e.g. /metrics)

This troubleshooting experience highlights the importance of end-to-end configuration alignment in Prometheus Metrics and Targets setups, from endpoint exposure to scraping configurations. It can be a method to debug any other similar Metrics issue in a Prometheus monitoring for cloud-native and non-cloud-native applications.

Export Redis Metrics

Here I will run another practice to install Redis and Redis exporter, then use Prometheus to scrape its metrics and visualize them in Grafana.

# Redis and Redis exporter deployment 

root@asb:~# cat k8s-redis-and-exporter-deployment.yaml 
---
apiVersion: v1
kind: Namespace
metadata:
  name: redis
---
apiVersion: apps/v1
kind: Deployment
metadata:
  namespace: redis
  name: redis
spec:
  replicas: 1
  selector:
    matchLabels:
      app: redis
  template:
    metadata:
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "9121"
      labels:
        app: redis
    spec:
      containers:
      - name: redis
        image: redis:4
        resources:
          requests:
            cpu: 100m
            memory: 100Mi
        ports:
        - containerPort: 6379
      - name: redis-exporter
        image: oliver006/redis_exporter:latest
        securityContext:
          runAsUser: 59000
          runAsGroup: 59000
          allowPrivilegeEscalation: false
          capabilities:
            drop:
            - ALL
        resources:
          requests:
            cpu: 100m
            memory: 100Mi
        ports:
        - containerPort: 9121

# redis service and servicemonitor
root@asb:~# cat k8s-redis-and-exporter-svc-svcmonitor.yaml 
apiVersion: v1
kind: Service
metadata:
  namespace: redis
  name: redis-metrics
  labels:
    app: redis
spec:
  selector:
    app: redis
  ports:
  - name: http-metrics
    port: 9121
    targetPort: 9121
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: redis-monitor
  namespace: monitoring
  labels:
    app: redis
    release: prometheus-stack  # important to match with kube-monitoring-stack
spec:
  selector:
    matchLabels:
      app: redis
  namespaceSelector:
    matchNames:
      - redis
  endpoints:
    - port: http-metrics
      interval: 30s

Head over to Prometheus Targets, we can see the metrics are being scraped from the Redis exporter.

Head over to Grafana, then import the dashboard 763

EKS - Debug Prometheus Metrics

Welcome to Zack's Blog