30: Prometheus & Alerts

30: Prometheus & Alerts

Objective

Understand the Azure Monitor managed Prometheus and Grafana monitoring stack for AKS. Learn how Prometheus metrics complement Container Insights logs, explore Grafana dashboards, and understand how to create alert rules for Kubernetes workloads.


Theory

Azure Monitor Managed Prometheus

Azure Monitor managed service for Prometheus is a fully managed, scalable Prometheus-compatible monitoring solution for AKS:

Feature Description
Metrics collection Scrapes Prometheus metrics from AKS nodes, Pods, and workloads
Azure Monitor workspace Stores Prometheus metrics (separate from Log Analytics)
PromQL Query language for Prometheus metrics
Managed infrastructure No need to deploy or manage your own Prometheus server
Retention 18 months of metric data by default

Managed Prometheus is enabled by the cluster administrator when creating or updating an AKS cluster:

# Instructor/admin commands — do not run
az aks create \
  --resource-group <RG> \
  --name <CLUSTER> \
  --enable-azure-monitor-metrics

# Or on an existing cluster:
az aks update \
  --resource-group <RG> \
  --name <CLUSTER> \
  --enable-azure-monitor-metrics

Azure Managed Grafana

Azure Managed Grafana provides pre-built dashboards for visualizing Kubernetes metrics:

  • Fully managed Grafana instance — no setup or maintenance required
  • Pre-configured data sources for Azure Monitor and Prometheus
  • Built-in Kubernetes dashboards:
    • Cluster overview (nodes, Pods, resource usage)
    • Namespace workloads
    • Pod-level metrics
    • Kubelet and API server metrics
  • Custom dashboards can be created for application-specific metrics

Prometheus vs Container Insights

Both are part of the AKS monitoring stack, but they serve different purposes:

Aspect Container Insights Prometheus
Primary focus Logs and inventory Metrics and time series
Query language KQL (Kusto) PromQL
Data store Log Analytics workspace Azure Monitor workspace
Best for Troubleshooting, log analysis, auditing Performance monitoring, dashboards, alerting
Visualization Azure Portal Insights views Grafana dashboards
Cost model Per GB ingested Per metrics ingested

Alert Rules

Alert rules define conditions that trigger notifications or automated actions:

Component Description
Condition The metric or log query that defines when to alert (e.g., CPU > 90%)
Threshold The value that triggers the alert
Evaluation period How often the condition is checked (e.g., every 5 minutes)
Action group What happens when the alert fires (email, SMS, webhook, Logic App)
Severity Alert severity level (0-Critical to 4-Verbose)

Common alert scenarios for AKS:

  • Pod restart count exceeds threshold
  • Node CPU or memory utilization is high
  • Pods stuck in Pending state
  • Container OOM kills detected
  • Node NotReady status

Mermaid Diagram: Production Monitoring Stack

flowchart TB
    subgraph AKS Cluster
        N1[Node 1]
        N2[Node 2]
        P1[App Pods]
        P2[System Pods]
        AMA[ama-metrics<br>DaemonSet]
        AML[ama-logs<br>DaemonSet]
    end

    subgraph Azure Monitor
        AMW[Azure Monitor<br>Workspace<br>Prometheus metrics]
        LAW[Log Analytics<br>Workspace<br>Container logs]
        AR[Alert Rules]
    end

    subgraph Visualization
        G[Azure Managed<br>Grafana<br>Dashboards]
        CI[Container Insights<br>Azure Portal]
    end

    subgraph Notifications
        AG[Action Group<br>Email / SMS /<br>Webhook]
    end

    N1 --> AMA
    N2 --> AMA
    N1 --> AML
    N2 --> AML
    AMA -->|Prometheus metrics| AMW
    AML -->|Logs + inventory| LAW
    AMW --> G
    AMW --> AR
    LAW --> CI
    LAW --> AR
    AR -->|Triggers| AG

Practical Tasks

Note: This exercise is primarily instructor-led. Tasks 1 involves kubectl commands that participants can run. Tasks 2 and 3 are Azure Portal demos shown by the instructor. The theory and examples are included so participants understand the complete production monitoring stack.

Task 1: Verify Prometheus Metrics Collection (Participant Task)

Check if the Prometheus metrics agent is running on the cluster:

kubectl get pods -n kube-system | grep prometheus

You may also see the Azure Monitor Agent collecting metrics:

kubectl get pods -n kube-system | grep ama-metrics

Check the DaemonSet status:

kubectl get daemonsets -n kube-system | grep -E "prometheus|ama-metrics"

Verify that metrics are being collected by checking node and pod metrics:

kubectl top nodes
kubectl top pods -n student-XX

Task 2: Explore Grafana Dashboards (Instructor Demo)

Note: Azure Managed Grafana access is an instructor-led demo. Participants observe while the instructor navigates the dashboards.

This task is demonstrated by the instructor using the Azure Portal:

  1. Navigate to the Azure Managed Grafana instance in the Azure Portal
  2. Click Endpoint to open the Grafana UI
  3. Go to Dashboards in the left menu
  4. Explore the pre-built Kubernetes dashboards:
    • Kubernetes / Compute Resources / Cluster — overall cluster resource usage
    • Kubernetes / Compute Resources / Namespace (Workloads) — per-namespace resource usage
    • Kubernetes / Compute Resources / Pod — individual Pod CPU and memory
    • Kubernetes / Kubelet — kubelet performance metrics
    • Kubernetes / API Server — API server request rates and latencies

Key metrics to observe:

  • CPU and memory usage vs requests vs limits
  • Pod count by namespace
  • Network I/O per Pod
  • API server request rate and error rate

Task 3: Create a Basic Alert Rule (Instructor Demo)

Note: Creating alert rules requires Azure Portal access and is demonstrated by the instructor. Participants learn the concepts and observe the configuration process.

This task is demonstrated by the instructor in the Azure Portal.

Example: Alert when Pod restart count exceeds 3 in 5 minutes.

  1. Navigate to Azure Monitor > Alerts
  2. Click Create > Alert rule
  3. Select the AKS cluster as the target resource
  4. Configure the condition:
    • Signal: Restarting container count (from Container Insights metrics)
    • Threshold: Greater than 3
    • Evaluation period: 5 minutes
    • Frequency: Every 1 minute
  5. Configure the action group:
    • Action type: Email/SMS/Push/Voice
    • Enter notification recipients
  6. Set alert details:
    • Alert rule name: High Pod Restart Count
    • Severity: 2 - Warning
  7. Review and create

Example PromQL-based alert (if using Prometheus alert rules):

# Alert when any Pod has restarted more than 3 times in the last 5 minutes
increase(kube_pod_container_status_restarts_total[5m]) > 3

Example KQL-based alert (using Log Analytics):

KubePodInventory
| where TimeGenerated > ago(5m)
| where ContainerRestartCount > 3
| summarize MaxRestarts = max(ContainerRestartCount) by PodName, Namespace
| where MaxRestarts > 3

Common Problems

Problem Cause Solution
No Prometheus pods found Managed Prometheus not enabled Inform the instructor — enable with az aks update --enable-azure-monitor-metrics
Grafana shows no data Data source not configured or metrics not collected yet Check data source configuration in Grafana settings; wait 5-10 minutes
kubectl top returns error Metrics Server or Prometheus agent not running Verify agent DaemonSets are healthy in kube-system
Alert not firing Condition not met or evaluation period too long Review alert rule configuration; check that the metric is being collected
High monitoring costs Too many metrics or custom metrics scraped Configure metric filtering; limit custom metric scraping to essential endpoints

Best Practices

  • Deploy the full monitoring stack — Container Insights (logs) + Prometheus (metrics) + Grafana (dashboards) + Alerts (notifications)
  • Use Prometheus for metrics, Container Insights for logs — each tool is optimized for its purpose; using both provides complete observability
  • Start with recommended alerts — Azure provides pre-configured alert rule templates for AKS; enable them as a baseline
  • Create dashboards for your applications — extend the built-in Kubernetes dashboards with application-specific metrics
  • Configure action groups wisely — route critical alerts to on-call teams (PagerDuty, Slack webhook), informational alerts to email
  • Set appropriate thresholds — avoid alert fatigue by tuning thresholds based on baseline metrics; too many false alerts cause teams to ignore real issues
  • Use metric filtering — only scrape the Prometheus metrics you need to control costs and reduce noise
  • Monitor the monitors — set up alerts for monitoring agent health (ama-metrics, ama-logs DaemonSets) to detect monitoring gaps

Summary

In this exercise you learned:

  • Azure Monitor managed Prometheus scrapes and stores Prometheus metrics from AKS clusters
  • Azure Managed Grafana provides pre-built dashboards for Kubernetes monitoring
  • Prometheus (metrics) and Container Insights (logs) are complementary — use both for complete observability
  • Alert rules define conditions (metric thresholds, log queries) that trigger notifications via action groups
  • The production monitoring stack consists of: Container Insights (logs) + Prometheus (metrics) + Grafana (dashboards) + Alerts (notifications)

results matching ""

    No results matching ""