30: Prometheus & Alerts

Objective

Understand the Azure Monitor managed Prometheus and Grafana monitoring stack for AKS. Learn how Prometheus metrics complement Container Insights logs, explore Grafana dashboards, and understand how to create alert rules for Kubernetes workloads.

Theory

Azure Monitor Managed Prometheus

Azure Monitor managed service for Prometheus is a fully managed, scalable Prometheus-compatible monitoring solution for AKS:

Feature	Description
Metrics collection	Scrapes Prometheus metrics from AKS nodes, Pods, and workloads
Azure Monitor workspace	Stores Prometheus metrics (separate from Log Analytics)
PromQL	Query language for Prometheus metrics
Managed infrastructure	No need to deploy or manage your own Prometheus server
Retention	18 months of metric data by default

Managed Prometheus is enabled by the cluster administrator when creating or updating an AKS cluster:

# Instructor/admin commands — do not run
az aks create \
  --resource-group <RG> \
  --name <CLUSTER> \
  --enable-azure-monitor-metrics

# Or on an existing cluster:
az aks update \
  --resource-group <RG> \
  --name <CLUSTER> \
  --enable-azure-monitor-metrics

Azure Managed Grafana

Azure Managed Grafana provides pre-built dashboards for visualizing Kubernetes metrics:

Fully managed Grafana instance — no setup or maintenance required
Pre-configured data sources for Azure Monitor and Prometheus
Built-in Kubernetes dashboards:
- Cluster overview (nodes, Pods, resource usage)
- Namespace workloads
- Pod-level metrics
- Kubelet and API server metrics
Custom dashboards can be created for application-specific metrics

Prometheus vs Container Insights

Both are part of the AKS monitoring stack, but they serve different purposes:

Aspect	Container Insights	Prometheus
Primary focus	Logs and inventory	Metrics and time series
Query language	KQL (Kusto)	PromQL
Data store	Log Analytics workspace	Azure Monitor workspace
Best for	Troubleshooting, log analysis, auditing	Performance monitoring, dashboards, alerting
Visualization	Azure Portal Insights views	Grafana dashboards
Cost model	Per GB ingested	Per metrics ingested

Alert Rules

Alert rules define conditions that trigger notifications or automated actions:

Component	Description
Condition	The metric or log query that defines when to alert (e.g., CPU > 90%)
Threshold	The value that triggers the alert
Evaluation period	How often the condition is checked (e.g., every 5 minutes)
Action group	What happens when the alert fires (email, SMS, webhook, Logic App)
Severity	Alert severity level (0-Critical to 4-Verbose)

Common alert scenarios for AKS:

Pod restart count exceeds threshold
Node CPU or memory utilization is high
Pods stuck in Pending state
Container OOM kills detected
Node NotReady status

Mermaid Diagram: Production Monitoring Stack

flowchart TB
    subgraph AKS Cluster
        N1[Node 1]
        N2[Node 2]
        P1[App Pods]
        P2[System Pods]
        AMA[ama-metrics<br>DaemonSet]
        AML[ama-logs<br>DaemonSet]
    end

    subgraph Azure Monitor
        AMW[Azure Monitor<br>Workspace<br>Prometheus metrics]
        LAW[Log Analytics<br>Workspace<br>Container logs]
        AR[Alert Rules]
    end

    subgraph Visualization
        G[Azure Managed<br>Grafana<br>Dashboards]
        CI[Container Insights<br>Azure Portal]
    end

    subgraph Notifications
        AG[Action Group<br>Email / SMS /<br>Webhook]
    end

    N1 --> AMA
    N2 --> AMA
    N1 --> AML
    N2 --> AML
    AMA -->|Prometheus metrics| AMW
    AML -->|Logs + inventory| LAW
    AMW --> G
    AMW --> AR
    LAW --> CI
    LAW --> AR
    AR -->|Triggers| AG

Practical Tasks

Note: This exercise is primarily instructor-led. Tasks 1 involves kubectl commands that participants can run. Tasks 2 and 3 are Azure Portal demos shown by the instructor. The theory and examples are included so participants understand the complete production monitoring stack.

Task 1: Verify Prometheus Metrics Collection (Participant Task)

Check if the Prometheus metrics agent is running on the cluster:

kubectl get pods -n kube-system | grep prometheus

You may also see the Azure Monitor Agent collecting metrics:

kubectl get pods -n kube-system | grep ama-metrics

Check the DaemonSet status:

kubectl get daemonsets -n kube-system | grep -E "prometheus|ama-metrics"

Verify that metrics are being collected by checking node and pod metrics:

kubectl top nodes
kubectl top pods -n student-XX

Task 2: Explore Grafana Dashboards (Instructor Demo)

Note: Azure Managed Grafana access is an instructor-led demo. Participants observe while the instructor navigates the dashboards.

This task is demonstrated by the instructor using the Azure Portal:

Navigate to the Azure Managed Grafana instance in the Azure Portal
Click Endpoint to open the Grafana UI
Go to Dashboards in the left menu
Explore the pre-built Kubernetes dashboards:
- Kubernetes / Compute Resources / Cluster — overall cluster resource usage
- Kubernetes / Compute Resources / Namespace (Workloads) — per-namespace resource usage
- Kubernetes / Compute Resources / Pod — individual Pod CPU and memory
- Kubernetes / Kubelet — kubelet performance metrics
- Kubernetes / API Server — API server request rates and latencies

Key metrics to observe:

CPU and memory usage vs requests vs limits
Pod count by namespace
Network I/O per Pod
API server request rate and error rate

Task 3: Create a Basic Alert Rule (Instructor Demo)

Note: Creating alert rules requires Azure Portal access and is demonstrated by the instructor. Participants learn the concepts and observe the configuration process.

This task is demonstrated by the instructor in the Azure Portal.

Example: Alert when Pod restart count exceeds 3 in 5 minutes.

Navigate to Azure Monitor > Alerts
Click Create > Alert rule
Select the AKS cluster as the target resource
Configure the condition:
- Signal: Restarting container count (from Container Insights metrics)
- Threshold: Greater than 3
- Evaluation period: 5 minutes
- Frequency: Every 1 minute
Configure the action group:
- Action type: Email/SMS/Push/Voice
- Enter notification recipients
Set alert details:
- Alert rule name: High Pod Restart Count
- Severity: 2 - Warning
Review and create

Example PromQL-based alert (if using Prometheus alert rules):

# Alert when any Pod has restarted more than 3 times in the last 5 minutes
increase(kube_pod_container_status_restarts_total[5m]) > 3

Example KQL-based alert (using Log Analytics):

KubePodInventory
| where TimeGenerated > ago(5m)
| where ContainerRestartCount > 3
| summarize MaxRestarts = max(ContainerRestartCount) by PodName, Namespace
| where MaxRestarts > 3

Common Problems

Problem	Cause	Solution
No Prometheus pods found	Managed Prometheus not enabled	Inform the instructor — enable with `az aks update --enable-azure-monitor-metrics`
Grafana shows no data	Data source not configured or metrics not collected yet	Check data source configuration in Grafana settings; wait 5-10 minutes
`kubectl top` returns error	Metrics Server or Prometheus agent not running	Verify agent DaemonSets are healthy in `kube-system`
Alert not firing	Condition not met or evaluation period too long	Review alert rule configuration; check that the metric is being collected
High monitoring costs	Too many metrics or custom metrics scraped	Configure metric filtering; limit custom metric scraping to essential endpoints

Best Practices

Deploy the full monitoring stack — Container Insights (logs) + Prometheus (metrics) + Grafana (dashboards) + Alerts (notifications)
Use Prometheus for metrics, Container Insights for logs — each tool is optimized for its purpose; using both provides complete observability
Start with recommended alerts — Azure provides pre-configured alert rule templates for AKS; enable them as a baseline
Create dashboards for your applications — extend the built-in Kubernetes dashboards with application-specific metrics
Configure action groups wisely — route critical alerts to on-call teams (PagerDuty, Slack webhook), informational alerts to email
Set appropriate thresholds — avoid alert fatigue by tuning thresholds based on baseline metrics; too many false alerts cause teams to ignore real issues
Use metric filtering — only scrape the Prometheus metrics you need to control costs and reduce noise
Monitor the monitors — set up alerts for monitoring agent health (ama-metrics, ama-logs DaemonSets) to detect monitoring gaps

Summary

In this exercise you learned:

Azure Monitor managed Prometheus scrapes and stores Prometheus metrics from AKS clusters
Azure Managed Grafana provides pre-built dashboards for Kubernetes monitoring
Prometheus (metrics) and Container Insights (logs) are complementary — use both for complete observability
Alert rules define conditions (metric thresholds, log queries) that trigger notifications via action groups
The production monitoring stack consists of: Container Insights (logs) + Prometheus (metrics) + Grafana (dashboards) + Alerts (notifications)

30: Prometheus & Alerts

30: Prometheus & Alerts

30: Prometheus & Alerts

Objective

Theory

Azure Monitor Managed Prometheus

Azure Managed Grafana

Prometheus vs Container Insights

Alert Rules

Mermaid Diagram: Production Monitoring Stack

Practical Tasks

Task 1: Verify Prometheus Metrics Collection (Participant Task)

Task 2: Explore Grafana Dashboards (Instructor Demo)

Task 3: Create a Basic Alert Rule (Instructor Demo)

Common Problems

Best Practices

Summary

results matching ""

No results matching ""