30: Prometheus & Alerts
30: Prometheus & Alerts
Objective
Understand the Azure Monitor managed Prometheus and Grafana monitoring stack for AKS. Learn how Prometheus metrics complement Container Insights logs, explore Grafana dashboards, and understand how to create alert rules for Kubernetes workloads.
Theory
Azure Monitor Managed Prometheus
Azure Monitor managed service for Prometheus is a fully managed, scalable Prometheus-compatible monitoring solution for AKS:
| Feature | Description |
|---|---|
| Metrics collection | Scrapes Prometheus metrics from AKS nodes, Pods, and workloads |
| Azure Monitor workspace | Stores Prometheus metrics (separate from Log Analytics) |
| PromQL | Query language for Prometheus metrics |
| Managed infrastructure | No need to deploy or manage your own Prometheus server |
| Retention | 18 months of metric data by default |
Managed Prometheus is enabled by the cluster administrator when creating or updating an AKS cluster:
# Instructor/admin commands — do not run
az aks create \
--resource-group <RG> \
--name <CLUSTER> \
--enable-azure-monitor-metrics
# Or on an existing cluster:
az aks update \
--resource-group <RG> \
--name <CLUSTER> \
--enable-azure-monitor-metrics
Azure Managed Grafana
Azure Managed Grafana provides pre-built dashboards for visualizing Kubernetes metrics:
- Fully managed Grafana instance — no setup or maintenance required
- Pre-configured data sources for Azure Monitor and Prometheus
- Built-in Kubernetes dashboards:
- Cluster overview (nodes, Pods, resource usage)
- Namespace workloads
- Pod-level metrics
- Kubelet and API server metrics
- Custom dashboards can be created for application-specific metrics
Prometheus vs Container Insights
Both are part of the AKS monitoring stack, but they serve different purposes:
| Aspect | Container Insights | Prometheus |
|---|---|---|
| Primary focus | Logs and inventory | Metrics and time series |
| Query language | KQL (Kusto) | PromQL |
| Data store | Log Analytics workspace | Azure Monitor workspace |
| Best for | Troubleshooting, log analysis, auditing | Performance monitoring, dashboards, alerting |
| Visualization | Azure Portal Insights views | Grafana dashboards |
| Cost model | Per GB ingested | Per metrics ingested |
Alert Rules
Alert rules define conditions that trigger notifications or automated actions:
| Component | Description |
|---|---|
| Condition | The metric or log query that defines when to alert (e.g., CPU > 90%) |
| Threshold | The value that triggers the alert |
| Evaluation period | How often the condition is checked (e.g., every 5 minutes) |
| Action group | What happens when the alert fires (email, SMS, webhook, Logic App) |
| Severity | Alert severity level (0-Critical to 4-Verbose) |
Common alert scenarios for AKS:
- Pod restart count exceeds threshold
- Node CPU or memory utilization is high
- Pods stuck in Pending state
- Container OOM kills detected
- Node NotReady status
Mermaid Diagram: Production Monitoring Stack
flowchart TB
subgraph AKS Cluster
N1[Node 1]
N2[Node 2]
P1[App Pods]
P2[System Pods]
AMA[ama-metrics<br>DaemonSet]
AML[ama-logs<br>DaemonSet]
end
subgraph Azure Monitor
AMW[Azure Monitor<br>Workspace<br>Prometheus metrics]
LAW[Log Analytics<br>Workspace<br>Container logs]
AR[Alert Rules]
end
subgraph Visualization
G[Azure Managed<br>Grafana<br>Dashboards]
CI[Container Insights<br>Azure Portal]
end
subgraph Notifications
AG[Action Group<br>Email / SMS /<br>Webhook]
end
N1 --> AMA
N2 --> AMA
N1 --> AML
N2 --> AML
AMA -->|Prometheus metrics| AMW
AML -->|Logs + inventory| LAW
AMW --> G
AMW --> AR
LAW --> CI
LAW --> AR
AR -->|Triggers| AG
Practical Tasks
Note: This exercise is primarily instructor-led. Tasks 1 involves kubectl commands that participants can run. Tasks 2 and 3 are Azure Portal demos shown by the instructor. The theory and examples are included so participants understand the complete production monitoring stack.
Task 1: Verify Prometheus Metrics Collection (Participant Task)
Check if the Prometheus metrics agent is running on the cluster:
kubectl get pods -n kube-system | grep prometheus
You may also see the Azure Monitor Agent collecting metrics:
kubectl get pods -n kube-system | grep ama-metrics
Check the DaemonSet status:
kubectl get daemonsets -n kube-system | grep -E "prometheus|ama-metrics"
Verify that metrics are being collected by checking node and pod metrics:
kubectl top nodes
kubectl top pods -n student-XX
Task 2: Explore Grafana Dashboards (Instructor Demo)
Note: Azure Managed Grafana access is an instructor-led demo. Participants observe while the instructor navigates the dashboards.
This task is demonstrated by the instructor using the Azure Portal:
- Navigate to the Azure Managed Grafana instance in the Azure Portal
- Click Endpoint to open the Grafana UI
- Go to Dashboards in the left menu
- Explore the pre-built Kubernetes dashboards:
- Kubernetes / Compute Resources / Cluster — overall cluster resource usage
- Kubernetes / Compute Resources / Namespace (Workloads) — per-namespace resource usage
- Kubernetes / Compute Resources / Pod — individual Pod CPU and memory
- Kubernetes / Kubelet — kubelet performance metrics
- Kubernetes / API Server — API server request rates and latencies
Key metrics to observe:
- CPU and memory usage vs requests vs limits
- Pod count by namespace
- Network I/O per Pod
- API server request rate and error rate
Task 3: Create a Basic Alert Rule (Instructor Demo)
Note: Creating alert rules requires Azure Portal access and is demonstrated by the instructor. Participants learn the concepts and observe the configuration process.
This task is demonstrated by the instructor in the Azure Portal.
Example: Alert when Pod restart count exceeds 3 in 5 minutes.
- Navigate to Azure Monitor > Alerts
- Click Create > Alert rule
- Select the AKS cluster as the target resource
- Configure the condition:
- Signal: Restarting container count (from Container Insights metrics)
- Threshold: Greater than 3
- Evaluation period: 5 minutes
- Frequency: Every 1 minute
- Configure the action group:
- Action type: Email/SMS/Push/Voice
- Enter notification recipients
- Set alert details:
- Alert rule name:
High Pod Restart Count - Severity: 2 - Warning
- Alert rule name:
- Review and create
Example PromQL-based alert (if using Prometheus alert rules):
# Alert when any Pod has restarted more than 3 times in the last 5 minutes
increase(kube_pod_container_status_restarts_total[5m]) > 3
Example KQL-based alert (using Log Analytics):
KubePodInventory
| where TimeGenerated > ago(5m)
| where ContainerRestartCount > 3
| summarize MaxRestarts = max(ContainerRestartCount) by PodName, Namespace
| where MaxRestarts > 3
Common Problems
| Problem | Cause | Solution |
|---|---|---|
| No Prometheus pods found | Managed Prometheus not enabled | Inform the instructor — enable with az aks update --enable-azure-monitor-metrics |
| Grafana shows no data | Data source not configured or metrics not collected yet | Check data source configuration in Grafana settings; wait 5-10 minutes |
kubectl top returns error |
Metrics Server or Prometheus agent not running | Verify agent DaemonSets are healthy in kube-system |
| Alert not firing | Condition not met or evaluation period too long | Review alert rule configuration; check that the metric is being collected |
| High monitoring costs | Too many metrics or custom metrics scraped | Configure metric filtering; limit custom metric scraping to essential endpoints |
Best Practices
- Deploy the full monitoring stack — Container Insights (logs) + Prometheus (metrics) + Grafana (dashboards) + Alerts (notifications)
- Use Prometheus for metrics, Container Insights for logs — each tool is optimized for its purpose; using both provides complete observability
- Start with recommended alerts — Azure provides pre-configured alert rule templates for AKS; enable them as a baseline
- Create dashboards for your applications — extend the built-in Kubernetes dashboards with application-specific metrics
- Configure action groups wisely — route critical alerts to on-call teams (PagerDuty, Slack webhook), informational alerts to email
- Set appropriate thresholds — avoid alert fatigue by tuning thresholds based on baseline metrics; too many false alerts cause teams to ignore real issues
- Use metric filtering — only scrape the Prometheus metrics you need to control costs and reduce noise
- Monitor the monitors — set up alerts for monitoring agent health (ama-metrics, ama-logs DaemonSets) to detect monitoring gaps
Summary
In this exercise you learned:
- Azure Monitor managed Prometheus scrapes and stores Prometheus metrics from AKS clusters
- Azure Managed Grafana provides pre-built dashboards for Kubernetes monitoring
- Prometheus (metrics) and Container Insights (logs) are complementary — use both for complete observability
- Alert rules define conditions (metric thresholds, log queries) that trigger notifications via action groups
- The production monitoring stack consists of: Container Insights (logs) + Prometheus (metrics) + Grafana (dashboards) + Alerts (notifications)