Prometheus Alert Rules
Learn to create effective alert rules using PromQL. Configure proactive monitoring to catch issues before they impact your applications and infrastructure.
Common Alert Rules
High CPU Usage
Alert when CPU usage exceeds threshold
Monitor CPU utilization across nodes and pods to prevent performance degradation.
PromQL Query:
100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
Memory Usage Alert
Alert when memory usage is critically high
Track memory consumption to prevent out-of-memory errors and system instability.
PromQL Query:
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 90
Pod CrashLoopBackOff
Detect pods in crash loop state
Identify pods that are repeatedly failing and restarting, indicating application issues.
PromQL Query:
kube_pod_status_phase{phase="Running"} == 0 and kube_pod_status_phase{phase="Pending"} == 0
Disk Space Warning
Monitor available disk space
Prevent disk space issues that could cause service failures or data loss.
PromQL Query:
(1 - (node_filesystem_avail_bytes / node_filesystem_size_bytes)) * 100 > 85
Alert Rule Structure
PrometheusRule Components
Understanding the structure of PrometheusRule resources for effective alert configuration.
Alert Name
Unique identifier for the alert
HighCPUUsage
PromQL Expression
Query that defines the alert condition
cpu_usage_percent > 80
For Duration
How long the condition must persist before firing
5m
Labels
Key-value pairs attached to the alert
severity: warning
Annotations
Additional metadata for notifications
description: "CPU usage is high"
PromQL Examples
CPU Usage Percentage
Calculate CPU usage percentage for each instance
100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
Memory Usage Percentage
Calculate available memory percentage
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
Pod Restart Count
Calculate restart rate for pods
rate(kube_pod_container_status_restarts_total[15m])
HTTP Request Rate
Calculate HTTP request rate over 5 minutes
rate(http_requests_total[5m])
Best Practices
Alert Rule Guidelines
Follow these best practices to create effective and maintainable alert rules.
- Use meaningful alert names that clearly describe the issue
- Set appropriate severity levels (info, warning, critical)
- Include helpful descriptions and runbook links in annotations
- Test alert rules in staging environments first
- Use proper "for" durations to avoid false positives
- Group related alerts using consistent label naming
- Regular review and cleanup of unused alert rules
Sample PrometheusRule
Complete Example
A complete PrometheusRule resource with multiple alert definitions.
apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: name: kubernetes-alerts namespace: monitoring spec: groups: - name: kubernetes.rules rules: - alert: HighCPUUsage expr: 100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80 for: 5m labels: severity: warning service: cpu annotations: summary: "High CPU usage detected" description: "CPU usage is above 80% for more than 5 minutes" - alert: HighMemoryUsage expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 90 for: 2m labels: severity: critical service: memory annotations: summary: "High memory usage detected" description: "Memory usage is above 90% for more than 2 minutes"
Ready to Configure Notifications?
Now that you have alert rules configured, learn how to set up Alertmanager for routing and notifications.