Actionable Alerting
Are you tired of getting paged in the middle of the night for noisy alerts or flapping systems, only to find no action can be taken?
How to build actionable alerts from your SLOs and SLIs as a Site Reliability Engineer. Alerting on low-level metrics such as CPU usage or disk space doesn’t actually show whether our users are experiencing issues with our product or service. Instead, we should build our alerts using our SLOs. By integrating our remaining error budget over time, we can see how outages or partial outages will affect our SLO.
Actionable alerts tie closely into the DevOps principles of expecting failure and creating a blameless culture.