Not too long ago, I had a discussion at work about alerting and monitoring that I figured would make good blog post fodder. Before I get started, let me differentiate alerts from monitors. To me (and your definition may vary, but it’s my blog post, so we get to use my terminology), a monitor is a piece of information that you may observe, whereas an alert is something that tries to get your attention. In other words, monitors are a pull technology whereas alerts are a push technology.
Let’s Tell A Story
A couple weeks ago, we had a production issue which happened due to a system error. Our alerts eventually let us know that there was something going on, but we didn’t trust them at first. The reason for this is simple: this alert (which is one of our more successful alerts) ran on 9 separate occasions during the past 30 days; 8 of those occasions were false positives. This leads to a false positive rate of 88%…which is still better than most of our alerts. The problem here is that our alert is sending us information through a noisy channel. Without going into too much detail, we are tracking a downstream revenue value and sending an alert if that value goes below our minimum expected value. The problem with this is that there are a number of things which could cause us to go below our minimum expected value: planned downtime, a lack of data coming into our system, a partial failure that our automated systems will retry and fix without human intervention, or even just a down day for our customers. When you have a number of potential causes but only need to intervene some of the time, you have a noisy channel.
If only we could have trusted this alert, we would have been able to act on the problem sooner. So let’s talk about what we can do to make trustworthy alerts.
What Makes A Good Alert?
With the above in mind, I want to define what makes for a good alert. To me, I see five critical characteristics of a good alert:
- A Problem Scenario. A problem scenario could be, “if we receive zero messages over the past hour, we are in an error state.” This is based on the assumption that our system is busy enough that we should always have some data streaming in, no matter the hour of night. This is a good assumption for our work system, except when we have planned downtime in which the messaging system is turned off.
- An Implementation. Given the above scenario, my implementation could be something like “I will write an SQL query to determine if we have received zero messages over the past hour.” Our messages get stored in the database and we can query those messages later. Implementation could be as simple as this or could involve lots of working parts, but the important thing here is that the implementation must answer the problem scenario. A huge danger in writing alerts is that we test downstream consequences instead of the actual problem, in part because we lack metrics on the actual problem. If you cannot measure the actual problem, you are likely going to alert against a noisy channel.
- An Action. In this case, my action is, “When the SQL query finds zero messages received in the past hour, send out an e-mail to the product team.” The action needs to be commensurate to the problem: if a critical process is down and the company is losing money, that alert had better be paging people left and right; but if it’s a minor problem that can wait an hour or so, an e-mail is fine.
- A Human Response. Simply having an action is not enough; there needs to be a human response. When receiving the zero message e-mail, what do we do? In this case, “When I (a product team member) receive the e-mail, I will check to ensure that there is no planned downtime. If not, I know that there is a catastrophic error with regard to the messaging system. I will check the error logs for corroborating information. If the error logs give me error X, then I need to restart service A. If the error logs give me error Y, then I need to check the database logs. If the database logs give me error Z, then I know that the problem is with procedure B; otherwise, I need to expand my investigation to other error logs and troubleshoot this problem.” I like this kind of flow because it shows what a person should do, and it also shows that this problem cannot be solved simply through automation. Certain parts could be automated, such as error X –> restart service A, but the product team might make a conscious decision to require a human to do this. What is important, however, is that there is a valid human response. If the appropriate response is simply to wait for an automated process to do its thing, this should not be an alert!
- A Low Failure Rate. This means relatively few false positives and relatively few false negatives. In the case of alerting, a false positive is a scenario in which the alert fires but there is no reasonable human response. This can be because the alert itself is wrong, because the reason the alert fired is outside of our hands (like in the scenario in which our customers simply had a bad day), or because some automated process will fix this without our intervention. A false negative is a scenario in which human intervention is necessary but the relevant monitor did not fire.
In the case of an alert, we can get metrics fairly easily on true positives, and false positives, but the true and false negatives are a lot harder. In our above example, we can go back in time and break down one-hour periods as far back as we have data, searching for our cases. If the alert fires once an hour, we can get our 2×2 grid. Here is a sample:
There were 720 distinct events over the past 30 days. During this time, we had 1 error that was alerted, 3 errors with no alert, 8 alerts with no error, and 708 cases in which there was no error and no alert. In other words, we have 1 true positive, 8 false positives, 3 false negatives, and 708 true negatives.
A naive measure would look at accuracy, which is defined as (True Positive + True Negative) / (Total Population), and say hey, we’re right 98.5% of the time, so everything’s good! This is…not such a great thing. Similar to medical tests, we expect True Negatives to overwhelm everything else, and so we need to get a little more sophisticated in our metrics.
My favorite measure is the positive predictive value, which is defined as (True Positive) / (Test Positive). For alerting, that basically tells us the likelihood that an e-mail alert is valid. In our case, PPV = 1 / 9, or 11%. In other words, 8 out of our 9 tests were false positives. When PPV is low, people tend to ignore alerts.
For alerting, I would also want to look at (True Positive) / (Test Positive + False Negative), which in this case would be 1 / (1 + 8 + 3) or 1 / 12. The reason I want to ignore true negatives is because we expect true negatives to be an overwhelming percentage of our results. This metric tells us how well we can trust the alert, with a number closer to 1 indicating a more trustworthy alert.
To improve your existing alerts and make them more trustworthy, go back to the five characteristics of a good alert: a written-down problem scenario, an implementation based on the problem scenario, an action, a human response, and a low failure rate. The best way to lower failure rate is to build implementations on sound problem scenarios. If you alert based off of a noisy channel, you complicate the human response aspect and get a higher failure rate.
Improving Alerts Through Monitors
But let’s suppose there are scenarios in which we need to use the noisy channel. In this case, it might be a good idea to build a monitor. As mentioned above, a monitor is a “pull” technology, in that a human needs to look at the monitor to gain information, as opposed to an alert which tries to interrupt a human and provide information. Good monitors generally tend to let you see data over time as a trend. People are pretty good at spotting trend differences over time, so making that data available can let people use their best judgement as to whether a problem is likely transient versus something which requires human intervention. As long as those monitors are in a common area that the product team can easily see, people can solve problems without filtering out potentially important alerts.