I’m giving a presentation on monitoring this Monday. As part of that, I want to firm up some thoughts on the differences between auditing, monitoring, and alerting. All three of these are vital for an organization, but they serve entirely different functions and have different requirements. I’ll hit a bunch of bullet points for each.
Auditing is all about understanding a process and what went on. Ideally, you would audit every business-relevant action, in order, and be able to “replay” that business action. Let’s say we have a process which grabs a flat file from an FTP server somewhere, dumps data into a staging table, and then performs ETL and puts rows into transactional tables. Our auditing process should be able to show what happened, when. We want to log each activity (grab flat file, insert rows into staging table, process ETL) down to its most granular level. If we make an external API call for each row as part of the ETL process, we should log the exact call. If we throw away a row, we should note that. If we modify attributes, we should note that.
Of course, this is a huge amount of data and depending upon processing requirements and available storage space, you probably have to live with something much less thorough. So here are some thoughts:
- Keep as much information around errors as you can, including stack traces, full parameter listings, and calling processes.
- Build in (whenever possible) the full logging mentioned, but leave it as a debug/trace flag in your app. You could get creative and have custom tracing—maybe turn on debugging just for one customer. You might also think about automatically switching that debug mode back off after a certain amount of time.
- Add logical “process run” keys. If there are three or four systems which process the same data in a pipeline, it makes sense to track those chunks of data separate from the individual pipeline steps. At an extreme case, you might want to see how an individual row in a table somewhere got there, with a lineage ID that traces back to specific flat files or specific API calls or specific processes and tells you everything that happened to get to that point. Again, this is probably more of an ideal than a practical scenario, but dream big…
- Build an app to read your audit data. Reading text files is okay, but once you get processes interacting with one another, audit files can get really confusing.
Monitoring is all about seeing what’s going on in your system “right now.“ You want nice visualizations which give you relevant information about currently-running processes, and I put “right now” in quotation marks because you can be monitoring a process which only updates once every X minutes.
There are a couple of important things to consider with monitoring:
- Track what’s important. Don’t track everything “just in case,” but focus on metrics you know are important. As you investigate problems and find new metrics which can help, add them in, but don’t be afraid to start small.
- Monitoring should focus on aggregations, streams, and trends. It’s your 50,000-foot view of your world. Ideally, your monitoring system will let you drill down to more detail, but at the very least, it should let you see if there’s a problem.
- Monitors are not directly actionable. In other words, the purpose of a monitor is to display information so a human can observe, orient, decide, and act. If you have an automated solution to a problem, you don’t need a monitor; you need an automated process to fix the issue! You can monitor the automated solution to make sure it’s still running and track how frequently it’s fixing things, of course, but the end consumer of a monitor is a human.
- Ideally, a monitor will display enough information to weed out cyclical noise. If you have a process which runs every 60 minutes and which always slams your SAN the top 5 minutes of each hour, maybe graph the last 2 or 3 hours so you can see the cycles. If you have enough data, you can also build baselines of “normal” behavior and plot those against current behavior to make it easier for people to see if there is a potential issue.
- Monitors are a “pull” technology. You, as a consumer, go to the monitor application and look at what’s going on. The monitor does not jump out and send you messages and popups and try to get your attention.
Alerting is all about sending messages and getting your attention. This is because an alert is telling you something that you (as a trained operator) need to act upon. I think alerting is the hardest thing on the list to get right because there are several important considerations here:
- Alerts need to be actionable. If I page the guy on call at 3 AM, it’d better be because I need the guy on call to do something.
- Alerts need to be “complete.” The alert should provide enough information that a sleep-deprived technician can know exactly what to do. The alert can provide links to additional documentation, how-to guides, etc. It can also show the complete error message and even some secondary diagnostic stuff which is (potentially) related. In other words, the alert definitely needs to be more than an e-mail alert which reads “Error: object reference not set to an instance of an object.”
- Alerts need to be accurate. If you start throwing false positive alerts—alerting when there is no actual underlying problem—people will turn off the alert. If you have false negatives—not alerting when there is an underlying problem—your technicians are living under a false sense of security. In the worst case scenario, technicians will turn off (or ignore) the alerts and occasionally remember to check a monitor which lets the know that there was a problem two hours ago.
- Alerts need human intervention. If I get an alert saying that something failed and an automated process has kicked in to fix the problem, I don’t need that alert! If the automated process fails and I need to perform some action, then I should get an alert. Otherwise, just log the failure, have the automated process run to fix the problem, and let the technicians sleep. If management needs figures or wants to know what things looked like overnight, create reports and digests of this information and pass it along to them, but don’t bother your technicians.
- On a related note, alerts need to be for non-automatable issues. If you can automate a problem away, do so. Even if it takes a fair amount of time, there’s a lot less risk in a documented, tested, automated process than in waking up some groggy technician. People at 3 AM make mistakes, even when they have how-to documents and clear processes. People at all hours of the day make mistakes; we get distracted and miss steps, mis-type something, click the wrong button, follow the wrong process, think we have everything memorized but (whoopsie) forgot a piece. Computers are less likely to have these problems.
Auditing, monitoring, and alerting solve three different sets of problems. They also have three different sets of requirements for what kind of data to use, how frequently to refresh this data, and how people interact with them. It’s important to keep these clearly delineated for that reason.
During this, I’m also working on some toy monitoring stuff, so I hope that’ll be tomorrow’s TIL.