Unacceptable

SQL injection vulnerabilities were up in 2014.  Sounds like a bunch of product managers need to buy copies of Tribal SQL and read the SQL injection chapter.  Seriously, SQL injection should have died a decade ago and my presentation on the topic should simply have historical value.

On the Anthem breach, Chris Bell is fed up as well.  Check out the comments there for additional insight.  There’s no word yet on the exact nature of the breach, but given the frequency with which data gets out into the wild, someone else will get popped next week.

Mentoring

Paul Randal has announced that he will mentor six people over a two-month stretch.  The night before he announced this, I had incidentally been watching Jason Alba’s Management 101 and Paul’s Communications:  How to Talk, Write, Present, and Get Ahead! courses on Pluralsight.  Both of these courses covered the idea of mentoring and the importance of finding a good mentor at various points in your career.

I am at one of those points in my career now.  I have spent the past several years establishing my technical chops, and it’s time for me to take the next step.  I can see two different visions of what “the next step” entails.  My first vision is a move into a technical leadership role, running a team.  For most of my career, I’ve been a lone wolf, the only database person around.  At my current position, I am a peer but not really a lead (because we have no database leads, due to the way the organization is structured).  As a result, taking my next step might involve moving to a new company…although I do like where I work, so this might be a tough call.

My second vision is to develop further my voice in the community.  Last year, I presented at three SQL Saturdays, gave a number of local user group talks, and even helped put on the first SQL Saturday in Raleigh since 2010.  This year, my goal is to present at a dozen SQL Saturdays (current stats:  1 down, 2 officially slated, and 9 to go), as well as hosting another SQL Saturday and presenting at various Triangle area events.  I even want to go outside of my comfort zone and looking at user groups tied to other technologies and concepts, like Azure, Hadoop, and analytics.  I may never present at PASS Summit (though do trust that I’ll try…at some point…), but teaching people techniques to help them solve their technical problems is enjoyable and I’d like to develop a reputation as a great teacher.  This means pouring time and energy into building a personal brand and establishing enough trustworthiness within the community that enough people would be willing to spend some of their precious time listening to me.

These visions are not mutually exclusive and I think a mentorship with Paul Randal would help me with both.  I have thought of a few areas in which Paul could provide outstanding guidance, and I’ll list some of them here in bulleted format:

  • I want to work on my presentation skills.  As I watch other presenters, I take mental notes on good (and bad) things they do and try to integrate some of the good things into my own presentations.  I have spent some time reading blog posts about improving presentation skills as well, but being able to ask questions to an outstanding presenter (who happens to be married to an even-better presenter) would not hurt at all.
  • I mentioned above my desire to take the next step with regard to the SQL Server community.  My game plan for this year is to present at more SQL Saturday events, but once I’m presenting across the eastern third of the US and Canada, what’s the next step?  There are definitely steps between SQL Saturdays and TechEd/PASS Summit, but I don’t really have a presenter roadmap.  I’m thinking that getting more active with the virtual user group chapters would be a stepping stone, but I admit that I don’t have a good game plan yet.
  • In addition to presenting in person, I’m thinking about trying to create some shorter-length videos on various topics.  My questions here would occasionally be technical in nature (e.g., recommendations on microphones and editing software), but mostly they would be about the human element.  For example, listening to Paul present at a user group (as he did—remotely—to our Raleigh user group last month), I can pick up some differences in style versus watching his SQLskills or Pluralsight videos.  I’d like to be able to discuss these stylistic differences, improving my understanding of videos as a separate medium from in-person or remote presentations to a live audience.
  • Another avenue I have not really pursued up to this point is writing.  I was fortunate enough to be able to contribute to Tribal SQL back in 2013 and I enjoyed the experience enough that I’d like to continue writing.  I have a few ideas, but I’d love to be able to pick the brain of somebody who earns (some) money writing and ask questions about choosing topics and his writing workflow.
  • Following from my first vision, I would definitely love to discuss how to develop leadership skills.  Leadership is about a lot more than simply understanding the technical details; a lot of it is about managing products under constraints (budget, time, political capital, etc.) and keeping your team members excited and productive.  I have some questions about how to do that, and being able to ask somebody who has run development teams at Microsoft and who currently manages a team of consultants would be fantastic.
  • The last topic I’ll hit here is work/life balance.  I will need to do most of the above on my own time, outside of my day job.  I look at some of the more frenetic members of the SQL Server community and wonder how, exactly, they do it.  In Paul’s case, I see scuba, travel, reading lots and lots of books, blogging, Twitter, work, managing a team of top-shelf consultants, conferences, Pluralsight videos, and giving a lot of presentations.  By contrast, I feel like I’m treading water too many days and I don’t want my home life (i.e., the lovely and talented missus) to suffer as a result of professional improvement.  If there are any techniques or practices I can glean to become more efficient or improve that work/life balance, I absolutely want to know.

These are thoughts I scribbled down while on the tarmac in Cleveland; I think that with a mentorship in place, I could expound upon these themes as well as several more.  To me, a mentor is not someone who tells you where to go, nor even really how to get there, but rather someone who helps you develop the mental tools to figure those things out for yourself.  I know where I want to go and I have some ideas on how to get there, and I believe that getting the guidance of an experienced person at the top of my field could help me considerably in making it to “the next step.”

Thoughts On Monitoring And Alerting

Not too long ago, I had a discussion at work about alerting and monitoring that I figured would make good blog post fodder.  Before I get started, let me differentiate alerts from monitors.  To me (and your definition may vary, but it’s my blog post, so we get to use my terminology), a monitor is a piece of information that you may observe, whereas an alert is something that tries to get your attention.  In other words, monitors are a pull technology whereas alerts are a push technology.

Let’s Tell A Story

A couple weeks ago, we had a production issue which happened due to a system error.  Our alerts eventually let us know that there was something going on, but we didn’t trust them at first.  The reason for this is simple:  this alert (which is one of our more successful alerts) ran on 9 separate occasions during the past 30 days; 8 of those occasions were false positives.  This leads to a false positive rate of 88%…which is still better than most of our alerts.  The problem here is that our alert is sending us information through a noisy channel.  Without going into too much detail, we are tracking a downstream revenue value and sending an alert if that value goes below our minimum expected value.  The problem with this is that there are a number of things which could cause us to go below our minimum expected value:  planned downtime, a lack of data coming into our system, a partial failure that our automated systems will retry and fix without human intervention, or even just a down day for our customers.  When you have a number of potential causes but only need to intervene some of the time, you have a noisy channel.

If only we could have trusted this alert, we would have been able to act on the problem sooner.  So let’s talk about what we can do to make trustworthy alerts.

What Makes A Good Alert?

With the above in mind, I want to define what makes for a good alert.  To me, I see five critical characteristics of a good alert:

  1. A Problem Scenario.  A problem scenario could be, “if we receive zero messages over the past hour, we are in an error state.”  This is based on the assumption that our system is busy enough that we should always have some data streaming in, no matter the hour of night.  This is a good assumption for our work system, except when we have planned downtime in which the messaging system is turned off.
  2. An Implementation.  Given the above scenario, my implementation could be something like “I will write an SQL query to determine if we have received zero messages over the past hour.”  Our messages get stored in the database and we can query those messages later.  Implementation could be as simple as this or could involve lots of working parts, but the important thing here is that the implementation must answer the problem scenario.  A huge danger in writing alerts is that we test downstream consequences instead of the actual problem, in part because we lack metrics on the actual problem.  If you cannot measure the actual problem, you are likely going to alert against a noisy channel.
  3. An Action.  In this case, my action is, “When the SQL query finds zero messages received in the past hour, send out an e-mail to the product team.”  The action needs to be commensurate to the problem:  if a critical process is down and the company is losing money, that alert had better be paging people left and right; but if it’s a minor problem that can wait an hour or so, an e-mail is fine.
  4. A Human Response.  Simply having an action is not enough; there needs to be a human response.  When receiving the zero message e-mail, what do we do?  In this case, “When I (a product team member) receive the e-mail, I will check to ensure that there is no planned downtime.  If not, I know that there is a catastrophic error with regard to the messaging system.  I will check the error logs for corroborating information.  If the error logs give me error X, then I need to restart service A.  If the error logs give me error Y, then I need to check the database logs.  If the database logs give me error Z, then I know that the problem is with procedure B; otherwise, I need to expand my investigation to other error logs and troubleshoot this problem.”  I like this kind of flow because it shows what a person should do, and it also shows that this problem cannot be solved simply through automation.  Certain parts could be automated, such as error X –> restart service A, but the product team might make a conscious decision to require a human to do this.  What is important, however, is that there is a valid human response.  If the appropriate response is simply to wait for an automated process to do its thing, this should not be an alert!
  5. A Low Failure Rate.  This means relatively few false positives and relatively few false negatives.  In the case of alerting, a false positive is a scenario in which the alert fires but there is no reasonable human response.  This can be because the alert itself is wrong, because the reason the alert fired is outside of our hands (like in the scenario in which our customers simply had a bad day), or because some automated process will fix this without our intervention.  A false negative is a scenario in which human intervention is necessary but the relevant monitor did not fire.

Alerting Metrics

In the case of an alert, we can get metrics fairly easily on true positives, and false positives, but the true and false negatives are a lot harder.  In our above example, we can go back in time and break down one-hour periods as far back as we have data, searching for our cases.  If the alert fires once an hour, we can get our 2×2 grid.  Here is a sample:

Error No Error
Alert 1 8
No Alert 3 708

There were 720 distinct events over the past 30 days.  During this time, we had 1 error that was alerted, 3 errors with no alert, 8 alerts with no error, and 708 cases in which there was no error and no alert.  In other words, we have 1 true positive, 8 false positives, 3 false negatives, and 708 true negatives.

A naive measure would look at accuracy, which is defined as (True Positive + True Negative) / (Total Population), and say hey, we’re right 98.5% of the time, so everything’s good!  This is…not such a great thing.  Similar to medical tests, we expect True Negatives to overwhelm everything else, and so we need to get a little more sophisticated in our metrics.

My favorite measure is the positive predictive value, which is defined as (True Positive) / (Test Positive).  For alerting, that basically tells us the likelihood that an e-mail alert is valid.  In our case, PPV = 1 / 9, or 11%.  In other words, 8 out of our 9 tests were false positives.  When PPV is low, people tend to ignore alerts.

For alerting, I would also want to look at (True Positive) / (Test Positive + False Negative), which in this case would be 1 / (1 + 8 + 3) or 1 / 12.  The reason I want to ignore true negatives is because we expect true negatives to be an overwhelming percentage of our results.  This metric tells us how well we can trust the alert, with a number closer to 1 indicating a more trustworthy alert.

Improving Alerts

To improve your existing alerts and make them more trustworthy, go back to the five characteristics of a good alert:  a written-down problem scenario, an implementation based on the problem scenario, an action, a human response, and a low failure rate.  The best way to lower failure rate is to build implementations on sound problem scenarios.  If you alert based off of a noisy channel, you complicate the human response aspect and get a higher failure rate.

Improving Alerts Through Monitors

But let’s suppose there are scenarios in which we need to use the noisy channel.  In this case, it might be a good idea to build a monitor.  As mentioned above, a monitor is a “pull” technology, in that a human needs to look at the monitor to gain information, as opposed to an alert which tries to interrupt a human and provide information.  Good monitors generally tend to let you see data over time as a trend.  People are pretty good at spotting trend differences over time, so making that data available can let people use their best judgement as to whether a problem is likely transient versus something which requires human intervention.  As long as those monitors are in a common area that the product team can easily see, people can solve problems without filtering out potentially important alerts.