TIL: The VSS Writer

I’m working on a presentation for July involving security, and got to the section in which I discussed turning of unneeded services.  I noticed the VSS Writer and realized, I didn’t actually know what it did…

One trip to MSDN later, I learned that the Volume Shadow Copy Service (VSS) writer allows third-party applications to snapshot and copy SQL Server data and log files while those files are in use.  If this service is off, the data and log files are locked and the only way you can take a backup is through SQL Server full, differential, and transaction log backups.  If the service is on, third-party tools can make real-time copies of these files.

My inclination is to turn this service off and rely on SQL Server’s internal backup strategy, as it’s a known quantity and having two services to perform such a similar action increases your system’s surface area.

TIL: Kafka Queueing Strategy

I had the privilege to sit down with a couple Hortonworks solution engineers and discuss a potential Hadoop solution in our environment.  During that time, I learned an interesting strategy for handling data in Kafka.

Our environment uses MSMQ for queueing.  What we do is add items to a queue, and then consumers pop items off of the queue and consume them.  The advantage to this is that you can easily see how many items are currently on the queue and multiple consumer threads can interact, playing nice with the queue.  Queue items last for a certain amount of time—in our case, we typically expire them after one hour.

With Kafka, however, queues are handled a bit differently.  The queue manager does not know or care about which consumers read what data when (making it impossible for a consumer to tell how many items are left on the queue at a certain point in time), and the consumers have no ability to pop items off of the queue.  Instead, queue items fall off after they expire.

Our particular scenario has some web servers which need to handle incoming clicks.  Ideally, we want to handle that click and dump it immediately onto a queue, freeing up the web server thread to handle the next request.  Once data gets into the queue, we want it to live until our internal systems have a chance to process that data—if we have a catastrophe and items fall off of this queue, we lose revenue.

The strategy in this case is to take advantage of multiple queues and multiple stages.  I had thought of “a” queue, into which the web server puts click data and out of which the next step pulls clicks for processing.  Instead of that, a better strategy (given what we do and our requirements) is to immediately put the data into a queue and then have consumers pull from the queue, perform some internal “enrichment” processes, and finally put the enriched data back onto a new queue.  That new queue will collect data and an occasional batch job pulls it off to write to HDFS.  This way, you don’t take the hit of streaming rows into HDFS.  As far as maintaining data goes, we’d need to set our TTL to last long enough that we can deal with an internal processing engine catastrophe but not so long that we run out of disk space holding messages.  That’s a fine line trade-off we’ll need to figure out as we go along.

Summing things up, Kafka is quite a different product than MSMQ, and a workable architecture is going to look different depending upon which queue product you use.

TIL: Bots

Last night, Jamie Dixon (whose book you should buy) talked about his experience at Build this year.  His main takeaway is that Microsoft is pushing their Bot Framework pretty hard.  Jamie showed how to create a stock lookup bot and deploy it to Slack using F# code and an Azure account.

About a month ago, I went to F8 and my main takeaway from that conference is that Facebook is pushing bots on their Messenger platform pretty hard.

Based on my (extremely) limited knowledge of both, it seems that the Facebook bot platform is a bit easier to deal with, as there’s a user interface for writing messages and tokenizing, but right now, Microsoft’s platform is a bit more customizable and developer-friendly.  What’s particularly interesting about both of these is that the analytics and intelligence engines are both closed-source—neither company will let you see the wizard behind the curtain.

TIL: Auditing, Monitoring, Alerting

I’m giving a presentation on monitoring this Monday.  As part of that, I want to firm up some thoughts on the differences between auditing, monitoring, and alerting.  All three of these are vital for an organization, but they serve entirely different functions and have different requirements.  I’ll hit a bunch of bullet points for each.

Auditing

Auditing is all about understanding a process and what went on.  Ideally, you would audit every business-relevant action, in order, and be able to “replay” that business action.  Let’s say we have a process which grabs a flat file from an FTP server somewhere, dumps data into a staging table, and then performs ETL and puts rows into transactional tables.  Our auditing process should be able to show what happened, when.  We want to log each activity (grab flat file, insert rows into staging table, process ETL) down to its most granular level.  If we make an external API call for each row as part of the ETL process, we should log the exact call.  If we throw away a row, we should note that.  If we modify attributes, we should note that.

Of course, this is a huge amount of data and depending upon processing requirements and available storage space, you probably have to live with something much less thorough.  So here are some thoughts:

  • Keep as much information around errors as you can, including stack traces, full parameter listings, and calling processes.
  • Build in (whenever possible) the full logging mentioned, but leave it as a debug/trace flag in your app.  You could get creative and have custom tracing—maybe turn on debugging just for one customer.  You might also think about automatically switching that debug mode back off after a certain amount of time.
  • Add logical “process run” keys.  If there are three or four systems which process the same data in a pipeline, it makes sense to track those chunks of data separate from the individual pipeline steps.  At an extreme case, you might want to see how an individual row in a table somewhere got there, with a lineage ID that traces back to specific flat files or specific API calls or specific processes and tells you everything that happened to get to that point.  Again, this is probably more of an ideal than a practical scenario, but dream big…
  • Build an app to read your audit data.  Reading text files is okay, but once you get processes interacting with one another, audit files can get really confusing.

Monitoring

Monitoring is all about seeing what’s going on in your system “right now.  You want nice visualizations which give you relevant information about currently-running processes, and I put “right now” in quotation marks because you can be monitoring a process which only updates once every X minutes.

There are a couple of important things to consider with monitoring:

  • Track what’s important.  Don’t track everything “just in case,” but focus on metrics you know are important.  As you investigate problems and find new metrics which can help, add them in, but don’t be afraid to start small.
  • Monitoring should focus on aggregations, streams, and trends.  It’s your 50,000-foot view of your world.  Ideally, your monitoring system will let you drill down to more detail, but at the very least, it should let you see if there’s a problem.
  • Monitors are not directly actionable.  In other words, the purpose of a monitor is to display information so a human can observe, orient, decide, and act.  If you have an automated solution to a problem, you don’t need a monitor; you need an automated process to fix the issue!  You can monitor the automated solution to make sure it’s still running and track how frequently it’s fixing things, of course, but the end consumer of a monitor is a human.
  • Ideally, a monitor will display enough information to weed out cyclical noise.  If you have a process which runs every 60 minutes and which always slams your SAN the top 5 minutes of each hour, maybe graph the last 2 or 3 hours so you can see the cycles.  If you have enough data, you can also build baselines of “normal” behavior and plot those against current behavior to make it easier for people to see if there is a potential issue.
  • Monitors are a “pull” technology.  You, as a consumer, go to the monitor application and look at what’s going on.  The monitor does not jump out and send you messages and popups and try to get your attention.

Alerting

Alerting is all about sending messages and getting your attention.  This is because an alert is telling you something that you (as a trained operator) need to act upon.  I think alerting is the hardest thing on the list to get right because there are several important considerations here:

  • Alerts need to be actionable.  If I page the guy on call at 3 AM, it’d better be because I need the guy on call to do something.
  • Alerts need to be “complete.”  The alert should provide enough information that a sleep-deprived technician can know exactly what to do.  The alert can provide links to additional documentation, how-to guides, etc.  It can also show the complete error message and even some secondary diagnostic stuff which is (potentially) related.  In other words, the alert definitely needs to be more than an e-mail alert which reads “Error:  object reference not set to an instance of an object.”
  • Alerts need to be accurate.  If you start throwing false positive alerts—alerting when there is no actual underlying problem—people will turn off the alert.  If you have false negatives—not alerting when there is an underlying problem—your technicians are living under a false sense of security.  In the worst case scenario, technicians will turn off (or ignore) the alerts and occasionally remember to check a monitor which lets the know that there was a problem two hours ago.
  • Alerts need human intervention.  If I get an alert saying that something failed and an automated process has kicked in to fix the problem, I don’t need that alert!  If the automated process fails and I need to perform some action, then I should get an alert.  Otherwise, just log the failure, have the automated process run to fix the problem, and let the technicians sleep.  If management needs figures or wants to know what things looked like overnight, create reports and digests of this information and pass it along to them, but don’t bother your technicians.
  • On a related note, alerts need to be for non-automatable issues.  If you can automate a problem away, do so.  Even if it takes a fair amount of time, there’s a lot less risk in a documented, tested, automated process than in waking up some groggy technician.  People at 3 AM make mistakes, even when they have how-to documents and clear processes.  People at all hours of the day make mistakes; we get distracted and miss steps, mis-type something, click the wrong button, follow the wrong process, think we have everything memorized but (whoopsie) forgot a piece.  Computers are less likely to have these problems.

Wrapping Up

Auditing, monitoring, and alerting solve three different sets of problems.  They also have three different sets of requirements for what kind of data to use, how frequently to refresh this data, and how people interact with them.  It’s important to keep these clearly delineated for that reason.

During this, I’m also working on some toy monitoring stuff, so I hope that’ll be tomorrow’s TIL.

TIL: Installing Jupyter And R Support

I recently got through some difficulties installing Jupyter and incorporating R support, so I wanted to write up a quick installation post for a Linux installation.

First, install Jupyter through anaconda.  Notes:

  1. I grabbed the Python 3.5 version.  I don’t intend to write too much Python code here, so it shouldn’t make a huge difference to me.
  2. When you install, do not run sudo.  Just run the bash script.  It will install in your home directory by default, and for a one-off installation on a VM (like my scenario), this is fine.  It also makes future steps easier.

When running Jupyter, I started by following this guide.  Notes:

  1. Starting Jupyter is as easy as running “jupyter notebook” and navigating to http://localhost:8888 in a browser.
  2. Midori does not appear to be a good browser for Jupyter; when I tried to open a new notebook, I got an error message in the console and the browser seemed not to want to open up the notebook.  Firefox worked just fine.  Maybe I’m doing this wrong, though.

As for installing R support, Andrie de Vries has a nice post on the topic.  Notes:

  1. Here’s where not running sudo above pays off.  If you did run sudo, you’ll get an error saying that you can’t install in the home directory and that you should run a command to make a copy of the files…in your home directory.  If you accidentally ran sudo, you can chmod all of the files in your anaconda3/ directory using “chmod -R [user]:[user] anaconda3/” and correct the issue.
  2. Installation is as simple as running “conda install -c r ipython-notebook r-irkernel”  Again, note that I’m not running sudo here.

TIL: SQL Server 2003

I’m starting off my Today I Learned series with something I didn’t actually learn today.  Fantastic start…

We’re looking for database engineers (check out the posting if you’re interested!).  Today I reviewed one candidate’s resume, and I noticed “SQL Server 2003” on it.

There is no SQL Server 2003.  SQL Server 2003 was a pirated copy of SQL Server 2000 with a different splash screen.  If you put knowledge of SQL Server 2003 on your resume, you are telling me one of four things:

  1. You worked at a company which used pirated software.  Furthermore, you either didn’t know or didn’t care, neither of which is the type of thing you want to broadcast during an interview for a job focused around that product.
  2. You have confused SQL Server versions with Windows Server versions.
  3. You are grabbing bad information from the Internet and passing it off as your own experience.
  4. You aren’t proofreading your resume.

In short, if you put SQL Server 2003 on your resume, it is a negative signal.  I’ll bring it up during the interview, just in case some unscrupulous recruiter “fixed up” your resume, but if you’re the one who put it in there, that’s a huge black mark in my book.

New Series: Today I Learned

I’ve been off the blog for a while due to other pressing obligations.  I’ve kept Curated SQL up to date but sadly have neglected this blog.

To make it easier for me to get back into regular blogging, I’m going to try to post a TIL once a day and link that up with Twitter.  I expect these typically to be relatively short and sweet and will generally try to include only things I truly learned (or re-learned!)…generally…