Metrics For Free: SQL Server Monitoring With Telegraf

Not too long ago, I had the opportunity to put into place a free solution for monitoring SQL Server instances. I saw Tracy’s series on collecting performance metrics InfluxDB + Telegraf + Grafana, and then I saw her talk on the topic (Collecting Performance Metrics), but until I implemented it myself, I couldn’t believe how easy it was. I thought it was going to take two or three days of hard work to get done, but I had everything going within a few hours.

Let’s walk through the process together.

Step 1: We Need a Database

The first piece of the puzzle is a place to host InfluxDB, an open source time series database.

Step 1A: We Need a Server

My customer is running most of their infrastructure in Azure, so I spun up a Linux-based host running Ubuntu. This doesn’t need to be a big server: I’m using a Standard D2s v3 (2 vCPUs, 8 GB of RAM) on a 100GB Premium SSD to host 5 servers, and my CPU never gets above 2%. This is the most expensive line item and will run about $70 or so a month in the East US region if you pay as you go and as little as $27 a month if you reserve for three years.

For a bigger environment, Tracy’s advice for 500 servers is 8 vCPUs and 32 GB of RAM.

Step 1B: We Need InfluxDB

Tracy shows how to install with CentOS but if you prefer Ubuntu (as I do), you’ll have to change a couple of things around. The installation instructions are pretty easy if you’ve worked with Linux before, so I won’t dally there.

Step 2: We Need Pretty Pictures

While you’re tooling around on that Linux VM, you should also install Grafana, a great tool for analytics and monitoring. Here are a couple examples of dashboards that you’ll get with this solution. First up is the instances at a glance dashboard:

Instances at a glance. 70% CPU? Just fine. Nothing to see here.

Next up is details on a specific instance, which let you look at activity, memory utilization, waits, CPU load, I/O load, etc.

A snippet of instance details.

Installing Grafana is easy. Tracy has scripts for CentOS or other Red Hat-based distributions and the Grafana website has step-by-step instructions for installing on Ubuntu or Debian-based distributions.

Step 3: We Need an Agent

InfluxDB is going to store all of your metrics and Grafana is going to display all of them, but we need one more piece of the puzzle: an agent to send metrics to InfluxDB at fixed intervals. That’s what Telegraf does. Grab the stable version of Telegraf or the Windows nightly if you’re feeling lucky. Unzip the archive and you’ll see two files: telegraf.conf and telegraf.exe. Here are the changes I made to my telegraf.conf file:

  • In [[outputs.influxdb]] there is a section for URLs. Add your InfluxDB URL in a format like urls = ["http://10.10.10.10:8086"], replacing the IP address with your own. Your Telegraf agents need to be able to communicate with InfluxDB and by default, that’s over port 8086 but you can configure that port if you really need it to change.
  • Still in [[outputs.influxdb]], I have uncommented the following lines: retention_policy = "", write_consistency = "any", and timeout = "5s".
  • I added input plugins and shamelessly stole the code from Tracy. Here’s what that section looks like:
###############################################################################
#                            INPUT PLUGINS                                    #
###############################################################################
[[inputs.sqlserver]]
  servers = [
    "Server=localhost;Port=1433;User Id=telegraf;Password=<YOURPASSWORD>;app name=telegraf;log=1;",
  ]
  #
  ## Optional parameter, setting this to two will use a new version
  ## of the collection queries that break compatibility with the original
  ## dashboards. 
  query_version = 2
  ## If you are using AzureDB, setting this to True will gather resource utilization metrics
  # azuredb = False
  ## If you would like to exclude some of the metrics queries, list them here
  # exclude_query = [ 'PerformanceCounters','WaitStatsCatagorized' ]

[[inputs.win_perf_counters.object]]
  # Processor usage, alternative to native, reports on a per core.
  ObjectName = "Processor"
  Instances = ["*"]
  Counters = ["% Idle Time", "% Interrupt Time", "% Privileged Time", "% User Time", "% Processor Time"]
  Measurement = "sql_cpu"
  #IncludeTotal=false #Set to true to include _Total instance when querying for all (*).

[[inputs.win_perf_counters.object]]
  ObjectName = "LogicalDisk"
  Instances = ["*"]
  Counters = ["% Free Space","Free Megabytes","Avg. Disk Read Queue Length","Avg. Disk Write Queue Length","Avg. Disk sec/Read","Avg. Disk sec/Write"]
  Measurement = "sql_disk"

Note that you might need to change the servers entry above if you use named instances; that entry can also handle multiple instances if you have them installed on the same machine.

Once you make these configuration changes, you can deploy telegraf.exe and telegraf.conf to your Windows servers. Deploy them to %PROGRAMFILES%\telegraf\ and register Telegraf as a service using Powershell using the block below, which I copied from Tracy:

Stop-Service -Name telegraf -ErrorAction SilentlyContinue
& "c:\program files\telegraf\telegraf.exe" --service install -config "c:\program files\telegraf\telegraf.conf"
Start-Service -Name telegraf

If you want the full Powershell script which includes deploying remotely to multiple servers, you’re going to have to click the link and read her post.

Step 4: We Need to Hook Up Grafana

Grafana is installed on our Linux VM, but it’s not doing anything quite yet. In Grafana, mouse over the settings menu (the gear icon), select Data Sources, and click the “Add data source” button.

Adding a new data source to Grafana.

From there, you’ll see a menu where you get to enter values. The most important values are:

  • Name — what you want to call this thing. I stuck with Tracy’s name because that way I could copy her dashboards with fewer changes on my part.
  • Type — this needs to be InfluxDB.
  • URL — because I’m hosting Grafana on the same server as InfluxDB, I’m going to localhost. If InfluxDB is on a different server, use that IP address instead.
  • InfluxDB Details — we created a database on Influx called telegraf, so use that for the name and fill in your username and password.
Adding a new data source. There’s a save button below all this noise.

As far as security goes, Grafana seems to be fairly trusting of the outside world. In my case, I locked down the VM so that only people already on our VPN can access the server, and you can go one further by using nginx as a reverse proxy to control access and improve your security posture.

Step 5: We Need Cool-Looking Visuals

At this point, we have all of the pieces in place and data should be flowing into InfluxDB. We just need a way to display that data. You can build your own dashboards, but Tracy and Mark Wilkinson have a great set of starter dashboards for you.

To create a new dashboard, mouse over the dashboards icon and click “Manage.” Then click the “Import” button on the right.

MORE DASHBOARDS NOM NOM NOM

This will bring you to a page where you can import a dashboard in one of a few ways. I just copied and pasted the JSON straight from GitHub:

Adding a new dashboard.

Once you have the dashboard JSON imported, you should see the dashboard on the dashboards list.

Step 6: Invoice the Customer

At this point, you’re done. I thought it was going to take hours and hours to do, but it was really easy. The trickiest part was getting Azure security policies right and making sure all of my instances could see the server with Influx running on it. Otherwise, it was smooth sailing.

Coda: This Does Not Cover Everything

One thing to note is that this solution is a high-level monitoring solution. It does not include information on specific query plans or let you drill into currently-running queries. The purpose of this tool is to let you know that your server is in pain and give you some idea where the pain is: in other words, understand the current resource limitations and allow you to begin formulating conjectures about why this performance issue is happening. It does not do a great job of giving you the precise detail. From there, we have other tools for digging deeper once we know there is a problem.

If you want to stay on the free side, a few options for gathering more information include Query Store (or OpenQueryStore for older versions of SQL Server), running sp_whoisactive and capturing outputs, and collecting data from Extended Events. For more information on some of these techniques, check out Tracy’s GroupBy talk on collecting performance metrics.

Or you could shell out money and buy a product. That works too.

Advertisements

One thought on “Metrics For Free: SQL Server Monitoring With Telegraf

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s