Capturing Deadlocks With Extended Events

Extended Events is the replacement for Profiler.  There are a number of advantages to XEs:  they are much more lightweight than server-side traces, they can capture more information, and there are more methods for storing this information (including a ring buffer and writing out to disk).  My biggest problem with Extended Events is that even with the GUI, it’s still easier to set up a Profiler trace than futz about trying to set up an XE.

Deadlocks Are Different

With deadlocks, the default SQL Server system_health Extended Event already tracks deadlocks.  You can get to this session pretty easily in SQL Server Management Studio by connecting to your instance and going to Management —> Extended Events —> Sessions —> system_health.  Inside this session, there are two options:  a ring buffer which keeps track of recent events, as well as an event_file which holds a bit more detail on past events.  Depending upon how busy your server is, that event file might go back several days, or maybe just a few hours (or minutes on a very busy server).


Double-click on one of the session targets, depending upon whether you want to watch live data (ring buffer) or view older data (event file).  Once you do that, you’ll see the Extended Events viewer.  In SSMS, you will get a Filters button in the Extended Events menu.  Click the Filters button and you’ll get the ability to enter a filter in.


Select “name” from the dropdown and set its value equal to xml_deadlock_report.


Once you’re done with this, you’ll see only XML deadlock reports.  You can grab the XML and also see the deadlock graph.



Once you have deadlock graphs and you know how to read them, you can use that information to fix your deadlocking issues.

Podcasts I’m Enjoying

With winter comes colder temperatures, and that means I have to put the top up on the Miata.  That is unfortunate for many reasons, but one of the plus sides is that I can comfortably listen to podcasts in my car, squeezing in a few extra hours of learning each week.

Here are the main podcasts I’m listening to today, in alphabetical order:

  • Away From the Keyboard.  Richie Rump and Cecil Phillip keep it a bit lighter, focusing more on stories than hardcore learning.  It’s a good podcast for unwinding after a long day.
  • Paul’s Security Weekly.  This used to be pauldotcom but whatever the name, Paul Asadoorian & co do a great job making security news entertaining.
  • SQL Data Partners podcast.  Carlos Chacon started podcasting not too long ago, and almost all of his podcasts are in interview format.
  • WOxPod.  Chris Bell also has interview-style podcasts, but in addition to that, has contemplative monologues.

At one point in time, I had dozens of podcasts, but it got to be too much.  I’m starting over again with these four.

The Cost Of Synchronous Mirroring

About a month or so ago, I started dealing with a customer’s performance issues.  When I checked the wait stats using Glenn Berry’s fantastic set of DMV queries, I noticed that 99% of wait stats were around mirroring.  This says that 99% of the time that SQL Server spends waiting to run queries is due to the fact that the primary instance is waiting for the secondary instance to synchronize changes.

The reason that mirroring stats were that high is because my customer is using Standard Edition of SQL Server.  Unfortunately, Standard Edition only allows for synchronous mirroring.  Now, I know that mirroring is deprecated, but my customer didn’t, and until SQL Server 2016 comes out and we get asynchronous (or synchronous) availability groups, they didn’t have many high availability options.

Because this customer was having performance problems, we ended up breaking the mirror.  We did this after I discussed their Recovery Time Objectives and Recovery Point Objectives—that is, how long they can afford to be down and how much data we can afford to lose—and it turned out that synchronous mirroring just wasn’t necessary given the company model and RTO/RPO requirements.  Instead, I bumped up backup frequency and have a medium-term plan to introduce log shipping to reduce recovery time in the event of failure.

But let’s say that this option wasn’t available to me.  Here are other things you can do to improve mirroring performance:

  1. Switch to asynchronous mode.  If you’re using Enterprise Edition, you can switch mirroring to asynchronous mode, which improves performance considerably.  Of course, this comes at the risk of data loss in the event of failure—a transaction can commit on the primary node before it commits on the secondary, so in the event of primary failure immediately after a commit, it’s possible that the secondary doesn’t have that transaction.  If you need your secondary to be synchronous, this isn’t an option.
  2. Improve storage and network subsystems.  In my customer’s case, they’re using a decent NAS.  They’re a small company and don’t need SANs with racks full of SSDs or on-board flash storage, and there’s no way they could afford that.  But if they needed synchronous mirroring, getting those writes to the secondary more quickly would help performance.
  3. Review mirroring.  In an interesting blog post on mirroring, Graham Kent looks at the kind of information he wants when troubleshooting problems with database mirroring, and also points us to Microsoft guidance on the topic.  It’s possible that my customer could have tweaked mirroring somehow to keep it going.

In the end, after shutting off mirroring, we saw a significant performance improvement.  It wasn’t enough and I still needed to modify some code, but this at least helped them through the immediate crisis.  They lost the benefit of having mirrored instances—knowing that if one instance goes down, another can come up very quickly to take over—but because the RTO/RPO requirements were fairly loose, we decided that we could sacrifice this level of security in order to obtain sufficient performance.

Spinach And Databases

The pseudonymous Phil Factor explains the necessity of constraining data, preventing as much as possible the entry of bad data:

A few thoughts on the video:

  1. When reading Phil’s work, I got an impression of him.  Phil on video is pretty similar to that impression, though I can’t say his appearance is exactly as I expected.
  2. This video sounds like a Phil Factor essay read aloud.  I enjoyed it a lot.
  3. Phil mentions that the iron content in spinach was off due to a misplaced decimal point.  It looks like that was proven incorrect.  Nevertheless, I approve of his disapproval of that ghastly vegetable.

Curated SQL Is Live

Curated SQL is now live.  I’ve been putting links in for the past week, and I’m slowly starting to advertise the site.  My goal is to build it out further over the next several weeks, improving the Twitter account, trying to draw in a regular viewership, and showing the value of curating technical posts across the world of SQL Server blogs.

As part of this, I’m actively expanding my blogroll.  So far, I have somewhere along the lines of 80-90 SQL Server-specific blogs, and I’ll keep expanding the list.  I may also put out some feelers for a collaborator or two, but I definitely want to give the site a distinctive feel before opening it up in that direction.

Beware Functions In SQL Server

Aren’t Functions A Good Thing?

As developers, we want to modularize code in order to present duplication.  There are several laudable reasons to prevent code duplication.  The simplest reason is that we all have better things to do than type the same lines of code over and over.  Putting code into a common module or function means that we make calling that code simpler and we type less.

Another reason to prevent code duplication is to prevent errors.  If we have five versions of a calculation and that calculation later changes, it’s easy accidentally to miss one of those changes and end up with two separate versions of a calculation.  The best case scenario is that this will cause users to doubt the veracity of your data; the worst case scenario is that you will permanently damage your application by unknowingly storing incorrect data that you cannot later fix.

Yet another reason developers want to use functions is to make testing easier.  When a piece of logic exists in one place, we can write a single set of tests to cover that logic.  If you have logic strewn about in many different functions, it makes testing harder for the same reason as above:  you might miss one version of the calculation or operation.

All of this shows that there are definite advantages to writing modular code using functions or other language constructs.

So Why Are They A Bad Thing?

In SQL server, modularization usually comes in three forms:  stored procedures, functions, and views.  I am a huge fan of stored procedures, am somewhat wary of views, and am strongly anti-function.  The reason comes down to performance.  Stored procedures have no performance overhead as opposed to ad hoc SQL, and so there is no performance optimization reason not to use them.  Views theoretically should not have any performance impact on an environment, but when you nest views deeply enough or hit complex enough views, the query optimizer can get confused and throw out an unnecessarily complex execution plan which performs poorly.  Often times, an easy fix to a poorly-performing query involving a view is to bring the relevant view code in-line.

Functions are like views, except much, much worse.  The problem with functions tends to be the same, regardless of whether you use scalar, multi-set, or table functions:  they perform poorly, much worse than their alternatives.  If you create a user-defined scalar function and use it in your WHERE clause or in a JOIN condition, you can pretty much guarantee that SQL Server will go through your table row-by-row, running that function every time.  This is bad enough if your function is a fairly simple calculation, but if you need to look up data from another table, you’re making a separate table join every single time.  Suppose you have 10 million rows in your main table and a function which does a quick lookup against a tiny table with just 30 rows.  To make the example starker, let’s say that the 10 million rows are ordered in such a way that we could profit most from a merge join against the 30-row table.  It would make sense to grab those 30 rows and stream the 10 million rows through, matching up against the relevant lookup row along the way.  But if we put this lookup operation into a function, it gets called once for each of the 10 million rows.  This most likely will not be a merge join and will lead to a huge amount of I/O.  The worst part is that your execution plan will show what one lookup looks like but fail to mention that it’s going to happen 10 million times, meaning that if you don’t understand the performance problems of functions, it’s easy to look at the plan and say that the function will actually perform better…until you see the total time elapsed!

A Case Study:  User Total Time

A customer of mine produces software which optimizes employee schedules.  They track shift times and figure out when an employee hits the overtime cutoff, and can track shift data from the past, present, and future.  The way that they did this was with a function called UserTotalTime().  This function called a few scalar functions to check parameters, and then performed a somewhat expensive set of table joins, aggregating the data into a result set with the user ID, time period, and number of hours worked + scheduled, leading to a determination of whether that employee has overtime hours or not.

When you’re looking at one user or even all users over a single time period, the process didn’t perform that badly, and so this function was integrated into more and more code, more and more complex procedures.  At this point, they started to notice the solution beginning to burst at the seams.  Queries for larger customers were taking an unacceptably long time and they weren’t sure what they could do to improve the situation.  It got to the point where they started deleting some data to limit how many rows were in the tables; that provided a temporary boost to performance, but wasn’t a good long-term strategy because it limited future analytics.

What To Do?

Taking a look at the function, the first thing I realized was that this generally did not need to be a function, nor did it need to be as complex as it was.  Those scalar functions to check parameters could be simplified and eliminated, as well as part of the main calculation.  At that point, I could inline the remaining code and modify procedures to build the result set directly instead of calling a function.

In this case, however, I went a step further.  This customer’s database is really closer to a warehouse than a true transactional system:  the read to write ratio is in the thousands or tens of thousands; people rarely update data, but they definitely read it a lot.  As a result, I ended up creating a reporting table which holds the aggregated results of this query in a brand new table, updating it as part of an ETL process which runs whenever rows get updated, as well as periodically (just in case somebody modifies data in a non-standard manner).  This works well for their environment, although it certainly isn’t a solution to every function-based performance problem.

What Can You Do?

The easiest way to fix performance problems with functions is to remove the function and bring the function’s code inline.  This doesn’t feel very nice for a developer, as it means copying code and violates the Don’t Repeat Yourself mantra, but if you’re running into performance problems, a little repetition is a small cost to pay for code which performs up to business needs.

Bonus Answer:  APPLY Yourself

Another tool to help get rid of pesky functions is the APPLY operator.  I just happen to have a presentation on the topic.  The great thing about the APPLY operator is that it operates like a table-valued function, but because the code is inline, you don’t get the performance problems that you would with a function.

Stay Tuned

In tomorrow’s blog post, I’m going to circle back around to UserTotalTime, showing why a function appears useful, how it can perform poorly, and how we can deconstruct a function and get back an acceptable level of performance.