Upcoming Events: Hampton Roads SQL Server User Group

Key Details

What: Hampton Roads SQL Server User Group.
Where: GoToWebinar.
When: Wednesday, August 19th.
Admission is free. Register on the Hampton Roads SQL Server User Group Meetup.

What I’m Presenting

6:30 PM — 08:30 PM EDT — The Curated Data Platform

This will be a preview of the talk that I will present at PASS Summit.

Transaction Modes in SQL Server

In the following video, I take a look at the three most important transaction modes in SQL Server: autocommit, explicit transactions, and implicit transactions. Sorry, batch-scoped transactions, but nobody loves you.

If you’d prefer a textual rendering of the video, here are the pre-demo and post-demo sections, lightly edited for narrative flow.

Setting the Stage: Transactions and Modes

What I want to do in today’s post is to cover the different sorts of transaction modes and get into the debate about whether you should use explicit transactions or rely on auto-committed transactions for data modification in SQL Server. This came from an interesting discussion at work, where some of the more recent database engineers were curious about our company policy around transaction modes and understanding the whys behind it. I didn’t come up with the policy, but my thinking isn’t too far off from the people who did.

But before I get too far off course, let’s briefly lay out some of the basics around transactions.

When you modify data (that is, run a command like INSERT, UPDATE, MERGE, or TRUNCATE) or tables (CREATE, ALTER, DROP, etc.), that operation takes place inside a transaction.  A transaction is, according to Microsoft Docs, a single unit of work.  Everything in a transaction will either succeed as a whole or fail as a whole–you won’t end up with some operations succeeding and others not–it’s really all or nothing.  The importance of this goes back to relational databases having ACID properties, but because that’s a little far afield of where I want to go, I’ll give you a link if you’d like to learn more about the topic, as it helps explain why transactions are useful for relational database developers.

What I do want to get to is that there are three kinds of transactions:  autocommit transactions, explicit transactions, and implicit transactions.  There’s actually a fourth kind, batch-scoped transactions, but that only applies to Multiple Active Result Sets transactions and if you find yourself there, you’ve got bigger problems than deciding how you want to deal with transactions.

In the demo for the video, I show off each of the three transaction modes, including how you enable them, how you work with them, and any important considerations around them.

Recommendations

The easy recommendation is, don’t use implicit transactions.  For SQL Server developers and database administrators, this is unexpected behavior–the default is to use autocommit, so that if you run an INSERT statement by itself, the transaction automatically commits at the end.  If you set implicit transactions on, there is no UI indication that this is on and it becomes really easy to forget to commit a transaction.  I understand that if you come from an Oracle background, where implicit transactions are the norm, it might feel comfortable to enable this, but it becomes really easy to start a transaction, forget to commit or rollback, and leave for lunch, blocking access to a table for a considerable amount of time. And if you’re using Azure Data Studio, it appears that implicit transactions might not even work, so you’d be in a world of hurt if you were relying upon them.  So let’s throw this one away as a recommendation.

My recommendation, whenever you have data modification on non-temporary tables, is to use explicit transactions over autocommit.  I have a few reasons for this.

First, consistency.  Sometimes you will need explicit transactions.  For example, if I need to ensure that I delete from table A only if an insert into table B and an update of table C are successful, I want to link those together with an explicit transaction.  That way, either all three operations succeed or none of them succeed.  Given that I need explicit transactions some of the time, I’d rather be in the habit of using them; so to build that habit, I’d prefer to use them for all data modification queries.

Second, explicit transactions give you clarity around what is actually necessary in a transaction.  Suppose you query a table and load the results into a temporary table.  From there, you make some modifications, join to other tables, and reshape the data a bit.  So far, nothing I’ve mentioned requires an explicit transaction because you’re only working with temp tables here.  When you take the final results and update a real table, now we want to open a transaction.  By using an explicit transaction, I make it clear exactly what I intend to have in the transaction:  the update of a real table, but not the temp table shenanigans.

Third, as an implication of the second point, explicit transactions can help you reduce the amount of time you’re locking tables.  You can do all of your heavy lifting in temp tables before opening the transaction, and that means you don’t have to do that while locking the real table. In the best case, autocommit will behave the same, but saying “Here is where I want my transaction to be” also lets you think about whether you really want to do everything at one statement or break it up into smaller chunks.

Finally, if you use a loop, whether that be a cursor or WHILE statement, you can control whether you want one transaction per loop iteration or one transaction in total, and that’s entirely to do with whether you begin and commit the transaction outside of the loop or inside.  Having one transaction in total can be considerably faster in some circumstances, but if you have an expensive action in the loop, you can commit after each loop iteration.  This will minimize the amount of time you block any single operation waiting to access this table.  It will increase the total runtime of your query, but minimize the pain to other users, and that’s a trade-off  you can only make if you use explicit transactions.

Rules of Thumb

First, if you have a stored procedure which is simply running a SELECT statement, use autocommit.  There’s no real advantage to putting this into an explicit transaction and there is the downside that you might forget to commit.

Second, if you have a stored procedure which performs data modification on non-temporary tables, use an explicit transaction only over the area which modifies data.  Don’t begin the transaction until you’re ready to start modifying tables; this will minimize the amount of time you need to keep the transaction open and resources locked.

As a corollary of the second point, note that you can use explicit transactions to control parent-child relationships with stored procedures, where the parent begins a transaction, calls each child, and rolls back or commits at the end depending upon the results. That’s something you can’t do with autocommit, as each data modification statement would run in its own auto-committed transaction.

Third, if you are working with non-global temporary tables beforehand, don’t include any modification of those inside the explicit transaction.  If you are working with global temporary tables, I suppose you should treat them like non-temporary tables here if you expect other sessions to use them and care about blocking, though there’s a pretty small number of cases where it makes sense to have global temporary tables with multiple users, so I’d call that an edge case.

Fourth, in a loop, choose whether you want to put the explicit transaction around the loop or inside it.  In most cases, I prefer to put the transaction inside the loop to minimize the amount of time that I’m blocking others. This is probably the smarter move to make in busy transactional environments, where you want to prevent blocking as much as possible.  Also, If one loop iteration fails, you’ll have less you need to roll back, so you can fix the issue and pick back up where you left off. Note that at that point, you trade atomicity on the entire set of data for atomicity on a block of data, so if that’s a big enough concern, bite the bullet and put an explicit transaction around the loop. Or see if you can make it faster without a loop.

Fifth, outside of a stored procedure—that is, when I’m just writing ad hoc statements in a client tool—use explicit transactions if you’re doing something potentially risky. I know this brings up the question of “Why are you doing risky things in a client tool to begin with?” But that’s a story for a different day.

Sixth, watch out for nested transactions.  In SQL Server, there’s very little utility in them and their behavior is weird. Paul Randal explains in great detail just how broken they are, and I’d rather the product never have had them.  Anyhow, check to see if you’re in a transaction before opening one. The pattern I like to use comes from my Working Effectively with Legacy SQL talk (which, ironically enough, needs some changes to be brought up to date) and originally from smart people in the office who put it together before I got there. Here’s a simplified version of it for a sample stored procedure:

CREATE OR ALTER PROCEDURE dbo.GetFraction
@Divisor INT = 5
AS
 
DECLARE
    @AlreadyInTransaction BIT;

BEGIN TRY 
    IF ( @@TRANCOUNT > 0 )
    BEGIN
        SET @AlreadyInTransaction = 1;
    END
    ELSE
    BEGIN
        SET @AlreadyInTransaction = 0;
        BEGIN TRANSACTION;
    END;
 
    -- Note:  this is where you'd put your data modification statements.
    SELECT
        1.0 / @Divisor AS Quotient;
 
    IF ( @AlreadyInTransaction = 0 AND @@TRANCOUNT > 0 )
    BEGIN
        COMMIT TRANSACTION;
    END;
END TRY
BEGIN CATCH
    IF ( @AlreadyInTransaction = 0 AND @@TRANCOUNT > 0 )
    BEGIN
        ROLLBACK TRANSACTION;
    END;

    THROW;
END CATCH
GO

--Test the procedure
EXEC dbo.GetFraction @Divisor = 5;

--Start an explicit transaction
BEGIN TRANSACTION
EXEC dbo.GetFraction @Divisor = 5;
SELECT @@TRANCOUNT;
ROLLBACK TRANSACTION

Finally, make sure you roll back the transaction on failure.  If you write code using try-catch blocks, commit at the end of the TRY block or rollback at the beginning of the CATCH. Explicit transactions offer you more power, but come with the responsibility of handling transactions appropriately.

Thoughts?

What are your thoughts on explicit transactions versus autocommit? Do you prefer the ease of autocommit or the power of explicit transactions? Or where do you draw the line between the two? Leave your thoughts in the comments section below—either here or on the video—and let me know.

Upcoming Events: SQL Saturday Auckland

Key Details

What: SQL Saturday Auckland.
Where: New Zealand, but also the Internet.
When: Saturday, August 8th.
Admission is free. Register on the SQL Saturday website.

What I’m Presenting

8:55 AM — 9:40 AM NZST — Applying Forensic Accounting Techniques with SQL and R

Please note that all of the times are in New Zealand Standard Time (NZST). This is UTC+12, so if you’re on the Eastern Daylight Time zone like me (UTC-4), that’s 16 hours ahead. In other words, I present on Friday starting at 4:55 PM EDT.

Upcoming Events: Data Platform Summit

Key Details

What: Data Platform Summit 2020
Where: Online
When: November 30th through December 8th
Register at the Data Platform Summit site.

What I’m Presenting

Monday, December 7th and Tuesday, December 8th, 12 PM through 4 PM EST — Data Analysis in Python and SQL

This is a paid post-event training. I promise to spend no more than about 20 minutes complaining about how data analysis is a lot easier to do in R… But I’m really looking forward to this—two separate 4-hour days digging into data analysis techniques using T-SQL as well as Python.

I’m also scheduled to present a regular session, though that hasn’t been announced yet, so stay tuned for that.

My Presentations at PASS Summit 2020

I have two sessions at PASS Summit 2020, one a full-day training and the other a general session.

SQL Server on the Edge: a Full-Day Training

I posted about this when Summit pre-cons were announced and the info is still good. The short version is, I’m going to take people through a real-ish IoT scenario involving SQL Server, .NET Core, Azure, and some IoT devices. We’ll see how things work, dive into Azure SQL Edge, and have some fun with Raspberry Pi’s along the way.

Here is the PASS Summit writeup if you’d like more information. And if you do decide to register for this full-day training, you’ll save $200 off of what it would have been had we attended PASS Summit in person.

The Curated Data Platform

I am really excited about this talk. Several years ago, I had a talk called Big Data, Small Data, and Everything in Between, and the idea of the talk was to walk through various data platform technologies and see where they fit.

That talk isn’t that out of date, but I decided to revamp it entirely, taking advantage of my insanity dedication as a Curator to give it a better name and a better theme.

The idea now is, let’s take a fictional but realistic company, walk through the types of data problems it experiences, and see which data platform technologies solve its problems, along with the biggest players in those spaces, and some reference architectures to boot.

The talk is currently under development and I plan to revise it a fair bit between now and Summit, but here’s a sneak peek of the agenda:

Register for PASS Summit

Registration is still open for PASS Summit 2020, so join me for the virtual event. And as a special offer, I’m giving away free hot takes for anyone who wants one. If that doesn’t seal the deal, I don’t know what would.

Upcoming Events: TriPASS Data Science

Key Details

What: TriPASS Data Science sub-group
Where: TriPASS on Twitch.
When: Tuesday, July 26th.
You can RSVP via Meetup or just show up on Twitch.

What I’m Presenting

6:00 PM — 7:30 PM EDT — IoT and Machine Learning in Azure

This won’t be a formal talk so much as it is a discussion of IoT strategies around Azure. I’ll talk about combining together several services in Azure, where the pain points are, and discuss a few alternative strategies around processing and analyzing data.

With ML Services, Watch Those Resource Groups

I wanted to cover something which has bitten me in two separate ways regarding SQL Server Machine Learning Services and Resource Governor.

Resource Governor and Default Memory

If you install a brand new copy of SQL Server and enable SQL Server Machine Learning Services, you’ll want to look at sys.resource_governor_external_resource_pools:

That’s a mighty fine cap you’re wearing.

By default, SQL Server will grant 20% of available memory to any R or Python scripts running. The purpose of this limit is to prevent you from hurting server performance with expensive external scripts (like, say, training large neural networks on a SQL Server).

Here’s the kicker: this affects you even if you don’t have Resource Governor enabled. If you see out-of-memory exceptions in Python or error messages about memory allocation in R, I’d recommend bumping this max memory percent up above 20, and I have scripts to help you with the job. Of course, making this change assumes that your server isn’t stressed to the breaking point; if it is, you might simply want to offload that work somewhere else.

Resource Governor and CPU

Notice that by default, the max CPU percent for external pools is 100, meaning that we get to push the server to its limits with respect to CPU.

Well, what happens if you accidentally change that? I found out the answer the hard way!

In my case, our servers were accidentally scaled down to 1% max CPU utilization. The end result was that even something as simple as print("Hello") in either R or Python would fail after 30 seconds. I thought it had to do with the Launchpad service causing problems, but after investigation, this was the culprit.

Identities blurred to protect the innocent.

The trickiest part about diagnosing this was that the Launchpad logs error messages gave no indication what the problem was—the error message was a vague “could not connect to Launchpad” error and the Launchpad error logs didn’t have any messages about the failed queries. So that’s one more thing to keep in mind when troubleshooting Machine Learning Services failures.

PolyBase and Excel: TOP Now Works

Back with SQL Server 2019 CU2, I reported an error with PolyBase connecting to Excel when trying to select TOP(10) from the table. I’m using the Microsoft Access Database Engine 2016 Redistributable’s Excel driver.

Here’s a sample external data source and table statement:

CREATE EXTERNAL DATA SOURCE VolcanoType WITH
(
    LOCATION = 'odbc://noplace',
    CONNECTION_OPTIONS = 'Driver={Microsoft Excel Driver (*.xls, *.xlsx, *.xlsm, *.xlsb)}; DBQ=C:\SourceCode\Polybase\Scripts\04 - Data Virtualization\VolcanoTypes.xlsx'
);

CREATE EXTERNAL TABLE dbo.VolcanoType
(
    Type NVARCHAR(100),
    Description NVARCHAR(1000)
)
WITH
(
    LOCATION = '[VolcanoTypes$]',
    DATA_SOURCE = VolcanoType
);

Prior to CU5, I could run SELECT * FROM dbo.VolcanoType successfully, but trying SELECT TOP(10) * FROM dbo.VolcanoType would return the following error:

Msg 7320, Level 16, State 110, Line 1
Cannot execute the query “Remote Query” against OLE DB provider “MSOLEDBSQL” for linked server “(null)”. 105082;Generic ODBC error: [Microsoft][ODBC Excel Driver] Syntax error (missing operator) in query expression ‘`T_0`.`Type` `Type`’. .

Now, I could get this to work by turning off external pushdown: SELECT TOP(10) * FROM dbo.VolcanoType OPTION(DISABLE EXTERNALPUSHDOWN);

But now with CU5, I was excited to see bug reference 13525968 corrected:

Disables PolyBase Generic ODBC External Data Sources’ default behavior of pushing down the TOP operator and calling the SQLRowCount function

After installing CU5, I can run a TOP() operation hitting Excel and it won’t fail, so that’s good. Unfortunately, another bug still hasn’t been corrected so I’m hoping they are able to correct that in a future CU.