I wanted to announce the first open source project officially released by ChannelAdvisor: the QDS Toolbox. This is an effort which Pablo Lozano and Efraim Sharon pushed hard internally and several database administrators and database engineers contributed to (though I wasn’t one of them).
From the summary page:
This is a collection of tools (comprised of a combination of views, procedures, functions…) developed using the Query Store functionality as a base to facilitate its usage and reports’ generation. These include but are not limited to:
– Implementations of SSMS’ GUI reports that can be invoked using T-SQL code, with added funcionalities (parameterization, saving results to tables) so they can be programmatically executed and used to send out mails. – Quick analysis of a server’s overall activity to identify bottlenecks and points of high pressure on the SQL instance at any given time, both in real time or in the past. – Cleanup of QDS’ cache with a smaller footprint than the internal one generates, with customization parameters to enable a customizable cleanup (such as removing information regarding dropped objects, cleaning details of ad-hoc or internal queries executed on the server as index maintenance operations).
The biggest of these is the third item. In our environment, Query Store could be a beast when trying to delete old data, and would often be the biggest performance problem on a given server.
In addition, several procedures exist as a way of aggregating data across databases. We have a sharded multi-tenant environment, where we might have 5-15 replicas of a database schema and assign customers to those databases. QDS Toolbox helps aggregate information across these databases so that you don’t need to look at each individually to understand performance problems. The database team has then created reports off of this to improve their understanding of what’s going on.
Check out the QDS Toolbox as a way to clean up data better than the built-in cleanup process and get additional information aggregated in a smart way.
In the following video, I take a look at the three most important transaction modes in SQL Server: autocommit, explicit transactions, and implicit transactions. Sorry, batch-scoped transactions, but nobody loves you.
If you’d prefer a textual rendering of the video, here are the pre-demo and post-demo sections, lightly edited for narrative flow.
Setting the Stage: Transactions and Modes
What I want to do in today’s post is to cover the different sorts of transaction modes and get into the debate about whether you should use explicit transactions or rely on auto-committed transactions for data modification in SQL Server. This came from an interesting discussion at work, where some of the more recent database engineers were curious about our company policy around transaction modes and understanding the whys behind it. I didn’t come up with the policy, but my thinking isn’t too far off from the people who did.
But before I get too far off course, let’s briefly lay out some of the basics around transactions.
When you modify data (that is, run a command like INSERT, UPDATE, MERGE, or TRUNCATE) or tables (CREATE, ALTER, DROP, etc.), that operation takes place inside a transaction. A transaction is, according to Microsoft Docs, a single unit of work. Everything in a transaction will either succeed as a whole or fail as a whole–you won’t end up with some operations succeeding and others not–it’s really all or nothing. The importance of this goes back to relational databases having ACID properties, but because that’s a little far afield of where I want to go, I’ll give you a link if you’d like to learn more about the topic, as it helps explain why transactions are useful for relational database developers.
What I do want to get to is that there are three kinds of transactions: autocommit transactions, explicit transactions, and implicit transactions. There’s actually a fourth kind, batch-scoped transactions, but that only applies to Multiple Active Result Sets transactions and if you find yourself there, you’ve got bigger problems than deciding how you want to deal with transactions.
In the demo for the video, I show off each of the three transaction modes, including how you enable them, how you work with them, and any important considerations around them.
The easy recommendation is, don’t use implicit transactions. For SQL Server developers and database administrators, this is unexpected behavior–the default is to use autocommit, so that if you run an INSERT statement by itself, the transaction automatically commits at the end. If you set implicit transactions on, there is no UI indication that this is on and it becomes really easy to forget to commit a transaction. I understand that if you come from an Oracle background, where implicit transactions are the norm, it might feel comfortable to enable this, but it becomes really easy to start a transaction, forget to commit or rollback, and leave for lunch, blocking access to a table for a considerable amount of time. And if you’re using Azure Data Studio, it appears that implicit transactions might not even work, so you’d be in a world of hurt if you were relying upon them. So let’s throw this one away as a recommendation.
My recommendation, whenever you have data modification on non-temporary tables, is to use explicit transactions over autocommit. I have a few reasons for this.
First, consistency. Sometimes you will need explicit transactions. For example, if I need to ensure that I delete from table A only if an insert into table B and an update of table C are successful, I want to link those together with an explicit transaction. That way, either all three operations succeed or none of them succeed. Given that I need explicit transactions some of the time, I’d rather be in the habit of using them; so to build that habit, I’d prefer to use them for all data modification queries.
Second, explicit transactions give you clarity around what is actually necessary in a transaction. Suppose you query a table and load the results into a temporary table. From there, you make some modifications, join to other tables, and reshape the data a bit. So far, nothing I’ve mentioned requires an explicit transaction because you’re only working with temp tables here. When you take the final results and update a real table, now we want to open a transaction. By using an explicit transaction, I make it clear exactly what I intend to have in the transaction: the update of a real table, but not the temp table shenanigans.
Third, as an implication of the second point, explicit transactions can help you reduce the amount of time you’re locking tables. You can do all of your heavy lifting in temp tables before opening the transaction, and that means you don’t have to do that while locking the real table. In the best case, autocommit will behave the same, but saying “Here is where I want my transaction to be” also lets you think about whether you really want to do everything at one statement or break it up into smaller chunks.
Finally, if you use a loop, whether that be a cursor or WHILE statement, you can control whether you want one transaction per loop iteration or one transaction in total, and that’s entirely to do with whether you begin and commit the transaction outside of the loop or inside. Having one transaction in total can be considerably faster in some circumstances, but if you have an expensive action in the loop, you can commit after each loop iteration. This will minimize the amount of time you block any single operation waiting to access this table. It will increase the total runtime of your query, but minimize the pain to other users, and that’s a trade-off you can only make if you use explicit transactions.
Rules of Thumb
First, if you have a stored procedure which is simply running a SELECT statement, use autocommit. There’s no real advantage to putting this into an explicit transaction and there is the downside that you might forget to commit.
Second, if you have a stored procedure which performs data modification on non-temporary tables, use an explicit transaction only over the area which modifies data. Don’t begin the transaction until you’re ready to start modifying tables; this will minimize the amount of time you need to keep the transaction open and resources locked.
As a corollary of the second point, note that you can use explicit transactions to control parent-child relationships with stored procedures, where the parent begins a transaction, calls each child, and rolls back or commits at the end depending upon the results. That’s something you can’t do with autocommit, as each data modification statement would run in its own auto-committed transaction.
Third, if you are working with non-global temporary tables beforehand, don’t include any modification of those inside the explicit transaction. If you are working with global temporary tables, I suppose you should treat them like non-temporary tables here if you expect other sessions to use them and care about blocking, though there’s a pretty small number of cases where it makes sense to have global temporary tables with multiple users, so I’d call that an edge case.
Fourth, in a loop, choose whether you want to put the explicit transaction around the loop or inside it. In most cases, I prefer to put the transaction inside the loop to minimize the amount of time that I’m blocking others. This is probably the smarter move to make in busy transactional environments, where you want to prevent blocking as much as possible. Also, If one loop iteration fails, you’ll have less you need to roll back, so you can fix the issue and pick back up where you left off. Note that at that point, you trade atomicity on the entire set of data for atomicity on a block of data, so if that’s a big enough concern, bite the bullet and put an explicit transaction around the loop. Or see if you can make it faster without a loop.
Fifth, outside of a stored procedure—that is, when I’m just writing ad hoc statements in a client tool—use explicit transactions if you’re doing something potentially risky. I know this brings up the question of “Why are you doing risky things in a client tool to begin with?” But that’s a story for a different day.
Sixth, watch out for nested transactions. In SQL Server, there’s very little utility in them and their behavior is weird. Paul Randal explains in great detail just how broken they are, and I’d rather the product never have had them. Anyhow, check to see if you’re in a transaction before opening one. The pattern I like to use comes from my Working Effectively with Legacy SQL talk (which, ironically enough, needs some changes to be brought up to date) and originally from smart people in the office who put it together before I got there. Here’s a simplified version of it for a sample stored procedure:
CREATE OR ALTER PROCEDURE dbo.GetFraction
@Divisor INT = 5
IF ( @@TRANCOUNT > 0 )
SET @AlreadyInTransaction = 1;
SET @AlreadyInTransaction = 0;
-- Note: this is where you'd put your data modification statements.
1.0 / @Divisor AS Quotient;
IF ( @AlreadyInTransaction = 0 AND @@TRANCOUNT > 0 )
IF ( @AlreadyInTransaction = 0 AND @@TRANCOUNT > 0 )
--Test the procedure
EXEC dbo.GetFraction @Divisor = 5;
--Start an explicit transaction
EXEC dbo.GetFraction @Divisor = 5;
Finally, make sure you roll back the transaction on failure. If you write code using try-catch blocks, commit at the end of the TRY block or rollback at the beginning of the CATCH. Explicit transactions offer you more power, but come with the responsibility of handling transactions appropriately.
What are your thoughts on explicit transactions versus autocommit? Do you prefer the ease of autocommit or the power of explicit transactions? Or where do you draw the line between the two? Leave your thoughts in the comments section below—either here or on the video—and let me know.
Please note that all of the times are in New Zealand Standard Time (NZST). This is UTC+12, so if you’re on the Eastern Daylight Time zone like me (UTC-4), that’s 16 hours ahead. In other words, I present on Friday starting at 4:55 PM EDT.
What: Data Platform Summit 2020 Where: Online When: November 30th through December 8th Register at the Data Platform Summit site.
What I’m Presenting
Monday, December 7th and Tuesday, December 8th, 12 PM through 4 PM EST — Data Analysis in Python and SQL
This is a paid post-event training. I promise to spend no more than about 20 minutes complaining about how data analysis is a lot easier to do in R… But I’m really looking forward to this—two separate 4-hour days digging into data analysis techniques using T-SQL as well as Python.
I’m also scheduled to present a regular session, though that hasn’t been announced yet, so stay tuned for that.
Here is the PASS Summit writeup if you’d like more information. And if you do decide to register for this full-day training, you’ll save $200 off of what it would have been had we attended PASS Summit in person.
The Curated Data Platform
I am really excited about this talk. Several years ago, I had a talk called Big Data, Small Data, and Everything in Between, and the idea of the talk was to walk through various data platform technologies and see where they fit.
That talk isn’t that out of date, but I decided to revamp it entirely, taking advantage of my insanity dedication as a Curator to give it a better name and a better theme.
The idea now is, let’s take a fictional but realistic company, walk through the types of data problems it experiences, and see which data platform technologies solve its problems, along with the biggest players in those spaces, and some reference architectures to boot.
The talk is currently under development and I plan to revise it a fair bit between now and Summit, but here’s a sneak peek of the agenda:
6:00 PM — 7:30 PM EDT — IoT and Machine Learning in Azure
This won’t be a formal talk so much as it is a discussion of IoT strategies around Azure. I’ll talk about combining together several services in Azure, where the pain points are, and discuss a few alternative strategies around processing and analyzing data.
I wanted to cover something which has bitten me in two separate ways regarding SQL Server Machine Learning Services and Resource Governor.
Resource Governor and Default Memory
If you install a brand new copy of SQL Server and enable SQL Server Machine Learning Services, you’ll want to look at sys.resource_governor_external_resource_pools:
By default, SQL Server will grant 20% of available memory to any R or Python scripts running. The purpose of this limit is to prevent you from hurting server performance with expensive external scripts (like, say, training large neural networks on a SQL Server).
Here’s the kicker: this affects you even if you don’t have Resource Governor enabled. If you see out-of-memory exceptions in Python or error messages about memory allocation in R, I’d recommend bumping this max memory percent up above 20, and I have scripts to help you with the job. Of course, making this change assumes that your server isn’t stressed to the breaking point; if it is, you might simply want to offload that work somewhere else.
Resource Governor and CPU
Notice that by default, the max CPU percent for external pools is 100, meaning that we get to push the server to its limits with respect to CPU.
Well, what happens if you accidentally change that? I found out the answer the hard way!
In my case, our servers were accidentally scaled down to 1% max CPU utilization. The end result was that even something as simple as print("Hello") in either R or Python would fail after 30 seconds. I thought it had to do with the Launchpad service causing problems, but after investigation, this was the culprit.
The trickiest part about diagnosing this was that the Launchpad logs error messages gave no indication what the problem was—the error message was a vague “could not connect to Launchpad” error and the Launchpad error logs didn’t have any messages about the failed queries. So that’s one more thing to keep in mind when troubleshooting Machine Learning Services failures.