Upcoming Events: Data Platform Summit 2020

I’ve talked about Data Platform Summit before, but now that we’re close to go time, I wanted to mention it again.

This year, I will be presenting one breakout session and one training class.

The breakout session is entitled Optimizing Backup Performance using Data Science Techniques. In it, I walk through the configuration settings we have available to us for tuning backups (and restorations) of databases and then apply data science techniques to find the best set for a given environment. It’s a great way of applying data science skills to a classic database administration task and should be fun for both audiences. That session, along with more than 200 others, will run from December 2nd through the 4th.

The training, meanwhile, will take place over two days, December 7th and 8th. The training is entitled Applying Forensic Accounting Techniques to Data Analysis in Python and SQL and is a lengthy enough title that it’s self-explanatory. In it, I answer the perennial question for data platform specialists: what does it mean to know your data? The same processes and techniques forensic accountants use to analyze data sets can help us gain a detailed understanding of data in our own environments and become more proficient. Along the way, we uncover a bit of dirty dealing in an artificial data set because hey, where’s the fun in talking about forensic accounting if we can’t have a fraud-based example?

Save Some Money

Use the discount code TRIANGLESQL to save big on sessions and trainings, and also kick a little bit to the Triangle Area SQL Server Users Group.

Q&A: The Curated Data Platform

On Thursday, I presented a session at PASS Summit entitled The Curated Data Platform. You can grab slides and links to additional information on my website. Thank you to everyone who attended the session.

During and after the session, I had a few questions come in from the audience, and I wanted to cover them here.

Cross-Platform Data Strategies

The first question was, “What handles the translation between the relational truth system and the Document system?” The context of the question comes from a discussion about product catalogs, and specifically this slide.

Document databases are great for things like product catalogs, where we meet the following properties:

  • Data has a very high read-to-write ratio.
  • You generally look at one item per page—in this example, one product.
  • The set of data to appear on a page is complex and typically has nested items: a product has attributes (title, price, brand, description) but also collections of sub-items (product image links, current stock in different stores, top reviews, etc.).
  • The data is not mission-critical: if updates are delayed or even occasionally lost, that is acceptable.

But I do like keeping a “golden record” version of the data and my biases push me toward storing that golden record in a relational database. I mentioned two processes in my talk: a regular push on updates and an occasional full sync to true up the document database(s).

And that leads to the question of, how do we do that? There are products from companies like CData and Oracle which can handle this, or you can write your own. If your source is SQL Server, I’d push for a two-phase process:

  1. Enable Change Data Capture on the SQL Server instance and have a scheduled task query the latest changes and write them to your document database(s). You can use constructs like FOR JSON PATH in SQL Server to shape the documents directly, or pull in the source data and shape it in your application code.
  2. Periodically (e.g., once an hour, once a day), grab all of the data, shape the documents, and perform a comparison with what’s out there. This will confirm that nothing slips through the cracks for longer than one time period and will keep disparate clusters of document data separated.

Of course, this golden record doesn’t need to be in a relational database—you could store it in a document database and use replication there to push data to different clusters. If you use Cosmos DB, for example, you can replicate to other regions easily.

Document Databases: Scale-Out vs Replication

Another attendee asked about “Document databases and scale-out vs replication.” In retrospect, I think I misinterpreted the question as asked, as I mentioned that scale-out and replication are one and the same: you replicate data between nodes in a cluster to achieve scale-out.

But this time around, I’m going to answer the question, “How do I choose between making my current cluster bigger and replicating out to a new cluster?”

Here are some key considerations:

  • If the issue you are trying to solve is geographical in nature, replicate out to a new cluster closer to your users. In other words, suppose you have a cluster hosted in Virginia. Many of your users are in Japan, so they have to deal with the network latency of pulling data from a Virginia-based data center. If this is the problem, create another document database cluster in Japan and replicate to it from Virginia.
  • If your cluster is in Virginia and is getting hammered hard by relatively local users, scaling out is generally a good option. That is, adding more servers to the existing cluster. Depending on your technology, there will be a maximum number of nodes or a maximum scale-out size available to you, so you’d have to check out those details.
  • If you’re getting close to that maximum scale-out size, it may make sense to replicate to another cluster in the same region and use a load balancer to shift load between the two. I have to be vague here because different technologies have different limits and I’m definitely not an expert on any document database technology’s systems architecture.

Cosmos DB and MongoDB

Another attendee asked, “I have heard that Azure Cosmos DB is built upon an older version of MongoDB – do you know if this is true?”

The answer is no, it’s not. The two platforms are different. I believe where the question comes from is around the MongoDB API for Cosmos DB. For a while, Cosmos DB supported an older version of the MongoDB API, specifically 3.2. That API was released in December of 2015. New Cosmos DB clusters support the MongoDB 3.6 API, which is still active.

But something I want to point out is that the API is an interface, not an implementation. That Cosmos DB supports a specific MongoDB API version doesn’t mean that the code bases are similar; it only means that you can safely (well, presumably safely) interact with both and expect to get the same results when you perform the same set of API steps with the same inputs.

Graph Languages

My last question came from an attendee who mentioned, “I thought GraphQL was the common standard for graph querying.”

The context for this is in my discussion of graph databases, particularly the slide in which I talk about the key issues I have with graph databases. For a bit more background than what I had time to get into during the session, Mala Mahadevan and I have talked about graph databases in a little bit of detail on a couple of occasions, once on the SQL Data Partners Podcast and once on Shop Talk.

As for the question, the comment I had made was that there is no common graph language. We have SQL for relational databases (and other mature data platform technologies) but historically haven’t had a common language for graph platforms, meaning that you have to learn a new language each time you move to a different platform. The Gremlin language is an attempt at creating a single language for graph databases and it’s making enough strides that it may indeed become the standard. But it’s not there yet.

Meanwhile, GraphQL, despite the name, is not a language for graph databases. It’s actually a language for querying APIs. The key idea is that you ask for data from an API and you get back just the data you want. But behind the API, your data can be stored in any sort of data source—or even multiple data sources. In other words, I might expose a product catalog API which hits Cosmos DB, a finance API which hits SQL Server, and a product associations API which hits Neo4j. Those three APIs could all be queried using GraphQL, as it’s up to the API to interpret inputs and return the appropriate outputs.

Approaching Zero: The Video

Thanks to a bit of gentle nudging from Anthony Nocentino (b | t), I’ve finally put together a recording of my session entitled Approaching Zero: Minimizing Downtime During Deployments.

I tried to work the word “asymptotic” into the title, but my editor didn’t let me.

It’s a long video but there’s a lot of content that I hope you enjoy. Long-time blog readers might remember my series on the topic, which gives you a bit more detail. You can also get the slides, code, and links to additional resources on my presentation page.

Finally, one call-out to Carlos L. Chacon (w | t): I’m not saying you chose a bad color, but one of us has a problem here:

It’s not easy being wiped out by a chroma key.

PASS Virtual Summit is Next Week

I have two sessions at PASS Virtual Summit 2020 next week, one a full-day training and the other a general session.

SQL Server on the Edge: a Full-Day Training

Next Monday, I’m giving a pre-con on using SQL Server, .NET Core, and Azure to help solve IoT issues. We’ll walk through a realistic scenario involving building security via employee key cards, show how to use Azure IoT Edge and Azure SQL Edge to power edge devices, and perform machine learning on employee comings and goings to track behavioral changes.

Here is the PASS Summit writeup if you’d like more information. Use the discount code LGDISCWY7 to save a bit of cash along the way.

The Curated Data Platform

In addition, I’m going to give a regular session entitled The Curated Data Platform.

The idea is, let’s take a fictional but realistic company, walk through the types of data problems it experiences, and see which data platform technologies solve its problems, along with the biggest players in those spaces, and some reference architectures to boot.

Register for PASS Summit

Registration is still open for PASS Summit 2020, so join me for the virtual event. Use the discount code LGDISCWY7 for additional savings and I hope to see you there.

PolyBase and Postgres

Last month, I presented at year 4 of SQL Trail, a conference normally in Richmond, Virginia but entirely online this year due to obvious reasons. Before the conference, one of the attendees asked for a session on PolyBase, specifically around accessing Postgres data from SQL Server. I didn’t have an example on that, so figured I could come up with one and (eventually) blog about it.

Preparing Postgres

I didn’t have a Postgres server lying around anywhere, so I decided to set up my own using a Docker container:

docker pull postgres
docker run --name pg -e POSTGRES_PASSWORD=pgtestpwd -p 5432:5432 -d postgres

What this does is first, pull down the latest Postgres image from Docker Hub. Then, we create a container named pg based off of the postgres image. We’re going to forward requests on port 5432 to the Postgres container and set the administrative password to pgtestpwd, clearly a secure password.

Next up, we need to retrieve the Postgres ODBC driver. You can find drivers on the Postgres website. I’m using Windows for this demo, so I grabbed the latest MSI from the MSI directory. We need to install the ODBC driver on the machine which is running SQL Server. In this example, I have SQL Server installed on my PC, not in a Docker container or VM, so I’ll install the ODBC driver directly on my PC. In a prod scenario, you’d install it on your production machine running SQL Server, not the production machine running PostgreSQL.

After that, we need to get some data for our database. I’m going to create a simple table called event which stores event data. Because the data itself doesn’t really matter, roll with me here. You can connect to your new Postgres server from Azure Data Studio, pgadmin4, or whatever other tool you’d like.

If you do connect to Postgres via Azure Data Studio, like I did, you want to make sure that you’ve set the connection type correctly:

Connect to a Postgres instance

If you don’t see this connection type, make sure you have the PostgreSQL extension for Azure Data Studio.

Make sure you have the PostgreSQL extension.

Once connected, we can run the following script. I decided to use the default postgres database, but you can certainly create your own database as well if you’d like to make this more realistic.

CREATE TABLE event(
    id serial,
    machine_id int,
    event_id varchar(40),
    event_type varchar(70),
    entity_type varchar(70),
    entity_id varchar(40),
    event_data json
)

CREATE EXTENSION IF NOT EXISTS "uuid-ossp";

-- Run this several times to generate data.
INSERT INTO event(
    machine_id,
    event_id,
    event_type,
    entity_type,
    entity_id,
    event_data
)
VALUES
(
    21644,
    uuid_generate_v4(),
    'Telemetry Ingest',
    'MachineTelemetry',
    uuid_generate_v4(),
    '{ "Test": "Yes" }'
);

SELECT * FROM event;

If everything worked as expected, the final SELECT statement should return some results for you.

Connecting to Postgres

Now that we have some data, let’s go back to SQL Server. I assume you’ve already installed and configured PolyBase—if not, check out my presentation on PolyBase. Note that this requires SQL Server 2019 or later, as that’s the first version which supports PolyBase to ODBC. Here’s a script which assumes a database named Scratch and a master key <<SomeSecureKey>>.

First up, we want to create a database scoped credential so that we can authenticate to Postgres. This credential gets encrypted using the database master key.

USE [Scratch]
GO
OPEN MASTER KEY DECRYPTION BY PASSWORD = '<<SomeSecureKey>>';
GO

IF NOT EXISTS
(
	SELECT 1
	FROM sys.database_scoped_credentials
	WHERE
		name = N'PostgresCredential'
)
BEGIN
	CREATE DATABASE SCOPED CREDENTIAL PostgresCredential
	WITH IDENTITY = 'postgres',
	SECRET = 'pgtestpwd';
END
GO

After creating the database scoped credential, we will create a data source pointing to Postgres, connecting on localhost and port 5432, just like we set up with Docker. We’ll specify the PostgreSQL Unicode driver and point to the postgres database. Note that PostgreSQL contains ANSI and Unicode drivers, meaning that all of your strings will be either ANSI or Unicode. Given my preferences in SQL Server, I think it’d be obvious which way I chose.

IF NOT EXISTS
(
    SELECT 1
    FROM sys.external_data_sources ds
    WHERE
        ds.name = N'PostgresEvents'
)
BEGIN
    CREATE EXTERNAL DATA SOURCE PostgresEvents WITH
    (
        LOCATION = 'odbc://localhost:5432',
        CONNECTION_OPTIONS = 'Driver={PostgreSQL Unicode}; Database=postgres',
		CREDENTIAL = PostgresCredential,
		PUSHDOWN = ON
    );
END
GO

Next, we’ll wrap up the external object creation by building out an external table. This external table will use NVARCHAR to represent both Postgres strings and the JSON data type.

IF NOT EXISTS
(
    SELECT 1
    FROM sys.external_tables t
    WHERE
        t.name = N'PostgresEvent'
)
BEGIN
    CREATE EXTERNAL TABLE dbo.PostgresEvent
    (
        id INT,
        machine_id INT,
        event_id NVARCHAR(40),
        event_type NVARCHAR(70),
        entity_type NVARCHAR(70),
        entity_id NVARCHAR(40),
        event_data NVARCHAR(255)
    )
    WITH
    (
        LOCATION = 'event',
        DATA_SOURCE = PostgresEvents
    );
END
GO

The final test comes when we query the data. Run the following and you should see the same set of rows that you saw when connecting directly to Postgres:

SELECT * FROM dbo.PostgresEvent;
GO

Postgres-Specific Notes

I am not particularly familiar with PostgreSQL, certainly not to the point where I could give you the ins and outs of compatibility issues. My simple tests were successful and I didn’t find a huge number of people complaining about compatibility issues with PolyBase, but Postgres has multiple languages and enough complex data types that there may very well be compatibility problems. But that said, one of the problems I expected—having a JSON data type—converted just fine. I did see an issue around the NUMERIC data type but didn’t dive in.

Shameless Advertising

In case you want to learn a lot more about PolyBase, check out PolyBase Revealed, the only book-length treatment on the topic. Not only that, but it also makes for a great monitor stand—it’s a thick enough book to make a difference, but thin enough that you can get more precise monitor adjustments with it. Buy several copies and relieve neck strain today!

Query Store: QDS Toolbox

I wanted to announce the first open source project officially released by ChannelAdvisor: the QDS Toolbox. This is an effort which Pablo Lozano and Efraim Sharon pushed hard internally and several database administrators and database engineers contributed to (though I wasn’t one of them).

From the summary page:

This is a collection of tools (comprised of a combination of views, procedures, functions…) developed using the Query Store functionality as a base to facilitate its usage and reports’ generation. These include but are not limited to:

– Implementations of SSMS’ GUI reports that can be invoked using T-SQL code, with added funcionalities (parameterization, saving results to tables) so they can be programmatically executed and used to send out mails.
– Quick analysis of a server’s overall activity to identify bottlenecks and points of high pressure on the SQL instance at any given time, both in real time or in the past.
– Cleanup of QDS’ cache with a smaller footprint than the internal one generates, with customization parameters to enable a customizable cleanup (such as removing information regarding dropped objects, cleaning details of ad-hoc or internal queries executed on the server as index maintenance operations).

The biggest of these is the third item. In our environment, Query Store could be a beast when trying to delete old data, and would often be the biggest performance problem on a given server.

In addition, several procedures exist as a way of aggregating data across databases. We have a sharded multi-tenant environment, where we might have 5-15 replicas of a database schema and assign customers to those databases. QDS Toolbox helps aggregate information across these databases so that you don’t need to look at each individually to understand performance problems. The database team has then created reports off of this to improve their understanding of what’s going on.

Check out the QDS Toolbox as a way to clean up data better than the built-in cleanup process and get additional information aggregated in a smart way.

Transaction Modes in SQL Server

In the following video, I take a look at the three most important transaction modes in SQL Server: autocommit, explicit transactions, and implicit transactions. Sorry, batch-scoped transactions, but nobody loves you.

If you’d prefer a textual rendering of the video, here are the pre-demo and post-demo sections, lightly edited for narrative flow.

Setting the Stage: Transactions and Modes

What I want to do in today’s post is to cover the different sorts of transaction modes and get into the debate about whether you should use explicit transactions or rely on auto-committed transactions for data modification in SQL Server. This came from an interesting discussion at work, where some of the more recent database engineers were curious about our company policy around transaction modes and understanding the whys behind it. I didn’t come up with the policy, but my thinking isn’t too far off from the people who did.

But before I get too far off course, let’s briefly lay out some of the basics around transactions.

When you modify data (that is, run a command like INSERT, UPDATE, MERGE, or TRUNCATE) or tables (CREATE, ALTER, DROP, etc.), that operation takes place inside a transaction.  A transaction is, according to Microsoft Docs, a single unit of work.  Everything in a transaction will either succeed as a whole or fail as a whole–you won’t end up with some operations succeeding and others not–it’s really all or nothing.  The importance of this goes back to relational databases having ACID properties, but because that’s a little far afield of where I want to go, I’ll give you a link if you’d like to learn more about the topic, as it helps explain why transactions are useful for relational database developers.

What I do want to get to is that there are three kinds of transactions:  autocommit transactions, explicit transactions, and implicit transactions.  There’s actually a fourth kind, batch-scoped transactions, but that only applies to Multiple Active Result Sets transactions and if you find yourself there, you’ve got bigger problems than deciding how you want to deal with transactions.

In the demo for the video, I show off each of the three transaction modes, including how you enable them, how you work with them, and any important considerations around them.

Recommendations

The easy recommendation is, don’t use implicit transactions.  For SQL Server developers and database administrators, this is unexpected behavior–the default is to use autocommit, so that if you run an INSERT statement by itself, the transaction automatically commits at the end.  If you set implicit transactions on, there is no UI indication that this is on and it becomes really easy to forget to commit a transaction.  I understand that if you come from an Oracle background, where implicit transactions are the norm, it might feel comfortable to enable this, but it becomes really easy to start a transaction, forget to commit or rollback, and leave for lunch, blocking access to a table for a considerable amount of time. And if you’re using Azure Data Studio, it appears that implicit transactions might not even work, so you’d be in a world of hurt if you were relying upon them.  So let’s throw this one away as a recommendation.

My recommendation, whenever you have data modification on non-temporary tables, is to use explicit transactions over autocommit.  I have a few reasons for this.

First, consistency.  Sometimes you will need explicit transactions.  For example, if I need to ensure that I delete from table A only if an insert into table B and an update of table C are successful, I want to link those together with an explicit transaction.  That way, either all three operations succeed or none of them succeed.  Given that I need explicit transactions some of the time, I’d rather be in the habit of using them; so to build that habit, I’d prefer to use them for all data modification queries.

Second, explicit transactions give you clarity around what is actually necessary in a transaction.  Suppose you query a table and load the results into a temporary table.  From there, you make some modifications, join to other tables, and reshape the data a bit.  So far, nothing I’ve mentioned requires an explicit transaction because you’re only working with temp tables here.  When you take the final results and update a real table, now we want to open a transaction.  By using an explicit transaction, I make it clear exactly what I intend to have in the transaction:  the update of a real table, but not the temp table shenanigans.

Third, as an implication of the second point, explicit transactions can help you reduce the amount of time you’re locking tables.  You can do all of your heavy lifting in temp tables before opening the transaction, and that means you don’t have to do that while locking the real table. In the best case, autocommit will behave the same, but saying “Here is where I want my transaction to be” also lets you think about whether you really want to do everything at one statement or break it up into smaller chunks.

Finally, if you use a loop, whether that be a cursor or WHILE statement, you can control whether you want one transaction per loop iteration or one transaction in total, and that’s entirely to do with whether you begin and commit the transaction outside of the loop or inside.  Having one transaction in total can be considerably faster in some circumstances, but if you have an expensive action in the loop, you can commit after each loop iteration.  This will minimize the amount of time you block any single operation waiting to access this table.  It will increase the total runtime of your query, but minimize the pain to other users, and that’s a trade-off  you can only make if you use explicit transactions.

Rules of Thumb

First, if you have a stored procedure which is simply running a SELECT statement, use autocommit.  There’s no real advantage to putting this into an explicit transaction and there is the downside that you might forget to commit.

Second, if you have a stored procedure which performs data modification on non-temporary tables, use an explicit transaction only over the area which modifies data.  Don’t begin the transaction until you’re ready to start modifying tables; this will minimize the amount of time you need to keep the transaction open and resources locked.

As a corollary of the second point, note that you can use explicit transactions to control parent-child relationships with stored procedures, where the parent begins a transaction, calls each child, and rolls back or commits at the end depending upon the results. That’s something you can’t do with autocommit, as each data modification statement would run in its own auto-committed transaction.

Third, if you are working with non-global temporary tables beforehand, don’t include any modification of those inside the explicit transaction.  If you are working with global temporary tables, I suppose you should treat them like non-temporary tables here if you expect other sessions to use them and care about blocking, though there’s a pretty small number of cases where it makes sense to have global temporary tables with multiple users, so I’d call that an edge case.

Fourth, in a loop, choose whether you want to put the explicit transaction around the loop or inside it.  In most cases, I prefer to put the transaction inside the loop to minimize the amount of time that I’m blocking others. This is probably the smarter move to make in busy transactional environments, where you want to prevent blocking as much as possible.  Also, If one loop iteration fails, you’ll have less you need to roll back, so you can fix the issue and pick back up where you left off. Note that at that point, you trade atomicity on the entire set of data for atomicity on a block of data, so if that’s a big enough concern, bite the bullet and put an explicit transaction around the loop. Or see if you can make it faster without a loop.

Fifth, outside of a stored procedure—that is, when I’m just writing ad hoc statements in a client tool—use explicit transactions if you’re doing something potentially risky. I know this brings up the question of “Why are you doing risky things in a client tool to begin with?” But that’s a story for a different day.

Sixth, watch out for nested transactions.  In SQL Server, there’s very little utility in them and their behavior is weird. Paul Randal explains in great detail just how broken they are, and I’d rather the product never have had them.  Anyhow, check to see if you’re in a transaction before opening one. The pattern I like to use comes from my Working Effectively with Legacy SQL talk (which, ironically enough, needs some changes to be brought up to date) and originally from smart people in the office who put it together before I got there. Here’s a simplified version of it for a sample stored procedure:

CREATE OR ALTER PROCEDURE dbo.GetFraction
@Divisor INT = 5
AS
 
DECLARE
    @AlreadyInTransaction BIT;

BEGIN TRY 
    IF ( @@TRANCOUNT > 0 )
    BEGIN
        SET @AlreadyInTransaction = 1;
    END
    ELSE
    BEGIN
        SET @AlreadyInTransaction = 0;
        BEGIN TRANSACTION;
    END;
 
    -- Note:  this is where you'd put your data modification statements.
    SELECT
        1.0 / @Divisor AS Quotient;
 
    IF ( @AlreadyInTransaction = 0 AND @@TRANCOUNT > 0 )
    BEGIN
        COMMIT TRANSACTION;
    END;
END TRY
BEGIN CATCH
    IF ( @AlreadyInTransaction = 0 AND @@TRANCOUNT > 0 )
    BEGIN
        ROLLBACK TRANSACTION;
    END;

    THROW;
END CATCH
GO

--Test the procedure
EXEC dbo.GetFraction @Divisor = 5;

--Start an explicit transaction
BEGIN TRANSACTION
EXEC dbo.GetFraction @Divisor = 5;
SELECT @@TRANCOUNT;
ROLLBACK TRANSACTION

Finally, make sure you roll back the transaction on failure.  If you write code using try-catch blocks, commit at the end of the TRY block or rollback at the beginning of the CATCH. Explicit transactions offer you more power, but come with the responsibility of handling transactions appropriately.

Thoughts?

What are your thoughts on explicit transactions versus autocommit? Do you prefer the ease of autocommit or the power of explicit transactions? Or where do you draw the line between the two? Leave your thoughts in the comments section below—either here or on the video—and let me know.

Upcoming Events: SQL Saturday Auckland

Key Details

What: SQL Saturday Auckland.
Where: New Zealand, but also the Internet.
When: Saturday, August 8th.
Admission is free. Register on the SQL Saturday website.

What I’m Presenting

8:55 AM — 9:40 AM NZST — Applying Forensic Accounting Techniques with SQL and R

Please note that all of the times are in New Zealand Standard Time (NZST). This is UTC+12, so if you’re on the Eastern Daylight Time zone like me (UTC-4), that’s 16 hours ahead. In other words, I present on Friday starting at 4:55 PM EDT.