Transaction Modes in SQL Server

In the following video, I take a look at the three most important transaction modes in SQL Server: autocommit, explicit transactions, and implicit transactions. Sorry, batch-scoped transactions, but nobody loves you.

If you’d prefer a textual rendering of the video, here are the pre-demo and post-demo sections, lightly edited for narrative flow.

Setting the Stage: Transactions and Modes

What I want to do in today’s post is to cover the different sorts of transaction modes and get into the debate about whether you should use explicit transactions or rely on auto-committed transactions for data modification in SQL Server. This came from an interesting discussion at work, where some of the more recent database engineers were curious about our company policy around transaction modes and understanding the whys behind it. I didn’t come up with the policy, but my thinking isn’t too far off from the people who did.

But before I get too far off course, let’s briefly lay out some of the basics around transactions.

When you modify data (that is, run a command like INSERT, UPDATE, MERGE, or TRUNCATE) or tables (CREATE, ALTER, DROP, etc.), that operation takes place inside a transaction.  A transaction is, according to Microsoft Docs, a single unit of work.  Everything in a transaction will either succeed as a whole or fail as a whole–you won’t end up with some operations succeeding and others not–it’s really all or nothing.  The importance of this goes back to relational databases having ACID properties, but because that’s a little far afield of where I want to go, I’ll give you a link if you’d like to learn more about the topic, as it helps explain why transactions are useful for relational database developers.

What I do want to get to is that there are three kinds of transactions:  autocommit transactions, explicit transactions, and implicit transactions.  There’s actually a fourth kind, batch-scoped transactions, but that only applies to Multiple Active Result Sets transactions and if you find yourself there, you’ve got bigger problems than deciding how you want to deal with transactions.

In the demo for the video, I show off each of the three transaction modes, including how you enable them, how you work with them, and any important considerations around them.

Recommendations

The easy recommendation is, don’t use implicit transactions.  For SQL Server developers and database administrators, this is unexpected behavior–the default is to use autocommit, so that if you run an INSERT statement by itself, the transaction automatically commits at the end.  If you set implicit transactions on, there is no UI indication that this is on and it becomes really easy to forget to commit a transaction.  I understand that if you come from an Oracle background, where implicit transactions are the norm, it might feel comfortable to enable this, but it becomes really easy to start a transaction, forget to commit or rollback, and leave for lunch, blocking access to a table for a considerable amount of time. And if you’re using Azure Data Studio, it appears that implicit transactions might not even work, so you’d be in a world of hurt if you were relying upon them.  So let’s throw this one away as a recommendation.

My recommendation, whenever you have data modification on non-temporary tables, is to use explicit transactions over autocommit.  I have a few reasons for this.

First, consistency.  Sometimes you will need explicit transactions.  For example, if I need to ensure that I delete from table A only if an insert into table B and an update of table C are successful, I want to link those together with an explicit transaction.  That way, either all three operations succeed or none of them succeed.  Given that I need explicit transactions some of the time, I’d rather be in the habit of using them; so to build that habit, I’d prefer to use them for all data modification queries.

Second, explicit transactions give you clarity around what is actually necessary in a transaction.  Suppose you query a table and load the results into a temporary table.  From there, you make some modifications, join to other tables, and reshape the data a bit.  So far, nothing I’ve mentioned requires an explicit transaction because you’re only working with temp tables here.  When you take the final results and update a real table, now we want to open a transaction.  By using an explicit transaction, I make it clear exactly what I intend to have in the transaction:  the update of a real table, but not the temp table shenanigans.

Third, as an implication of the second point, explicit transactions can help you reduce the amount of time you’re locking tables.  You can do all of your heavy lifting in temp tables before opening the transaction, and that means you don’t have to do that while locking the real table. In the best case, autocommit will behave the same, but saying “Here is where I want my transaction to be” also lets you think about whether you really want to do everything at one statement or break it up into smaller chunks.

Finally, if you use a loop, whether that be a cursor or WHILE statement, you can control whether you want one transaction per loop iteration or one transaction in total, and that’s entirely to do with whether you begin and commit the transaction outside of the loop or inside.  Having one transaction in total can be considerably faster in some circumstances, but if you have an expensive action in the loop, you can commit after each loop iteration.  This will minimize the amount of time you block any single operation waiting to access this table.  It will increase the total runtime of your query, but minimize the pain to other users, and that’s a trade-off  you can only make if you use explicit transactions.

Rules of Thumb

First, if you have a stored procedure which is simply running a SELECT statement, use autocommit.  There’s no real advantage to putting this into an explicit transaction and there is the downside that you might forget to commit.

Second, if you have a stored procedure which performs data modification on non-temporary tables, use an explicit transaction only over the area which modifies data.  Don’t begin the transaction until you’re ready to start modifying tables; this will minimize the amount of time you need to keep the transaction open and resources locked.

As a corollary of the second point, note that you can use explicit transactions to control parent-child relationships with stored procedures, where the parent begins a transaction, calls each child, and rolls back or commits at the end depending upon the results. That’s something you can’t do with autocommit, as each data modification statement would run in its own auto-committed transaction.

Third, if you are working with non-global temporary tables beforehand, don’t include any modification of those inside the explicit transaction.  If you are working with global temporary tables, I suppose you should treat them like non-temporary tables here if you expect other sessions to use them and care about blocking, though there’s a pretty small number of cases where it makes sense to have global temporary tables with multiple users, so I’d call that an edge case.

Fourth, in a loop, choose whether you want to put the explicit transaction around the loop or inside it.  In most cases, I prefer to put the transaction inside the loop to minimize the amount of time that I’m blocking others. This is probably the smarter move to make in busy transactional environments, where you want to prevent blocking as much as possible.  Also, If one loop iteration fails, you’ll have less you need to roll back, so you can fix the issue and pick back up where you left off. Note that at that point, you trade atomicity on the entire set of data for atomicity on a block of data, so if that’s a big enough concern, bite the bullet and put an explicit transaction around the loop. Or see if you can make it faster without a loop.

Fifth, outside of a stored procedure—that is, when I’m just writing ad hoc statements in a client tool—use explicit transactions if you’re doing something potentially risky. I know this brings up the question of “Why are you doing risky things in a client tool to begin with?” But that’s a story for a different day.

Sixth, watch out for nested transactions.  In SQL Server, there’s very little utility in them and their behavior is weird. Paul Randal explains in great detail just how broken they are, and I’d rather the product never have had them.  Anyhow, check to see if you’re in a transaction before opening one. The pattern I like to use comes from my Working Effectively with Legacy SQL talk (which, ironically enough, needs some changes to be brought up to date) and originally from smart people in the office who put it together before I got there. Here’s a simplified version of it for a sample stored procedure:

CREATE OR ALTER PROCEDURE dbo.GetFraction
@Divisor INT = 5
AS
 
DECLARE
    @AlreadyInTransaction BIT;

BEGIN TRY 
    IF ( @@TRANCOUNT > 0 )
    BEGIN
        SET @AlreadyInTransaction = 1;
    END
    ELSE
    BEGIN
        SET @AlreadyInTransaction = 0;
        BEGIN TRANSACTION;
    END;
 
    -- Note:  this is where you'd put your data modification statements.
    SELECT
        1.0 / @Divisor AS Quotient;
 
    IF ( @AlreadyInTransaction = 0 AND @@TRANCOUNT > 0 )
    BEGIN
        COMMIT TRANSACTION;
    END;
END TRY
BEGIN CATCH
    IF ( @AlreadyInTransaction = 0 AND @@TRANCOUNT > 0 )
    BEGIN
        ROLLBACK TRANSACTION;
    END;

    THROW;
END CATCH
GO

--Test the procedure
EXEC dbo.GetFraction @Divisor = 5;

--Start an explicit transaction
BEGIN TRANSACTION
EXEC dbo.GetFraction @Divisor = 5;
SELECT @@TRANCOUNT;
ROLLBACK TRANSACTION

Finally, make sure you roll back the transaction on failure.  If you write code using try-catch blocks, commit at the end of the TRY block or rollback at the beginning of the CATCH. Explicit transactions offer you more power, but come with the responsibility of handling transactions appropriately.

Thoughts?

What are your thoughts on explicit transactions versus autocommit? Do you prefer the ease of autocommit or the power of explicit transactions? Or where do you draw the line between the two? Leave your thoughts in the comments section below—either here or on the video—and let me know.

With ML Services, Watch Those Resource Groups

I wanted to cover something which has bitten me in two separate ways regarding SQL Server Machine Learning Services and Resource Governor.

Resource Governor and Default Memory

If you install a brand new copy of SQL Server and enable SQL Server Machine Learning Services, you’ll want to look at sys.resource_governor_external_resource_pools:

That’s a mighty fine cap you’re wearing.

By default, SQL Server will grant 20% of available memory to any R or Python scripts running. The purpose of this limit is to prevent you from hurting server performance with expensive external scripts (like, say, training large neural networks on a SQL Server).

Here’s the kicker: this affects you even if you don’t have Resource Governor enabled. If you see out-of-memory exceptions in Python or error messages about memory allocation in R, I’d recommend bumping this max memory percent up above 20, and I have scripts to help you with the job. Of course, making this change assumes that your server isn’t stressed to the breaking point; if it is, you might simply want to offload that work somewhere else.

Resource Governor and CPU

Notice that by default, the max CPU percent for external pools is 100, meaning that we get to push the server to its limits with respect to CPU.

Well, what happens if you accidentally change that? I found out the answer the hard way!

In my case, our servers were accidentally scaled down to 1% max CPU utilization. The end result was that even something as simple as print("Hello") in either R or Python would fail after 30 seconds. I thought it had to do with the Launchpad service causing problems, but after investigation, this was the culprit.

Identities blurred to protect the innocent.

The trickiest part about diagnosing this was that the Launchpad logs error messages gave no indication what the problem was—the error message was a vague “could not connect to Launchpad” error and the Launchpad error logs didn’t have any messages about the failed queries. So that’s one more thing to keep in mind when troubleshooting Machine Learning Services failures.

NVARCHAR Everywhere: A Thought Experiment

Doubling Down on Madness

In the last episode of Shop Talk, I laid out an opinion which was…not well received. So I wanted to take some time and walk through my thinking a little more cogently than I was able to do during Shop Talk.

Here’s the short version. When you create a table and need a string column, you have a couple options available: VARCHAR and NVARCHAR. Let’s say that you’re a developer creating a table to store this string data. Do you choose VARCHAR or NVARCHAR? The classic answer is, “It depends.” And so I talk about why that is in video format right below these words.

A Video from the Void

The Camps

Camp One: only use VARCHAR. Prior to SQL Server 2019, this is basically the set of people who never have to deal with internationalization. If you’re running solo projects or building systems where you know the complete set of users, and if there’s no need for Unicode, I can understand this camp. For projects of any significance, though, you usually have to go elsewhere.

Camp Two: default to VARCHAR, only use NVARCHAR when necessary. There are a lot of people in this camp, especially in the western world. Most of the companies I’ve worked at live in this camp.

Camp Three: default to NVARCHAR, but use VARCHAR when you know you don’t need Unicode. This is a fairly popular group as well, and outside of this thought experiment, I probably end up here.

Aaron Bertrand lays out the costs and benefits of Camps Two and Three (or, Camps VARCHAR Mostly and NVARCHAR Mostly), so I recommend reading his arguments and understanding that I am sympathetic to them.

But there is also Camp Four: NVARCHAR everywhere. And this camp is growing on me.

Why NVARCHAR Everywhere?

I see several benefits to this:

  • Developers and product owners don’t need to think about or guess whether a particular string value will ever contain Unicode data. Sometimes we guess wrong, and migrating from VARCHAR to NVARCHAR can be a pain.
  • NVARCHAR Everywhere avoids implicit conversion between string columns because you can assume that everything is NVARCHAR. Implicit conversion can be a nasty performance impediment.
  • Furthermore, you can train developers to preface string literals with N, ensure that data access tools ask for Unicode strings (most ORMs either default to Unicode or know enough to do it right), and ensure that every stored procedure string parameter is NVARCHAR because there are no exceptions. That’s one less thing you ever have to think about when designing or tuning a database and one less area where ambiguity in design can creep in.
  • If somebody tries to store Unicode data in a VARCHAR column, that information is silently lost.

Why Not NVARCHAR Everywhere?

The first thing you’ll hear from people about this is storage requirements: NVARCHAR characters are typically 2 bytes, whereas equivalent VARCHAR characters are typically 1 byte. For the nuanced version of this, Solomon Rutzky goes into great detail on the topic, but let’s stick with the simplistic version for now because I don’t think the added nuance changes the story any.

SQL Server has Unicode compression, meaning that, per row, if the data in a column can fit in your collation’s code page, the database engine can compress the Unicode data to take as much space as equivalent VARCHAR data would—maybe it’s a little bigger but we’re talking a tiny amount. Enabling row-level compression turns on Unicode compression as well and can provide additional compression benefits. And page-level compression does an even better job at saving space on disk. There are CPU costs, but my experience has been that compression will often be faster because I/O subsystems are so much slower than CPU, even with fancy all-flash arrays or direct-attached NVMe.

The exception is if you are using NVARCHAR(MAX) as your data type. In that case, Unicode and row-level compression won’t do anything and page-level compression only works if your data fits on a page rather than falling into LOB. Hugo Kornelis covers why that is. So that’s a weakness, which means I need a bulleted list here.

  • NVARCHAR(MAX) columns with overflow to LOB will be larger than their VARCHAR counterparts and we cannot use Unicode, Row, or Page compression to reduce storage.
  • If your max data length is less between 4001 and 8000 characters, you know the column will never have Unicode characters, and the data is highly compressible, you will save a lot of space with VARCHAR plus page-level compression, whereas in this zone, you’d need to use an NVARCHAR(MAX) column and lose out.
  • If you are in the unlucky situation where even row-level compression tanks your performance—something I’ve never seen but acknowledge it as a possibility—going with NVARCHAR becomes a trade-off between reducing storage and maximizing performance.

The Demo Code

In case you want to try out the demo code on your own, here it is:

USE [Scratch]
GO
DROP TABLE IF EXISTS dbo.TestTable;
DROP TABLE IF EXISTS dbo.NTestTable;
GO
CREATE TABLE dbo.TestTable
(
    Id INT IDENTITY(1,1) NOT NULL,
    SomeStringColumn VARCHAR(150) NOT NULL,
    SomeOtherStringColumn VARCHAR(30) NOT NULL,
    CONSTRAINT [PK_TestTable] PRIMARY KEY CLUSTERED(Id)
);
GO
CREATE INDEX [IX_TestTable_SomeStringColumn] ON dbo.TestTable
(
    SomeStringColumn
);
GO
CREATE TABLE dbo.NTestTable
(
    Id INT IDENTITY(1,1) NOT NULL,
    SomeStringColumn NVARCHAR(150) NOT NULL,
    SomeOtherStringColumn NVARCHAR(30) NOT NULL,
    CONSTRAINT [PK_NTestTable] PRIMARY KEY CLUSTERED(Id)
);
CREATE INDEX [IX_NTestTable_SomeStringColumn] ON dbo.NTestTable
(
    SomeStringColumn
);
GO

-- Test 1:  It's All ASCII.
INSERT INTO dbo.TestTable
(
    SomeStringColumn,
    SomeOtherStringColumn
)
SELECT TOP(100000)
    REPLICATE('A', 150),
    REPLICATE('X', 30)
FROM sys.all_columns ac
    CROSS JOIN sys.all_columns ac2;
GO
INSERT INTO dbo.NTestTable
(
    SomeStringColumn,
    SomeOtherStringColumn
)
SELECT TOP(100000)
    REPLICATE(N'A', 150),
    REPLICATE(N'X', 30)
FROM sys.all_columns ac
    CROSS JOIN sys.all_columns ac2;
GO
EXEC sp_spaceused 'dbo.TestTable';
EXEC sp_spaceused 'dbo.NTestTable';

-- Test 2:  Unicode me.
SELECT DATALENGTH(N'🐭');
SELECT DATALENGTH(N'🪕');
GO
TRUNCATE TABLE dbo.NTestTable;
INSERT INTO dbo.NTestTable
(
    SomeStringColumn,
    SomeOtherStringColumn
)
SELECT TOP(100000)
    REPLICATE(N'🐭', 75),
    REPLICATE(N'🪕', 15)
FROM sys.all_columns ac
    CROSS JOIN sys.all_columns ac2;
GO
EXEC sp_spaceused 'dbo.TestTable';
EXEC sp_spaceused 'dbo.NTestTable';

-- Test 3:  Mix It Up.
TRUNCATE TABLE dbo.NTestTable;
INSERT INTO dbo.NTestTable
(
    SomeStringColumn,
    SomeOtherStringColumn
)
SELECT TOP(100000)
    REPLICATE(N'A', 148) + N'🐭',
    REPLICATE(N'X', 28) + N'🪕'
FROM sys.all_columns ac
    CROSS JOIN sys.all_columns ac2;
GO
EXEC sp_spaceused 'dbo.TestTable';
EXEC sp_spaceused 'dbo.NTestTable';
GO

-- Check the DATALENGTH
SELECT TOP(1)
    SomeStringColumn,
    DATALENGTH(SomeStringColumn)
FROM dbo.TestTable;

SELECT TOP(1)
    SomeStringColumn,
    DATALENGTH(SomeStringColumn)
FROM dbo.NTestTable;

-- Row Compression includes Unicode compression.
ALTER INDEX ALL ON dbo.NTestTable REBUILD WITH (DATA_COMPRESSION = ROW);
GO
-- Test 3a:  Continue to Mix It Up.
TRUNCATE TABLE dbo.NTestTable;
INSERT INTO dbo.NTestTable
(
    SomeStringColumn,
    SomeOtherStringColumn
)
SELECT TOP(100000)
    REPLICATE(N'A', 148) + N'🐭',
    REPLICATE(N'X', 28) + N'🪕'
FROM sys.all_columns ac
    CROSS JOIN sys.all_columns ac2;
GO
EXEC sp_spaceused 'dbo.TestTable';
EXEC sp_spaceused 'dbo.NTestTable';
GO

-- Another check of the DATALENGTH
SELECT TOP(1)
    SomeStringColumn,
    DATALENGTH(SomeStringColumn)
FROM dbo.TestTable;

SELECT TOP(1)
    SomeStringColumn,
    DATALENGTH(SomeStringColumn)
FROM dbo.NTestTable;

-- Let's check the LOB
DROP TABLE IF EXISTS dbo.NTestTableLob;
GO
CREATE TABLE dbo.NTestTableLob
(
    Id INT IDENTITY(1,1) NOT NULL,
    SomeStringColumn NVARCHAR(MAX) NOT NULL,
    SomeOtherStringColumn NVARCHAR(MAX) NOT NULL,
    CONSTRAINT [PK_NTestTableLob] PRIMARY KEY CLUSTERED(Id) WITH(DATA_COMPRESSION = ROW)
);
-- Can't use NVARCHAR(MAX) as a key column in an index...
/* CREATE INDEX [IX_NTestTableLob_SomeStringColumn] ON dbo.NTestTableLob
(
    SomeStringColumn
)  WITH(DATA_COMPRESSION = ROW); */
GO

-- No overflow necessary.
INSERT INTO dbo.NTestTableLob
(
    SomeStringColumn,
    SomeOtherStringColumn
)
SELECT TOP(100000)
    REPLICATE(N'A', 148) + N'🐭',
    REPLICATE(N'X', 28) + N'🪕'
FROM sys.all_columns ac
    CROSS JOIN sys.all_columns ac2;
GO
EXEC sp_spaceused 'dbo.NTestTable';
EXEC sp_spaceused 'dbo.NTestTableLob';
GO

-- What about page-level compression?
ALTER INDEX ALL ON dbo.NTestTableLob REBUILD WITH (DATA_COMPRESSION = PAGE);

EXEC sp_spaceused 'dbo.NTestTable';
EXEC sp_spaceused 'dbo.NTestTableLob';
GO

-- And to be fair, we'll see the same on NTestTable.
ALTER INDEX ALL ON dbo.NTestTable REBUILD WITH (DATA_COMPRESSION = PAGE);

EXEC sp_spaceused 'dbo.NTestTable';
EXEC sp_spaceused 'dbo.NTestTableLob';
GO

-- My page runneth over.
TRUNCATE TABLE dbo.NTestTableLob;
-- Let's reset the data compression.
ALTER INDEX ALL ON dbo.NTestTableLob REBUILD WITH (DATA_COMPRESSION = NONE);
INSERT INTO dbo.NTestTableLob
(
    SomeStringColumn,
    SomeOtherStringColumn
)
SELECT TOP(10000)
    REPLICATE(N'🐭', 14800),
    REPLICATE(N'X', 28000) + N'🪕'
FROM sys.all_columns ac
    CROSS JOIN sys.all_columns ac2;
GO
EXEC sp_spaceused 'dbo.NTestTableLob';
GO
-- Now we compress.
ALTER INDEX ALL ON dbo.NTestTableLob REBUILD WITH (DATA_COMPRESSION = PAGE);
GO
EXEC sp_spaceused 'dbo.NTestTableLob';
GO

-- Time to clean up.
DROP TABLE IF EXISTS dbo.TestTable;
DROP TABLE IF EXISTS dbo.NTestTable;
DROP TABLE IF EXISTS dbo.NTestTableLob;
GO

SELECT
    N'🐭' as Mouse,
    '🐭' as [Mouse?];

Final Thoughts…For Now

I think where I stand right now is, for greenfield database development, I heavily bias toward NVARCHAR and could even say NVARCHAR Everywhere. I think the benefits outweigh the costs here.

For brownfield database development, it’s a harder call to make because you almost certainly have a mix of VARCHAR and NVARCHAR data types. If you already have a solid system within a brownfield database, stick with that system. For example, you might use NVARCHAR for user-entry fields but VARCHAR for internal system fields like codes and levels. If that pattern works for you, that’s fine.

If you’re in a brownfield development mess, I can see the potential benefit of migrating to NVARCHAR Everywhere, but the work-to-benefit ratio is probably not going to be acceptable for most companies. The exception here is if you find out that you’re losing valuable customer data and need to go through an internationalization project. It might be tempting to change the minimum amount necessary, though my philosophy is that if you have the opportunity to make big improvements, take them.

But as I mention in the video, I’m interested in your thoughts as well. So add them to the video or drop them in here. Is there something big I’m missing which makes NVARCHAR Everywhere untenable? Have I shifted your thinking at least a little bit? Let me know.

Building a Docker Container of a SQL Server Database

Today, we’re going to go through the process of turning a database you’ve built into a Docker container. Before we get started, here are the expectations:

  1. I want a fully running copy of SQL Server with whatever database I’m using, as well as key components installed.
  2. I want this not to be on a persistent volume. In other words, when I destroy the container and create a new one from my image, I want to reset back to the original state. I’m using this for technical demos, where I want to be at the same starting point each time.
  3. I want this to be as easy as possible for users of my container. I consider the use of a container here as not particularly noteworthy in and of itself, so the more time I make people think trying to set up my demo environment, the more likely it is that people will simply give up.

With that preamble aside, let’s get to work!

Step One: Have a Database

This might seem a little obvious, but I want to make it clear that we need the database set up exactly how we want it. This includes user configuration (usually with SQL authentication, given that we’re using Linux containers and passing them out to who-knows-where), database objects like tables and procedures, and whatever else we need. After that, I will take a database backup and save it to my local disk.

For this example, I’m going to talk through how I containerized the demos for my Approaching Zero talk. This is a fairly straightforward database with some pre-staged data and objects, but no features such as PolyBase, SQL Agent, or Machine Learning Services to throw a spanner in the works. Incidentally, if you do use those types of features, I was able to get them working in my Native Scoring demo, so it’s possible to do and the process actually isn’t that far removed from what we’re doing here.

Step Two: Spin Up a Container

It takes a container to beat a container. Well, not exactly, but this is the route that I like to take.

I’m going to spin up a container. Here’s the command I run:

docker run -d -e MSSQL_PID=Developer -e ACCEPT_EULA=Y -e SA_PASSWORD=YourPasswordGoesHereButDontUseThisOneBecauseItIsntAGood1! --name approaching-zero-db -p 51433:1433 mcr.microsoft.com/mssql/server:2019-latest

Taking this step by step, I want to run a docker container in the background (that’s what the -d command means). I’m going to set several environment variables: MSSQL_PID (to tell SQL Server which edition to use), ACCEPT_EULA (to promise Microsoft we won’t sue them), and SA_PASSWORD (to throw off our enemies with deceptively bad passwords). The name of my container will be called approaching-zero-db. You can, of course, name it whatever you’d like, and it doesn’t even have to be the same as what we’re going to push out to Docker Hub, so get creative if you really want.

The -p flag says that we’d like the container’s port 1433 to be represented on our host as port 51433. I already have SQL Server running on my host machine, so I’m selecting a totally different, unused port for the container. But as far as the container is concerned, it is listening on its port 1433, so it has that going for it.

Microsoft has their own container repository for SQL Server, and we’re getting the version tagged as 2019-latest, which is CU4 as of the time of this blog post going live. Downloading this may take a while, so try not to do this on a really slow internet connection.

Incidentally, you might get the following error:

C:\Program Files\Docker\Docker\resources\bin\docker.exe: Error response from daemon: Ports are not available: listen tcp 0.0.0.0:51433: bind: An attempt was made to access a socket in a way forbidden by its access permissions.

If you get this error, it means that something else is already listening on port 51433 on your host, so you’ll have to use a different port instead. Maybe 52433 or 51434 or something. You’ll need to run docker rm approaching-zero-db to clean up the mess before you try again, but don’t shed too many tears over the containers we have to slaughter along the path to glory.

Step Three: Restore the Dauphin

We have a container and a database backup, so let’s do what we do best: fitting square pegs into round holes.

To do this, first I will make a backup directory within our container:

docker exec -it approaching-zero-db mkdir /var/opt/mssql/backup

This command is pretty straightforward: we’re going to execute a command in interactive mode (-i) and allocate a pseudo-TTY (-t). Don’t know what a pseudo-TTY is? If it really matters, learn all about tty and then come back here.

Our backup has a home, so now let’s move it in and make it feel comfortable.

docker cp D:\SQLServer\Backup\ApproachingZero.bak approaching-zero-db:/var/opt/mssql/backup/ApproachingZero.bak

We’re using Docker to copy the backup from my local directory into the container. You would, of course, modify this to fit your system.

After the backup is up there, I like to run the following command in Powershell. The prior commands you could run in cmd, but this one’s a lot easier to read as a multi-liner:

docker exec -it approaching-zero-db /opt/mssql-tools/bin/sqlcmd -S localhost `
    -U SA -P YourPasswordGoesHereButDontUseThisOneBecauseItIsntAGood1! `
    -Q "RESTORE DATABASE ApproachingZero FROM DISK =
        '/var/opt/mssql/backup/ApproachingZero.bak' WITH MOVE 'ApproachingZero' TO '/var/opt/mssql/data/ApproachingZero.mdf', MOVE 'ApproachingZeor_Log' TO '/var/opt/mssql/data/ApproachingZero.ldf'"

Some builds of SQL Server containers don’t have mssql-tools installed, so you might need to install them separately. Let’s talk about that in a sub-header.

Step Three Point A: What to Expect when you Expected mssql-tools

If, for some reason, your container does not have mssql-tools installed, that’s okay. As long as you have an internet connection, you can get this done.

We’re first going to open up a shell on the container:

docker exec -it approaching-zero-db /bin/bash

Why bash and not ksh? Because I’m not hardcore enough to live my life in vi. I’m a frequent vi tourist but not a native.

Next up, we’re going to install some stuff. Unlike in the Microsoft instructions, we are Master and Commander of this container and so sudo won’t do much.

curl https://packages.microsoft.com/keys/microsoft.asc | apt-key add -
curl https://packages.microsoft.com/config/ubuntu/16.04/prod.list | tee /etc/apt/sources.list.d/msprod.list

apt-get update 
apt-get install mssql-tools unixodbc-dev

And now we have mssql-tools and we can return to step 4, already in progress.

Step Four: Perform Sanity Tests

After you’ve restored the database, connect to it in SQL Server Management Studio, Azure Data Studio, or your tool of choice. Make sure the objects are there, that you can log in with the accounts you need, etc. You might need to create logins for those SQL authenticated accounts, for example—unless you’re like me and giving away sa access like it’s free candy at Halloween.

Here’s the important thing, though: any changes you make here will be on the container’s permanent record, so if you want a pristine experience every time you reconstitute that container, you’ll want to handle this step lightly.

Step Five: Admit Defeat on Sanity, Get Committed

We’ve made sure that the database is what we expected, that all of the pieces are there, and that queries are running as we expect. From there, it’s time to push this to Docker Hub.

The first step in pushing a container is to commit the current container as a new image with the docker commit command:

docker commit -m="Creating the Approaching Zero database." -a "{Your Name Here}" approaching-zero-db approaching-zero-db

We are running the commit command and passing in a message with -m, setting the author with -a, selecting the running container to commit (approaching-zero-db) and naming the local repository to which we want to commit this thing (approaching-zero-db).

At this point, you can spin up a local container off of your approaching-zero-db model, but nobody else can. So let’s fix that.

Step Six: Bust Out and Let the Whole World See You

We want to push this new image to Docker Hub. If you have not already, you’ll probably need to log in:

docker login

You might be able to authenticate with cached credentials, but if not you’ll enter your username and password and be good to go.

Now, we want to tag our local image and tell Docker what it will represent up in Docker Hub:

docker tag approaching-zero-db docker.io/feaselkl/presentations:approaching-zero-db

This takes my local repository named approaching-zero-db (with the default latest tag) and ties it back to Docker Hub, where in my feaselkl account I have a presentations repository and a tag called approaching-zero-db.

My last step is to push the fruits of my labor into Docker Hub for all to see.

docker push docker.io/feaselkl/presentations:approaching-zero-db

Step Seven: Using the Image

Now that I have the image up in Docker Hub, I can run the following commands from any machine with Docker installed and spin up a container based on my image:

docker pull docker.io/feaselkl/presentations:approaching-zero-db

docker run --name approaching-zero-db -p 51433:1433 docker.io/feaselkl/presentations:approaching-zero-db

And that’s it. I connect to port 51433 on localhost and authenticate with my extra-great sa username and password, and can run through whatever I need to do. When I’m done, I can stop and kill the image whenever I’d like:

docker stop approaching-zero-db
docker rm approaching-zero-db

Conclusion

In this post, we learned how to take an existing database in SQL Server—whether that be for Windows or Linux—and create a SQL Server container which includes this database.

Coda: Discourses and Distractions

I wanted to save a couple of notes here to the end in order not to distract you with too many tangents in the main body of the post.

Let’s do this as Q&A, as I haven’t done that in a while.

Couldn’t you just use Dockerfiles for all of this?

Short answer, yes. Long answer, I didn’t want to. Dropping a Dockerfile in my GitHub repo makes it easier for me, sure, but then makes it more difficult for people to follow along. As an experiment, I did include all of the steps in my Map to Success repository. Compare the section on “Run Docker Images” to “Build Docker Images” and tell me which one is easier.

If I were putting together a training on Docker, then it would make perfect sense. But I’m using containers as a background tool, and I want to get past it as soon as possible and with as little learner friction as possible.

Can’t you just script all of these steps?

Short answer, yes. Long answer, yes you can.

The only thing to watch out for in scripting is that I noticed a timing issue between when you copy the database backup to the container and when sqlcmd recognizes that the backup is there. I wasn’t able to get it all working in the same Powershell script, even when I did hacky things like adding Start-Sleep. But maybe you’ll have better luck.

Why did you label the local repository as approaching-zero-db when you knew you were going to push this as presentations:approaching-zero-db?

That’s a really good question.

Why not expose a volume in the container, put the database backup there, and commit the image without the database backup once you’re done?

Credit Bartosz Ratajzyk for the great question.

The short version is that I actually forgot about volumes when I was writing this up. But using docker cp isn’t the worst thing, especially when copying one file. Bartosz makes a great point though that we should remove the backup files before committing, regardless of the technique we use.

So why would you actually use this technique in real life?

Androgogy is real life.

Aside from that, this technique is quite useful for building automated test environments. Create a pre-defined database with a known beginning state, spin up a container, hammer that database with all of your automated tests, and destroy the container when you’re done. Unlike most cases, where you want to save the data permanently, these sorts of tests cry out for ephemeral databases.

If you’re looking at containers for most other purposes (like, say, production databases), you’d definitely be interested in persisted volumes and I’d hand you off to Andrew Pruski for that.

Why would you want to restore the Dauphin? You don’t seem like an Armagnac.

Yeah, if you pressed me on it I’d probably admit to Burgundian sympathies, but that’s mostly because of the role it played in the subsequent Dutch republic. But if you pressed me even further, I’ll admit that I was just rooting for injuries in that whole tete-a-tete.

The Importance of 0 in Regressions

Mala Mahadevan (b | t) brought an article to my attention regarding a regression between elevation and introversion/extroversion by state from a few years back. Before I get into this, I want to note that I haven’t read the linked journal article and am not casting aspersions at the blog post author or the journal article authors, but this was a good learning opportunity for an important concept.

Here is the original image:

I see a line. Definitely, definitely a line.

So boom, extroverts want to live on flat land and introverts in the mountains. Except that there are a few problems with this interpretation. Let’s go through them. I’ll focus entirely on interpreting this regression and try to avoid getting down any tangential rabbit holes…though knowing me, I’ll probably end up in one or two.

The Line is NOT the Data

One of the worst things we can do as data analysts is to interpret a regression line as the most important thing on a visual. The important thing here is the per-state set of data points, but our eyes are drawn to the line. The line mentally replaces the data, but in doing so, we lose the noise. And boy, is there a lot of noise.

Boy, is There a Lot of Noise

I don’t have the raw values but I think I can fake it well enough here to explain my point. If you look at a given elevation difference, there are some huge swings in values. For example, check out the four boxes I drew:

Mastery of boxes? Check.

On the left-most box, approximately the same elevation difference relates to ranges from roughly -0.6 to 1.8 or so. Considering that our actual data ranges from -2 to approximately 3, we’re talking about a huge slice. The second box spans the two extremes. The third and fourth boxes also take up well over half the available space.

This takes us to a problem with the thin line:

The Thin Black Line

When we draw a regression line, we typically draw a thin line to avoid overwhelming the visual. The downside to this is that it implies a level of precision which the data won’t support. We don’t see states clustered around this thin line; they’re all around it. Incorporating the variance in NEO E zscore for a given elevation difference, we have something which looks more like this:

That’s a thick line.

Mind you, I don’t have the actual numbers so I’m not drawing a proper confidence interval. I think it’d be pretty close to this, though, just from eyeballing the data and recognizing the paucity of data points.

So what’s the problem here? The lines are all pointing in the same direction, so there’s definitely a relationship…right?

Zeroing in on the Problem

Looking at the vertical axis, we have a score which runs from -2 to 3(ish), where negative numbers mean introversion and positive numbers extroversion. That makes 0 the midpoint where people are neither introverted nor extroverted. This is important because we want to show not only that this relationship is negative, but that it is meaningful. A quick and dirty way we can check this is to see how much of our confidence interval is outside the zero line. After all, we’re trying to interpret this as “people who live in higher-elevation areas tend to be more introverted.”

The answer? Not much.

With our fat confidence interval guess, the confidence interval for all 50 states (plus one swamp) includes the 0 line, meaning that even though we can draw a line pointing downward, we can’t conclusively say that there is any sort of relationship between introversion/extroversion and differences in elevation because both answers are within our realm of possibility for the entire range of the visual.

But hey, maybe I’m way off on my confidence interval guess. Let’s tighten it up quite a bit and shrink it roughly in half. That gives us an image which looks like this:

Conclusive evidence that Alaskans are introverts.

If I cut that confidence interval roughly in half, I lose enough states that those CI bars probably are too narrow. Conclusions we can draw include:

  • Any state with an elevation difference over ~16,000 is likely to have a NEO E zscore below 0.
  • Alaska is the only state with an elevation difference over 16,000.

For all of the other states, well, we still can’t tell.

Conclusion

Looking solely at this image, we can’t tell much about NEO E zscore versus elevation difference except that there appears to be a negative correlation which is meaningful for any state above 16,000 feet of difference in elevation. Based on the raw picture, however, your eyes want to believe that there’s a meaningful negative correlation. It’s just not there, though.

Bonus Round: Rabbit Holes I Semi-Successfully Avoided

I won’t get into any of these because they’re tangents and the further I get away from looking at the one picture, the more likely it is that I end up talking about something which the paper authors covered. Let me reiterate that I’m not trashing the underlying paper, as I haven’t read it. But here are a few things I’d want to think about:

  • This data is at the state level and shows elevation difference. When sampling, it seems like you’d want to sample at something closer to the county level in order to get actual elevation. After all, the conjecture is that there is a separating equilibrium between extroverts and introverts based on elevation.
  • Elevation difference is also a bit weird of a proxy to use by state. Not so weird that it’s hinky, but weird enough to make me think about it.
  • Looking at Alaska in particular, they had 710K people as of the 2010 census, but here are the top cities and their elevations:
CityPopulationElevation (feet)
Anchorage291,826102
Fairbanks31,535446
Juneau31,27556
Sitka8,88126
Ketchikan8,0500
Wasilla7,831341
Kenai7,10072
Kodiak6,13049
Bethel6,0803
I gave up after 9. Frankly, that’s 9 more than I expected to do.

This tells us that, at a minimum, ~56% of Alaska residents lived at or near sea level despite being one of the most mountainous states. If introverts want to live in high-elevation areas, it’s a little weird that they’re flocking to the coastline, which is supposed to be a high-extroversion area based on the journal article’s summary. But again, I didn’t read the article (or even look for a non-gated copy), so take that with plenty of grains of salt.

Learning Goals for 2020

Unlike the last couple of years (e.g., 2019), I’m lopping off the “Presentation” portion and focusing more on what I want to learn. Presentations will follow from some of this but there are few guarantees. I’m going to break this up into sections because if I just gave the full list I’d demoralize myself.

The Wide World of Spark

It’s a pretty good time to be working with Apache Spark, and I’m interested in deepening my knowledge considerably this year. Here’s what I’m focusing on in this area:

  • Azure Databricks. I’m pretty familiar with Azure Databricks already and have spent quite a bit of time working with the Community Edition, but I want to spend more time diving into the product and gain expertise.
  • Spark.NET, particularly F#. Getting better with F# is a big part of my 2020 learning goals, and so this fits two goals at once.
  • SparkR and sparklyr have been on my radar for quite a while, but I’ve yet to get comfortable with them. That changes in 2020.
  • Microsoft is putting a lot of money into Big Data Clusters and Azure Synapse Analytics, and I want to be at a point in 2020 where I’m comfortable working with both.
  • Finally, Kafka Streams and Spark Streaming round out my list. Kafka Streams isn’t Spark-related, but I want to be proficient with both of these.

Azure Everywhere

I’m eyeing a few Azure-related certifications for 2020. Aside from the elements above (Databricks, Big Data Clusters, and Synapse Analytics), I’ve got a few things I want to get better at:

  • Azure Data Factory. I skipped Gen1 because it didn’t seem worth it. Gen2 is a different story and it’s about time I got into ADF for real.
  • Azure Functions, especially F#. F# seems like a perfect language for serverless operations.
  • Azure Data Lake Gen2. Same deal with ADF, where Gen1 was blah, but Gen2 looks a lot better. I’ve got the hang of data lakes but really want to dive into theory and practices.

Getting Functional

I released my first F# projects in 2019. These ranged from a couple F#-only projects to combination C#-F# solutions. I learned a huge amount along the way, and 2020 is a good year for me to get even more comfortable with the language.

  • Serverless F#. This relates to Azure Functions up above.
  • Fable and the SAFE Stack. This is a part of programming where I’ve been pretty weak (that’s the downside of specializing in the data platform side of things), so it’d be nice to build up some rudimentary skills here.
  • Become comfortable with .NET Core. I’ve been a .NET Framework kind of guy for a while, and I’m just not quite used to Core. That changes this year.
  • Become comfortable with computational expressions and async in F#. I can use them, but I want to use them without having to think about it first.
  • Finish Domain Modeling Made Functional and Category Theory for Programmers. I’ve been working through these two books and plan to complete them.
  • Get more active in the community. I’ve created one simple pull request for FSharp.Data.SqlClient. I’d really like to add a few more in areas where it makes sense.

Data Science + Pipelines

I have two sets of goals in this category. The first is to become more comfortable with neural networks and boosted decision trees, and get back to where I was in grad school with regression.

The other set of goals is all about data science pipelines. I think you’ll be hearing a lot more about this over the next couple of years, but the gist is using data science-oriented version control systems (like DVC), Docker containers governed by Kubernetes, and pipeline software like MLFlow to build repeatable solutions for data science problems. Basically, data science is going through the maturation phase that software development in general went through over the past decade.

Video Editing

This last set of goals pertains to video editing rather than a data platform or programming topic. I want to make 2020 the year of the Linux desktop video production, and that means sitting down and learning the software side of things. I’m including YouTube tutorials and videos as well as improving my use of OBS Studio for TriPASS’s Twitch channel. If I get good enough at it, I might do a bit more with Twitch, but we’ll see.

Conclusion

Looking back at the list, it’s a pretty ambitious goal. Still, these are the areas where I think it’s worth me spending those crucial off-work hours, and I’d expect to see three or four new presentations come out of all of this.

New Website Design

I’ve been working on this for a little while now, but I finally decided to flip over the Catallaxy Services website to its new design today.

The old design was something I came up with about seven years ago and it definitely showed its age. In particular, it had a crummy mobile experience, with different font sizes, needing to pinch-scroll to see anything, and other dodgy bits.

I like the new website a lot better. Granted, if I didn’t like it as much, I wouldn’t have made the switch, but there are a few things I want to point out.

Resources

I do a lot of stuff, and it’s nice to be able to sort out most of those things in one location. That’s the Resources section.

Some of the things I do.

I also broke out resources by section, so if you want an overview of my community resources, you can click the button to show just those resources. Or if you want to give me money in exchange for goods and/or services, I have a “Paid Resources” section too.

Mobile-First Experience

I wanted to make sure not only that the site would work on a phone, but that it looked good on a phone. When I present, this will often be the first way people check out my site, so I want it to be a good experience.

Images

I’m still working on this, but one of the things about the old site which made it fast but boring is that it was entirely textual. There were no images. Now, there are images on pretty much every page. I want to see what I can do to improve things a bit more, but it’s a start.

Presentations Page

Like the Resources section, you can see a list of presentations and also filter by topic.

These machines, they never learn.

For each presentation, I have the same contents that I had before: abstract, slides, demo code, additional media, and links/resources. I’ve shuffled things around a bit so that it looks reasonably good on a wide-screen monitor but also works on a phone or tablet. Here’s a presentation where everything’s nice and filled out. For other talks, where not everything is filled out, I at least have sensible defaults.

Services

Something I didn’t do a great job of calling out before was availability for consulting and paid services. With the new site, I emphasize this in a few ways, whether that’s consulting, paid training (though actual paid training is forthcoming), or books + third-party offerings like PolyBase Revealed.

What’s Next

I have a few more things around the site, namely around custom training, which I intend to work on during the winter. I also have a couple of smaller ideas which I plan to implement over the next several days, but won’t call them out specifically. For now, though, I think this is a pretty good start.

Category Theory: the Video Lectures

I just wrapped up watching Bartosz Milewski’s YouTube series on category theory. If you are at all interested in functional programming, this is an outstanding set of lectures.

The course covers category theory, a rather abstract branch of mathematics. Throughout the twenty videos, Milewski takes us through the landscape of category theory, grounding it as much as possible in the language and concepts of programming. Some of Milewski’s examples are in Haskell but you don’t need to know that language to understand what’s going on. Similarly, other examples are in C++, but it’s clear from the context what he means even if you’ve never seen a line of the language.

I came into this course with a fair knowledge of functional programming and a less-than-fair knowledge of category theory. Watching these videos has given me a much better understanding of the topic and has really cleared up some of the trickier concepts in functional programming like monads.

For further reading, Milewski has a blog full of posts on the topic and a group has (with permission) turned his posts into a book. There are also two more courses on category theory that Milewski has put together, helping us dig even further into the topic.

If you are interested in the series, don’t get too distracted by the intentionally-opaque examples. I’ve seen this in undergraduate courses I took on logic, where the professor wants to ensure that you don’t get stuck thinking about a specific example when explaining a general concept, as though the specific example were the only case. Milewski does a good job of combining the highly-general drawings of categories and how elements in categories can map to other categories via functors but then brings it down to examples we’re more familiar using, particularly with sets. There’s a balancing act involved in these examples and I think Milewski has that act pretty well covered.

It may take you a month or three to get through all of these videos, but I definitely recommend them. From here, I’m going to work through the book and fill in more gaps.

Today’s Events: PASS DBA Virtual Group

Key Details

What: PASS Database Administration Virtual Chapter
Where: From my house, but you can see me online instead of having to visit my house.
When: Today! August 28, 2019.
Admission is free. Register on the PASS website.

What I’m Presenting

12:00 PM — 1:00 PM Eastern — Approaching Zero

This isn’t a pure DBA talk—in fact, it’s more useful for developers than DBAs. But if you’re in a database development or hybrid DBA role, and you’re sick of needing to be awake for 3 AM database deployments, this talk is for you.

Machine Learning with .NET: Model Serialization

This is part five in a series on Machine Learning with .NET.

So far in this series, we’ve looked at training a model at runtime. Even in the prior post, where we looked at predictions, we first had to train a model. With a Naive Bayes model against a small data set, that’s not an onerous task—it’s maybe a second or two the first time we instantiate our objects and we can leave the prediction engine around for the lifetime of our application, so we are able to amortize the cost fairly effectively.

Suppose, however, that we have a mighty oak of a neural network which we have trained over the past two months. It finally completed training and we now have a set of weights. Obviously, we don’t want to retrain this if we can avoid it, so we need to find a way to serialize our model and save it to disk somewhere so that we can re-use it later.

Saving a Model to Disk

Saving a model to disk is quite easy. Here’s the method I created to do just that:

public void SaveModel(MLContext mlContext, ITransformer model, string modelPath)
{
	using (var stream = File.Create(modelPath))
	{
		mlContext.Model.Save(model, null, stream);
	}
}

We create a file stream and use the built-in Save() method to write our data to that stream. Here is an example of me calling that method:

string modelPath = "C:\\Temp\\BillsModel.mdl";
bmt.SaveModel(mlContext, model, modelPath);

This generates a binary .mdl file which contains enough information to reconstitute our model and generate predictions off of it.

Loading a Model from Disk

Here’s where things get annoying again. If you have any kind of custom mapping, you need to do a bit of extra work, as I mentioned in my rant.

Next, when it comes time to deserialize the model, we need to register assemblies. I have two custom mappings but they’re both in the same assembly. Therefore, I only need to register the assembly using one of them. Should I desire to migrate these mappings into separate assemblies later, I would need to update the code to include each assembly at least once. I’m not sure yet which is a better practice: include all custom assemblies regardless of whether you need them, or go back and modify calling code later. What I do know is that it’s a bit annoying when all I really wanted was a simple string or integer translation. Anyhow, here’s the code:

mlContext.ComponentCatalog.RegisterAssembly(typeof(QBCustomMappings).Assembly);
mlContext.ComponentCatalog.RegisterAssembly(typeof(PointsCustomMappings).Assembly);

Once I’ve registered those assemblies, I can reconstitute the model using a method call:

var newModel = bmt.LoadModel(mlContext, modelPath);

That method call wraps the following code:

public ITransformer LoadModel(MLContext mlContext, string modelPath)
{
	ITransformer loadedModel;
	using (var stream = File.OpenRead(modelPath))
	{
		DataViewSchema dvs;
		loadedModel = mlContext.Model.Load(stream, out dvs);
	}

	return loadedModel;
}

What we’re doing is building a model (of interface type ITransformer) as well as a DataViewSchema which I don’t need here. I get back my completed model and can use it like I just finished training. In fact, I have a test case in my GitHub repo which compares a freshly-trained model versus a saved and reloaded model to ensure function calls work and the outputs are the same:

private string GenerateOutcome(PredictionEngineBase<RawInput, Prediction> pe)
{
	return pe.Predict(new RawInput
	{
		Game = 0,
		Quarterback = "Josh Allen",
		Location = "Home",
		NumberOfPointsScored = 17,
		TopReceiver = "Robert Foster",
		TopRunner = "Josh Allen",
		NumberOfSacks = 0,
		NumberOfDefensiveTurnovers = 0,
		MinutesPossession = 0,
		Outcome = "WHO KNOWS?"
	}).Outcome;
}

[Test()]
public void SaveAndLoadModel()
{
	string modelPath = "C:\\Temp\\BillsModel.mdl";
	bmt.SaveModel(mlContext, model, modelPath);

	// Register the assembly that contains 'QBCustomMappings' with the ComponentCatalog
	// so it can be found when loading the model.
	mlContext.ComponentCatalog.RegisterAssembly(typeof(QBCustomMappings).Assembly);
	mlContext.ComponentCatalog.RegisterAssembly(typeof(PointsCustomMappings).Assembly);

	var newModel = bmt.LoadModel(mlContext, modelPath);

	var newPredictor = mlContext.Model.CreatePredictionEngine<RawInput, Prediction>(newModel);
	var po = GenerateOutcome(predictor);
	var npo = GenerateOutcome(newPredictor);

	Assert.IsNotNull(newModel);
	Assert.AreEqual(po, npo);
}

Results, naturally, align.

Conclusion

In today’s post, we looked at the ability to serialize and deserialize models. We looked at saving the model to disk, although we could save to some other location, as the Save() method on MLContext requires merely a Stream and not a FileStream. Saving and loading models is straightforward as long as you’ve done all of the pre-requisite work around custom mappings (or avoided them altogether).