In the last post, I talked about the problem we want to solve. The short version is, I want to build a process to migrate data from one node to another. To keep this simple, we’ll say that we can turn off warehouse ETL processes for a particular customer as we move their data.
Part of what makes BIML so powerful is that we can generalize a solution to a repetitive ETL problem. But before I can solve the general case, I typically want to try to solve a basic version of the problem. In this scenario, I want to sketch out how I can migrate data from one table. My most basic migration would be a data flow with a SQL Server source and a SQL Server destination. Of course, if I fail in the middle of migration, I can’t re-run that package because I would get primary key constraint violations (assuming I have primary keys!) or, even worse, could end up with duplicate data. A simple fix is to delete the partially-migrated data from my destination instance before attempting to insert data from the source instance.
The first thing I need for my sample, minimum-required-functionality map is at least one parameter to determine which customer I’m migrating. Then, I decided to create parameters to tell my source and destination instance and database names. Finally, to make my debugging a little easier, I have a couple of parameters that will control whether I want to delete partially-migrated data from the destination database and whether I want to perform the migration (which I might not do if I’m simply testing that deletion works). This leads to a project parameters listing that looks something like this:
From there, I created a DTSX package and began to sketch out my sample map. Here’s the Control flow for what such a map would look like:
In this case, you can get a feel for what the final product should look like, as well as plan ahead slightly on design. The first thing I want to point out is that I have two connections: one for the source node and one for the destination node. Any automated package manager I use will need to make sure that those connections are available. We’ll ignore the expression portion for now and move on to the control flow.
In the control flow itself, I have a Sequence container named “Delete From New Tables,” which deletes records from my new node’s tables. Inside there is one Execute SQL task which looks a little like this:
For a simple mock-up scenario, you might even have this execute TRUNCATE TABLE and be done. In my case, I set up a stored procedure which performs batched deletion of records for a given customer’s ID. In either event, this takes care of partial migrations and subsequent primary key violations. For now, I’m going to stick with this design, but later on in the series, we’ll make it a bit more robust.
Below the deletion sequence, we have another sequence for migrating data. You can see from the control flow image that inside the primary sequence, I have several sub-sequences. The reason I have these is simple: some tables have foreign key constraints linking to other tables, meaning that I would need to insert rows into the “parent” table before inserting rows into the “child” table. Note that I would also need to have a reverse set for deletions—I need to delete from the child tables before deleting records from the parent tables. Although I only have one table in my model, this setup helps remind me that I need to address this fact of life in my design.
Each of the sub-sequences will be full of data flows, each of which migrates a single table. Inside the lone data flow in my example, things are pretty simple:
I have a source node and a destination node. I won’t include a picture of the source settings here because it’s trivial: my source selects data from the source table where the customer key is equal to the key I’m going to pass in. For my destination, I have a straight mapping. The only interesting part is on the Connection Manager tab.
In the Connection Manager tab, I have made a few tweaks. First, I made sure that Table Lock is on. I want to be able to do bulk insertions quickly, and that means locking the table. To minimize the amount of time I hold that table lock, I set my commit size equal to the number of rows in my buffer, meaning that I should issue a commit (and temporarily rescind that table lock) with each buffer. This will give other statements a chance to sneak in and work without my bulk insertion blocking everything for what could be a very long time. In addition to table locks and a 10K commit size, I set the maximum insert commit size equal to 2 million. The reason for this is that, in my data set, 2 million rows tends to be the largest number of records for this table for a given customer. Finally, I have to make a decision on whether I need to keep identity values or not. In my scenario, this decision came down to two things: first, some of the dimensions had their keys generated through an identity integer, and if I didn’t keep that identity value, I would need to re-translate the facts as I moved them from one node to another. It’s much easier (and faster!) for me to keep the identity value and not need to create a bunch of lookup components. The other consideration is that this identity value needs to be tied to a specific customer as well. Imagine this scenario: MyTable ID 1 on node 1 ties to customer A. MyTable ID 1 on node 2 ties to customer B. If customer A and customer B ever get migrated to the same node, that migration would fail if we did not include the customer key as part of our dimension’s primary key. If there are any tables which do not include the customer key as part of the primary key, I can’t use that primary key as such and would need to generate a new row with an auto-incremented identity value. In my case, all of the relevant dimensions included the customer key, but there were a few facts which did not.
The last thing I want to do is check data flow settings. To help optimize my bulk insert performance, I want to do batch insertions of 10,000 records. To make sure I can push 10K rows in a batch, I want to bump my default buffer size to 100 MB. The server which hosts SSIS has enough RAM to support 100 MB buffers and some of my dimensions are pretty wide–some of the rows might be in the 8-10K range, meaning that at 10K per row and 10K rows, we’re looking at…a 100 MB buffer.
This gives us an idea of what we want to do before ever writing a line of BIML and also gives us something against which we can compare our BIML outputs. When developing something generic, it always helps to have a concrete implementation in mind for comparison purposes, if only to ensure that you don’t forget any steps.
After this, my next step in this journey is going to be to use BIML to create a hard-coded replica of the ideal package, proving that I can use BIML to generate equivalent SSIS code which I can use later. Stay tuned!
3 thoughts on “BIML, Part 3—The Ideal Package”