Tracing Foreign Key Dependencies In SQL Server

In my latest post on using BIML, I created a table called MigrationStep and populated it with data. Now I want to show how exactly I generated that data and how you can find foreign key constraints.

Let’s use the AdventureWorksDW2012 database as an example and look at the thirteen tables we identified as being related to DIMCustomer in some way. Note that my diagram is a bit of a post hoc analysis—I already know which tables are involved—but our solution doesn’t depend upon this level of foreknowledge.

AdventureWorksDWDiagram

Looking at this diagram, we can see that DimCustomer has “parent” tables and “child” tables. By this, I mean that DimCustomer has foreign key constraints leading to other (parent) tables, and there are tables (children) which have foreign key constraints leading back to DimCustomer. In particular, DimCustomer has a GeographyKey which ties back to DimGeography, making DimGeography a “parent” table for DimCustomer. If we want to load a row into DimCustomer, we need to load that row’s GeographyKey into DimGeography first.

In addition, there are two tables which have CustomerKey attributes which tie back to DimCustomer: FactInternetSales and FactSurveyResponse. Looking at FactInternetSales, there are a number of keys: DimProduct, DimPromotion, DimCurrency, etc. We need to load all of these dimensions before we can populate FactInternetSales…but before we can load DimProduct, we need to load DimProductCategory and then DimProductSubcategory.

With this in mind, I want to use SQL Server metadata to determine all of the tables which tie directly to some base table. I’m going to start out with a big blob of code and then I’ll explain it in detail.

DECLARE
	@BaseSchemaName SYSNAME = N'dbo',
	@BaseTableName SYSNAME = N'DimCustomer',
	@BasePrimaryKeyName SYSNAME = N'CustomerKey';

WITH Predecessors(SchemaName, TableName, TableID, Ordinal, Path, JoinCriteria, WhereCriteria, LookupColumns, LookupColumnXML) AS
(
	SELECT
		OBJECT_SCHEMA_NAME(t.object_id) AS SchemaName,
		OBJECT_NAME(t.object_id) AS TableName,
		t.object_id AS TableID,
		0 AS Ordinal,
		CAST('/' + @BaseSchemaName + '.' + @BaseTableName + '/' AS VARCHAR(8000)) AS Path,
		CAST('' AS VARCHAR(8000)) AS JoinCriteria,
		'x.' + @BasePrimaryKeyName + ' = ?' AS WhereCriteria,
		(
			SELECT
				CAST(STUFF
				(
					(
						SELECT
							', x.' + c.name
						FROM sys.key_constraints kc
							INNER JOIN sys.index_columns ic
								ON kc.parent_object_id = ic.object_id
								AND kc.unique_index_id = ic.index_id
							INNER JOIN sys.columns c
								ON ic.column_id = c.column_id
								AND ic.object_id = c.object_id
						WHERE
							kc.type = 'PK'
							AND kc.parent_object_id = t.object_id
						FOR XML PATH ('')
					), 1, 2, ''
				) AS VARCHAR(8000))
		) AS LookupColumns,
		(
			SELECT
				CAST(STUFF
				(
					(
						SELECT
							'<Column SourceColumn="' + c.name + '" TargetColumn="' + c.name + '" />'
						FROM sys.key_constraints kc
							INNER JOIN sys.index_columns ic
								ON kc.parent_object_id = ic.object_id
								AND kc.unique_index_id = ic.index_id
							INNER JOIN sys.columns c
								ON ic.column_id = c.column_id
								AND ic.object_id = c.object_id
						WHERE
							kc.type = 'PK'
							AND kc.parent_object_id = t.object_id
						FOR XML PATH ('')
					), 1, 0, ''
				) AS VARCHAR(8000))
		) AS LookupColumnXML
	FROM sys.tables t
	WHERE
		t.type = 'U'
		AND t.name = @BaseTableName
		AND OBJECT_SCHEMA_NAME(t.object_id) = @BaseSchemaName
	UNION ALL
	SELECT
		OBJECT_SCHEMA_NAME(t.object_id) AS SchemaName,
		OBJECT_NAME(t.object_id) AS TableName,
		t.object_id AS TableID,
		tt.Ordinal - 1 AS Ordinal,
		CAST(Path + '/' + OBJECT_SCHEMA_NAME(t.object_id) + '.' + OBJECT_NAME(t.object_id) + '/' AS VARCHAR(8000)) AS Path,
		CAST('' AS VARCHAR(8000)) AS JoinCriteria,
		'c.' + @BasePrimaryKeyName + ' = ?' AS WhereCriteria,
		(
			SELECT
				CAST(STUFF
				(
					(
						SELECT
							', x.' + c.name
						FROM sys.key_constraints kc
							INNER JOIN sys.index_columns ic
								ON kc.parent_object_id = ic.object_id
								AND kc.unique_index_id = ic.index_id
							INNER JOIN sys.columns c
								ON ic.column_id = c.column_id
								AND ic.object_id = c.object_id
						WHERE
							kc.type = 'PK'
							AND kc.parent_object_id = t.object_id
						FOR XML PATH ('')
					), 1, 2, ''
				) AS VARCHAR(8000))
		) AS LookupColumns,
		(
			SELECT
				CAST(STUFF
				(
					(
						SELECT
							'<Column SourceColumn="' + c.name + '" TargetColumn="' + c.name + '" />'
						FROM sys.key_constraints kc
							INNER JOIN sys.index_columns ic
								ON kc.parent_object_id = ic.object_id
								AND kc.unique_index_id = ic.index_id
							INNER JOIN sys.columns c
								ON ic.column_id = c.column_id
								AND ic.object_id = c.object_id
						WHERE
							kc.type = 'PK'
							AND kc.parent_object_id = t.object_id
						FOR XML PATH ('')
					), 1, 0, ''
				) AS VARCHAR(8000))
		) AS LookupColumnXML
	FROM sys.tables t
		INNER JOIN sys.foreign_keys f
			ON f.referenced_object_id = t.object_id
			AND f.parent_object_id <> f.referenced_object_id
		INNER JOIN Predecessors tt
			ON f.parent_object_id = tt.TableID
	WHERE
		t.type = 'U'
		AND Path NOT LIKE '%/' + OBJECT_SCHEMA_NAME(t.object_id) + '.' + OBJECT_NAME(t.object_id) + '/%'
),
Successors(SchemaName, TableName, TableID, Ordinal, Path, NumberOfMoves, CurrentDirection, JoinCriteria, WhereCriteria, LookupColumns, LookupColumnXML) AS
(
	SELECT
		OBJECT_SCHEMA_NAME(t.object_id) AS SchemaName,
		OBJECT_NAME(t.object_id) AS TableName,
		t.object_id AS TableID,
		0 AS Ordinal,
		CAST('/' + @BaseSchemaName + '.' + @BaseTableName + '/' AS VARCHAR(8000)) AS Path,
		CAST(0 AS INT) AS NumberOfMoves,
		'' AS CurrentDirection,
		CAST('' AS VARCHAR(8000)) AS JoinCriteria,
		'x.' + @BasePrimaryKeyName + ' = ?' AS WhereCriteria,
		(
			SELECT
				CAST(STUFF
				(
					(
						SELECT
							', x.' + c.name
						FROM sys.key_constraints kc
							INNER JOIN sys.index_columns ic
								ON kc.parent_object_id = ic.object_id
								AND kc.unique_index_id = ic.index_id
							INNER JOIN sys.columns c
								ON ic.column_id = c.column_id
								AND ic.object_id = c.object_id
						WHERE
							kc.type = 'PK'
							AND kc.parent_object_id = t.object_id
						FOR XML PATH ('')
					), 1, 2, ''
				) AS VARCHAR(8000))
		) AS LookupColumns,
		(
			SELECT
				CAST(STUFF
				(
					(
						SELECT
							'<Column SourceColumn="' + c.name + '" TargetColumn="' + c.name + '" />'
						FROM sys.key_constraints kc
							INNER JOIN sys.index_columns ic
								ON kc.parent_object_id = ic.object_id
								AND kc.unique_index_id = ic.index_id
							INNER JOIN sys.columns c
								ON ic.column_id = c.column_id
								AND ic.object_id = c.object_id
						WHERE
							kc.type = 'PK'
							AND kc.parent_object_id = t.object_id
						FOR XML PATH ('')
					), 1, 0, ''
				) AS VARCHAR(8000))
		) AS LookupColumnXML
	FROM sys.tables t
	WHERE
		t.type = 'U'
		AND t.name = @BaseTableName
	UNION ALL
	SELECT
		OBJECT_SCHEMA_NAME(t.object_id) AS SchemaName,
		OBJECT_NAME(t.object_id) AS TableName,
		t.object_id AS TableID,
		tt.Ordinal + 1 AS Ordinal,
		CAST(Path + '/' + OBJECT_SCHEMA_NAME(t.object_id) + '.' + OBJECT_NAME(t.object_id) + '/' AS VARCHAR(8000)) AS Path,
		NumberOfMoves + CASE WHEN CurrentDirection = 'D' THEN 0 ELSE 1 END AS NumberOfMoves,
		'D' AS CurrentDirection,
		CAST('' AS VARCHAR(8000)) AS JoinCriteria,
		'c.' + @BasePrimaryKeyName + ' = ?' AS WhereCriteria,
		(
			SELECT
				CAST(STUFF
				(
					(
						SELECT
							', x.' + c.name
						FROM sys.key_constraints kc
							INNER JOIN sys.index_columns ic
								ON kc.parent_object_id = ic.object_id
								AND kc.unique_index_id = ic.index_id
							INNER JOIN sys.columns c
								ON ic.column_id = c.column_id
								AND ic.object_id = c.object_id
						WHERE
							kc.type = 'PK'
							AND kc.parent_object_id = t.object_id
						FOR XML PATH ('')
					), 1, 2, ''
				) AS VARCHAR(8000))
		) AS LookupColumns,
		(
			SELECT
				CAST(STUFF
				(
					(
						SELECT
							'<Column SourceColumn="' + c.name + '" TargetColumn="' + c.name + '" />'
						FROM sys.key_constraints kc
							INNER JOIN sys.index_columns ic
								ON kc.parent_object_id = ic.object_id
								AND kc.unique_index_id = ic.index_id
							INNER JOIN sys.columns c
								ON ic.column_id = c.column_id
								AND ic.object_id = c.object_id
						WHERE
							kc.type = 'PK'
							AND kc.parent_object_id = t.object_id
						FOR XML PATH ('')
					), 1, 0, ''
				) AS VARCHAR(8000))
		) AS LookupColumnXML
	FROM sys.tables t
		INNER JOIN sys.foreign_keys f
			ON f.parent_object_id = t.object_id
			AND f.parent_object_id <> f.referenced_object_id
		INNER JOIN Successors tt
			ON f.referenced_object_id = tt.TableID
	WHERE
		t.type = 'U'
		AND Path NOT LIKE '%/' + OBJECT_SCHEMA_NAME(t.object_id) + '.' + OBJECT_NAME(t.object_id) + '/%'
		--We can change direction twice:  once to go "down" the tree to get child tables, and once more to get those
		--child tables' parents.  We don't need any more moves than that; this way, we get the minimum number of tables
		--necessary to populate our base table.
		AND NumberOfMoves + CASE WHEN CurrentDirection = 'D' THEN 0 ELSE 1 END < 2
	UNION ALL
	SELECT
		OBJECT_SCHEMA_NAME(t.object_id) AS SchemaName,
		OBJECT_NAME(t.object_id) AS TableName,
		t.object_id AS TableID,
		tt.Ordinal - 1 AS Ordinal,
		CAST(Path + '/' + OBJECT_SCHEMA_NAME(t.object_id) + '.' + OBJECT_NAME(t.object_id) + '/' AS VARCHAR(8000)) AS Path,
		NumberOfMoves + CASE WHEN CurrentDirection = 'U' THEN 0 ELSE 1 END AS NumberOfMoves,
		'U' AS CurrentDirection,
		CAST('' AS VARCHAR(8000)) AS JoinCriteria,
		'c.' + @BasePrimaryKeyName + ' = ?' AS WhereCriteria,
		(
			SELECT
				CAST(STUFF
				(
					(
						SELECT
							', x.' + c.name
						FROM sys.key_constraints kc
							INNER JOIN sys.index_columns ic
								ON kc.parent_object_id = ic.object_id
								AND kc.unique_index_id = ic.index_id
							INNER JOIN sys.columns c
								ON ic.column_id = c.column_id
								AND ic.object_id = c.object_id
						WHERE
							kc.type = 'PK'
							AND kc.parent_object_id = t.object_id
						FOR XML PATH ('')
					), 1, 2, ''
				) AS VARCHAR(8000))
		) AS LookupColumns,
		(
			SELECT
				CAST(STUFF
				(
					(
						SELECT
							'<Column SourceColumn="' + c.name + '" TargetColumn="' + c.name + '" />'
						FROM sys.key_constraints kc
							INNER JOIN sys.index_columns ic
								ON kc.parent_object_id = ic.object_id
								AND kc.unique_index_id = ic.index_id
							INNER JOIN sys.columns c
								ON ic.column_id = c.column_id
								AND ic.object_id = c.object_id
						WHERE
							kc.type = 'PK'
							AND kc.parent_object_id = t.object_id
						FOR XML PATH ('')
					), 1, 0, ''
				) AS VARCHAR(8000))
		) AS LookupColumnXML
	FROM sys.tables t
		INNER JOIN sys.foreign_keys f
			ON f.referenced_object_id = t.object_id
			AND f.parent_object_id <> f.referenced_object_id
		INNER JOIN Successors tt
			ON f.parent_object_id = tt.TableID
	WHERE
		t.type = 'U'
		AND Path NOT LIKE '%/' + OBJECT_SCHEMA_NAME(t.object_id) + '.' + OBJECT_NAME(t.object_id) + '/%'
		--No check here for NumberOfMoves because we want to be able to always move up the tree to find foreign key constraints.
)
SELECT
	a.SchemaName,
	a.TableName,
	MAX(Ordinal) AS Section,
	a.JoinCriteria,
	MAX(a.Path) AS Path,
	--We can hardcode DISTINCT x.* here, but there are some tables which don't need the DISTINCT
	'DISTINCT x.*' AS SelectCriteria,
	a.WhereCriteria,
	--We can hardcode 0 here because all tables have primary key constraints
	0 AS NeedToDelete,
	a.LookupColumns,
	--A side effect of FOR XML Path is that < becomes &lt; and > becomes &gt;, so we need to fix those
	REPLACE(REPLACE(a.LookupColumnXML, '&lt;', '<'), '&gt;', '>')
FROM
(
	SELECT
		p.SchemaName,
		p.TableName,
		p.TableID,
		p.Ordinal,
		p.JoinCriteria,
		p.Path,
		p.WhereCriteria,
		p.LookupColumns,
		p.LookupColumnXML
	FROM Predecessors p

	UNION ALL

	SELECT
		s.SchemaName,
		s.TableName,
		s.TableID,
		s.Ordinal,
		s.JoinCriteria,
		s.Path,
		s.WhereCriteria,
		s.LookupColumns,
		s.LookupColumnXML
	FROM Successors s
		LEFT OUTER JOIN Predecessors p
			ON s.TableID = p.TableID
	WHERE
		p.TableID IS NULL
		AND s.Path LIKE '%/' + @BaseSchemaName + '.' + @BaseTableName + '/%'
) a
GROUP BY
	a.SchemaName,
	a.TableName,
	a.JoinCriteria,
	a.WhereCriteria,
	a.LookupColumns,
	a.LookupColumnXML
ORDER BY
	Section,
	SchemaName,
	TableName;

Yeah, that’s 370 lines of code. Don’t worry—we’ll walk through it piece by piece.

My first code segment is the Predecessors CTE. This is a recursive common table expression, so we’ll break it apart into bite-sized pieces. Let’s start with the base case of our recursive CTE and hide the LookupColumns and LookupColumnsXML work:

SELECT
	OBJECT_SCHEMA_NAME(t.object_id) AS SchemaName,
	OBJECT_NAME(t.object_id) AS TableName,
	t.object_id AS TableID,
	0 AS Ordinal,
	CAST('/' + @BaseSchemaName + '.' + @BaseTableName + '/' AS VARCHAR(8000)) AS Path,
	CAST('' AS VARCHAR(8000)) AS JoinCriteria,
	'x.' + @BasePrimaryKeyName + ' = ?' AS WhereCriteria,
	'TODO' AS LookupColumns,
	'TODO' AS LookupColumnXML
FROM sys.tables t
WHERE
	t.type = 'U'
	AND t.name = @BaseTableName
	AND OBJECT_SCHEMA_NAME(t.object_id) = @BaseSchemaName

This query looks pretty simple: look for a table named [@BaseSchemaName].[@BaseTableName]. In our example, that will translate to dbo.DimCustomer. Once we have that, we select the schema and table names, the table ID (for easy comparison), and start setting some values. Ordinal is a parameter which tells us how deep we are in the recursion chain. This is the base case, so we’re at level 0. I also generate a path to tell us the exact chain we followed to get to this point. In this case, it’s just /dbo.DimCustomer/. I put the slashes in to separate levels, and needed to cast as VARCHAR(8000) because all parts of my recursive CTE must use the exact same table types, including data sizes. For the WHERE clause, I’m assuming my table will be aliased as “x” and has my base primary key. If your base table has a composite key…well, good luck with the modifications…

That part was pretty easy, so let’s look at LookupColumns and LookupColumnsXML. You’ll see that they’re both really the same thing, just in a slightly different form:

(
	SELECT
		CAST(STUFF
		(
			(
				SELECT
					', x.' + c.name
				FROM sys.key_constraints kc
					INNER JOIN sys.index_columns ic
						ON kc.parent_object_id = ic.object_id
						AND kc.unique_index_id = ic.index_id
					INNER JOIN sys.columns c
						ON ic.column_id = c.column_id
						AND ic.object_id = c.object_id
				WHERE
					kc.type = 'PK'
					AND kc.parent_object_id = t.object_id
				FOR XML PATH ('')
			), 1, 2, ''
		) AS VARCHAR(8000))
) AS LookupColumns,
(
	SELECT
		CAST(STUFF
		(
			(
				SELECT
					'<Column SourceColumn="' + c.name + '" TargetColumn="' + c.name + '" />'
				FROM sys.key_constraints kc
					INNER JOIN sys.index_columns ic
						ON kc.parent_object_id = ic.object_id
						AND kc.unique_index_id = ic.index_id
					INNER JOIN sys.columns c
						ON ic.column_id = c.column_id
						AND ic.object_id = c.object_id
				WHERE
					kc.type = 'PK'
					AND kc.parent_object_id = t.object_id
				FOR XML PATH ('')
			), 1, 0, ''
		) AS VARCHAR(8000))
) AS LookupColumnXML

I’ll focus on LookupColumns for this description. What I’m doing here is finding all of the columns associated with this table’s primary key and turning it into a comma-delimited list with the FOR XML PATH trick. This will create a string like “, x.Column1, x.Column2, x.Column3″ and then I use the STUFF function to strip off the first two characters, leaving “x.Column1, x.Column2, x.Column3″ for us.

LookupColumnXML is basically the same thing, except that instead of a comma-delimited list, I’m generating .

Having walked through this, let’s now look at the second half of the Predecessors recursive CTE.

UNION ALL
	SELECT
		OBJECT_SCHEMA_NAME(t.object_id) AS SchemaName,
		OBJECT_NAME(t.object_id) AS TableName,
		t.object_id AS TableID,
		tt.Ordinal - 1 AS Ordinal,
		CAST(Path + '/' + OBJECT_SCHEMA_NAME(t.object_id) + '.' + OBJECT_NAME(t.object_id) + '/' AS VARCHAR(8000)) AS Path,
		CAST('' AS VARCHAR(8000)) AS JoinCriteria,
		'c.' + @BasePrimaryKeyName + ' = ?' AS WhereCriteria,
		'LOOKED AT' AS LookupColumns,
		'LOOKED AT' AS LookupColumnXML
	FROM sys.tables t
		INNER JOIN sys.foreign_keys f
			ON f.referenced_object_id = t.object_id
			AND f.parent_object_id <> f.referenced_object_id
		INNER JOIN Predecessors tt
			ON f.parent_object_id = tt.TableID
	WHERE
		t.type = 'U'
		AND Path NOT LIKE '%/' + OBJECT_SCHEMA_NAME(t.object_id) + '.' + OBJECT_NAME(t.object_id) + '/%'

We’ve already looked at how LookupColumns and LookupColumnXML get populated, and it’s the same throughout the entire code base, so I’m going to replace those references with “LOOKED AT” to simplify the code.

In this code snippet, we have the recursive case. Once we look at our base table, we want to see any tables to which our base table links, followed by any tables to which those tables link, and so on until we run out of links. The way that we do this is in the FROM clause: we join sys.tables to sys.foreign_keys to get any tables in which the parent object ID is not the referenced object ID (to skip self joins), and then we join to the Predecessors common table expression to form the chain. In our concrete scenario, DimCustomer has a foreign key to DimGeography, so the second round of recursion would bring in that table. DimGeography has a foreign key to DimSalesTerritory, so the third round of recursion would bring that table in. We aren’t limited to bringing in just one table per round, but given this particular data model, that’s all we get.

In the SELECT clause, get the schema and table information (SchemaName, TableName, and TableID) the same as in the base case. After that, several things have changed. First of all, I decremented the Ordinal value. This lets us build a number line for precedence: I know that DimSalesTerritory precedes DimGeography, which itself precedes DimCustomer. Thus, I could assign DimCustomer a value of 0, DimGeography a value of -1, and DimSalesTerritory a value of -2. If I have two tables at the same level, it means that I can load records into both concurrently without violating foreign key constraints, assuming I have already loaded the appropriate records for any tables with earlier levels.

Next up is Path, in which I append the current schema and table name to the previous recursion level’s Path. Path is important here because relational database keys can be expressed as a directed cyclical graph, but recursion in a cycle leads to stack overflows. If I have tables A, B, and C, in which A has a key to B, which has a key to C, which has a key back to A, my recursion would build up A, B, C, A, B, C, etc. I don’t want any cycles in my report (not only to protect against stack overflows, but also because I’m interested in the shortest path to a table, and that necessarily precludes a cycle.

Once I take care of path, I set my WhereCriteria value. The “c.” is going to be the base table alias that I plan to use. Because every table in this set eventually ties back to the base table, I can build a valid set of joins to filter every table to include only the records I care about.

So, that’s the Predecessors common table expression, which takes care of any dependencies for our base table. From there, we want to branch down to the children of our base table, as well as those tables’ children, etc. To do this, I use the Successors common table expression. You’ll note that the Successors common table expression is very similar to Predecessors, but includes just a little bit more. Just like with Predecessors, let’s start with the base case.

SELECT
	OBJECT_SCHEMA_NAME(t.object_id) AS SchemaName,
	OBJECT_NAME(t.object_id) AS TableName,
	t.object_id AS TableID,
	0 AS Ordinal,
	CAST('/' + @BaseSchemaName + '.' + @BaseTableName + '/' AS VARCHAR(8000)) AS Path,
	CAST(0 AS INT) AS NumberOfMoves,
	'' AS CurrentDirection,
	CAST('' AS VARCHAR(8000)) AS JoinCriteria,
	'x.' + @BasePrimaryKeyName + ' = ?' AS WhereCriteria,
	'LOOKED AT' AS LookupColumns,
	'LOOKED AT' AS LookupColumnXML
FROM sys.tables t
WHERE
	t.type = 'U'
	AND t.name = @BaseTableName

This base case is almost exactly the same as Predecessors. The two differences are the attributes NumberOfMoves and CurrentDirection. For our base case, we don’t have a direction (either “up” the chain to parents or “down” the chain to children), and we haven’t made any directional changes yet, so NumberOfMoves is 0.

Let’s look at the next part of this CTE:

UNION ALL
SELECT
	OBJECT_SCHEMA_NAME(t.object_id) AS SchemaName,
	OBJECT_NAME(t.object_id) AS TableName,
	t.object_id AS TableID,
	tt.Ordinal + 1 AS Ordinal,
	CAST(Path + '/' + OBJECT_SCHEMA_NAME(t.object_id) + '.' + OBJECT_NAME(t.object_id) + '/' AS VARCHAR(8000)) AS Path,
	NumberOfMoves + CASE WHEN CurrentDirection = 'D' THEN 0 ELSE 1 END AS NumberOfMoves,
	'D' AS CurrentDirection,
	CAST('' AS VARCHAR(8000)) AS JoinCriteria,
	'c.' + @BasePrimaryKeyName + ' = ?' AS WhereCriteria,
	'LOOKED AT' AS LookupColumns,
	'LOOKED AT' AS LookupColumnXML
FROM sys.tables t
	INNER JOIN sys.foreign_keys f
		ON f.parent_object_id = t.object_id
		AND f.parent_object_id <> f.referenced_object_id
	INNER JOIN Successors tt
		ON f.referenced_object_id = tt.TableID
WHERE
	t.type = 'U'
	AND Path NOT LIKE '%/' + OBJECT_SCHEMA_NAME(t.object_id) + '.' + OBJECT_NAME(t.object_id) + '/%'
	--We can change direction twice:  once to go "down" the tree to get child tables, and once more to get those
	--child tables' parents.  We don't need any more moves than that; this way, we get the minimum number of tables
	--necessary to populate our base table.
	AND NumberOfMoves + CASE WHEN CurrentDirection = 'D' THEN 0 ELSE 1 END < 2

This part selects the “children” of our base table, as well as those tables’ children, and so on. Thus, CurrentDirection is “down” the tree. To understand NumberOfMoves, let me walk through this concrete example. Remember that DimCustomer is the base table and has an Ordinal of 0. There are two tables which have CustomerKeys which tie back to DimCustomer: FactInternetSales and FactSurveyResponse. At this point, we still have the base case’s CurrentDirection, which is ”. Because that doesn’t match ‘D’, we set NumberOfMoves equal to the base case’s number of moves (0) plus 1. We have changed direction once at this point.

Now, let’s go to the next level. FactInternetSales has one child table: FactInternetSalesReason. This table will be picked up in the third run of this CTE, so its ordinal will be 2, and we check number of moves again. Because the CurrentDirection for FactInternetSales is D and we’re still going down the chain, NumberOfMoves = parent number of moves (1) + 0, or 1. We have not changed direction; we’ve simply moved down the chain.

This is great, but leads to a problem: FactInternetSalesReason has a relationship with FactInternetSales, but also one to DimSalesReason. We don’t have DimSalesReason in our set anywhere, so trying to insert solely into the tables we’ve gathered so far will fail! That’s where the next part of the Successors CTE comes in:

UNION ALL
SELECT
	OBJECT_SCHEMA_NAME(t.object_id) AS SchemaName,
	OBJECT_NAME(t.object_id) AS TableName,
	t.object_id AS TableID,
	tt.Ordinal - 1 AS Ordinal,
	CAST(Path + '/' + OBJECT_SCHEMA_NAME(t.object_id) + '.' + OBJECT_NAME(t.object_id) + '/' AS VARCHAR(8000)) AS Path,
	NumberOfMoves + CASE WHEN CurrentDirection = 'U' THEN 0 ELSE 1 END AS NumberOfMoves,
	'U' AS CurrentDirection,
	CAST('' AS VARCHAR(8000)) AS JoinCriteria,
	'c.' + @BasePrimaryKeyName + ' = ?' AS WhereCriteria,
	'LOOKED AT' AS LookupColumns,
	'LOOKED AT' AS LookupColumnXML
FROM sys.tables t
	INNER JOIN sys.foreign_keys f
		ON f.referenced_object_id = t.object_id
		AND f.parent_object_id <> f.referenced_object_id
	INNER JOIN Successors tt
		ON f.parent_object_id = tt.TableID
WHERE
	t.type = 'U'
	AND Path NOT LIKE '%/' + OBJECT_SCHEMA_NAME(t.object_id) + '.' + OBJECT_NAME(t.object_id) + '/%'
	--No check here for NumberOfMoves because we want to be able to always move up the tree to find foreign key constraints.

Now we’re moving “up” the tree to find any parents. For FactInternetSalesReason, we’d move back up to see DimSalesReason (remember that we won’t see FactInternetSales again because the Path check forbids cycles). With this moving up the tree, we change the current direction from ‘D’ to ‘U’ and increment the number of directional changes if we were headed downstream or are starting out at the base.

Note my comment at the bottom of this CTE: I do not check the number of directional changes because I always want to be able to drive up through the tree to include any tables required for loading. This leads to a final question on the topic: why check for number of moves at all? The reason for that is to prevent “filling the tree,” as I’ll call it. I might start from DimCustomer and drive down to FactInternetSales. From there, we need to go up the tree and load DimProduct. If I had an unlimited number of up-and-down moves, the recursive CTE would then move down the tree from DimProduct into FactProductInventory. But nothing in FactProductInventory really ties back to a particular customer—I don’t need that table loaded. This is why I allow one downward cut but all the upward movement the recursive CTE needs.

With all of this in mind, my final query starts with a subquery:

SELECT
	p.SchemaName,
	p.TableName,
	p.TableID,
	p.Ordinal,
	p.JoinCriteria,
	p.Path,
	p.WhereCriteria,
	p.LookupColumns,
	p.LookupColumnXML
FROM Predecessors p

UNION ALL

SELECT
	s.SchemaName,
	s.TableName,
	s.TableID,
	s.Ordinal,
	s.JoinCriteria,
	s.Path,
	s.WhereCriteria,
	s.LookupColumns,
	s.LookupColumnXML
FROM Successors s
	LEFT OUTER JOIN Predecessors p
		ON s.TableID = p.TableID
WHERE
	p.TableID IS NULL
	AND s.Path LIKE '%/' + @BaseSchemaName + '.' + @BaseTableName + '/%'

Yeah, this could be another common table expression, but for the purposes of demonstration, I wanted to separate my two recursive CTEs from the mechanics of combining results together.

There are two interesting bits in this subquery, both on the Successors side of the union. First of all, I join Successors back to Predecessors and filter out any tables which exist in Predecessors. This has one important use: the way that I derive my final Section number is to check the MAX(Ordinal). If I have a table in Predecessors, that number should be negative, but if it shows up in Successors, it will be positive. If the number is positive, then a process would try to load my base table’s parent table only after trying to load the base table! This would cause a foreign key constraint violation, so instead I have to take care of this scenario. As a note, this actually does appear in the data set: DimSalesTerritory is in the Predecessors list as well as the Successors list (where it ties to FactInternetSales).

My other interesting part is that I force Path to include my base table. As an aside, this is technically unnecessary, but just in case somebody starts monkeying with the recursive CTE and introduces invalid rows, this is a weak filter.

That all forms my base table, which I alias as “a.” From there, I have my final SELECT, GROUP BY, and ORDER BY clauses:

SELECT
	a.SchemaName,
	a.TableName,
	MAX(Ordinal) AS Section,
	a.JoinCriteria,
	MAX(a.Path) AS Path,
	--We can hardcode DISTINCT x.* here, but there are some tables which don't need the DISTINCT
	'DISTINCT x.*' AS SelectCriteria,
	a.WhereCriteria,
	--We can hardcode 0 here because all tables have primary key constraints
	0 AS NeedToDelete,
	a.LookupColumns,
	--A side effect of FOR XML Path is that < becomes &lt; and > becomes &gt;, so we need to fix those
	REPLACE(REPLACE(a.LookupColumnXML, '&lt;', '<'), '&gt;', '>')
FROM
(LOOKED AT) a
GROUP BY
	a.SchemaName,
	a.TableName,
	a.JoinCriteria,
	a.WhereCriteria,
	a.LookupColumns,
	a.LookupColumnXML
ORDER BY
	Section,
	SchemaName,
	TableName;

As noted above, my Section is the largest value of Ordinal. This way, in the event that a table shows up at two separate points, we can ensure that all predecessors are loaded. I also show the path, which lets us see how the table is connected. The rest of this is simply hard-coding values that I need or displaying the set of lookup columns.

One important note on this is that I do not have join criteria created. Given that I know the tables and keys, I could create a subquery in my CTEs which defines valid join criteria. But because I did this for a small data set, I decided to hand-craft my join criteria. If I ever need to do this for a larger table set, I’ll probably write the query to do this correctly… In any event, we now have a table which describes our foreign key relationships between tables. Of course, if you have a database without foreign key constraints, you can’t walk up and down the chain this way, so that’s yet another reason to populate those foreign keys, whether it be a normalized system or Kimball-style warehouse.

Generating An Empty Node

In BIML, Part 5, I talked about AdventureWorksDW2012 and AdventureWorksDW2012Node2. If you don’t know how to create an empty node for testing, it’s pretty easy. For this example, I’m going to assume that you have SQL Server 2012 or 2014 installed. If you’re still on 2008 R2 or earlier, this will still work but you’ll need to change all of the “AdventureWorksDW2012″ references to be the correct version.

The first step is to download the AdventureWorksDW2012 database file. Once you have that, you can Attach it to your database—but be sure to remove the invalid log file reference when you attach! The following screen shot shows you the Attach Databases screen:

AttachDatabase

Once you have AdventureWorksDW2012 set up, right-click on the database and go to Tasks –> Generate Scripts.

GenerateScripts

From there, you’ll walk through a wizard. In this scenario, we want to script out all objects and make the new database a replica of the old—except without the data.

ScriptAllObjects

Click Next and you’ll have the option to save this in a new query window, which is what I’m going to do.

SaveToNewQueryWindow

This gives you a fairly long file which looks a little bit like:

Scripts

There’s just one problem with this: I still have everything named AdventureWorksDW2012, but I want it named AdventureWorksDW2012Node2. So let’s hit Ctrl+H and bring up the Quick Replace window.

RenameObjects

Then, we’ll fill out the “Find what:” and “Replace with:” sections and hit the “Replace All” button. That renames all references of AdventureWorksDW2012 in our script to be AdventureWorksDW2012Node2.

RenamedObjects

After we have all of that done, hit F5 or the Execute button and you now have an empty node for your own testing purposes.

BIML, Part 5—Expanding To Multiple Tables

In my last post on the topic, we walked through the BIML necessary to generate an SSIS package. The end result is useful and allows us to treat SSIS packages as binary code rather than source code, but the real power to BIML is in automation. In this case, I want to automate every relevant table in my database.

From here on out, I’m going to use AdventureWorksDW2012 as my warehouse. My local SQL Server instance has two databases: AdventureWorksDW2012 and AdventureWorksDW2012Node2. AdventureWorksDW2012Node2 is the same as the main instance except that I only have database structure and not any data. That way, we can migrate data from the primary node to Node2.

In our scenario, we want to move data for a customer from one node to another. This drives from DimCustomer.CustomerKey. To insert a DimCustomer record, we need DimGeography populated, and to populate that, we need to load DimSalesTerritory. The three facts which tie to DimCustomer are FactInternetSales, FactInternetSalesReason, and FactSurveyResponse. To get those facts’ data in, we need to have valid information for DimProduct, DimDate, DimPromotion, DimCurrency, and DimSalesReason. Before I can load DimProduct, I need to load DimProductCategory and DimProductSubcategory. All in all, this means that I need to migrate data from 13 separate tables to migrate customer data to another node.

To maintain this information and know in which order I need to load each table, I’ve created a table called MigrationStep. In a future post, I’ll describe how to populate MigrationStep so we can keep this table in sync if the underlying fact-dimension structure changes. For now, I’ll offer this as a T-SQL script to create and populate the table:

USE [AdventureWorksDW2012]
GO
CREATE TABLE [dbo].[MigrationStep]
(
	[MigrationStepID] [int] IDENTITY(1,1) NOT NULL,
	[SchemaName] [sysname] NOT NULL,
	[TableName] [sysname] NOT NULL,
	[Section] [int] NULL,
	[JoinCriteria] [nvarchar](2000) NULL,
	[SelectCriteria] [nvarchar](2000) NULL,
	[WhereCriteria] [nvarchar](100) NULL,
	[NeedToDelete] [bit] NOT NULL,
	[LookupColumns] [nvarchar](512) NULL,
	[LookupColumnXML] [nvarchar](1024) NULL,
	CONSTRAINT [PK_MigrationStep] PRIMARY KEY CLUSTERED 
	(
		[MigrationStepID] ASC
	)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 100) ON [PRIMARY]
) ON [PRIMARY]
GO

INSERT INTO dbo.MigrationStep
(
	SchemaName,
	TableName,
	Section,
	JoinCriteria,
	SelectCriteria,
	WhereCriteria,
	NeedToDelete,
	LookupColumns,
	LookupColumnXML
)
VALUES
('dbo', 'DimProductCategory', -2, 'INNER JOIN dbo.DimProductSubcategory s ON s.ProductCategoryKey = x.ProductCategoryKey INNER JOIN dbo.DimProduct p ON p.ProductSubcategoryKey = s.ProductSubcategoryKey INNER JOIN dbo.FactInternetSales f ON p.ProductKey = f.ProductKey INNER JOIN dbo.DimCustomer c ON f.CustomerKey = c.CustomerKey', 'DISTINCT x.*', 'c.CustomerKey = ?', 0, 'x.ProductCategoryKey', '<Column SourceColumn="ProductCategoryKey" TargetColumn="ProductCategoryKey" />'),
('dbo', 'DimSalesTerritory', -2, 'INNER JOIN dbo.DimGeography g ON x.SalesTerritoryKey = g.SalesTerritoryKey INNER JOIN dbo.DimCustomer c ON g.GeographyKey = c.GeographyKey', 'DISTINCT x.*', 'c.CustomerKey = ?', 0, 'x.SalesTerritoryKey', '<Column SourceColumn="SalesTerritoryKey" TargetColumn="SalesTerritoryKey" />'),
('dbo', 'DimProductSubcategory', -1, 'INNER JOIN dbo.DimProduct p ON p.ProductSubcategoryKey = x.ProductSubcategoryKey INNER JOIN dbo.FactInternetSales f ON p.ProductKey = f.ProductKey INNER JOIN dbo.DimCustomer c ON f.CustomerKey = c.CustomerKey', 'DISTINCT x.*', 'c.CustomerKey = ?', 0, 'x.ProductSubcategoryKey', '<Column SourceColumn="ProductSubcategoryKey" TargetColumn="ProductSubcategoryKey" />'),
('dbo', 'DimGeography', -1, 'INNER JOIN dbo.DimCustomer c ON x.GeographyKey = c.GeographyKey', 'DISTINCT x.*', 'c.CustomerKey = ?', 0, 'x.GeographyKey', '<Column SourceColumn="GeographyKey" TargetColumn="GeographyKey" />'),
('dbo', 'DimCurrency', 0, 'INNER JOIN dbo.FactInternetSales f ON x.CurrencyKey = f.CurrencyKey INNER JOIN dbo.DimCustomer c ON f.CustomerKey = c.CustomerKey', 'DISTINCT x.*', 'c.CustomerKey = ?', 0, 'x.CurrencyKey', '<Column SourceColumn="CurrencyKey" TargetColumn="CurrencyKey" />'),
('dbo', 'DimCustomer', 0, '', 'DISTINCT x.*', 'x.CustomerKey = ?', 0, 'x.CustomerKey', '<Column SourceColumn="CustomerKey" TargetColumn="CustomerKey" />'),
('dbo', 'DimDate', 0, 'INNER JOIN dbo.FactInternetSales f ON x.DateKey = f.OrderDateKey INNER JOIN dbo.DimCustomer c ON f.CustomerKey = c.CustomerKey', 'DISTINCT x.*', '1 = 1', 0, 'x.DateKey', '<Column SourceColumn="DateKey" TargetColumn="DateKey" />'),
('dbo', 'DimProduct', 0, 'INNER JOIN dbo.FactInternetSales f ON x.ProductKey = f.ProductKey INNER JOIN dbo.DimCustomer c ON f.CustomerKey = c.CustomerKey', 'DISTINCT x.*', 'c.CustomerKey = ?', 0, 'x.ProductKey', '<Column SourceColumn="ProductKey" TargetColumn="ProductKey" />'),
('dbo', 'DimPromotion', 0, 'INNER JOIN dbo.FactInternetSales f ON x.PromotionKey = f.PromotionKey INNER JOIN dbo.DimCustomer c ON f.CustomerKey = c.CustomerKey', 'DISTINCT x.*', 'c.CustomerKey = ?', 0, 'x.PromotionKey', '<Column SourceColumn="PromotionKey" TargetColumn="PromotionKey" />'),
('dbo', 'DimSalesReason', 1, 'INNER JOIN dbo.FactInternetSalesReason fsr ON x.SalesReasonKey = fsr.SalesReasonKey INNER JOIN dbo.FactInternetSales f ON fsr.SalesOrderNumber = f.SalesOrderNumber AND fsr.SalesOrderLineNumber = f.SalesOrderLineNumber INNER JOIN dbo.DimCustomer c ON f.CustomerKey = c.CustomerKey', 'DISTINCT x.*', 'c.CustomerKey = ?', 0, 'x.SalesReasonKey', '<Column SourceColumn="SalesReasonKey" TargetColumn="SalesReasonKey" />'),
('dbo', 'FactInternetSales', 1, 'INNER JOIN dbo.DimCustomer c ON x.CustomerKey = c.CustomerKey', 'DISTINCT x.*', 'c.CustomerKey = ?', 0, 'x.SalesOrderNumber, x.SalesOrderLineNumber', '<Column SourceColumn="SalesOrderNumber" TargetColumn="SalesOrderNumber" /><Column SourceColumn="SalesOrderLineNumber" TargetColumn="SalesOrderLineNumber" />'),
('dbo', 'FactSurveyResponse', 1, 'INNER JOIN dbo.DimCustomer c ON x.CustomerKey = c.CustomerKey', 'DISTINCT x.*', 'c.CustomerKey = ?', 0, 'x.SurveyResponseKey', '<Column SourceColumn="SurveyResponseKey" TargetColumn="SurveyResponseKey" />'),
('dbo', 'FactInternetSalesReason', 2, 'INNER JOIN dbo.FactInternetSales f ON x.SalesOrderNumber = f.SalesOrderNumber AND x.SalesOrderLineNumber = f.SalesOrderLineNumber INNER JOIN dbo.DimCustomer c ON f.CustomerKey = c.CustomerKey', 'DISTINCT x.*', 'c.CustomerKey = ?', 0, 'x.SalesOrderNumber, x.SalesOrderLineNumber, x.SalesReasonKey', '<Column SourceColumn="SalesOrderNumber" TargetColumn="SalesOrderNumber" /><Column SourceColumn="SalesOrderLineNumber" TargetColumn="SalesOrderLineNumber" /><Column SourceColumn="SalesReasonKey" TargetColumn="SalesReasonKey" />');

The one thing I want to point out is that the line which loads DIMDate has a WHERE clause of 1 = 1—that’s the easier method to populate that dimension, as otherwise I would need to UNION together all of the date keys for all of the facts and dimensions which join to DIMDate, and it’s just simpler to move everything.

Now that we have a table, it’s time to integrate that into our BIML that we developed. Just like last time, I’ll give you the full file first and then walk through it piece by piece.

<#@ template language="C#" hostspecific="true" #>
<#@ import namespace="System.Data" #>

<Biml xmlns="http://schemas.varigence.com/biml.xsd">
	<#
	string metadataConnectionString ="Provider=SQLNCLI10;Server=LOCALHOST;Initial Catalog=AdventureWorksDW2012;Integrated Security=SSPI;";
	DataTable tables = ExternalDataAccess.GetDataTable(metadataConnectionString,
	"SELECT SchemaName, TableName, Section, JoinCriteria, SelectCriteria, WhereCriteria, NeedToDelete, LookupColumns, LookupColumnXML FROM dbo.MigrationStep ORDER BY Section, SchemaName, TableName");
	DataTable sections = ExternalDataAccess.GetDataTable(metadataConnectionString,
	"SELECT DISTINCT Section FROM dbo.MigrationStep ORDER BY Section");
	DataTable sectionsDesc = ExternalDataAccess.GetDataTable(metadataConnectionString,
	"SELECT DISTINCT Section FROM dbo.MigrationStep ORDER BY Section DESC"); #>
	<PackageProjects>
		<PackageProject Name="MigrateWarehouseData" ProtectionLevel="DontSaveSensitive">
			<Parameters>
				<Parameter Name="CustomerID" DataType="Int32">1234</Parameter>
				<Parameter Name="OldWarehouseServer" DataType="String">LOCALHOST</Parameter>
				<Parameter Name="OldWarehouseDatabase" DataType="String">AdventureWorksDW2012</Parameter>
				<Parameter Name="NewWarehouseServer" DataType="String">LOCALHOST</Parameter>
				<Parameter Name="NewWarehouseDatabase" DataType="String">AdventureWorksDW2012Node2</Parameter>
				<Parameter Name="PerformMigration" DataType="Boolean">True</Parameter>
				<Parameter Name="PerformDeletion" DataType="Boolean">True</Parameter>
			</Parameters>
			<Packages>
				<Package PackageName="MigrateCustomer" />
			</Packages>
		</PackageProject>
	</PackageProjects>
	<Connections>
		<Connection Name="OldWarehouseNode" ConnectionString="Data Source=LOCALHOST;Initial Catalog=AdventureWorksDW2012;Provider=SQLNCLI11.1;Integrated Security=SSPI;Auto Translate=False;" />
		<Connection Name="NewWarehouseNode" ConnectionString="Data Source=LOCALHOST;Initial Catalog=AdventureWorksDW2012Node2;Provider=SQLNCLI11.1;Integrated Security=SSPI;Auto Translate=False;" />
	</Connections>
	<Packages>
		<Package Name="MigrateCustomer" ConstraintMode="Linear" ProtectionLevel="DontSaveSensitive">
			<Connections>
				<Connection ConnectionName="OldWarehouseNode">
					<Expressions>
						<Expression ExternalProperty="InitalCatalog">@[$Project::OldWarehouseDatabase]</Expression>
						<Expression ExternalProperty="ServerName">@[$Project::OldWarehouseServer]</Expression>
					</Expressions>
				</Connection>
				<Connection ConnectionName="NewWarehouseNode">
					<Expressions>
						<Expression ExternalProperty="InitialCatalog">@[$Project::NewWarehouseDatabase]</Expression>
						<Expression ExternalProperty="ServerName">@[$Project::NewWarehouseServer]</Expression>
					</Expressions>
				</Connection>
			</Connections>
			<Tasks>
				<Container Name="Delete From New Tables" ConstraintMode="Linear">
					<Expressions>
						<Expression ExternalProperty="Disable">!(@[$Project::PerformDeletion])</Expression>
					</Expressions>
					<Tasks>
						<# foreach (DataRow row in sectionsDesc.Rows) { #>
						<Container Name="Delete From New Table <#=row[0]#>" ConstraintMode="Linear">
							<Tasks>
								<# foreach (DataRow trow in tables.Rows) {
							//For this round, I want to make sure EVERY table has a Delete action.  We'll fix that later...
							if (Convert.ToInt32(trow[2]) == Convert.ToInt32(row[0]) && (Convert.ToBoolean(trow[6]) || 1 == 1)) {#>
								<ExecuteSQL Name="Prepare <#=trow[1]#>" ConnectionName="NewWarehouseNode">
									<DirectInput>
										DELETE FROM [<#=trow[0]#>].[<#=trow[1]#>];
									</DirectInput>
								</ExecuteSQL>
								<# }
							} #>
							</Tasks>
						</Container>
						<# } #>
					</Tasks>
				</Container>
				
				<Container Name="Migrate To New Table" ConstraintMode="Linear">
					<Expressions>
						<Expression ExternalProperty="Disable">!(@[$Project::PerformMigration])</Expression>
					</Expressions>
					<Tasks>
						<# foreach (DataRow row in sections.Rows) { #>
						<Container Name="Migrate To New Table <#=row[0]#>" ConstraintMode="Linear">
							<Tasks>
								<# foreach (DataRow trow in tables.Rows) {
							if (Convert.ToInt32(trow[2]) == Convert.ToInt32(row[0])) {#>
								<Dataflow Name="Migrate <#=trow[1]#>" DefaultBufferSize="104857600">
									<Transformations>
										<OleDbSource Name="Old Node" ConnectionName="OldWarehouseNode">
											<DirectInput>
												SELECT 
													<#=trow[4]#>
												FROM [<#=trow[0]#>].[<#=trow[1]#>] x 
													<#=trow[3]#>
												WHERE
													<#=trow[5]#>;
											</DirectInput>
											<Parameters>
												<Parameter Name="0" VariableName="MigrateWarehouseData.CustomerID" />
											</Parameters>
										</OleDbSource>
										<OleDbDestination Name="New Node" ConnectionName="NewWarehouseNode" BatchSize="10000" KeepIdentity=<# if(!Convert.ToBoolean(trow[6])) { #> "true" <# } else { #> "false" <# } #> UseFastLoadIfAvailable="true" MaximumInsertCommitSize="2000000">
											<ExternalTableOutput Table="[<#=trow[0]#>].[<#=trow[1]#>]"></ExternalTableOutput>
										</OleDbDestination>
									</Transformations>
								</Dataflow>
								<# }
							} #>
							</Tasks>
						</Container>
						<# } #>
					</Tasks>
				</Container>
			</Tasks>
		</Package>
	</Packages>
</Biml>

Given 25 more lines of BIML, we now have a package which deletes from 13 tables and migrates data to 13 tables. The best part is that if I add more tables, I don’t need to modify this BIML script–I just generate a new set of packages from the BIML and I’m good to go. In contrast, adding tables the old-fashioned way involves modifying packages and can get time-consuming depending upon how many steps your workflow has.

At any rate, let’s look at the changes, starting at the top.

<#@ template language="C#" hostspecific="true" #>
<#@ import namespace="System.Data" #>

<Biml xmlns="http://schemas.varigence.com/biml.xsd">
	<#
	string metadataConnectionString ="Provider=SQLNCLI10;Server=LOCALHOST;Initial Catalog=AdventureWorksDW2012;Integrated Security=SSPI;";
	DataTable tables = ExternalDataAccess.GetDataTable(metadataConnectionString,
	"SELECT SchemaName, TableName, Section, JoinCriteria, SelectCriteria, WhereCriteria, NeedToDelete, LookupColumns, LookupColumnXML FROM dbo.MigrationStep ORDER BY Section, SchemaName, TableName");
	DataTable sections = ExternalDataAccess.GetDataTable(metadataConnectionString,
	"SELECT DISTINCT Section FROM dbo.MigrationStep ORDER BY Section");
	DataTable sectionsDesc = ExternalDataAccess.GetDataTable(metadataConnectionString,
	"SELECT DISTINCT Section FROM dbo.MigrationStep ORDER BY Section DESC"); #>

On the very first line, I tell BIML that my templating language is C#. After that, I use the “import” declaration to bring in the .NET framework’s System.Data namespace. By the way, this templating may look familiar if you’re used to classic ASP or ASP.NET Web Forms.

My first major block of code is another templated section. In this, I create a connection string to my AdventureWorksDW2012 node, as that’s where my MigrationStep table is. I create three separate DataTables, one to get all of the columns, one to get just a distinct list of Sections, and one to get a distinct list of Sections in descending order. Each Section here is a grouping of tables; I have different sections because of foreign key chain dependencies. With these sections, I’ll use the descending-order list to delete from blocks of tables at a time and the ascending-order list to insert into blocks of tables.

<PackageProjects>
	<PackageProject Name="MigrateWarehouseData" ProtectionLevel="DontSaveSensitive">
		<Parameters>
			<Parameter Name="CustomerID" DataType="Int32">1234</Parameter>
			<Parameter Name="OldWarehouseServer" DataType="String">LOCALHOST</Parameter>
			<Parameter Name="OldWarehouseDatabase" DataType="String">AdventureWorksDW2012</Parameter>
			<Parameter Name="NewWarehouseServer" DataType="String">LOCALHOST</Parameter>
			<Parameter Name="NewWarehouseDatabase" DataType="String">AdventureWorksDW2012Node2</Parameter>
			<Parameter Name="PerformMigration" DataType="Boolean">True</Parameter>
			<Parameter Name="PerformDeletion" DataType="Boolean">True</Parameter>
		</Parameters>
		<Packages>
			<Package PackageName="MigrateCustomer" />
		</Packages>
	</PackageProject>
</PackageProjects>
<Connections>
	<Connection Name="OldWarehouseNode" ConnectionString="Data Source=LOCALHOST;Initial Catalog=AdventureWorksDW2012;Provider=SQLNCLI11.1;Integrated Security=SSPI;Auto Translate=False;" />
	<Connection Name="NewWarehouseNode" ConnectionString="Data Source=LOCALHOST;Initial Catalog=AdventureWorksDW2012Node2;Provider=SQLNCLI11.1;Integrated Security=SSPI;Auto Translate=False;" />
</Connections>
<Packages>
	<Package Name="MigrateCustomer" ConstraintMode="Linear" ProtectionLevel="DontSaveSensitive">
		<Connections>
			<Connection ConnectionName="OldWarehouseNode">
				<Expressions>
					<Expression ExternalProperty="InitalCatalog">@[$Project::OldWarehouseDatabase]</Expression>
					<Expression ExternalProperty="ServerName">@[$Project::OldWarehouseServer]</Expression>
				</Expressions>
			</Connection>
			<Connection ConnectionName="NewWarehouseNode">
				<Expressions>
					<Expression ExternalProperty="InitialCatalog">@[$Project::NewWarehouseDatabase]</Expression>
					<Expression ExternalProperty="ServerName">@[$Project::NewWarehouseServer]</Expression>
				</Expressions>
			</Connection>
		</Connections>

This next segment of code is exactly the same as in my first example. I’m setting up my project parameters, telling the project that there’s one package, and setting up a list of connections. Then, I start work on my one package, called MigrateCustomer. Once I get through this part, we get to the next crazy bit.

<Tasks>
	<Container Name="Delete From New Tables" ConstraintMode="Linear">
		<Expressions>
			<Expression ExternalProperty="Disable">!(@[$Project::PerformDeletion])</Expression>
		</Expressions>
		<Tasks>
			<# foreach (DataRow row in sectionsDesc.Rows) { #>
			<Container Name="Delete From New Table <#=row[0]#>" ConstraintMode="Linear">
				<Tasks>
					<# foreach (DataRow trow in tables.Rows) {
				//For this round, I want to make sure EVERY table has a Delete action.  We'll fix that later...
				if (Convert.ToInt32(trow[2]) == Convert.ToInt32(row[0]) && (Convert.ToBoolean(trow[6]) || 1 == 1)) {#>
					<ExecuteSQL Name="Prepare <#=trow[1]#>" ConnectionName="NewWarehouseNode">
						<DirectInput>
							DELETE FROM [<#=trow[0]#>].[<#=trow[1]#>];
						</DirectInput>
					</ExecuteSQL>
					<# }
				} #>
				</Tasks>
			</Container>
			<# } #>
		</Tasks>
	</Container>

These few lines of code generate all of our Delete containers and tasks. Inside the Tasks element, there is a templated snippet of code: foreach (DataRow row in sectionsDesc.Rows) {. This snippet loops through sectionsDesc, which will be a set containing the values { 2, 1, 0, -1, -2 } based on our example. For each of these sections, we create a new container called “Delete From New Table [Section]” and which contains one task. Inside that Task element, we have a couple lines of C#. The first line loops through each of our tables. Then, for each of our tables, we hit the second line of C#: if (Convert.ToInt32(trow[2]) == Convert.ToInt32(row[0]) && (Convert.ToBoolean(trow[6]) || 1 == 1)). This IF block checks to see if the Section for our table matches the current Section. If so, we check to see if the NeedsDeleted flag is set to true…or if 1 = 1… Yeah, for this scenario, I’m going to force deletion on every table. Don’t worry—I’ll show you how to do this a better way later on in the series.

So, if we have a matching Section (i.e., our table belongs in this container) and if we pass the NeedsDeleted check, then we create an ExecuteSQL task which deletes all data from that table (as represented by columns 0 and 1 in the table row array). We can’t TRUNCATE because we have foreign key constraints, and remember that you really should be batching these deletes and making sure you don’t delete any data you still want hanging around. For our demo, Node2 is empty so I don’t feel bad about deleting everything, but you probably won’t have that same nonchalant attitude about your production environment.

One set of containers down; now we move on to the next.

<Container Name="Migrate To New Table" ConstraintMode="Linear">
	<Expressions>
		<Expression ExternalProperty="Disable">!(@[$Project::PerformMigration])</Expression>
	</Expressions>
	<Tasks>
		<# foreach (DataRow row in sections.Rows) { #>
		<Container Name="Migrate To New Table <#=row[0]#>" ConstraintMode="Linear">
			<Tasks>
				<# foreach (DataRow trow in tables.Rows) {
			if (Convert.ToInt32(trow[2]) == Convert.ToInt32(row[0])) {#>
				<Dataflow Name="Migrate <#=trow[1]#>" DefaultBufferSize="104857600">
					<Transformations>
						<OleDbSource Name="Old Node" ConnectionName="OldWarehouseNode">
							<DirectInput>
								SELECT 
									<#=trow[4]#>
								FROM [<#=trow[0]#>].[<#=trow[1]#>] x 
									<#=trow[3]#>
								WHERE
									<#=trow[5]#>;
							</DirectInput>
							<Parameters>
								<Parameter Name="0" VariableName="MigrateWarehouseData.CustomerID" />
							</Parameters>
						</OleDbSource>
						<OleDbDestination Name="New Node" ConnectionName="NewWarehouseNode" BatchSize="10000" KeepIdentity="true" UseFastLoadIfAvailable="true" MaximumInsertCommitSize="2000000">
							<ExternalTableOutput Table="[<#=trow[0]#>].[<#=trow[1]#>]"></ExternalTableOutput>
						</OleDbDestination>
					</Transformations>
				</Dataflow>
				<# }
			} #>
			</Tasks>
		</Container>
		<# } #>
	</Tasks>
</Container>

In this, our final real segment of code, I create another set of containers, one for each Section in ascending order—that is, { -2, -1, 0, 1, 2 }. This lets me insert rows for “parent” tables before inserting rows into “child” tables. That way, for example, I load DimProduct and DimCustomer before trying to load FactInternetSales.

For each of my section containers, I do another loop through the tables variable. If the table belongs in this section, then I would create a data flow named “Migrate [TableName].” Of course, if you have two tables with the same name (but a different schema), you would probably need to change this to be “Migrate [SchemaName].[TableName]” but I didn’t have to worry about that with my table set.

Inside our data flow, we have an OLE DB Source named Old Node and an OLE DB Destination named New Node. The source connects back to our AdventureWorksDW2012 database and ties in with MigrationStep pretty closely: we use that table to tell our SELECT column list (trow[4]), our schema name (trow[0]), table name (trow[1]), join criteria (trow[3]), and WHERE clause (trow[5]). Looking back at the MigrationStep table, you’ll notice that all of our WHERE clauses look something like c.CustomerKey = ?;. We need to fill in the parameter, so that’s what the Parameters element does, and we tie that back to the CustomerID project parameter.

Our OLE DB destination is pretty simple. BIML knows how to hook up my source and destination components, and as soon as I tell it that I need to connect to [SchemaName].[TableName] on AdventureWorksDW2012Node2, it can also connect the metadata together and load my columns list automatically.

Once I’m done, all I have to do is right-click the BIML file and select Generate SSIS packages. That creates a package which looks a bit like:

MigrateCustomerBIML

Inside each of those control flows, I see the following data flow:

MigrateCustomerDataFlowBIML

This is the power of BIML in action.

BIML, Part 4—Implementing The Ideal Package

My last post in the BIML series introduced an SSIS package which would form the basis of our correct solution. Given this baseline, I want to write some BIML to generate our single package; that way, we can eventually expand our package to include more complex logic and solve the our problem.

For the rest of this series, I’m going to assume you have BIDS Helper installed. In Visual Studio, add a new BIML file by right-clicking your project and selecting the “Add New BIML File” option. This creates a nearly-empty BIML file named BimlScript.biml. Open that file and you’ll see the following code:

<Biml xmlns="http://schemas.varigence.com/biml.xsd">
</Biml>

To generate our SSIS package, we know that we need the following things:

  1. Seven package parameters for CustomerID, the old and new database connections, and flags.
  2. Two database connections, one called NewWarehouseNode and one called OldWarehouseNode.
  3. A sequence named “Delete From New Tables,” inside of which is an Execute SQL task named “Prepare Table” and which truncates our table.
  4. A sequence named “Migrate To New Table,” which has three sub-sequences: Migrate to New Table 1, 2, and 3. Inside 3, we have a data flow named “Migrate Table. Inside this data flow is a SQL Server source and a SQL Server destination. We would need to configure each of these components as well. On the data flow level, the max buffer count also needs to be set to 10,000 and the default buffer size to 100MB.

Armed with these requirements, we can get to work writing BIML. Before I go any further writing the BIML, here’s a script to generate two databases with a table named SomeTable, so that our BIML doesn’t fail and we can see our work in action:

IF NOT EXISTS (SELECT 1 FROM sys.databases WHERE name = 'ReportingOld')
BEGIN
	CREATE DATABASE [ReportingOld]
END
GO
IF NOT EXISTS (SELECT 1 FROM sys.databases WHERE name = 'ReportingNew')
BEGIN
	CREATE DATABASE [ReportingNew]
END
GO
USE [ReportingOld]
GO
IF NOT EXISTS (SELECT 1 FROM sys.tables WHERE name = 'SomeTable')
BEGIN
	CREATE TABLE SomeTable
	(
		Id BIGINT IDENTITY(1,1) NOT NULL,
		CustomerID INT NOT NULL,
		Details VARCHAR(30) NOT NULL
	);

	INSERT INTO dbo.SomeTable(CustomerID, Details)
	VALUES
	(1234, 'ABC'),
	(1234, 'ABCD'),
	(1234, 'ABCDE'),
	(1234, 'ABCDEF'),
	(2345, 'ABC'),
	(2345, 'ABC'),
	(3456, 'ABC'),
	(4567, 'ABC');
END
GO
USE [ReportingNew]
GO
IF NOT EXISTS (SELECT 1 FROM sys.tables WHERE name = 'SomeTable')
BEGIN
	CREATE TABLE SomeTable
	(
		Id BIGINT IDENTITY(1,1) NOT NULL,
		CustomerID INT NOT NULL,
		Details VARCHAR(30) NOT NULL
	);
END
GO

Here’s a full version of the BIML, followed by a breakdown in parts:

<Biml xmlns="http://schemas.varigence.com/biml.xsd">
	<PackageProjects>
		<PackageProject Name="MigrateWarehouseData" ProtectionLevel="DontSaveSensitive">
			<Parameters>
				<Parameter Name="CustomerID" DataType="Int32">1234</Parameter>
				<Parameter Name="OldWarehouseServer" DataType="String">LOCALHOST</Parameter>
				<Parameter Name="OldWarehouseDatabase" DataType="String">ReportingOld</Parameter>
				<Parameter Name="NewWarehouseServer" DataType="String">LOCALHOST</Parameter>
				<Parameter Name="NewWarehouseDatabase" DataType="String">ReportingNew</Parameter>
				<Parameter Name="PerformMigration" DataType="Boolean">True</Parameter>
				<Parameter Name="PerformDeletion" DataType="Boolean">True</Parameter>
			</Parameters>
			<Packages>
				<Package PackageName="MigrateCustomer" />
			</Packages>
		</PackageProject>
	</PackageProjects>
	<Connections>
		<Connection Name="OldWarehouseNode" ConnectionString="Data Source=LOCALHOST;Initial Catalog=ReportingOld;Provider=SQLNCLI11.1;Integrated Security=SSPI;Auto Translate=False;" />
		<Connection Name="NewWarehouseNode" ConnectionString="Data Source=LOCALHOST;Initial Catalog=ReportingNew;Provider=SQLNCLI11.1;Integrated Security=SSPI;Auto Translate=False;" />
	</Connections>
	<Packages>
		<Package Name="MigrateCustomer" ConstraintMode="Linear" ProtectionLevel="DontSaveSensitive">
			<Connections>
				<Connection ConnectionName="OldWarehouseNode">
					<Expressions>
						<Expression ExternalProperty="InitalCatalog">@[$Project::OldWarehouseDatabase]</Expression>
						<Expression ExternalProperty="ServerName">@[$Project::OldWarehouseServer]</Expression>
					</Expressions>
				</Connection>
				<Connection ConnectionName="NewWarehouseNode">
					<Expressions>
						<Expression ExternalProperty="InitialCatalog">@[$Project::NewWarehouseDatabase]</Expression>
						<Expression ExternalProperty="ServerName">@[$Project::NewWarehouseServer]</Expression>
					</Expressions>
				</Connection>
			</Connections>
			<Tasks>
				<Container Name="Delete From New Tables" ConstraintMode="Parallel">
					<Expressions>
						<Expression ExternalProperty="Disable">!(@[$Project::PerformDeletion])</Expression>
					</Expressions>
					<Tasks>
						<ExecuteSQL Name="Prepare Table" ConnectionName="NewWarehouseNode">
							<DirectInput>
								TRUNCATE TABLE dbo.SomeTable;
							</DirectInput>
						</ExecuteSQL>
					</Tasks>
				</Container>
				<Container Name="Migrate To New Table" ConstraintMode="Linear">
					<Expressions>
						<Expression ExternalProperty="Disable">!(@[$Project::PerformMigration])</Expression>
					</Expressions>
					<Tasks>
						<Container Name="Migrate To New Table 1" ConstraintMode="Linear">
						</Container>
						<Container Name="Migrate To New Table 2" ConstraintMode="Linear">
						</Container>
						<Container Name="Migrate To New Table 3" ConstraintMode="Linear">
							<Tasks>
								<Dataflow Name="Migrate Table" DefaultBufferSize="104857600">
									<Transformations>
										<OleDbSource Name="Old Node" ConnectionName="OldWarehouseNode">
											<DirectInput>
												SELECT * FROM dbo.SomeTable WHERE CustomerID = ?;
											</DirectInput>
											<Parameters>
												<Parameter Name="0" VariableName="MigrateWarehouseData.CustomerID" />
											</Parameters>
										</OleDbSource>
										<OleDbDestination Name="New Node" ConnectionName="NewWarehouseNode" BatchSize="10000" KeepIdentity="true" UseFastLoadIfAvailable="true" MaximumInsertCommitSize="2000000">
											<ExternalTableOutput Table="[dbo].[SomeTable]">
											</ExternalTableOutput>
										</OleDbDestination>
									</Transformations>
								</Dataflow>
							</Tasks>
						</Container>
					</Tasks>
				</Container>
			</Tasks>
		</Package>
	</Packages>
</Biml>

Once you save this BIML file, right-click on BimlScript.biml and select “Generate SSIS Packages.” It will create MigrateCustomer.dtsx, which is our BIML-generated package. This package does what our original package did, and it takes us 92 lines of XML. That may not seem like a good trade-off right now, but the power to BIML is that with just a few lines of C# added in, we can create dozens or hundreds of data flows and packages without wearing our your button-clicking finger.

So now let’s break this BIML file down into components and explain them more thoroughly, starting with package projects.

<PackageProjects>
	<PackageProject Name="MigrateWarehouseData" ProtectionLevel="DontSaveSensitive">
		<Parameters>
			<Parameter Name="CustomerID" DataType="Int32">1234</Parameter>
			<Parameter Name="OldWarehouseServer" DataType="String">LOCALHOST</Parameter>
			<Parameter Name="OldWarehouseDatabase" DataType="String">ReportingOld</Parameter>
			<Parameter Name="NewWarehouseServer" DataType="String">LOCALHOST</Parameter>
			<Parameter Name="NewWarehouseDatabase" DataType="String">ReportingNew</Parameter>
			<Parameter Name="PerformMigration" DataType="Boolean">True</Parameter>
			<Parameter Name="PerformDeletion" DataType="Boolean">True</Parameter>
		</Parameters>
		<Packages>
			<Package PackageName="MigrateCustomer" />
		</Packages>
	</PackageProject>
</PackageProjects>

We have one package project named “MigrateWarehouseData” and without any sensitive data like usernames or passwords. In that project, we have a set of parameters, one for each project parameter we created in the previous blog post. We need to define data types and can provide default values as well. After that, we have a packages section which lists each package in our project. In this case, I only have one package: MigrateCustomer, which will create MigrateCustomer.dtsx. This way, my package can read project parameters.

Next up on the hit list is connections:

<Connections>
	<Connection Name="OldWarehouseNode" ConnectionString="Data Source=LOCALHOST;Initial Catalog=ReportingOld;Provider=SQLNCLI11.1;Integrated Security=SSPI;Auto Translate=False;" />
	<Connection Name="NewWarehouseNode" ConnectionString="Data Source=LOCALHOST;Initial Catalog=ReportingNew;Provider=SQLNCLI11.1;Integrated Security=SSPI;Auto Translate=False;" />
</Connections>

These are my common connections that I will use across all packages. There’s nothing special about this: it’s just a name and a connection string pointing ot my local SQL Server instance.

The next step is the Packages section, in which I define one package: MigrateCustomer.

<Packages>
	<Package Name="MigrateCustomer" ConstraintMode="Linear" ProtectionLevel="DontSaveSensitive">

There are two properties I have set: ConstraintMode and ProtectionLevel. Just like my project parameters, I have no sensitive data stored, so I don’t need to protect the package with a password or user key. As for ConstraintMode, I have two options: Linear or Parallel. For this sample, I’m typically going to stick with Linear constraints, but there is one spot in which I choose to go Parallel. The difference is pretty simple: in Linear mode, SSIS draws constraint lines between components listed in the order that we include them in the BIML file, whereas in Parallel mode, all tasks within a container are allowed to run concurrently (whether they actually do or not is a totally different story!).

Inside each package, we have a set of connections. In this case, I want to take my connection strings as given in the Connections section, but I want to do something a little special. We’re collecting the real server and database names for our old and new warehouses, so let’s put that information to good use:

<Connections>
	<Connection ConnectionName="OldWarehouseNode">
		<Expressions>
			<Expression ExternalProperty="InitalCatalog">@[$Project::OldWarehouseDatabase]</Expression>
			<Expression ExternalProperty="ServerName">@[$Project::OldWarehouseServer]</Expression>
		</Expressions>
	</Connection>
	<Connection ConnectionName="NewWarehouseNode">
		<Expressions>
			<Expression ExternalProperty="InitialCatalog">@[$Project::NewWarehouseDatabase]</Expression>
			<Expression ExternalProperty="ServerName">@[$Project::NewWarehouseServer]</Expression>
		</Expressions>
	</Connection>
</Connections>

This snippet tells us that BIML can tie in with SSIS expressions. In this case, we are setting the InitialCatalog and ServerName properties on our connection. We can access project parameters the same way you would in the SSIS GUI: @[$Project::MyParameterName].

Now that we have the connections on our package, we can start laying out the control flow. All control flow tasks go in the Tasks section, and because our package has its constraint mode set to Linear, we will execute each top-level container one at a time. Let’s start by looking at the first container:

<Tasks>
	<Container Name="Delete From New Tables" ConstraintMode="Parallel">
		<Expressions>
			<Expression ExternalProperty="Disable">!(@[$Project::PerformDeletion])</Expression>
		</Expressions>
		<Tasks>
			<ExecuteSQL Name="Prepare Table" ConnectionName="NewWarehouseNode">
				<DirectInput>
					TRUNCATE TABLE dbo.SomeTable;
				</DirectInput>
			</ExecuteSQL>
		</Tasks>
	</Container>

This container will delete from new tables to avoid any primary key violations. Because it’s a toy example, I’m just going to truncate the table; a more realistic example would delete rows in batches, but I’m leaving that as an exercise for the reader.

One more thing to note before I move on is that we can have container-level expressions. In this case, I want to set the “Disable” flag to be equivalent to the opposite of my PerformDeletion project parameter. I personally don’t like how SSIS has properties named “Disable” rather than “Enable;” I think it confuses the matter a little bit, as you either need to have parameters with negative phrasing—such as “DoNotPerformDeletion”—or flip the boolean like I do here. Because “Enabled” makes a lot more sense to me intuitively, I’d rather let my end user have the equivalent of “Enabled” and handle the flip myself.

With that aside handled, we’ll move on to the next container. This one is a container of containers, and has a constraint mode of Linear, meaning that we run each sub-container one at a time.

<Container Name="Migrate To New Table" ConstraintMode="Linear">
	<Expressions>
		<Expression ExternalProperty="Disable">!(@[$Project::PerformMigration])</Expression>
	</Expressions>
	<Tasks>
		<Container Name="Migrate To New Table 1" ConstraintMode="Linear">
		</Container>
		<Container Name="Migrate To New Table 2" ConstraintMode="Linear">
		</Container>
		<Container Name="Migrate To New Table 3" ConstraintMode="Linear">
			<Tasks>
				<Dataflow Name="Migrate Table" DefaultBufferSize="104857600">

Just like before, I have my expression to check whether I want to perform this part of the process. Once we get past that, I have a Tasks flow for this container. Inside that, I have three sub-containers. Two of these containers are empty, but like I mentioned last time, it helps remind us that we need to think about multiple layers of operation—there are some tables we will need to migrate first and some we can hold off on until later.

Looking inside the third container, we have a data flow. This data flow is where we can set our default buffer size to 100 MB. Our default row count per buffer is already 10,000, so I don’t have to specify that property. Once I handle the properties, I move on to the Transformations, which is the data flow equivalent of control flow tasks. This is a relatively easy data flow, so as you’d expect, our Transformations section isn’t very complex.

					<Transformations>
						<OleDbSource Name="Old Node" ConnectionName="OldWarehouseNode">
							<DirectInput>
								SELECT * FROM dbo.SomeTable WHERE CustomerID = ?;
							</DirectInput>
							<Parameters>
								<Parameter Name="0" VariableName="MigrateWarehouseData.CustomerID" />
							</Parameters>
						</OleDbSource>
						<OleDbDestination Name="New Node" ConnectionName="NewWarehouseNode" BatchSize="10000" KeepIdentity="true" UseFastLoadIfAvailable="true" MaximumInsertCommitSize="2000000">
							<ExternalTableOutput Table="[dbo].[SomeTable]">
							</ExternalTableOutput>
						</OleDbDestination>
					</Transformations>
				</Dataflow>
			</Tasks>
		</Container>
	</Tasks>
</Container>

Inside this section, we start off with an OLE DB source which connects to OldWarehouseNode and gets all of the data from dbo.SomeTable where the Customer ID is set to the value that we have in the CustomerID project parameter. Note that the way we access project parameters from an OLE DB source is different than the way we specify them in an expression. We don’t need to have the @[$Project::] cruft, but at the same time, we need to specify the project name of MigrateWarehouseData.

Once data gets through the source, we push it through to the destination, which is connected to NewWarehouseNode and connects to that database’s SomeTable table. We set a few bulk insert settings, such as a 10K batch size, keeping identity (which is different from our SSIS example last time around), turning on fast load if it’s available, and setting a max commit size of 2 million rows.

One important thing to note here is that if the input columns for our OLE DB destination component have the same names as the columns on our table, we don’t need to specify any transformations. If we do, there’s a Column tag we can use in our OleDbDestination component to specify source and target column names, as well as sort order if it’s sorted.

Put all of this together and you have a way to describe an SSIS package. Even if you don’t have to do much more with this, I’d argue that you should still write the BIML because you can check this BIML file into source control and see changes. If I need to flip the KeepIdentity bit, I can make that change and check it in, letting code reviewers easily see what changed. Otherwise, SSIS packages themselves are little better than binaries: so many of the changes are difficult to read, even if you do literally nothing other than re-sizing boxes.

This is a pretty big step: right now, we have a BIML file that we can use to generate an SSIS package. From here, I want to take the next step and show how we can use the power of C# to generate packages dynamically. Stay tuned!

Deleting Data In Batches

In my last post, I talked about deleting partially-migrated data from an instance. My ideal set-based solution to this problem would be a simple query like:

BEGIN TRANSACTION;

DELETE
FROM dbo.MyTable
WHERE
	MyCriteria = 1;
	
COMMIT TRANSACTION;

The upside to this is that it’s a trivial query to write, totally understandable, and a set-based solution. The downside, however, is that if you’re deleting a lot of rows, your SQL Server transaction log can blow up in size to the point where you might run out of disk space. This process can also block other readers or writers if you are not in Read Committed Snapshot Isolation mode (or if your readers are not using, ugh, Read Uncommitted). On a busy production system, you might not have the ability to block for as long as it takes to delete a lot of rows.

One pattern that I’ve taken to is batching deletions:

DECLARE
	@RowCount INT = 0,
	@n INT = 1000;

WHILE (1=1)
BEGIN
	BEGIN TRANSACTION;
	
	DELETE TOP(@n) x
	FROM dbo.MyTable x
	WHERE
		x.MyCriteria = 1;
		
	SET @RowCount = @@ROWCOUNT;
	
	COMMIT TRANSACTION;
	
	IF (@RowCount < @n)
	BEGIN
		BREAK;
	END
END

This way, we delete 1000 records at a time (or whatever you set @n to be). That will reduce blocking, allowing readers and writers time to sneak in between batches. If you have frequent transaction log backups, you can also prevent the log from growing too much, as each deletion is its own operation, rather than everything being in one big transaction. I’ve found that performance will differ based on a few factors, but generally, a number between 100 and 10,000 is probably best. I usually do this by order of magnitude, so I’ll test with 10K, then 1K, and then 100, and see which performs best. Sometimes I’ll try 5K or 500 to see if there’s a significant difference, but I normally don’t go too much further.

The biggest downside to this is that if you are not deleting based on a nice index, you may need to scan your table once for each batch. If I need to delete 1 million rows from a billion-row table, SQL Server needs to scan that billion-row table 1000 times (when @n = 1000 rows per batch), and that’s costly. To compensate, you should try to have an appropriate index; at least then you’re seeking against the index.

BIML, Part 3—The Ideal Package

In the last post, I talked about the problem we want to solve. The short version is, I want to build a process to migrate data from one node to another. To keep this simple, we’ll say that we can turn off warehouse ETL processes for a particular customer as we move their data.

Part of what makes BIML so powerful is that we can generalize a solution to a repetitive ETL problem. But before I can solve the general case, I typically want to try to solve a basic version of the problem. In this scenario, I want to sketch out how I can migrate data from one table. My most basic migration would be a data flow with a SQL Server source and a SQL Server destination. Of course, if I fail in the middle of migration, I can’t re-run that package because I would get primary key constraint violations (assuming I have primary keys!) or, even worse, could end up with duplicate data. A simple fix is to delete the partially-migrated data from my destination instance before attempting to insert data from the source instance.

The first thing I need for my sample, minimum-required-functionality map is at least one parameter to determine which customer I’m migrating. Then, I decided to create parameters to tell my source and destination instance and database names. Finally, to make my debugging a little easier, I have a couple of parameters that will control whether I want to delete partially-migrated data from the destination database and whether I want to perform the migration (which I might not do if I’m simply testing that deletion works). This leads to a project parameters listing that looks something like this:

PackageParameters

From there, I created a DTSX package and began to sketch out my sample map. Here’s the Control flow for what such a map would look like:

ModelControlFlow

In this case, you can get a feel for what the final product should look like, as well as plan ahead slightly on design. The first thing I want to point out is that I have two connections: one for the source node and one for the destination node. Any automated package manager I use will need to make sure that those connections are available. We’ll ignore the expression portion for now and move on to the control flow.

In the control flow itself, I have a Sequence container named “Delete From New Tables,” which deletes records from my new node’s tables. Inside there is one Execute SQL task which looks a little like this:

ModelDelete

For a simple mock-up scenario, you might even have this execute TRUNCATE TABLE and be done. In my case, I set up a stored procedure which performs batched deletion of records for a given customer’s ID. In either event, this takes care of partial migrations and subsequent primary key violations. For now, I’m going to stick with this design, but later on in the series, we’ll make it a bit more robust.

Below the deletion sequence, we have another sequence for migrating data. You can see from the control flow image that inside the primary sequence, I have several sub-sequences. The reason I have these is simple: some tables have foreign key constraints linking to other tables, meaning that I would need to insert rows into the “parent” table before inserting rows into the “child” table. Note that I would also need to have a reverse set for deletions—I need to delete from the child tables before deleting records from the parent tables. Although I only have one table in my model, this setup helps remind me that I need to address this fact of life in my design.

Each of the sub-sequences will be full of data flows, each of which migrates a single table. Inside the lone data flow in my example, things are pretty simple:

ModelDataFlow

I have a source node and a destination node. I won’t include a picture of the source settings here because it’s trivial: my source selects data from the source table where the customer key is equal to the key I’m going to pass in. For my destination, I have a straight mapping. The only interesting part is on the Connection Manager tab.

DestinationSettings

In the Connection Manager tab, I have made a few tweaks. First, I made sure that Table Lock is on. I want to be able to do bulk insertions quickly, and that means locking the table. To minimize the amount of time I hold that table lock, I set my commit size equal to the number of rows in my buffer, meaning that I should issue a commit (and temporarily rescind that table lock) with each buffer. This will give other statements a chance to sneak in and work without my bulk insertion blocking everything for what could be a very long time. In addition to table locks and a 10K commit size, I set the maximum insert commit size equal to 2 million. The reason for this is that, in my data set, 2 million rows tends to be the largest number of records for this table for a given customer. Finally, I have to make a decision on whether I need to keep identity values or not. In my scenario, this decision came down to two things: first, some of the dimensions had their keys generated through an identity integer, and if I didn’t keep that identity value, I would need to re-translate the facts as I moved them from one node to another. It’s much easier (and faster!) for me to keep the identity value and not need to create a bunch of lookup components. The other consideration is that this identity value needs to be tied to a specific customer as well. Imagine this scenario: MyTable ID 1 on node 1 ties to customer A. MyTable ID 1 on node 2 ties to customer B. If customer A and customer B ever get migrated to the same node, that migration would fail if we did not include the customer key as part of our dimension’s primary key. If there are any tables which do not include the customer key as part of the primary key, I can’t use that primary key as such and would need to generate a new row with an auto-incremented identity value. In my case, all of the relevant dimensions included the customer key, but there were a few facts which did not.

DataFlowSettings

The last thing I want to do is check data flow settings. To help optimize my bulk insert performance, I want to do batch insertions of 10,000 records. To make sure I can push 10K rows in a batch, I want to bump my default buffer size to 100 MB. The server which hosts SSIS has enough RAM to support 100 MB buffers and some of my dimensions are pretty wide–some of the rows might be in the 8-10K range, meaning that at 10K per row and 10K rows, we’re looking at…a 100 MB buffer.

This gives us an idea of what we want to do before ever writing a line of BIML and also gives us something against which we can compare our BIML outputs. When developing something generic, it always helps to have a concrete implementation in mind for comparison purposes, if only to ensure that you don’t forget any steps.

After this, my next step in this journey is going to be to use BIML to create a hard-coded replica of the ideal package, proving that I can use BIML to generate equivalent SSIS code which I can use later. Stay tuned!