BIDSHelper Bug: NoMatch Issue With Visual Studio 2013

BIDS Helper is a fantastic tool, but it’s not perfect.  If you’re using Visual Studio 2013, there is a bug in BIDS Helper that would prevent you from implementing lookups with NoMatch criteria like I used in BIML Part 6.

Here’s a quick-and-easy BIML package which fails:

<Biml xmlns="http://schemas.varigence.com/biml.xsd">
    <Connections>
        <Connection Name="Instance" ConnectionString="Data Source=LOCALHOST;Initial Catalog=master;Provider=SQLNCLI11.1;Integrated Security=SSPI;Auto Translate=False;" />
    </Connections>
    <Packages>
        <Package Name="NoMatchError" ConstraintMode="Linear" ProtectionLevel="DontSaveSensitive">
            <Connections>
                <Connection ConnectionName="Instance"></Connection>
            </Connections>
            <Tasks>
                <Container Name="Test NoMatch Error" ConstraintMode="Parallel">
                    <Tasks>
                        <Dataflow Name="NoMatch DataFlow">
                            <Transformations>
                                <OleDbSource Name="Source" ConnectionName="Instance">
                                    <DirectInput>
                                        SELECT name FROM master.sys.tables;
                                    </DirectInput>
                                </OleDbSource>
                                <Lookup Name="Lookup" OleDbConnectionName="Instance" CacheMode="Full" NoMatchBehavior="RedirectRowsToNoMatchOutput">
                                    <DirectInput>
                                        SELECT name FROM master.sys.tables;
                                    </DirectInput>
                                    <Inputs>
                                        <Column SourceColumn="name" TargetColumn="name"></Column>
                                    </Inputs>
                                    <Outputs></Outputs>
                                </Lookup>
                                <OleDbDestination Name="Destination" ConnectionName="Instance">
                                    <InputPath OutputPathName="Lookup.NoMatch"></InputPath>
                                    <ExternalTableOutput Table="sys.tables"></ExternalTableOutput>
                                </OleDbDestination>
                            </Transformations>
                        </Dataflow>
                    </Tasks>
                </Container>
            </Tasks>
        </Package>
    </Packages>
</Biml>

The error message that you get from this is Destination Input references output NoMatchError.Test NoMatch Error.NoMatch DataFlow.Lookup.NoMatch, which is not found in the existing outputs in the data flow.

To try to get around this, you might set to read . This lets you generate the BIML package, but then you get an error in the SSIS package itself which reads: “Error at NoMatch DataFlow [Lookup [19]]: The Lookup.Outputs[Lookup Match Output] has an invalid error or truncation row disposition.”

On the Advanced Editor, the Component Properties tab shows that NoMatch behavior is wrong: it reads “Treat rows with no matching entries as errors” rather than “Send rows with no matching entries to the no match output.” This is a sign that BIDS Helper is either not setting no-match behavior or is setting it to error behavior.

Incidentally, changing Lookup.NoMatch to Lookup.Error on line 39 and NoMatchBehavior="RedirectRowsToErrorOutput" on line 20 generates a package which redirects failed lookups to the Error output and successfully hooks that Error output to the destination.

Please vote on the issue and help stamp out this bug. If you need to do use this pattern in the meantime, I’ve confirmed that it works with Visual Studio 2012, and you can also use Mist if you’ve purchased a copy.

BIML, Part 6—Lookups

Last time around, we built a BIML package which generates one data flow per table we wish to migrate. Before loading the customer’s data, we needed to delete existing data for that customer. This works fine with a toy database, but once you’re talking about millions or billions of rows of data in fact tables, the delete-and-then-reload model doesn’t scale at all. Instead, I’d like to insert into the new node only if that record does not already exist. The way that I’m going to implement this is by using a Lookup component in each data flow, between the SQL data source and SQL destination.

Doing this is going to require a few changes to the BIML. Let’s start with the final product, and then I’ll walk through each of the changes individually.

<#@ template language="C#" hostspecific="true" #>
<#@ import namespace="System.Data" #>

<Biml xmlns="http://schemas.varigence.com/biml.xsd">
	<#
	string metadataConnectionString ="Provider=SQLNCLI10;Server=LOCALHOST;Initial Catalog=AdventureWorksDW2012;Integrated Security=SSPI;";
	DataTable tables = ExternalDataAccess.GetDataTable(metadataConnectionString,
	"SELECT SchemaName, TableName, Section, JoinCriteria, SelectCriteria, WhereCriteria, NeedToDelete, LookupColumns, LookupColumnXML, CASE WHEN WhereCriteria = '1=1' THEN '1=1 \"' ELSE REPLACE(WhereCriteria, '?', '\" + (DT_WSTR, 12)@[$Project::CustomerID]') END AS ExpressionWhereCriteria FROM dbo.MigrationStep ORDER BY Section, SchemaName, TableName");
	DataTable sections = ExternalDataAccess.GetDataTable(metadataConnectionString,
	"SELECT DISTINCT Section FROM dbo.MigrationStep ORDER BY Section");
	DataTable sectionsDesc = ExternalDataAccess.GetDataTable(metadataConnectionString,
	"SELECT DISTINCT Section FROM dbo.MigrationStep ORDER BY Section DESC"); #>
	<PackageProjects>
		<PackageProject Name="MigrateWarehouseData" ProtectionLevel="DontSaveSensitive">
			<Parameters>
				<Parameter Name="CustomerID" DataType="Int32">1234</Parameter>
				<Parameter Name="OldWarehouseServer" DataType="String">LOCALHOST</Parameter>
				<Parameter Name="OldWarehouseDatabase" DataType="String">AdventureWorksDW2012</Parameter>
				<Parameter Name="NewWarehouseServer" DataType="String">LOCALHOST</Parameter>
				<Parameter Name="NewWarehouseDatabase" DataType="String">AdventureWorksDW2012Node2</Parameter>
				<Parameter Name="PerformMigration" DataType="Boolean">True</Parameter>
				<Parameter Name="PerformDeletion" DataType="Boolean">True</Parameter>
			</Parameters>
			<Packages>
				<Package PackageName="MigrateCustomer" />
			</Packages>
		</PackageProject>
	</PackageProjects>
	<Connections>
		<Connection Name="OldWarehouseNode" ConnectionString="Data Source=LOCALHOST;Initial Catalog=AdventureWorksDW2012;Provider=SQLNCLI11.1;Integrated Security=SSPI;Auto Translate=False;" />
		<Connection Name="NewWarehouseNode" ConnectionString="Data Source=LOCALHOST;Initial Catalog=AdventureWorksDW2012Node2;Provider=SQLNCLI11.1;Integrated Security=SSPI;Auto Translate=False;" />
	</Connections>
	<Packages>
		<Package Name="MigrateCustomer" ConstraintMode="Linear" ProtectionLevel="DontSaveSensitive">
			<Connections>
				<Connection ConnectionName="OldWarehouseNode">
					<Expressions>
						<Expression ExternalProperty="InitalCatalog">@[$Project::OldWarehouseDatabase]</Expression>
						<Expression ExternalProperty="ServerName">@[$Project::OldWarehouseServer]</Expression>
					</Expressions>
				</Connection>
				<Connection ConnectionName="NewWarehouseNode">
					<Expressions>
						<Expression ExternalProperty="InitialCatalog">@[$Project::NewWarehouseDatabase]</Expression>
						<Expression ExternalProperty="ServerName">@[$Project::NewWarehouseServer]</Expression>
					</Expressions>
				</Connection>
			</Connections>
			<Tasks>
				<Container Name="Delete From New Tables" ConstraintMode="Linear">
					<Expressions>
						<Expression ExternalProperty="Disable">!(@[$Project::PerformDeletion])</Expression>
					</Expressions>
					<Tasks>
						<# foreach (DataRow row in sectionsDesc.Rows) { #>
						<Container Name="Delete From New Table <#=row[0]#>" ConstraintMode="Linear">
							<Tasks>
								<Container Name="Start Deleting..." ConstraintMode="Linear"></Container>
								<# foreach (DataRow trow in tables.Rows) {
							if (Convert.ToInt32(trow[2]) == Convert.ToInt32(row[0]) && Convert.ToBoolean(trow[6])) { #>
								<ExecuteSQL Name="Prepare <#=trow[1]#>" ConnectionName="NewWarehouseNode">
									<DirectInput>
										DELETE FROM [<#=trow[0]#>].[<#=trow[1]#>];
									</DirectInput>
									<Parameters>
										<Parameter Name="0" VariableName="MigrateWarehouseData.CustomerID" DataType="Int32" />
									</Parameters>
								</ExecuteSQL>
								<# }
							} #>
							</Tasks>
						</Container>
						<# } #>
					</Tasks>
				</Container>
				
				<Container Name="Migrate To New Table" ConstraintMode="Linear">
					<Expressions>
						<Expression ExternalProperty="Disable">!(@[$Project::PerformMigration])</Expression>
					</Expressions>
					<Tasks>
						<# foreach (DataRow row in sections.Rows) { #>
						<Container Name="Migrate To New Table <#=row[0]#>" ConstraintMode="Linear">
							<Tasks>
								<# foreach (DataRow trow in tables.Rows) {
							if (Convert.ToInt32(trow[2]) == Convert.ToInt32(row[0])) {#>
								<Dataflow Name="Migrate <#=trow[1]#>" DefaultBufferSize="104857600">
									<Transformations>
										<OleDbSource Name="Old Node" ConnectionName="OldWarehouseNode">
											<DirectInput>
												SELECT 
													<#=trow[4]#> 
												FROM [<#=trow[0]#>].[<#=trow[1]#>] x 
													<#=trow[3]#> 
												WHERE
													<#=trow[5]#>;
											</DirectInput>
											<Parameters>
												<Parameter Name="0" VariableName="MigrateWarehouseData.CustomerID" />
											</Parameters>
										</OleDbSource>
										<# if(!Convert.ToBoolean(trow[6])) {#>
										<Lookup Name="Lookup" OleDbConnectionName="NewWarehouseNode" CacheMode="Full" NoMatchBehavior="RedirectRowsToNoMatchOutput">
											<DirectInput>
												SELECT
													<#=trow[7]#>
												FROM [<#=trow[0]#>].[<#=trow[1]#>] x
											</DirectInput>
											<Inputs>
												<#=trow[8]#>
											</Inputs>
											<Outputs></Outputs>
										</Lookup>
										<# } #>
										<OleDbDestination Name="New Node" ConnectionName="NewWarehouseNode" BatchSize="10000" KeepIdentity=<# if(!Convert.ToBoolean(trow[6])) { #> "true" <# } else { #> "false" <# } #> UseFastLoadIfAvailable="true" MaximumInsertCommitSize="2000000">
										<# if(!Convert.ToBoolean(trow[6])) {#>
											<InputPath OutputPathName="Lookup.NoMatch"></InputPath>
										<# } #>
											<ExternalTableOutput Table="[<#=trow[0]#>].[<#=trow[1]#>]"></ExternalTableOutput>
										</OleDbDestination>
									</Transformations>
									<Expressions>
										<Expression ExternalProperty="[Lookup].[SqlCommand]">
											"SELECT DISTINCT
												<#=trow[7]#>
											FROM [<#=trow[0]#>].[<#=trow[1]#>] x
												<#=trow[3]#>
											WHERE
												<#=trow[9]#>;"
										</Expression>
									</Expressions>
								</Dataflow>
								<# }
							} #>
							</Tasks>
						</Container>
						<# } #>
					</Tasks>
				</Container>
			</Tasks>
		</Package>
	</Packages>
</Biml>

Comparing this BIML to the previous set, you can see that we’re up to 143 lines of code. That seems like a lot, but we’ll look at what we’re getting in return, and I think the extra few dozen lines of code is completely worth it.

Our first major change starts in the Delete container. Previously, we forced deleting from each table. Now, we only want to delete if the NeedToDelete field has a value of 1 for each relevant table. Here’s what we do now:

<Container Name="Delete From New Tables" ConstraintMode="Linear">
	<Expressions>
		<Expression ExternalProperty="Disable">!(@[$Project::PerformDeletion])</Expression>
	</Expressions>
	<Tasks>
		<# foreach (DataRow row in sectionsDesc.Rows) { #>
		<Container Name="Delete From New Table <#=row[0]#>" ConstraintMode="Linear">
			<Tasks>
				<Container Name="Start Deleting..." ConstraintMode="Linear"></Container>
				<# foreach (DataRow trow in tables.Rows) {
			if (Convert.ToInt32(trow[2]) == Convert.ToInt32(row[0]) && Convert.ToBoolean(trow[6])) { #>
				<ExecuteSQL Name="Prepare <#=trow[1]#>" ConnectionName="NewWarehouseNode">
					<DirectInput>
						DELETE FROM [<#=trow[0]#>].[<#=trow[1]#>];
					</DirectInput>
					<Parameters>
						<Parameter Name="0" VariableName="MigrateWarehouseData.CustomerID" DataType="Int32" />
					</Parameters>
				</ExecuteSQL>
				<# }
			} #>
			</Tasks>
		</Container>
		<# } #>
	</Tasks>
</Container>

For each row, we create a container called “Delete From New Table [Section #]” and loop through the table set. For each table, if that table’s section is equal to the current section and that table’s NeedToDelete flag is set to 1, then we add an ExecuteSQL task and delete data from that table. Otherwise, I don’t need to delete data from that table.

The way that I handle lookups is with code changes in a couple of places. Let’s look at them, starting first with the lookup itself.

<# if(!Convert.ToBoolean(trow[6])) {#>
	<Lookup Name="Lookup" OleDbConnectionName="NewWarehouseNode" CacheMode="Full" NoMatchBehavior="RedirectRowsToNoMatchOutput">
		<DirectInput>
			SELECT
				<#=trow[7]#>
			FROM [<#=trow[0]#>].[<#=trow[1]#>] x
		</DirectInput>
		<Inputs>
			<#=trow[8]#>
		</Inputs>
		<Outputs></Outputs>
	</Lookup>
<# } #>

So, if NeedToDelete is set to 0, I can support a lookup component. I want to call my component Lookup and have it look up against the new warehouse node. You’ll notice that I set CacheMode to “Full” and my no-match behavior to redirect records to the no-match output. I want a full cache because I’d rather not hit the database over and over again for every row; that would be a horrible performance problem. You’ll notice that I have a special select statement: I’m using the LookupColumn column of MigrationStep, which is the set of key columns associated with this table. In AdventureWorksDW2012, that’s almost always a single column, but there is one table with a two-part key. Then, once I have that input, I need to load the Inputs section. To do that, I use the contents of LookupColumnXML, which is an XML representation of my primary key columns. This tells SSIS how to join each incoming row from the old node to my lookup component, with its new node data. Because this is a simple check, I don’t have any outputs, so that section can be blank.

I have a problem with this bit as it currently exists: I’m selecting all of the rows from my new warehouse node’s table and putting them into cache. If I have a billion rows and my primary key is 8 bytes, that’s 8GB of memory usage for that cache. On a beefy server, 8GB of memory won’t bring a machine down, but suppose I only need 100,000 of those billion rows; in that case, I’ve wasted a lot of time on rows that I am guaranteed never to use.

As such, I really want to use an expression which joins back to CustomerKey. Unfortunately, the Lookup component does not support parameters in this way. Yes, you could use a cache component, save the data that way, and use it in lookups, but there’s an easier way to do this: use the data flow’s expression.

<Expressions>
	<Expression ExternalProperty="[Lookup].[SqlCommand]">
		"SELECT DISTINCT
			<#=trow[7]#>
		FROM [<#=trow[0]#>].[<#=trow[1]#>] x
			<#=trow[3]#>
		WHERE
			<#=trow[9]#>;"
	</Expression>
</Expressions>

In this case, you’ll see that I introduce a WHERE clause along with the join criteria necessary to tie back to DimCustomer and its all-important CustomerKey. I should note that the expression is very finicky: you need to use “ExternalProperty” and the value needs to be exactly “[Lookup].[SqlCommand]” if “Lookup” is the name of your lookup component. If you drop the brackets, BIML won’t be able to make the connection and you’ll either get an error or won’t have the expression work as you expect. The positive to this, however, is that I only get the records I absolutely need.

In order to support this, I needed to change my SQL query in line 8 just a little bit:

"SELECT SchemaName, TableName, Section, JoinCriteria, SelectCriteria, WhereCriteria, NeedToDelete, LookupColumns, LookupColumnXML, CASE WHEN WhereCriteria = '1=1' THEN '1=1 \"' ELSE REPLACE(WhereCriteria, '?', '\" + (DT_WSTR, 12)@[$Project::CustomerID]') END AS ExpressionWhereCriteria FROM dbo.MigrationStep ORDER BY Section, SchemaName, TableName");

This replaces the “x.CustomerKey = ?;” pattern with x.CustomerKey = " + (DT_WSTR,12)@[$Project::CustomerID]

Now that I have my lookup set up and filtering out matches from non-matches, I want to hook the destination to my lookup component. The important thing here is that I need to connect to the no-match path, not the match path. The default is to use the match path, so I need to make a change to my OleDbDestination call:

<OleDbDestination Name="New Node" ConnectionName="NewWarehouseNode" BatchSize="10000" KeepIdentity=<# if(!Convert.ToBoolean(trow[6])) { #> "true" <# } else { #> "false" <# } #> UseFastLoadIfAvailable="true" MaximumInsertCommitSize="2000000">
<# if(!Convert.ToBoolean(trow[6])) {#>
	<InputPath OutputPathName="Lookup.NoMatch"></InputPath>
<# } #>
	<ExternalTableOutput Table="[<#=trow[0]#>].[<#=trow[1]#>]"></ExternalTableOutput>
</OleDbDestination>

This was a very small change. Basically, if NeedToDelete is set to 0, then add an InputPath tag and tell BIML that you want the NoMatch path of your lookup component.

The end result of all of this, once you generate the BIML, is this thing of beauty:

LookupSuccess

Tracing Foreign Key Dependencies In SQL Server

In my latest post on using BIML, I created a table called MigrationStep and populated it with data. Now I want to show how exactly I generated that data and how you can find foreign key constraints.

Let’s use the AdventureWorksDW2012 database as an example and look at the thirteen tables we identified as being related to DIMCustomer in some way. Note that my diagram is a bit of a post hoc analysis—I already know which tables are involved—but our solution doesn’t depend upon this level of foreknowledge.

AdventureWorksDWDiagram

Looking at this diagram, we can see that DimCustomer has “parent” tables and “child” tables. By this, I mean that DimCustomer has foreign key constraints leading to other (parent) tables, and there are tables (children) which have foreign key constraints leading back to DimCustomer. In particular, DimCustomer has a GeographyKey which ties back to DimGeography, making DimGeography a “parent” table for DimCustomer. If we want to load a row into DimCustomer, we need to load that row’s GeographyKey into DimGeography first.

In addition, there are two tables which have CustomerKey attributes which tie back to DimCustomer: FactInternetSales and FactSurveyResponse. Looking at FactInternetSales, there are a number of keys: DimProduct, DimPromotion, DimCurrency, etc. We need to load all of these dimensions before we can populate FactInternetSales…but before we can load DimProduct, we need to load DimProductCategory and then DimProductSubcategory.

With this in mind, I want to use SQL Server metadata to determine all of the tables which tie directly to some base table. I’m going to start out with a big blob of code and then I’ll explain it in detail.

DECLARE
	@BaseSchemaName SYSNAME = N'dbo',
	@BaseTableName SYSNAME = N'DimCustomer',
	@BasePrimaryKeyName SYSNAME = N'CustomerKey';

WITH Predecessors(SchemaName, TableName, TableID, Ordinal, Path, JoinCriteria, WhereCriteria, LookupColumns, LookupColumnXML) AS
(
	SELECT
		OBJECT_SCHEMA_NAME(t.object_id) AS SchemaName,
		OBJECT_NAME(t.object_id) AS TableName,
		t.object_id AS TableID,
		0 AS Ordinal,
		CAST('/' + @BaseSchemaName + '.' + @BaseTableName + '/' AS VARCHAR(8000)) AS Path,
		CAST('' AS VARCHAR(8000)) AS JoinCriteria,
		'x.' + @BasePrimaryKeyName + ' = ?' AS WhereCriteria,
		(
			SELECT
				CAST(STUFF
				(
					(
						SELECT
							', x.' + c.name
						FROM sys.key_constraints kc
							INNER JOIN sys.index_columns ic
								ON kc.parent_object_id = ic.object_id
								AND kc.unique_index_id = ic.index_id
							INNER JOIN sys.columns c
								ON ic.column_id = c.column_id
								AND ic.object_id = c.object_id
						WHERE
							kc.type = 'PK'
							AND kc.parent_object_id = t.object_id
						FOR XML PATH ('')
					), 1, 2, ''
				) AS VARCHAR(8000))
		) AS LookupColumns,
		(
			SELECT
				CAST(STUFF
				(
					(
						SELECT
							'<Column SourceColumn="' + c.name + '" TargetColumn="' + c.name + '" />'
						FROM sys.key_constraints kc
							INNER JOIN sys.index_columns ic
								ON kc.parent_object_id = ic.object_id
								AND kc.unique_index_id = ic.index_id
							INNER JOIN sys.columns c
								ON ic.column_id = c.column_id
								AND ic.object_id = c.object_id
						WHERE
							kc.type = 'PK'
							AND kc.parent_object_id = t.object_id
						FOR XML PATH ('')
					), 1, 0, ''
				) AS VARCHAR(8000))
		) AS LookupColumnXML
	FROM sys.tables t
	WHERE
		t.type = 'U'
		AND t.name = @BaseTableName
		AND OBJECT_SCHEMA_NAME(t.object_id) = @BaseSchemaName
	UNION ALL
	SELECT
		OBJECT_SCHEMA_NAME(t.object_id) AS SchemaName,
		OBJECT_NAME(t.object_id) AS TableName,
		t.object_id AS TableID,
		tt.Ordinal - 1 AS Ordinal,
		CAST(Path + '/' + OBJECT_SCHEMA_NAME(t.object_id) + '.' + OBJECT_NAME(t.object_id) + '/' AS VARCHAR(8000)) AS Path,
		CAST('' AS VARCHAR(8000)) AS JoinCriteria,
		'c.' + @BasePrimaryKeyName + ' = ?' AS WhereCriteria,
		(
			SELECT
				CAST(STUFF
				(
					(
						SELECT
							', x.' + c.name
						FROM sys.key_constraints kc
							INNER JOIN sys.index_columns ic
								ON kc.parent_object_id = ic.object_id
								AND kc.unique_index_id = ic.index_id
							INNER JOIN sys.columns c
								ON ic.column_id = c.column_id
								AND ic.object_id = c.object_id
						WHERE
							kc.type = 'PK'
							AND kc.parent_object_id = t.object_id
						FOR XML PATH ('')
					), 1, 2, ''
				) AS VARCHAR(8000))
		) AS LookupColumns,
		(
			SELECT
				CAST(STUFF
				(
					(
						SELECT
							'<Column SourceColumn="' + c.name + '" TargetColumn="' + c.name + '" />'
						FROM sys.key_constraints kc
							INNER JOIN sys.index_columns ic
								ON kc.parent_object_id = ic.object_id
								AND kc.unique_index_id = ic.index_id
							INNER JOIN sys.columns c
								ON ic.column_id = c.column_id
								AND ic.object_id = c.object_id
						WHERE
							kc.type = 'PK'
							AND kc.parent_object_id = t.object_id
						FOR XML PATH ('')
					), 1, 0, ''
				) AS VARCHAR(8000))
		) AS LookupColumnXML
	FROM sys.tables t
		INNER JOIN sys.foreign_keys f
			ON f.referenced_object_id = t.object_id
			AND f.parent_object_id <> f.referenced_object_id
		INNER JOIN Predecessors tt
			ON f.parent_object_id = tt.TableID
	WHERE
		t.type = 'U'
		AND Path NOT LIKE '%/' + OBJECT_SCHEMA_NAME(t.object_id) + '.' + OBJECT_NAME(t.object_id) + '/%'
),
Successors(SchemaName, TableName, TableID, Ordinal, Path, NumberOfMoves, CurrentDirection, JoinCriteria, WhereCriteria, LookupColumns, LookupColumnXML) AS
(
	SELECT
		OBJECT_SCHEMA_NAME(t.object_id) AS SchemaName,
		OBJECT_NAME(t.object_id) AS TableName,
		t.object_id AS TableID,
		0 AS Ordinal,
		CAST('/' + @BaseSchemaName + '.' + @BaseTableName + '/' AS VARCHAR(8000)) AS Path,
		CAST(0 AS INT) AS NumberOfMoves,
		'' AS CurrentDirection,
		CAST('' AS VARCHAR(8000)) AS JoinCriteria,
		'x.' + @BasePrimaryKeyName + ' = ?' AS WhereCriteria,
		(
			SELECT
				CAST(STUFF
				(
					(
						SELECT
							', x.' + c.name
						FROM sys.key_constraints kc
							INNER JOIN sys.index_columns ic
								ON kc.parent_object_id = ic.object_id
								AND kc.unique_index_id = ic.index_id
							INNER JOIN sys.columns c
								ON ic.column_id = c.column_id
								AND ic.object_id = c.object_id
						WHERE
							kc.type = 'PK'
							AND kc.parent_object_id = t.object_id
						FOR XML PATH ('')
					), 1, 2, ''
				) AS VARCHAR(8000))
		) AS LookupColumns,
		(
			SELECT
				CAST(STUFF
				(
					(
						SELECT
							'<Column SourceColumn="' + c.name + '" TargetColumn="' + c.name + '" />'
						FROM sys.key_constraints kc
							INNER JOIN sys.index_columns ic
								ON kc.parent_object_id = ic.object_id
								AND kc.unique_index_id = ic.index_id
							INNER JOIN sys.columns c
								ON ic.column_id = c.column_id
								AND ic.object_id = c.object_id
						WHERE
							kc.type = 'PK'
							AND kc.parent_object_id = t.object_id
						FOR XML PATH ('')
					), 1, 0, ''
				) AS VARCHAR(8000))
		) AS LookupColumnXML
	FROM sys.tables t
	WHERE
		t.type = 'U'
		AND t.name = @BaseTableName
	UNION ALL
	SELECT
		OBJECT_SCHEMA_NAME(t.object_id) AS SchemaName,
		OBJECT_NAME(t.object_id) AS TableName,
		t.object_id AS TableID,
		tt.Ordinal + 1 AS Ordinal,
		CAST(Path + '/' + OBJECT_SCHEMA_NAME(t.object_id) + '.' + OBJECT_NAME(t.object_id) + '/' AS VARCHAR(8000)) AS Path,
		NumberOfMoves + CASE WHEN CurrentDirection = 'D' THEN 0 ELSE 1 END AS NumberOfMoves,
		'D' AS CurrentDirection,
		CAST('' AS VARCHAR(8000)) AS JoinCriteria,
		'c.' + @BasePrimaryKeyName + ' = ?' AS WhereCriteria,
		(
			SELECT
				CAST(STUFF
				(
					(
						SELECT
							', x.' + c.name
						FROM sys.key_constraints kc
							INNER JOIN sys.index_columns ic
								ON kc.parent_object_id = ic.object_id
								AND kc.unique_index_id = ic.index_id
							INNER JOIN sys.columns c
								ON ic.column_id = c.column_id
								AND ic.object_id = c.object_id
						WHERE
							kc.type = 'PK'
							AND kc.parent_object_id = t.object_id
						FOR XML PATH ('')
					), 1, 2, ''
				) AS VARCHAR(8000))
		) AS LookupColumns,
		(
			SELECT
				CAST(STUFF
				(
					(
						SELECT
							'<Column SourceColumn="' + c.name + '" TargetColumn="' + c.name + '" />'
						FROM sys.key_constraints kc
							INNER JOIN sys.index_columns ic
								ON kc.parent_object_id = ic.object_id
								AND kc.unique_index_id = ic.index_id
							INNER JOIN sys.columns c
								ON ic.column_id = c.column_id
								AND ic.object_id = c.object_id
						WHERE
							kc.type = 'PK'
							AND kc.parent_object_id = t.object_id
						FOR XML PATH ('')
					), 1, 0, ''
				) AS VARCHAR(8000))
		) AS LookupColumnXML
	FROM sys.tables t
		INNER JOIN sys.foreign_keys f
			ON f.parent_object_id = t.object_id
			AND f.parent_object_id <> f.referenced_object_id
		INNER JOIN Successors tt
			ON f.referenced_object_id = tt.TableID
	WHERE
		t.type = 'U'
		AND Path NOT LIKE '%/' + OBJECT_SCHEMA_NAME(t.object_id) + '.' + OBJECT_NAME(t.object_id) + '/%'
		--We can change direction twice:  once to go "down" the tree to get child tables, and once more to get those
		--child tables' parents.  We don't need any more moves than that; this way, we get the minimum number of tables
		--necessary to populate our base table.
		AND NumberOfMoves + CASE WHEN CurrentDirection = 'D' THEN 0 ELSE 1 END < 2
	UNION ALL
	SELECT
		OBJECT_SCHEMA_NAME(t.object_id) AS SchemaName,
		OBJECT_NAME(t.object_id) AS TableName,
		t.object_id AS TableID,
		tt.Ordinal - 1 AS Ordinal,
		CAST(Path + '/' + OBJECT_SCHEMA_NAME(t.object_id) + '.' + OBJECT_NAME(t.object_id) + '/' AS VARCHAR(8000)) AS Path,
		NumberOfMoves + CASE WHEN CurrentDirection = 'U' THEN 0 ELSE 1 END AS NumberOfMoves,
		'U' AS CurrentDirection,
		CAST('' AS VARCHAR(8000)) AS JoinCriteria,
		'c.' + @BasePrimaryKeyName + ' = ?' AS WhereCriteria,
		(
			SELECT
				CAST(STUFF
				(
					(
						SELECT
							', x.' + c.name
						FROM sys.key_constraints kc
							INNER JOIN sys.index_columns ic
								ON kc.parent_object_id = ic.object_id
								AND kc.unique_index_id = ic.index_id
							INNER JOIN sys.columns c
								ON ic.column_id = c.column_id
								AND ic.object_id = c.object_id
						WHERE
							kc.type = 'PK'
							AND kc.parent_object_id = t.object_id
						FOR XML PATH ('')
					), 1, 2, ''
				) AS VARCHAR(8000))
		) AS LookupColumns,
		(
			SELECT
				CAST(STUFF
				(
					(
						SELECT
							'<Column SourceColumn="' + c.name + '" TargetColumn="' + c.name + '" />'
						FROM sys.key_constraints kc
							INNER JOIN sys.index_columns ic
								ON kc.parent_object_id = ic.object_id
								AND kc.unique_index_id = ic.index_id
							INNER JOIN sys.columns c
								ON ic.column_id = c.column_id
								AND ic.object_id = c.object_id
						WHERE
							kc.type = 'PK'
							AND kc.parent_object_id = t.object_id
						FOR XML PATH ('')
					), 1, 0, ''
				) AS VARCHAR(8000))
		) AS LookupColumnXML
	FROM sys.tables t
		INNER JOIN sys.foreign_keys f
			ON f.referenced_object_id = t.object_id
			AND f.parent_object_id <> f.referenced_object_id
		INNER JOIN Successors tt
			ON f.parent_object_id = tt.TableID
	WHERE
		t.type = 'U'
		AND Path NOT LIKE '%/' + OBJECT_SCHEMA_NAME(t.object_id) + '.' + OBJECT_NAME(t.object_id) + '/%'
		--No check here for NumberOfMoves because we want to be able to always move up the tree to find foreign key constraints.
)
SELECT
	a.SchemaName,
	a.TableName,
	MAX(Ordinal) AS Section,
	a.JoinCriteria,
	MAX(a.Path) AS Path,
	--We can hardcode DISTINCT x.* here, but there are some tables which don't need the DISTINCT
	'DISTINCT x.*' AS SelectCriteria,
	a.WhereCriteria,
	--We can hardcode 0 here because all tables have primary key constraints
	0 AS NeedToDelete,
	a.LookupColumns,
	--A side effect of FOR XML Path is that < becomes &lt; and > becomes &gt;, so we need to fix those
	REPLACE(REPLACE(a.LookupColumnXML, '&lt;', '<'), '&gt;', '>')
FROM
(
	SELECT
		p.SchemaName,
		p.TableName,
		p.TableID,
		p.Ordinal,
		p.JoinCriteria,
		p.Path,
		p.WhereCriteria,
		p.LookupColumns,
		p.LookupColumnXML
	FROM Predecessors p

	UNION ALL

	SELECT
		s.SchemaName,
		s.TableName,
		s.TableID,
		s.Ordinal,
		s.JoinCriteria,
		s.Path,
		s.WhereCriteria,
		s.LookupColumns,
		s.LookupColumnXML
	FROM Successors s
		LEFT OUTER JOIN Predecessors p
			ON s.TableID = p.TableID
	WHERE
		p.TableID IS NULL
		AND s.Path LIKE '%/' + @BaseSchemaName + '.' + @BaseTableName + '/%'
) a
GROUP BY
	a.SchemaName,
	a.TableName,
	a.JoinCriteria,
	a.WhereCriteria,
	a.LookupColumns,
	a.LookupColumnXML
ORDER BY
	Section,
	SchemaName,
	TableName;

Yeah, that’s 370 lines of code. Don’t worry—we’ll walk through it piece by piece.

My first code segment is the Predecessors CTE. This is a recursive common table expression, so we’ll break it apart into bite-sized pieces. Let’s start with the base case of our recursive CTE and hide the LookupColumns and LookupColumnsXML work:

SELECT
	OBJECT_SCHEMA_NAME(t.object_id) AS SchemaName,
	OBJECT_NAME(t.object_id) AS TableName,
	t.object_id AS TableID,
	0 AS Ordinal,
	CAST('/' + @BaseSchemaName + '.' + @BaseTableName + '/' AS VARCHAR(8000)) AS Path,
	CAST('' AS VARCHAR(8000)) AS JoinCriteria,
	'x.' + @BasePrimaryKeyName + ' = ?' AS WhereCriteria,
	'TODO' AS LookupColumns,
	'TODO' AS LookupColumnXML
FROM sys.tables t
WHERE
	t.type = 'U'
	AND t.name = @BaseTableName
	AND OBJECT_SCHEMA_NAME(t.object_id) = @BaseSchemaName

This query looks pretty simple: look for a table named [@BaseSchemaName].[@BaseTableName]. In our example, that will translate to dbo.DimCustomer. Once we have that, we select the schema and table names, the table ID (for easy comparison), and start setting some values. Ordinal is a parameter which tells us how deep we are in the recursion chain. This is the base case, so we’re at level 0. I also generate a path to tell us the exact chain we followed to get to this point. In this case, it’s just /dbo.DimCustomer/. I put the slashes in to separate levels, and needed to cast as VARCHAR(8000) because all parts of my recursive CTE must use the exact same table types, including data sizes. For the WHERE clause, I’m assuming my table will be aliased as “x” and has my base primary key. If your base table has a composite key…well, good luck with the modifications…

That part was pretty easy, so let’s look at LookupColumns and LookupColumnsXML. You’ll see that they’re both really the same thing, just in a slightly different form:

(
	SELECT
		CAST(STUFF
		(
			(
				SELECT
					', x.' + c.name
				FROM sys.key_constraints kc
					INNER JOIN sys.index_columns ic
						ON kc.parent_object_id = ic.object_id
						AND kc.unique_index_id = ic.index_id
					INNER JOIN sys.columns c
						ON ic.column_id = c.column_id
						AND ic.object_id = c.object_id
				WHERE
					kc.type = 'PK'
					AND kc.parent_object_id = t.object_id
				FOR XML PATH ('')
			), 1, 2, ''
		) AS VARCHAR(8000))
) AS LookupColumns,
(
	SELECT
		CAST(STUFF
		(
			(
				SELECT
					'<Column SourceColumn="' + c.name + '" TargetColumn="' + c.name + '" />'
				FROM sys.key_constraints kc
					INNER JOIN sys.index_columns ic
						ON kc.parent_object_id = ic.object_id
						AND kc.unique_index_id = ic.index_id
					INNER JOIN sys.columns c
						ON ic.column_id = c.column_id
						AND ic.object_id = c.object_id
				WHERE
					kc.type = 'PK'
					AND kc.parent_object_id = t.object_id
				FOR XML PATH ('')
			), 1, 0, ''
		) AS VARCHAR(8000))
) AS LookupColumnXML

I’ll focus on LookupColumns for this description. What I’m doing here is finding all of the columns associated with this table’s primary key and turning it into a comma-delimited list with the FOR XML PATH trick. This will create a string like “, x.Column1, x.Column2, x.Column3″ and then I use the STUFF function to strip off the first two characters, leaving “x.Column1, x.Column2, x.Column3″ for us.

LookupColumnXML is basically the same thing, except that instead of a comma-delimited list, I’m generating .

Having walked through this, let’s now look at the second half of the Predecessors recursive CTE.

UNION ALL
	SELECT
		OBJECT_SCHEMA_NAME(t.object_id) AS SchemaName,
		OBJECT_NAME(t.object_id) AS TableName,
		t.object_id AS TableID,
		tt.Ordinal - 1 AS Ordinal,
		CAST(Path + '/' + OBJECT_SCHEMA_NAME(t.object_id) + '.' + OBJECT_NAME(t.object_id) + '/' AS VARCHAR(8000)) AS Path,
		CAST('' AS VARCHAR(8000)) AS JoinCriteria,
		'c.' + @BasePrimaryKeyName + ' = ?' AS WhereCriteria,
		'LOOKED AT' AS LookupColumns,
		'LOOKED AT' AS LookupColumnXML
	FROM sys.tables t
		INNER JOIN sys.foreign_keys f
			ON f.referenced_object_id = t.object_id
			AND f.parent_object_id <> f.referenced_object_id
		INNER JOIN Predecessors tt
			ON f.parent_object_id = tt.TableID
	WHERE
		t.type = 'U'
		AND Path NOT LIKE '%/' + OBJECT_SCHEMA_NAME(t.object_id) + '.' + OBJECT_NAME(t.object_id) + '/%'

We’ve already looked at how LookupColumns and LookupColumnXML get populated, and it’s the same throughout the entire code base, so I’m going to replace those references with “LOOKED AT” to simplify the code.

In this code snippet, we have the recursive case. Once we look at our base table, we want to see any tables to which our base table links, followed by any tables to which those tables link, and so on until we run out of links. The way that we do this is in the FROM clause: we join sys.tables to sys.foreign_keys to get any tables in which the parent object ID is not the referenced object ID (to skip self joins), and then we join to the Predecessors common table expression to form the chain. In our concrete scenario, DimCustomer has a foreign key to DimGeography, so the second round of recursion would bring in that table. DimGeography has a foreign key to DimSalesTerritory, so the third round of recursion would bring that table in. We aren’t limited to bringing in just one table per round, but given this particular data model, that’s all we get.

In the SELECT clause, get the schema and table information (SchemaName, TableName, and TableID) the same as in the base case. After that, several things have changed. First of all, I decremented the Ordinal value. This lets us build a number line for precedence: I know that DimSalesTerritory precedes DimGeography, which itself precedes DimCustomer. Thus, I could assign DimCustomer a value of 0, DimGeography a value of -1, and DimSalesTerritory a value of -2. If I have two tables at the same level, it means that I can load records into both concurrently without violating foreign key constraints, assuming I have already loaded the appropriate records for any tables with earlier levels.

Next up is Path, in which I append the current schema and table name to the previous recursion level’s Path. Path is important here because relational database keys can be expressed as a directed cyclical graph, but recursion in a cycle leads to stack overflows. If I have tables A, B, and C, in which A has a key to B, which has a key to C, which has a key back to A, my recursion would build up A, B, C, A, B, C, etc. I don’t want any cycles in my report (not only to protect against stack overflows, but also because I’m interested in the shortest path to a table, and that necessarily precludes a cycle.

Once I take care of path, I set my WhereCriteria value. The “c.” is going to be the base table alias that I plan to use. Because every table in this set eventually ties back to the base table, I can build a valid set of joins to filter every table to include only the records I care about.

So, that’s the Predecessors common table expression, which takes care of any dependencies for our base table. From there, we want to branch down to the children of our base table, as well as those tables’ children, etc. To do this, I use the Successors common table expression. You’ll note that the Successors common table expression is very similar to Predecessors, but includes just a little bit more. Just like with Predecessors, let’s start with the base case.

SELECT
	OBJECT_SCHEMA_NAME(t.object_id) AS SchemaName,
	OBJECT_NAME(t.object_id) AS TableName,
	t.object_id AS TableID,
	0 AS Ordinal,
	CAST('/' + @BaseSchemaName + '.' + @BaseTableName + '/' AS VARCHAR(8000)) AS Path,
	CAST(0 AS INT) AS NumberOfMoves,
	'' AS CurrentDirection,
	CAST('' AS VARCHAR(8000)) AS JoinCriteria,
	'x.' + @BasePrimaryKeyName + ' = ?' AS WhereCriteria,
	'LOOKED AT' AS LookupColumns,
	'LOOKED AT' AS LookupColumnXML
FROM sys.tables t
WHERE
	t.type = 'U'
	AND t.name = @BaseTableName

This base case is almost exactly the same as Predecessors. The two differences are the attributes NumberOfMoves and CurrentDirection. For our base case, we don’t have a direction (either “up” the chain to parents or “down” the chain to children), and we haven’t made any directional changes yet, so NumberOfMoves is 0.

Let’s look at the next part of this CTE:

UNION ALL
SELECT
	OBJECT_SCHEMA_NAME(t.object_id) AS SchemaName,
	OBJECT_NAME(t.object_id) AS TableName,
	t.object_id AS TableID,
	tt.Ordinal + 1 AS Ordinal,
	CAST(Path + '/' + OBJECT_SCHEMA_NAME(t.object_id) + '.' + OBJECT_NAME(t.object_id) + '/' AS VARCHAR(8000)) AS Path,
	NumberOfMoves + CASE WHEN CurrentDirection = 'D' THEN 0 ELSE 1 END AS NumberOfMoves,
	'D' AS CurrentDirection,
	CAST('' AS VARCHAR(8000)) AS JoinCriteria,
	'c.' + @BasePrimaryKeyName + ' = ?' AS WhereCriteria,
	'LOOKED AT' AS LookupColumns,
	'LOOKED AT' AS LookupColumnXML
FROM sys.tables t
	INNER JOIN sys.foreign_keys f
		ON f.parent_object_id = t.object_id
		AND f.parent_object_id <> f.referenced_object_id
	INNER JOIN Successors tt
		ON f.referenced_object_id = tt.TableID
WHERE
	t.type = 'U'
	AND Path NOT LIKE '%/' + OBJECT_SCHEMA_NAME(t.object_id) + '.' + OBJECT_NAME(t.object_id) + '/%'
	--We can change direction twice:  once to go "down" the tree to get child tables, and once more to get those
	--child tables' parents.  We don't need any more moves than that; this way, we get the minimum number of tables
	--necessary to populate our base table.
	AND NumberOfMoves + CASE WHEN CurrentDirection = 'D' THEN 0 ELSE 1 END < 2

This part selects the “children” of our base table, as well as those tables’ children, and so on. Thus, CurrentDirection is “down” the tree. To understand NumberOfMoves, let me walk through this concrete example. Remember that DimCustomer is the base table and has an Ordinal of 0. There are two tables which have CustomerKeys which tie back to DimCustomer: FactInternetSales and FactSurveyResponse. At this point, we still have the base case’s CurrentDirection, which is ”. Because that doesn’t match ‘D’, we set NumberOfMoves equal to the base case’s number of moves (0) plus 1. We have changed direction once at this point.

Now, let’s go to the next level. FactInternetSales has one child table: FactInternetSalesReason. This table will be picked up in the third run of this CTE, so its ordinal will be 2, and we check number of moves again. Because the CurrentDirection for FactInternetSales is D and we’re still going down the chain, NumberOfMoves = parent number of moves (1) + 0, or 1. We have not changed direction; we’ve simply moved down the chain.

This is great, but leads to a problem: FactInternetSalesReason has a relationship with FactInternetSales, but also one to DimSalesReason. We don’t have DimSalesReason in our set anywhere, so trying to insert solely into the tables we’ve gathered so far will fail! That’s where the next part of the Successors CTE comes in:

UNION ALL
SELECT
	OBJECT_SCHEMA_NAME(t.object_id) AS SchemaName,
	OBJECT_NAME(t.object_id) AS TableName,
	t.object_id AS TableID,
	tt.Ordinal - 1 AS Ordinal,
	CAST(Path + '/' + OBJECT_SCHEMA_NAME(t.object_id) + '.' + OBJECT_NAME(t.object_id) + '/' AS VARCHAR(8000)) AS Path,
	NumberOfMoves + CASE WHEN CurrentDirection = 'U' THEN 0 ELSE 1 END AS NumberOfMoves,
	'U' AS CurrentDirection,
	CAST('' AS VARCHAR(8000)) AS JoinCriteria,
	'c.' + @BasePrimaryKeyName + ' = ?' AS WhereCriteria,
	'LOOKED AT' AS LookupColumns,
	'LOOKED AT' AS LookupColumnXML
FROM sys.tables t
	INNER JOIN sys.foreign_keys f
		ON f.referenced_object_id = t.object_id
		AND f.parent_object_id <> f.referenced_object_id
	INNER JOIN Successors tt
		ON f.parent_object_id = tt.TableID
WHERE
	t.type = 'U'
	AND Path NOT LIKE '%/' + OBJECT_SCHEMA_NAME(t.object_id) + '.' + OBJECT_NAME(t.object_id) + '/%'
	--No check here for NumberOfMoves because we want to be able to always move up the tree to find foreign key constraints.

Now we’re moving “up” the tree to find any parents. For FactInternetSalesReason, we’d move back up to see DimSalesReason (remember that we won’t see FactInternetSales again because the Path check forbids cycles). With this moving up the tree, we change the current direction from ‘D’ to ‘U’ and increment the number of directional changes if we were headed downstream or are starting out at the base.

Note my comment at the bottom of this CTE: I do not check the number of directional changes because I always want to be able to drive up through the tree to include any tables required for loading. This leads to a final question on the topic: why check for number of moves at all? The reason for that is to prevent “filling the tree,” as I’ll call it. I might start from DimCustomer and drive down to FactInternetSales. From there, we need to go up the tree and load DimProduct. If I had an unlimited number of up-and-down moves, the recursive CTE would then move down the tree from DimProduct into FactProductInventory. But nothing in FactProductInventory really ties back to a particular customer—I don’t need that table loaded. This is why I allow one downward cut but all the upward movement the recursive CTE needs.

With all of this in mind, my final query starts with a subquery:

SELECT
	p.SchemaName,
	p.TableName,
	p.TableID,
	p.Ordinal,
	p.JoinCriteria,
	p.Path,
	p.WhereCriteria,
	p.LookupColumns,
	p.LookupColumnXML
FROM Predecessors p

UNION ALL

SELECT
	s.SchemaName,
	s.TableName,
	s.TableID,
	s.Ordinal,
	s.JoinCriteria,
	s.Path,
	s.WhereCriteria,
	s.LookupColumns,
	s.LookupColumnXML
FROM Successors s
	LEFT OUTER JOIN Predecessors p
		ON s.TableID = p.TableID
WHERE
	p.TableID IS NULL
	AND s.Path LIKE '%/' + @BaseSchemaName + '.' + @BaseTableName + '/%'

Yeah, this could be another common table expression, but for the purposes of demonstration, I wanted to separate my two recursive CTEs from the mechanics of combining results together.

There are two interesting bits in this subquery, both on the Successors side of the union. First of all, I join Successors back to Predecessors and filter out any tables which exist in Predecessors. This has one important use: the way that I derive my final Section number is to check the MAX(Ordinal). If I have a table in Predecessors, that number should be negative, but if it shows up in Successors, it will be positive. If the number is positive, then a process would try to load my base table’s parent table only after trying to load the base table! This would cause a foreign key constraint violation, so instead I have to take care of this scenario. As a note, this actually does appear in the data set: DimSalesTerritory is in the Predecessors list as well as the Successors list (where it ties to FactInternetSales).

My other interesting part is that I force Path to include my base table. As an aside, this is technically unnecessary, but just in case somebody starts monkeying with the recursive CTE and introduces invalid rows, this is a weak filter.

That all forms my base table, which I alias as “a.” From there, I have my final SELECT, GROUP BY, and ORDER BY clauses:

SELECT
	a.SchemaName,
	a.TableName,
	MAX(Ordinal) AS Section,
	a.JoinCriteria,
	MAX(a.Path) AS Path,
	--We can hardcode DISTINCT x.* here, but there are some tables which don't need the DISTINCT
	'DISTINCT x.*' AS SelectCriteria,
	a.WhereCriteria,
	--We can hardcode 0 here because all tables have primary key constraints
	0 AS NeedToDelete,
	a.LookupColumns,
	--A side effect of FOR XML Path is that < becomes &lt; and > becomes &gt;, so we need to fix those
	REPLACE(REPLACE(a.LookupColumnXML, '&lt;', '<'), '&gt;', '>')
FROM
(LOOKED AT) a
GROUP BY
	a.SchemaName,
	a.TableName,
	a.JoinCriteria,
	a.WhereCriteria,
	a.LookupColumns,
	a.LookupColumnXML
ORDER BY
	Section,
	SchemaName,
	TableName;

As noted above, my Section is the largest value of Ordinal. This way, in the event that a table shows up at two separate points, we can ensure that all predecessors are loaded. I also show the path, which lets us see how the table is connected. The rest of this is simply hard-coding values that I need or displaying the set of lookup columns.

One important note on this is that I do not have join criteria created. Given that I know the tables and keys, I could create a subquery in my CTEs which defines valid join criteria. But because I did this for a small data set, I decided to hand-craft my join criteria. If I ever need to do this for a larger table set, I’ll probably write the query to do this correctly… In any event, we now have a table which describes our foreign key relationships between tables. Of course, if you have a database without foreign key constraints, you can’t walk up and down the chain this way, so that’s yet another reason to populate those foreign keys, whether it be a normalized system or Kimball-style warehouse.

Generating An Empty Node

In BIML, Part 5, I talked about AdventureWorksDW2012 and AdventureWorksDW2012Node2. If you don’t know how to create an empty node for testing, it’s pretty easy. For this example, I’m going to assume that you have SQL Server 2012 or 2014 installed. If you’re still on 2008 R2 or earlier, this will still work but you’ll need to change all of the “AdventureWorksDW2012″ references to be the correct version.

The first step is to download the AdventureWorksDW2012 database file. Once you have that, you can Attach it to your database—but be sure to remove the invalid log file reference when you attach! The following screen shot shows you the Attach Databases screen:

AttachDatabase

Once you have AdventureWorksDW2012 set up, right-click on the database and go to Tasks –> Generate Scripts.

GenerateScripts

From there, you’ll walk through a wizard. In this scenario, we want to script out all objects and make the new database a replica of the old—except without the data.

ScriptAllObjects

Click Next and you’ll have the option to save this in a new query window, which is what I’m going to do.

SaveToNewQueryWindow

This gives you a fairly long file which looks a little bit like:

Scripts

There’s just one problem with this: I still have everything named AdventureWorksDW2012, but I want it named AdventureWorksDW2012Node2. So let’s hit Ctrl+H and bring up the Quick Replace window.

RenameObjects

Then, we’ll fill out the “Find what:” and “Replace with:” sections and hit the “Replace All” button. That renames all references of AdventureWorksDW2012 in our script to be AdventureWorksDW2012Node2.

RenamedObjects

After we have all of that done, hit F5 or the Execute button and you now have an empty node for your own testing purposes.

BIML, Part 5—Expanding To Multiple Tables

In my last post on the topic, we walked through the BIML necessary to generate an SSIS package. The end result is useful and allows us to treat SSIS packages as binary code rather than source code, but the real power to BIML is in automation. In this case, I want to automate every relevant table in my database.

From here on out, I’m going to use AdventureWorksDW2012 as my warehouse. My local SQL Server instance has two databases: AdventureWorksDW2012 and AdventureWorksDW2012Node2. AdventureWorksDW2012Node2 is the same as the main instance except that I only have database structure and not any data. That way, we can migrate data from the primary node to Node2.

In our scenario, we want to move data for a customer from one node to another. This drives from DimCustomer.CustomerKey. To insert a DimCustomer record, we need DimGeography populated, and to populate that, we need to load DimSalesTerritory. The three facts which tie to DimCustomer are FactInternetSales, FactInternetSalesReason, and FactSurveyResponse. To get those facts’ data in, we need to have valid information for DimProduct, DimDate, DimPromotion, DimCurrency, and DimSalesReason. Before I can load DimProduct, I need to load DimProductCategory and DimProductSubcategory. All in all, this means that I need to migrate data from 13 separate tables to migrate customer data to another node.

To maintain this information and know in which order I need to load each table, I’ve created a table called MigrationStep. In a future post, I’ll describe how to populate MigrationStep so we can keep this table in sync if the underlying fact-dimension structure changes. For now, I’ll offer this as a T-SQL script to create and populate the table:

USE [AdventureWorksDW2012]
GO
CREATE TABLE [dbo].[MigrationStep]
(
	[MigrationStepID] [int] IDENTITY(1,1) NOT NULL,
	[SchemaName] [sysname] NOT NULL,
	[TableName] [sysname] NOT NULL,
	[Section] [int] NULL,
	[JoinCriteria] [nvarchar](2000) NULL,
	[SelectCriteria] [nvarchar](2000) NULL,
	[WhereCriteria] [nvarchar](100) NULL,
	[NeedToDelete] [bit] NOT NULL,
	[LookupColumns] [nvarchar](512) NULL,
	[LookupColumnXML] [nvarchar](1024) NULL,
	CONSTRAINT [PK_MigrationStep] PRIMARY KEY CLUSTERED 
	(
		[MigrationStepID] ASC
	)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 100) ON [PRIMARY]
) ON [PRIMARY]
GO

INSERT INTO dbo.MigrationStep
(
	SchemaName,
	TableName,
	Section,
	JoinCriteria,
	SelectCriteria,
	WhereCriteria,
	NeedToDelete,
	LookupColumns,
	LookupColumnXML
)
VALUES
('dbo', 'DimProductCategory', -2, 'INNER JOIN dbo.DimProductSubcategory s ON s.ProductCategoryKey = x.ProductCategoryKey INNER JOIN dbo.DimProduct p ON p.ProductSubcategoryKey = s.ProductSubcategoryKey INNER JOIN dbo.FactInternetSales f ON p.ProductKey = f.ProductKey INNER JOIN dbo.DimCustomer c ON f.CustomerKey = c.CustomerKey', 'DISTINCT x.*', 'c.CustomerKey = ?', 0, 'x.ProductCategoryKey', '<Column SourceColumn="ProductCategoryKey" TargetColumn="ProductCategoryKey" />'),
('dbo', 'DimSalesTerritory', -2, 'INNER JOIN dbo.DimGeography g ON x.SalesTerritoryKey = g.SalesTerritoryKey INNER JOIN dbo.DimCustomer c ON g.GeographyKey = c.GeographyKey', 'DISTINCT x.*', 'c.CustomerKey = ?', 0, 'x.SalesTerritoryKey', '<Column SourceColumn="SalesTerritoryKey" TargetColumn="SalesTerritoryKey" />'),
('dbo', 'DimProductSubcategory', -1, 'INNER JOIN dbo.DimProduct p ON p.ProductSubcategoryKey = x.ProductSubcategoryKey INNER JOIN dbo.FactInternetSales f ON p.ProductKey = f.ProductKey INNER JOIN dbo.DimCustomer c ON f.CustomerKey = c.CustomerKey', 'DISTINCT x.*', 'c.CustomerKey = ?', 0, 'x.ProductSubcategoryKey', '<Column SourceColumn="ProductSubcategoryKey" TargetColumn="ProductSubcategoryKey" />'),
('dbo', 'DimGeography', -1, 'INNER JOIN dbo.DimCustomer c ON x.GeographyKey = c.GeographyKey', 'DISTINCT x.*', 'c.CustomerKey = ?', 0, 'x.GeographyKey', '<Column SourceColumn="GeographyKey" TargetColumn="GeographyKey" />'),
('dbo', 'DimCurrency', 0, 'INNER JOIN dbo.FactInternetSales f ON x.CurrencyKey = f.CurrencyKey INNER JOIN dbo.DimCustomer c ON f.CustomerKey = c.CustomerKey', 'DISTINCT x.*', 'c.CustomerKey = ?', 0, 'x.CurrencyKey', '<Column SourceColumn="CurrencyKey" TargetColumn="CurrencyKey" />'),
('dbo', 'DimCustomer', 0, '', 'DISTINCT x.*', 'x.CustomerKey = ?', 0, 'x.CustomerKey', '<Column SourceColumn="CustomerKey" TargetColumn="CustomerKey" />'),
('dbo', 'DimDate', 0, 'INNER JOIN dbo.FactInternetSales f ON x.DateKey = f.OrderDateKey INNER JOIN dbo.DimCustomer c ON f.CustomerKey = c.CustomerKey', 'DISTINCT x.*', '1 = 1', 0, 'x.DateKey', '<Column SourceColumn="DateKey" TargetColumn="DateKey" />'),
('dbo', 'DimProduct', 0, 'INNER JOIN dbo.FactInternetSales f ON x.ProductKey = f.ProductKey INNER JOIN dbo.DimCustomer c ON f.CustomerKey = c.CustomerKey', 'DISTINCT x.*', 'c.CustomerKey = ?', 0, 'x.ProductKey', '<Column SourceColumn="ProductKey" TargetColumn="ProductKey" />'),
('dbo', 'DimPromotion', 0, 'INNER JOIN dbo.FactInternetSales f ON x.PromotionKey = f.PromotionKey INNER JOIN dbo.DimCustomer c ON f.CustomerKey = c.CustomerKey', 'DISTINCT x.*', 'c.CustomerKey = ?', 0, 'x.PromotionKey', '<Column SourceColumn="PromotionKey" TargetColumn="PromotionKey" />'),
('dbo', 'DimSalesReason', 1, 'INNER JOIN dbo.FactInternetSalesReason fsr ON x.SalesReasonKey = fsr.SalesReasonKey INNER JOIN dbo.FactInternetSales f ON fsr.SalesOrderNumber = f.SalesOrderNumber AND fsr.SalesOrderLineNumber = f.SalesOrderLineNumber INNER JOIN dbo.DimCustomer c ON f.CustomerKey = c.CustomerKey', 'DISTINCT x.*', 'c.CustomerKey = ?', 0, 'x.SalesReasonKey', '<Column SourceColumn="SalesReasonKey" TargetColumn="SalesReasonKey" />'),
('dbo', 'FactInternetSales', 1, 'INNER JOIN dbo.DimCustomer c ON x.CustomerKey = c.CustomerKey', 'DISTINCT x.*', 'c.CustomerKey = ?', 0, 'x.SalesOrderNumber, x.SalesOrderLineNumber', '<Column SourceColumn="SalesOrderNumber" TargetColumn="SalesOrderNumber" /><Column SourceColumn="SalesOrderLineNumber" TargetColumn="SalesOrderLineNumber" />'),
('dbo', 'FactSurveyResponse', 1, 'INNER JOIN dbo.DimCustomer c ON x.CustomerKey = c.CustomerKey', 'DISTINCT x.*', 'c.CustomerKey = ?', 0, 'x.SurveyResponseKey', '<Column SourceColumn="SurveyResponseKey" TargetColumn="SurveyResponseKey" />'),
('dbo', 'FactInternetSalesReason', 2, 'INNER JOIN dbo.FactInternetSales f ON x.SalesOrderNumber = f.SalesOrderNumber AND x.SalesOrderLineNumber = f.SalesOrderLineNumber INNER JOIN dbo.DimCustomer c ON f.CustomerKey = c.CustomerKey', 'DISTINCT x.*', 'c.CustomerKey = ?', 0, 'x.SalesOrderNumber, x.SalesOrderLineNumber, x.SalesReasonKey', '<Column SourceColumn="SalesOrderNumber" TargetColumn="SalesOrderNumber" /><Column SourceColumn="SalesOrderLineNumber" TargetColumn="SalesOrderLineNumber" /><Column SourceColumn="SalesReasonKey" TargetColumn="SalesReasonKey" />');

The one thing I want to point out is that the line which loads DIMDate has a WHERE clause of 1 = 1—that’s the easier method to populate that dimension, as otherwise I would need to UNION together all of the date keys for all of the facts and dimensions which join to DIMDate, and it’s just simpler to move everything.

Now that we have a table, it’s time to integrate that into our BIML that we developed. Just like last time, I’ll give you the full file first and then walk through it piece by piece.

<#@ template language="C#" hostspecific="true" #>
<#@ import namespace="System.Data" #>

<Biml xmlns="http://schemas.varigence.com/biml.xsd">
	<#
	string metadataConnectionString ="Provider=SQLNCLI10;Server=LOCALHOST;Initial Catalog=AdventureWorksDW2012;Integrated Security=SSPI;";
	DataTable tables = ExternalDataAccess.GetDataTable(metadataConnectionString,
	"SELECT SchemaName, TableName, Section, JoinCriteria, SelectCriteria, WhereCriteria, NeedToDelete, LookupColumns, LookupColumnXML FROM dbo.MigrationStep ORDER BY Section, SchemaName, TableName");
	DataTable sections = ExternalDataAccess.GetDataTable(metadataConnectionString,
	"SELECT DISTINCT Section FROM dbo.MigrationStep ORDER BY Section");
	DataTable sectionsDesc = ExternalDataAccess.GetDataTable(metadataConnectionString,
	"SELECT DISTINCT Section FROM dbo.MigrationStep ORDER BY Section DESC"); #>
	<PackageProjects>
		<PackageProject Name="MigrateWarehouseData" ProtectionLevel="DontSaveSensitive">
			<Parameters>
				<Parameter Name="CustomerID" DataType="Int32">1234</Parameter>
				<Parameter Name="OldWarehouseServer" DataType="String">LOCALHOST</Parameter>
				<Parameter Name="OldWarehouseDatabase" DataType="String">AdventureWorksDW2012</Parameter>
				<Parameter Name="NewWarehouseServer" DataType="String">LOCALHOST</Parameter>
				<Parameter Name="NewWarehouseDatabase" DataType="String">AdventureWorksDW2012Node2</Parameter>
				<Parameter Name="PerformMigration" DataType="Boolean">True</Parameter>
				<Parameter Name="PerformDeletion" DataType="Boolean">True</Parameter>
			</Parameters>
			<Packages>
				<Package PackageName="MigrateCustomer" />
			</Packages>
		</PackageProject>
	</PackageProjects>
	<Connections>
		<Connection Name="OldWarehouseNode" ConnectionString="Data Source=LOCALHOST;Initial Catalog=AdventureWorksDW2012;Provider=SQLNCLI11.1;Integrated Security=SSPI;Auto Translate=False;" />
		<Connection Name="NewWarehouseNode" ConnectionString="Data Source=LOCALHOST;Initial Catalog=AdventureWorksDW2012Node2;Provider=SQLNCLI11.1;Integrated Security=SSPI;Auto Translate=False;" />
	</Connections>
	<Packages>
		<Package Name="MigrateCustomer" ConstraintMode="Linear" ProtectionLevel="DontSaveSensitive">
			<Connections>
				<Connection ConnectionName="OldWarehouseNode">
					<Expressions>
						<Expression ExternalProperty="InitalCatalog">@[$Project::OldWarehouseDatabase]</Expression>
						<Expression ExternalProperty="ServerName">@[$Project::OldWarehouseServer]</Expression>
					</Expressions>
				</Connection>
				<Connection ConnectionName="NewWarehouseNode">
					<Expressions>
						<Expression ExternalProperty="InitialCatalog">@[$Project::NewWarehouseDatabase]</Expression>
						<Expression ExternalProperty="ServerName">@[$Project::NewWarehouseServer]</Expression>
					</Expressions>
				</Connection>
			</Connections>
			<Tasks>
				<Container Name="Delete From New Tables" ConstraintMode="Linear">
					<Expressions>
						<Expression ExternalProperty="Disable">!(@[$Project::PerformDeletion])</Expression>
					</Expressions>
					<Tasks>
						<# foreach (DataRow row in sectionsDesc.Rows) { #>
						<Container Name="Delete From New Table <#=row[0]#>" ConstraintMode="Linear">
							<Tasks>
								<# foreach (DataRow trow in tables.Rows) {
							//For this round, I want to make sure EVERY table has a Delete action.  We'll fix that later...
							if (Convert.ToInt32(trow[2]) == Convert.ToInt32(row[0]) && (Convert.ToBoolean(trow[6]) || 1 == 1)) {#>
								<ExecuteSQL Name="Prepare <#=trow[1]#>" ConnectionName="NewWarehouseNode">
									<DirectInput>
										DELETE FROM [<#=trow[0]#>].[<#=trow[1]#>];
									</DirectInput>
								</ExecuteSQL>
								<# }
							} #>
							</Tasks>
						</Container>
						<# } #>
					</Tasks>
				</Container>
				
				<Container Name="Migrate To New Table" ConstraintMode="Linear">
					<Expressions>
						<Expression ExternalProperty="Disable">!(@[$Project::PerformMigration])</Expression>
					</Expressions>
					<Tasks>
						<# foreach (DataRow row in sections.Rows) { #>
						<Container Name="Migrate To New Table <#=row[0]#>" ConstraintMode="Linear">
							<Tasks>
								<# foreach (DataRow trow in tables.Rows) {
							if (Convert.ToInt32(trow[2]) == Convert.ToInt32(row[0])) {#>
								<Dataflow Name="Migrate <#=trow[1]#>" DefaultBufferSize="104857600">
									<Transformations>
										<OleDbSource Name="Old Node" ConnectionName="OldWarehouseNode">
											<DirectInput>
												SELECT 
													<#=trow[4]#>
												FROM [<#=trow[0]#>].[<#=trow[1]#>] x 
													<#=trow[3]#>
												WHERE
													<#=trow[5]#>;
											</DirectInput>
											<Parameters>
												<Parameter Name="0" VariableName="MigrateWarehouseData.CustomerID" />
											</Parameters>
										</OleDbSource>
										<OleDbDestination Name="New Node" ConnectionName="NewWarehouseNode" BatchSize="10000" KeepIdentity=<# if(!Convert.ToBoolean(trow[6])) { #> "true" <# } else { #> "false" <# } #> UseFastLoadIfAvailable="true" MaximumInsertCommitSize="2000000">
											<ExternalTableOutput Table="[<#=trow[0]#>].[<#=trow[1]#>]"></ExternalTableOutput>
										</OleDbDestination>
									</Transformations>
								</Dataflow>
								<# }
							} #>
							</Tasks>
						</Container>
						<# } #>
					</Tasks>
				</Container>
			</Tasks>
		</Package>
	</Packages>
</Biml>

Given 25 more lines of BIML, we now have a package which deletes from 13 tables and migrates data to 13 tables. The best part is that if I add more tables, I don’t need to modify this BIML script–I just generate a new set of packages from the BIML and I’m good to go. In contrast, adding tables the old-fashioned way involves modifying packages and can get time-consuming depending upon how many steps your workflow has.

At any rate, let’s look at the changes, starting at the top.

<#@ template language="C#" hostspecific="true" #>
<#@ import namespace="System.Data" #>

<Biml xmlns="http://schemas.varigence.com/biml.xsd">
	<#
	string metadataConnectionString ="Provider=SQLNCLI10;Server=LOCALHOST;Initial Catalog=AdventureWorksDW2012;Integrated Security=SSPI;";
	DataTable tables = ExternalDataAccess.GetDataTable(metadataConnectionString,
	"SELECT SchemaName, TableName, Section, JoinCriteria, SelectCriteria, WhereCriteria, NeedToDelete, LookupColumns, LookupColumnXML FROM dbo.MigrationStep ORDER BY Section, SchemaName, TableName");
	DataTable sections = ExternalDataAccess.GetDataTable(metadataConnectionString,
	"SELECT DISTINCT Section FROM dbo.MigrationStep ORDER BY Section");
	DataTable sectionsDesc = ExternalDataAccess.GetDataTable(metadataConnectionString,
	"SELECT DISTINCT Section FROM dbo.MigrationStep ORDER BY Section DESC"); #>

On the very first line, I tell BIML that my templating language is C#. After that, I use the “import” declaration to bring in the .NET framework’s System.Data namespace. By the way, this templating may look familiar if you’re used to classic ASP or ASP.NET Web Forms.

My first major block of code is another templated section. In this, I create a connection string to my AdventureWorksDW2012 node, as that’s where my MigrationStep table is. I create three separate DataTables, one to get all of the columns, one to get just a distinct list of Sections, and one to get a distinct list of Sections in descending order. Each Section here is a grouping of tables; I have different sections because of foreign key chain dependencies. With these sections, I’ll use the descending-order list to delete from blocks of tables at a time and the ascending-order list to insert into blocks of tables.

<PackageProjects>
	<PackageProject Name="MigrateWarehouseData" ProtectionLevel="DontSaveSensitive">
		<Parameters>
			<Parameter Name="CustomerID" DataType="Int32">1234</Parameter>
			<Parameter Name="OldWarehouseServer" DataType="String">LOCALHOST</Parameter>
			<Parameter Name="OldWarehouseDatabase" DataType="String">AdventureWorksDW2012</Parameter>
			<Parameter Name="NewWarehouseServer" DataType="String">LOCALHOST</Parameter>
			<Parameter Name="NewWarehouseDatabase" DataType="String">AdventureWorksDW2012Node2</Parameter>
			<Parameter Name="PerformMigration" DataType="Boolean">True</Parameter>
			<Parameter Name="PerformDeletion" DataType="Boolean">True</Parameter>
		</Parameters>
		<Packages>
			<Package PackageName="MigrateCustomer" />
		</Packages>
	</PackageProject>
</PackageProjects>
<Connections>
	<Connection Name="OldWarehouseNode" ConnectionString="Data Source=LOCALHOST;Initial Catalog=AdventureWorksDW2012;Provider=SQLNCLI11.1;Integrated Security=SSPI;Auto Translate=False;" />
	<Connection Name="NewWarehouseNode" ConnectionString="Data Source=LOCALHOST;Initial Catalog=AdventureWorksDW2012Node2;Provider=SQLNCLI11.1;Integrated Security=SSPI;Auto Translate=False;" />
</Connections>
<Packages>
	<Package Name="MigrateCustomer" ConstraintMode="Linear" ProtectionLevel="DontSaveSensitive">
		<Connections>
			<Connection ConnectionName="OldWarehouseNode">
				<Expressions>
					<Expression ExternalProperty="InitalCatalog">@[$Project::OldWarehouseDatabase]</Expression>
					<Expression ExternalProperty="ServerName">@[$Project::OldWarehouseServer]</Expression>
				</Expressions>
			</Connection>
			<Connection ConnectionName="NewWarehouseNode">
				<Expressions>
					<Expression ExternalProperty="InitialCatalog">@[$Project::NewWarehouseDatabase]</Expression>
					<Expression ExternalProperty="ServerName">@[$Project::NewWarehouseServer]</Expression>
				</Expressions>
			</Connection>
		</Connections>

This next segment of code is exactly the same as in my first example. I’m setting up my project parameters, telling the project that there’s one package, and setting up a list of connections. Then, I start work on my one package, called MigrateCustomer. Once I get through this part, we get to the next crazy bit.

<Tasks>
	<Container Name="Delete From New Tables" ConstraintMode="Linear">
		<Expressions>
			<Expression ExternalProperty="Disable">!(@[$Project::PerformDeletion])</Expression>
		</Expressions>
		<Tasks>
			<# foreach (DataRow row in sectionsDesc.Rows) { #>
			<Container Name="Delete From New Table <#=row[0]#>" ConstraintMode="Linear">
				<Tasks>
					<# foreach (DataRow trow in tables.Rows) {
				//For this round, I want to make sure EVERY table has a Delete action.  We'll fix that later...
				if (Convert.ToInt32(trow[2]) == Convert.ToInt32(row[0]) && (Convert.ToBoolean(trow[6]) || 1 == 1)) {#>
					<ExecuteSQL Name="Prepare <#=trow[1]#>" ConnectionName="NewWarehouseNode">
						<DirectInput>
							DELETE FROM [<#=trow[0]#>].[<#=trow[1]#>];
						</DirectInput>
					</ExecuteSQL>
					<# }
				} #>
				</Tasks>
			</Container>
			<# } #>
		</Tasks>
	</Container>

These few lines of code generate all of our Delete containers and tasks. Inside the Tasks element, there is a templated snippet of code: foreach (DataRow row in sectionsDesc.Rows) {. This snippet loops through sectionsDesc, which will be a set containing the values { 2, 1, 0, -1, -2 } based on our example. For each of these sections, we create a new container called “Delete From New Table [Section]” and which contains one task. Inside that Task element, we have a couple lines of C#. The first line loops through each of our tables. Then, for each of our tables, we hit the second line of C#: if (Convert.ToInt32(trow[2]) == Convert.ToInt32(row[0]) && (Convert.ToBoolean(trow[6]) || 1 == 1)). This IF block checks to see if the Section for our table matches the current Section. If so, we check to see if the NeedsDeleted flag is set to true…or if 1 = 1… Yeah, for this scenario, I’m going to force deletion on every table. Don’t worry—I’ll show you how to do this a better way later on in the series.

So, if we have a matching Section (i.e., our table belongs in this container) and if we pass the NeedsDeleted check, then we create an ExecuteSQL task which deletes all data from that table (as represented by columns 0 and 1 in the table row array). We can’t TRUNCATE because we have foreign key constraints, and remember that you really should be batching these deletes and making sure you don’t delete any data you still want hanging around. For our demo, Node2 is empty so I don’t feel bad about deleting everything, but you probably won’t have that same nonchalant attitude about your production environment.

One set of containers down; now we move on to the next.

<Container Name="Migrate To New Table" ConstraintMode="Linear">
	<Expressions>
		<Expression ExternalProperty="Disable">!(@[$Project::PerformMigration])</Expression>
	</Expressions>
	<Tasks>
		<# foreach (DataRow row in sections.Rows) { #>
		<Container Name="Migrate To New Table <#=row[0]#>" ConstraintMode="Linear">
			<Tasks>
				<# foreach (DataRow trow in tables.Rows) {
			if (Convert.ToInt32(trow[2]) == Convert.ToInt32(row[0])) {#>
				<Dataflow Name="Migrate <#=trow[1]#>" DefaultBufferSize="104857600">
					<Transformations>
						<OleDbSource Name="Old Node" ConnectionName="OldWarehouseNode">
							<DirectInput>
								SELECT 
									<#=trow[4]#>
								FROM [<#=trow[0]#>].[<#=trow[1]#>] x 
									<#=trow[3]#>
								WHERE
									<#=trow[5]#>;
							</DirectInput>
							<Parameters>
								<Parameter Name="0" VariableName="MigrateWarehouseData.CustomerID" />
							</Parameters>
						</OleDbSource>
						<OleDbDestination Name="New Node" ConnectionName="NewWarehouseNode" BatchSize="10000" KeepIdentity="true" UseFastLoadIfAvailable="true" MaximumInsertCommitSize="2000000">
							<ExternalTableOutput Table="[<#=trow[0]#>].[<#=trow[1]#>]"></ExternalTableOutput>
						</OleDbDestination>
					</Transformations>
				</Dataflow>
				<# }
			} #>
			</Tasks>
		</Container>
		<# } #>
	</Tasks>
</Container>

In this, our final real segment of code, I create another set of containers, one for each Section in ascending order—that is, { -2, -1, 0, 1, 2 }. This lets me insert rows for “parent” tables before inserting rows into “child” tables. That way, for example, I load DimProduct and DimCustomer before trying to load FactInternetSales.

For each of my section containers, I do another loop through the tables variable. If the table belongs in this section, then I would create a data flow named “Migrate [TableName].” Of course, if you have two tables with the same name (but a different schema), you would probably need to change this to be “Migrate [SchemaName].[TableName]” but I didn’t have to worry about that with my table set.

Inside our data flow, we have an OLE DB Source named Old Node and an OLE DB Destination named New Node. The source connects back to our AdventureWorksDW2012 database and ties in with MigrationStep pretty closely: we use that table to tell our SELECT column list (trow[4]), our schema name (trow[0]), table name (trow[1]), join criteria (trow[3]), and WHERE clause (trow[5]). Looking back at the MigrationStep table, you’ll notice that all of our WHERE clauses look something like c.CustomerKey = ?;. We need to fill in the parameter, so that’s what the Parameters element does, and we tie that back to the CustomerID project parameter.

Our OLE DB destination is pretty simple. BIML knows how to hook up my source and destination components, and as soon as I tell it that I need to connect to [SchemaName].[TableName] on AdventureWorksDW2012Node2, it can also connect the metadata together and load my columns list automatically.

Once I’m done, all I have to do is right-click the BIML file and select Generate SSIS packages. That creates a package which looks a bit like:

MigrateCustomerBIML

Inside each of those control flows, I see the following data flow:

MigrateCustomerDataFlowBIML

This is the power of BIML in action.

BIML, Part 4—Implementing The Ideal Package

My last post in the BIML series introduced an SSIS package which would form the basis of our correct solution. Given this baseline, I want to write some BIML to generate our single package; that way, we can eventually expand our package to include more complex logic and solve the our problem.

For the rest of this series, I’m going to assume you have BIDS Helper installed. In Visual Studio, add a new BIML file by right-clicking your project and selecting the “Add New BIML File” option. This creates a nearly-empty BIML file named BimlScript.biml. Open that file and you’ll see the following code:

<Biml xmlns="http://schemas.varigence.com/biml.xsd">
</Biml>

To generate our SSIS package, we know that we need the following things:

  1. Seven package parameters for CustomerID, the old and new database connections, and flags.
  2. Two database connections, one called NewWarehouseNode and one called OldWarehouseNode.
  3. A sequence named “Delete From New Tables,” inside of which is an Execute SQL task named “Prepare Table” and which truncates our table.
  4. A sequence named “Migrate To New Table,” which has three sub-sequences: Migrate to New Table 1, 2, and 3. Inside 3, we have a data flow named “Migrate Table. Inside this data flow is a SQL Server source and a SQL Server destination. We would need to configure each of these components as well. On the data flow level, the max buffer count also needs to be set to 10,000 and the default buffer size to 100MB.

Armed with these requirements, we can get to work writing BIML. Before I go any further writing the BIML, here’s a script to generate two databases with a table named SomeTable, so that our BIML doesn’t fail and we can see our work in action:

IF NOT EXISTS (SELECT 1 FROM sys.databases WHERE name = 'ReportingOld')
BEGIN
	CREATE DATABASE [ReportingOld]
END
GO
IF NOT EXISTS (SELECT 1 FROM sys.databases WHERE name = 'ReportingNew')
BEGIN
	CREATE DATABASE [ReportingNew]
END
GO
USE [ReportingOld]
GO
IF NOT EXISTS (SELECT 1 FROM sys.tables WHERE name = 'SomeTable')
BEGIN
	CREATE TABLE SomeTable
	(
		Id BIGINT IDENTITY(1,1) NOT NULL,
		CustomerID INT NOT NULL,
		Details VARCHAR(30) NOT NULL
	);

	INSERT INTO dbo.SomeTable(CustomerID, Details)
	VALUES
	(1234, 'ABC'),
	(1234, 'ABCD'),
	(1234, 'ABCDE'),
	(1234, 'ABCDEF'),
	(2345, 'ABC'),
	(2345, 'ABC'),
	(3456, 'ABC'),
	(4567, 'ABC');
END
GO
USE [ReportingNew]
GO
IF NOT EXISTS (SELECT 1 FROM sys.tables WHERE name = 'SomeTable')
BEGIN
	CREATE TABLE SomeTable
	(
		Id BIGINT IDENTITY(1,1) NOT NULL,
		CustomerID INT NOT NULL,
		Details VARCHAR(30) NOT NULL
	);
END
GO

Here’s a full version of the BIML, followed by a breakdown in parts:

<Biml xmlns="http://schemas.varigence.com/biml.xsd">
	<PackageProjects>
		<PackageProject Name="MigrateWarehouseData" ProtectionLevel="DontSaveSensitive">
			<Parameters>
				<Parameter Name="CustomerID" DataType="Int32">1234</Parameter>
				<Parameter Name="OldWarehouseServer" DataType="String">LOCALHOST</Parameter>
				<Parameter Name="OldWarehouseDatabase" DataType="String">ReportingOld</Parameter>
				<Parameter Name="NewWarehouseServer" DataType="String">LOCALHOST</Parameter>
				<Parameter Name="NewWarehouseDatabase" DataType="String">ReportingNew</Parameter>
				<Parameter Name="PerformMigration" DataType="Boolean">True</Parameter>
				<Parameter Name="PerformDeletion" DataType="Boolean">True</Parameter>
			</Parameters>
			<Packages>
				<Package PackageName="MigrateCustomer" />
			</Packages>
		</PackageProject>
	</PackageProjects>
	<Connections>
		<Connection Name="OldWarehouseNode" ConnectionString="Data Source=LOCALHOST;Initial Catalog=ReportingOld;Provider=SQLNCLI11.1;Integrated Security=SSPI;Auto Translate=False;" />
		<Connection Name="NewWarehouseNode" ConnectionString="Data Source=LOCALHOST;Initial Catalog=ReportingNew;Provider=SQLNCLI11.1;Integrated Security=SSPI;Auto Translate=False;" />
	</Connections>
	<Packages>
		<Package Name="MigrateCustomer" ConstraintMode="Linear" ProtectionLevel="DontSaveSensitive">
			<Connections>
				<Connection ConnectionName="OldWarehouseNode">
					<Expressions>
						<Expression ExternalProperty="InitalCatalog">@[$Project::OldWarehouseDatabase]</Expression>
						<Expression ExternalProperty="ServerName">@[$Project::OldWarehouseServer]</Expression>
					</Expressions>
				</Connection>
				<Connection ConnectionName="NewWarehouseNode">
					<Expressions>
						<Expression ExternalProperty="InitialCatalog">@[$Project::NewWarehouseDatabase]</Expression>
						<Expression ExternalProperty="ServerName">@[$Project::NewWarehouseServer]</Expression>
					</Expressions>
				</Connection>
			</Connections>
			<Tasks>
				<Container Name="Delete From New Tables" ConstraintMode="Parallel">
					<Expressions>
						<Expression ExternalProperty="Disable">!(@[$Project::PerformDeletion])</Expression>
					</Expressions>
					<Tasks>
						<ExecuteSQL Name="Prepare Table" ConnectionName="NewWarehouseNode">
							<DirectInput>
								TRUNCATE TABLE dbo.SomeTable;
							</DirectInput>
						</ExecuteSQL>
					</Tasks>
				</Container>
				<Container Name="Migrate To New Table" ConstraintMode="Linear">
					<Expressions>
						<Expression ExternalProperty="Disable">!(@[$Project::PerformMigration])</Expression>
					</Expressions>
					<Tasks>
						<Container Name="Migrate To New Table 1" ConstraintMode="Linear">
						</Container>
						<Container Name="Migrate To New Table 2" ConstraintMode="Linear">
						</Container>
						<Container Name="Migrate To New Table 3" ConstraintMode="Linear">
							<Tasks>
								<Dataflow Name="Migrate Table" DefaultBufferSize="104857600">
									<Transformations>
										<OleDbSource Name="Old Node" ConnectionName="OldWarehouseNode">
											<DirectInput>
												SELECT * FROM dbo.SomeTable WHERE CustomerID = ?;
											</DirectInput>
											<Parameters>
												<Parameter Name="0" VariableName="MigrateWarehouseData.CustomerID" />
											</Parameters>
										</OleDbSource>
										<OleDbDestination Name="New Node" ConnectionName="NewWarehouseNode" BatchSize="10000" KeepIdentity="true" UseFastLoadIfAvailable="true" MaximumInsertCommitSize="2000000">
											<ExternalTableOutput Table="[dbo].[SomeTable]">
											</ExternalTableOutput>
										</OleDbDestination>
									</Transformations>
								</Dataflow>
							</Tasks>
						</Container>
					</Tasks>
				</Container>
			</Tasks>
		</Package>
	</Packages>
</Biml>

Once you save this BIML file, right-click on BimlScript.biml and select “Generate SSIS Packages.” It will create MigrateCustomer.dtsx, which is our BIML-generated package. This package does what our original package did, and it takes us 92 lines of XML. That may not seem like a good trade-off right now, but the power to BIML is that with just a few lines of C# added in, we can create dozens or hundreds of data flows and packages without wearing our your button-clicking finger.

So now let’s break this BIML file down into components and explain them more thoroughly, starting with package projects.

<PackageProjects>
	<PackageProject Name="MigrateWarehouseData" ProtectionLevel="DontSaveSensitive">
		<Parameters>
			<Parameter Name="CustomerID" DataType="Int32">1234</Parameter>
			<Parameter Name="OldWarehouseServer" DataType="String">LOCALHOST</Parameter>
			<Parameter Name="OldWarehouseDatabase" DataType="String">ReportingOld</Parameter>
			<Parameter Name="NewWarehouseServer" DataType="String">LOCALHOST</Parameter>
			<Parameter Name="NewWarehouseDatabase" DataType="String">ReportingNew</Parameter>
			<Parameter Name="PerformMigration" DataType="Boolean">True</Parameter>
			<Parameter Name="PerformDeletion" DataType="Boolean">True</Parameter>
		</Parameters>
		<Packages>
			<Package PackageName="MigrateCustomer" />
		</Packages>
	</PackageProject>
</PackageProjects>

We have one package project named “MigrateWarehouseData” and without any sensitive data like usernames or passwords. In that project, we have a set of parameters, one for each project parameter we created in the previous blog post. We need to define data types and can provide default values as well. After that, we have a packages section which lists each package in our project. In this case, I only have one package: MigrateCustomer, which will create MigrateCustomer.dtsx. This way, my package can read project parameters.

Next up on the hit list is connections:

<Connections>
	<Connection Name="OldWarehouseNode" ConnectionString="Data Source=LOCALHOST;Initial Catalog=ReportingOld;Provider=SQLNCLI11.1;Integrated Security=SSPI;Auto Translate=False;" />
	<Connection Name="NewWarehouseNode" ConnectionString="Data Source=LOCALHOST;Initial Catalog=ReportingNew;Provider=SQLNCLI11.1;Integrated Security=SSPI;Auto Translate=False;" />
</Connections>

These are my common connections that I will use across all packages. There’s nothing special about this: it’s just a name and a connection string pointing ot my local SQL Server instance.

The next step is the Packages section, in which I define one package: MigrateCustomer.

<Packages>
	<Package Name="MigrateCustomer" ConstraintMode="Linear" ProtectionLevel="DontSaveSensitive">

There are two properties I have set: ConstraintMode and ProtectionLevel. Just like my project parameters, I have no sensitive data stored, so I don’t need to protect the package with a password or user key. As for ConstraintMode, I have two options: Linear or Parallel. For this sample, I’m typically going to stick with Linear constraints, but there is one spot in which I choose to go Parallel. The difference is pretty simple: in Linear mode, SSIS draws constraint lines between components listed in the order that we include them in the BIML file, whereas in Parallel mode, all tasks within a container are allowed to run concurrently (whether they actually do or not is a totally different story!).

Inside each package, we have a set of connections. In this case, I want to take my connection strings as given in the Connections section, but I want to do something a little special. We’re collecting the real server and database names for our old and new warehouses, so let’s put that information to good use:

<Connections>
	<Connection ConnectionName="OldWarehouseNode">
		<Expressions>
			<Expression ExternalProperty="InitalCatalog">@[$Project::OldWarehouseDatabase]</Expression>
			<Expression ExternalProperty="ServerName">@[$Project::OldWarehouseServer]</Expression>
		</Expressions>
	</Connection>
	<Connection ConnectionName="NewWarehouseNode">
		<Expressions>
			<Expression ExternalProperty="InitialCatalog">@[$Project::NewWarehouseDatabase]</Expression>
			<Expression ExternalProperty="ServerName">@[$Project::NewWarehouseServer]</Expression>
		</Expressions>
	</Connection>
</Connections>

This snippet tells us that BIML can tie in with SSIS expressions. In this case, we are setting the InitialCatalog and ServerName properties on our connection. We can access project parameters the same way you would in the SSIS GUI: @[$Project::MyParameterName].

Now that we have the connections on our package, we can start laying out the control flow. All control flow tasks go in the Tasks section, and because our package has its constraint mode set to Linear, we will execute each top-level container one at a time. Let’s start by looking at the first container:

<Tasks>
	<Container Name="Delete From New Tables" ConstraintMode="Parallel">
		<Expressions>
			<Expression ExternalProperty="Disable">!(@[$Project::PerformDeletion])</Expression>
		</Expressions>
		<Tasks>
			<ExecuteSQL Name="Prepare Table" ConnectionName="NewWarehouseNode">
				<DirectInput>
					TRUNCATE TABLE dbo.SomeTable;
				</DirectInput>
			</ExecuteSQL>
		</Tasks>
	</Container>

This container will delete from new tables to avoid any primary key violations. Because it’s a toy example, I’m just going to truncate the table; a more realistic example would delete rows in batches, but I’m leaving that as an exercise for the reader.

One more thing to note before I move on is that we can have container-level expressions. In this case, I want to set the “Disable” flag to be equivalent to the opposite of my PerformDeletion project parameter. I personally don’t like how SSIS has properties named “Disable” rather than “Enable;” I think it confuses the matter a little bit, as you either need to have parameters with negative phrasing—such as “DoNotPerformDeletion”—or flip the boolean like I do here. Because “Enabled” makes a lot more sense to me intuitively, I’d rather let my end user have the equivalent of “Enabled” and handle the flip myself.

With that aside handled, we’ll move on to the next container. This one is a container of containers, and has a constraint mode of Linear, meaning that we run each sub-container one at a time.

<Container Name="Migrate To New Table" ConstraintMode="Linear">
	<Expressions>
		<Expression ExternalProperty="Disable">!(@[$Project::PerformMigration])</Expression>
	</Expressions>
	<Tasks>
		<Container Name="Migrate To New Table 1" ConstraintMode="Linear">
		</Container>
		<Container Name="Migrate To New Table 2" ConstraintMode="Linear">
		</Container>
		<Container Name="Migrate To New Table 3" ConstraintMode="Linear">
			<Tasks>
				<Dataflow Name="Migrate Table" DefaultBufferSize="104857600">

Just like before, I have my expression to check whether I want to perform this part of the process. Once we get past that, I have a Tasks flow for this container. Inside that, I have three sub-containers. Two of these containers are empty, but like I mentioned last time, it helps remind us that we need to think about multiple layers of operation—there are some tables we will need to migrate first and some we can hold off on until later.

Looking inside the third container, we have a data flow. This data flow is where we can set our default buffer size to 100 MB. Our default row count per buffer is already 10,000, so I don’t have to specify that property. Once I handle the properties, I move on to the Transformations, which is the data flow equivalent of control flow tasks. This is a relatively easy data flow, so as you’d expect, our Transformations section isn’t very complex.

					<Transformations>
						<OleDbSource Name="Old Node" ConnectionName="OldWarehouseNode">
							<DirectInput>
								SELECT * FROM dbo.SomeTable WHERE CustomerID = ?;
							</DirectInput>
							<Parameters>
								<Parameter Name="0" VariableName="MigrateWarehouseData.CustomerID" />
							</Parameters>
						</OleDbSource>
						<OleDbDestination Name="New Node" ConnectionName="NewWarehouseNode" BatchSize="10000" KeepIdentity="true" UseFastLoadIfAvailable="true" MaximumInsertCommitSize="2000000">
							<ExternalTableOutput Table="[dbo].[SomeTable]">
							</ExternalTableOutput>
						</OleDbDestination>
					</Transformations>
				</Dataflow>
			</Tasks>
		</Container>
	</Tasks>
</Container>

Inside this section, we start off with an OLE DB source which connects to OldWarehouseNode and gets all of the data from dbo.SomeTable where the Customer ID is set to the value that we have in the CustomerID project parameter. Note that the way we access project parameters from an OLE DB source is different than the way we specify them in an expression. We don’t need to have the @[$Project::] cruft, but at the same time, we need to specify the project name of MigrateWarehouseData.

Once data gets through the source, we push it through to the destination, which is connected to NewWarehouseNode and connects to that database’s SomeTable table. We set a few bulk insert settings, such as a 10K batch size, keeping identity (which is different from our SSIS example last time around), turning on fast load if it’s available, and setting a max commit size of 2 million rows.

One important thing to note here is that if the input columns for our OLE DB destination component have the same names as the columns on our table, we don’t need to specify any transformations. If we do, there’s a Column tag we can use in our OleDbDestination component to specify source and target column names, as well as sort order if it’s sorted.

Put all of this together and you have a way to describe an SSIS package. Even if you don’t have to do much more with this, I’d argue that you should still write the BIML because you can check this BIML file into source control and see changes. If I need to flip the KeepIdentity bit, I can make that change and check it in, letting code reviewers easily see what changed. Otherwise, SSIS packages themselves are little better than binaries: so many of the changes are difficult to read, even if you do literally nothing other than re-sizing boxes.

This is a pretty big step: right now, we have a BIML file that we can use to generate an SSIS package. From here, I want to take the next step and show how we can use the power of C# to generate packages dynamically. Stay tuned!