36 Chambers – The Legendary Journeys: Execution to the max!

April 15, 2014

Gaps And Islands

Filed under: Database Administration — Kevin Feasel @ 6:00 pm

Not too long ago, I helped a co-worker solve a problem.  The quick version of the problem is, we want to collate together common occurrences of an event.  In this case, it’s log records by IP address.  If the same IP address shows up 3 times in a row, we want to know that there were 3 events, followed by the next IP, and so on.  If that first IP shows up later on, it’s a new chain.

This is another way of formulating the island problem.  Itzik Ben-Gan covers it wonderfully in his chapter on gaps and islands in SQL Server MVP Deep Dives, which chapter Manning has made available for free.

In my case, here’s a sample and the solution:

CREATE TABLE #Temp
(
	LogID INT,
	ErrorDate DATETIME,
	IPAddress VARCHAR(50)
);

INSERT into dbo.#Temp
(
	LogID,
	ErrorDate,
	IPAddress
)
VALUES
	(173371, '2014-04-07 09:47:35', 'IP 1'),
	(173372, '2014-04-07 09:47:36', 'IP 1'),
	(173373, '2014-04-07 09:47:37', 'IP 1'),
	(173374, '2014-04-07 10:01:35', 'IP 2'),
	(173375, '2014-04-07 10:02:35', 'IP 3'),
	(173376, '2014-04-07 10:03:04', 'IP 3'),
	(173377, '2014-04-07 10:03:05', 'IP 3'),
	(173378, '2014-04-07 10:03:18', 'IP 3'),
	(173379, '2014-04-07 10:18:26', 'IP 1'),
	(173380, '2014-04-07 10:18:26', 'IP 1'),
	(173381, '2014-04-07 10:18:26', 'IP 1'),
	(173382, '2014-04-07 10:18:27', 'IP 3');

WITH records AS
(
	SELECT
		LogID,
		ErrorDate,
		IPAddress,
		ROW_NUMBER() OVER (ORDER BY LogID) - ROW_NUMBER() OVER (ORDER BY IPAddress, LogID) AS grp
	FROM
		#Temp
)
SELECT
	MIN(LogID) AS MinimumID,
	MAX(LogID) AS MaximumID,
	COUNT(*) AS NumberOfRows,
	IPAddress
FROM
	records
GROUP BY
	IPAddress,
	grp
ORDER BY
	MinimumID;

The trick here is that we use two separate ROW_NUMBER calls to generate a difference in values.  The first part of the line (ROW_NUMBER() OVER (ORDER BY LogID)) generates a unique row number for each record, ordered by LogID.  The second ROW_NUMBER() operation generates a unique row number for each line, but it’s ordered by IP Address first and LogID second.  This means that the six records with an IP address of 1 will be clustered together and numbered 1-6.  We can take the difference between these two numbers to get a unique cluster, and that (which we call grp) lets us separate each IP address cluster out, solving the island problem with a single read of the data, much more efficiently than any row-by-row or self-joining solution could ever come up with.

April 13, 2014

New Presentation: SQL Injection

Filed under: (In)Security, Database Administration — Kevin Feasel @ 6:00 pm

I will be presenting for the PASS Security virtual chapter on Thursday, April 17, 2014 at 1 PM Eastern.  Attendees can go to the link to register for the webinar.  The talk is entitled “Understanding and Eliminating SQL Injection” and here is the abstract:

Over the past several years, hacktivists, criminals, and people just “out for lulz” have managed to find sensitive data owned by organizations like Sony, Yahoo, NASA, and the U.S. army, among many others. In all of these cases, the attackers exploited websites using SQL injection attacks.

SQL injection is at the top of the Open Web Application Security Project (OWASP) top 10 list and is an important part of one of the SANS 20 critical security controls. This talk will go into what SQL injection is, how attackers can use it, and how to secure your sites so that your CIO and CISO never show up on the evening news.

Although the talk will focus on using the Microsoft stack (IIS, ASP.Net, and SQL Server), the lessons will apply to all web systems everywhere.

April 10, 2014

SQL Saturday Update

Filed under: Database Administration — Kevin Feasel @ 6:00 pm

SQL Saturday #320 will officially be at Wake Tech North.  We’ve signed the paperwork and will have room for 10 simultaneous sessions.

I’m quite excited about the event and will probably have a couple more updates as we get closer to the go-live date.

April 1, 2014

Happy Release Day

Filed under: Database Administration — Kevin Feasel @ 6:00 pm

SQL Server 2014 is out.  See what’s new and start preparing for upgrades.

March 23, 2014

SQL Saturday Richmond Roundup

Filed under: Database Administration, Where's Poochy? — Kevin Feasel @ 7:27 pm

SQL Saturday #277 in Richmond, Virginia was yesterday.  The RVA PASS chapter put on a great event and I can only hope we do as good a job in Raleigh come September.

My session seemed to go pretty well.  I had a (small) room filled with people and I definitely brought the frenetic energy that I wanted to.  The next presentation will be at the TriNUG Data SIG in April.

March 4, 2014

T-SQL Anti-Patterns: Wrapup

Filed under: Database Administration — Kevin Feasel @ 6:00 pm

Over on the right-hand bar, I have a page listing all of the T-SQL anti-patterns in this series.  There are additional anti-patterns, code smells, and things to avoid, but I think 14 is enough for one presentation.  Speaking of presentations, check it out at SQL Saturday #277 in Richmond, Virginia on Saturday, March 22nd, 2014.

March 3, 2014

T-SQL Anti-Patterns: Clustered GUIDs

Filed under: Database Administration — Kevin Feasel @ 6:00 pm

My final T-SQL anti-pattern that I’m going to discuss is using GUIDs as a clustered index.  This is a particularly pernicious problem because it’s so easy for developers to fall into it.

A GUID, or globally unique identifier, is a 16-byte field which is (supposed to be) truly unique.  You can run into duplicates when dealing with large enough data sets on enough servers, but let’s not worry about that here.  Instead, I’m going to focus on the performance implications of this.  Let’s say that we have a web application which hits three tables:  Contract, Contact, and ContractContact.  We have people (Contacts), we have agreements (Contracts), and we have a listing of which people are associated with which agreements (ContractContacts).  Our web developer came up with a great page in which you can enter the contract details and tie a person to a contract on the same page at the same time.  That web developer knows about identity integer columns, but because he wants all of the logic to happen in a C# business layer but take place in a single T-SQL transaction, he can’t get the identity integer value.  As such, he grabs the Guid.NewGuid() method in .NET and passes in a GUID as a unique identifier for Contract.  Because this GUID is unique, our developer creates a primary key on the GUID.  He tests the logic out and it works fine in development.

Then he moves it out to prod and failure occurs.  Paul Randal and Kim Tripp pointed out that this could have been a plausible explanation for why the Windows 7 RC download process was so problematic.  The reason is that, if you use a GUID as a primary key, by default, SQL Server also makes that the clustered key.  And having a GUID as a clustered key is a terrible idea.  Given how unique and random GUIDs are, they’re a perfect way of maximizing page splits and I/O operations whenever you insert new rows.  This introduces additional, unnecessary table fragmentation and hurts performance of SELECT queries.

So what alternatives do we have here?  Let’s list a few:

  1. Use identity integer columns.  in our Contract example, the web developer didn’t want to do that, mostly because he didn’t want to create a stored procedure with Contract inputs followed by ContractContact_ContactID1, ContractContact_ContactID2, etc.  Avoiding this is totally reasonable, and if you’re using SQL Server 2008 or later, writing a stored procedure like this is totally unnecessary.  Instead, use a table-valued parameter to store your list of contact IDs.  The great news is that you can build them in .NET quite easily.  Then, in your creation stored procedure, you can get the contract ID’s identity value and insert ContractContact records within the same transaction, so if there’s a problem during insertion, you don’t have to worry about backing part of the data out.
  2. Use a sequence column instead of an identity integer value.  Starting in SQL Server 2012, we can use sequences rather than identity integers.  The main advantage here is that, by following a standard method, we can get guaranteed unique values before inserting rows into the Contract table.  That way, we can assign ContractID on each ContractContact object and still perform all of our .NET operations in a single T-SQL transaction.
  3. Use a non-clustered GUID as your primary key.  This is a last alternative, when the other two just won’t work.  In this case, you can create a GUID in the business layer and use it for web operations, but have an identity column as your clustered key.  There are a couple of problems with this solution as opposed to #1 and #2.  First of all, you have an additional 16 bytes per row, so with a large enough table, that adds up:  1 billion rows means 16 GB of additional storage.  In addition to that base storage addition, you’re also responsible for a non-clustered index on the GUID, so there’s another 16 GB of data.  Finally, whenever you look up an object by its GUID value, you’re going to perform a bookmark lookup, meaning that the number of reads to grab a row will double.  This will make SELECT operations slower.

Getting off of GUIDs can be an arduous process, but the advantage is that your database will now scale a lot better.

March 2, 2014

T-SQL Anti-Patterns: Linked Server Joins

Filed under: Database Administration — Kevin Feasel @ 6:00 pm

Yesterday’s anti-pattern, non-SARGable ORs in joins, was all about writing code which the SQL Server database engine can optimize.  Today’s anti-pattern is in that vein, but this time we’re looking at cross-server queries.

Linked servers are a powerful feature within SQL Server, and I don’t want to throw them out entirely.  What I do want to do, however, is limit how often they are used in regular, business-critical code.  The reason is that, prior to SQL Server 2012 SP1, table statistics were not available to linked server users without sysadmin, db_owner, or db_ddladmin rights.  The number of cases in which I would want a linked server role with sysadmin, db_owner, or db_ddladmin rights is very small, and I would definitely not want to give those rights to a web application or anything else user-facing.  Without those table statistics, SQL Server would make poor choices with linked server queries, and so joins against remote tables would get very expensive.

Even with those statistics, cross-server joins are going to be expensive.  The reason is that in order to join row sets from two servers, you need to pass all of one server’s row set over to the other server, perform the join operation, and then (potentially) pass the full result set back to the first server.  If you’re joining two tables with a billion rows, that could be extremely expensive, even if you have great indexes on both sides.

So how do we get around this?  Here are a few potential methods:

  1. Get rid of the linked servers altogether and use another method, such as replication or Always-On Availability Groups, to propagate the data you need from your remote server.  This makes a lot of sense if you’re looking at relatively static lookup data.
  2. For truly static data, you might keep non-replicated copies on both servers.  Replication can be a pain to troubleshoot and manage and AGs require 2012 Enterprise Edition so they can be very expensive.  Thus, if the data never changes or only changes once a year, it might make more sense to create the yearly change script and execute it on multiple servers.
  3. If you absolutely need the remote data, try to re-write queries to avoid joining local tables to remote tables.  Create a temporary table on your local server, pull in as few rows from the remote server as possible, and then join your local tables to the temp table you created.  You still take the linked server hit, but there’s a much smaller chance of that query blowing up.

March 1, 2014

T-SQL Anti-Patterns: Non-SARGable ORs In Joins

Filed under: Database Administration — Kevin Feasel @ 6:00 pm

Our last anti-pattern had us discussing unnecessary repetition of queries, but this time, I’m going to hit the opposite problem:  trying to merge two queries which should remain separate.

Let’s start with an example of a query.  This is a simplified version of a query which I ran into at a prior job:

SELECT
	ACCT.[Id].
	ACCT.[TypeId].
	ACCT.[EntityId].
	ACCT.[Name].
	ACCT.[Purpose],
	TF.Amount As TransferAmount
FROM
	dbo.Account ACCT
	INNER JOIN dbo.Transfer TF
		ON ACCT.Id = TF.AccountToId
		OR ACCT.Id = TF.AccountFromId;

There’s nothing objectionable in this query, per se…at least until you try to run it. The problem here is that the join from Account to Transfer happens on one of two keys: ID to AccountToID, or ID to AccountFromID. If we have indexes on neither AccountToID nor AccountFromID, we’re going to scan the Transfer table once for each account. So, our DBA looks at the scan and says, “Hey, let’s put an index on this.” Unfortunately, the indexes do nothing:

  • If you put an index on AccountToID, we still need to scan to get the rows with a matching AccountFromID.
  • If you put an index on AccountFromID, we still need to scan to get rows with a matching AccountToID.
  • If you put an index on both AccountToID and AccountFromID, that still doesn’t help:  rows matching on AccountToID typically won’t be the same as those matching AccountFromID (you don’t transfer money from one account to the same account, after all).

But let’s say you put separate indexes on AccountToID and AccountFromID.  In that case, one of the indexes might be useful, but SQL Server won’t be able to use both of them effectively in a single query.

The performance-improving correction I made was as follows:

SELECT
	ACCT.[Id].
	ACCT.[TypeId].
	ACCT.[EntityId].
	ACCT.[Name].
	ACCT.[Purpose],
	TF.Amount As TransferAmount
FROM
	dbo.Account ACCT
	INNER JOIN dbo.Transfer TF
		ON ACCT.Id = TF.AccountFromId

UNION ALL

SELECT
	ACCT.[Id].
	ACCT.[TypeId].
	ACCT.[EntityId].
	ACCT.[Name].
	ACCT.[Purpose],
	TF.Amount As TransferAmount
FROM
	dbo.Account ACCT
	INNER JOIN dbo.Transfer TF
		ON ACCT.Id = TF.AccountToId;

In our case, we had a check constraint making sure that AccountToID and AccountFromID were not the same on a single transfer, so I could use a UNION ALL. By separating this out into two queries, we can now take full advantage of the two separate indexes. This leads to two separate accesses of the Account table rather than one, but cuts down on I/O tremendously, especially as Account and Transfer get large.

So when is it OK to use OR in a join? In the work that I’ve done, I’ve seen three major cases:

  1. TableX x INNER JOIN TableY y ON (x.ColumnA = y.ColumnA OR x.ColumnB = y.ColumnB) — Two separate columns with two separate joins are OK.
  2. TableX x INNER JOIN TableY y ON (x.ColumnA = y.ColumnA OR x.ColumnA = 6) — Join on a match or when one table matches a constant.  That second part could be ColumnB as well.
  3. TableX x INNER JOIN TableY y ON (x.ColumnA = y.ColumnA OR z.ColumnA = y.ColumnA) — Join on another table match, as long as z was brought in on a join earlier in the query.  Watch out for accidental Cartesian products here.

In contrast, I would simplify our scenario as:  TableX x INNER JOIN TableY y ON (x.ColumnA = y.ColumnA OR x.ColumnB = y.ColumnA).  Whenever we have two separate columns matching to one column in the other table, we will want to look at an alternative form of joining these tables together.  Otherwise, we run the risk of terrible performance.

February 28, 2014

T-SQL Anti-Patterns: Query By Agonizing Query (QBAQ)

Filed under: Database Administration — Kevin Feasel @ 10:05 pm

Row By Agonizing Row (RBAR, pronounced “re-bar”) was our last anti-pattern.  Today, I’m going to extend it to a term I’m coining, Query By Agonizing Query (QBAQ, pronounced “cue-back”).  According to a quick Google search, nobody has used that term before, so that’s my claim to fame.

What I mean by QBAQ is the following anti-pattern:

UPDATE dbo.SomeTable
SET ColumnA = @Value
WHERE KeyColumn = @ID;

UPDATE dbo.SomeTable
SET ColumnB = @SomeOtherValue
WHERE KeyColumn = @ID;

UPDATE dbo.SomeTable
SET ColumnC = ColumnA + ColumnB
WHERE KeyColumn = @ID;

In this case, we see multiple queries where just one would do nicely:

UPDATE dbo.SomeTable
SET
	ColumnA = @Value,
	ColumnB = @SomeOtherValue,
	ColumnC = ColumnA + ColumnB
WHERE KeyColumn = @ID;

The top query hits SomeTable three times. Even though this is going to be three table seeks (because we’re joining on the primary key column), it’s also two more seeks than necessary. If you run this code on a regular basis, performance will be worse than it needs to be. Also, if these are running in a single transaction, you’re going to lock pages in SomeTable for a longer period of time than necessary.

Often, QBAQ gets combined with RBAR, so that the full query might be something like:

WHILE EXISTS(SELECT * FROM #RelevantValues)
BEGIN
	SELECT TOP 1 @ID = ID
	FROM #RelevantValues;

	UPDATE dbo.SomeTable
	SET ColumnA = @Value
	WHERE KeyColumn = @ID;

	UPDATE dbo.SomeTable
	SET ColumnB = @SomeOtherValue
	WHERE KeyColumn = @ID;

	UPDATE dbo.SomeTable
	SET ColumnC = ColumnA + ColumnB
	WHERE KeyColumn = @ID;

	DELETE FROM #RelevantValues
	WHERE ID = @ID;
END

Here is a piece of code which you could optimize quite nicely, removing all kinds of table hits:

UPDATE st
SET
	ColumnA = @Value,
	ColumnB = @SomeOtherValue,
	ColumnC = ColumnA + ColumnB
FROM dbo.SomeTable st
	INNER JOIN #RelevantValues rv ON st.KeyColumn = rv.ID;

This T-SQL-specific form of the UPDATE operator eliminates a loop, scanning (or seeking) #RelevantValues N times, deleting from #RelevantValues N times, and 2/3 of our table update hits. It’s also shorter, more concise, and easier to understand.

UPDATE statements aren’t the only time I see QBAQ, but they’re typically a good start. I also sometimes see it with UNION abuse:

SELECT A, B, C FROM dbo.SomeTable WHERE SomeValue = 1
UNION ALL
SELECT A, B, C FROM dbo.SomeTable WHERE SomeValue = 2
UNION ALL
SELECT A, B, C FROM dbo.SomeTable WHERE SomeValue = 3
UNION ALL
SELECT A, B, C FROM dbo.SomeTable WHERE SomeOtherValue = 'A'

In this case, we have four seeks or scans against SomeTable, and we can get that down to one:

SELECT 
	A, 
	B, 
	C 
FROM dbo.SomeTable 
WHERE
	SomeValue IN (1, 2, 3) 
	OR SomeOtherValue = 'A';

If you’re concerned about SARGability, you can also try that as two separate queries; in any event, we cut the number of table seeks/scans at least in half, and that adds up.

Most of the time, QBAQ comes from being stuck in a procedural mindset: “I need to perform operation X, then operation Y, then operation Z.” Once you think in those terms, you’re biasing your answers toward performing discrete operations, rather than seeing if there is a good way of combining X, Y, and Z together in one query. Being able to do this is a great indicator that you truly understand and can apply set-based concepts.

Older Posts »

The Silver is the New Black Theme Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 74 other followers