I’ve started spending some time over at the Ask SQL Server Central forum. It’s a lot of fun, but more importantly, trying to answer the questions people have helps you focus your mind on what you do know, as well as opening up new lines of inquiry for things you don’t know. This is more of the former situation.
Linked servers are a really cool idea. With very little extra work, you can pretend that the data on one SQL Server instance is in the same location as your local instance. In theory, you could reduce a lot of redundancy if you have the same type of data in a lot of different locations. Set up a reference data server and all of your other servers connect to it whenever they need that data. Unfortunately, reality intrudes in this scheme…
We used linked servers pretty heavily at work. One major use case is having reference data in production and creating queries to join to that production reference data. This doesn’t affect us in production, but it does affect our development and training environments because they use linked servers to connect to the reference data. With that in mind, here are some bits of experience we’ve earned pertaining to linked server connections.
In my experience with linked servers, they’re useful for certain types of queries but can become major performance problems very quickly. Linchi Shea has a series of articles (1, 2, 3) on linked servers and performance that you should look at before beginning anything. My work with them pretty much corroborates his points, too.
Here’s a quick point list of things I’ve seen:
- If you can, you may want to use an account with sysadmin, db_ownerfixed, or db_ddladmin rights across the linked server. Otherwise, your linked server queries will not get access to the remote SQL Server’s statistics and this can lead to some very ugly plans. In our case, we can’t use an account with that level of privilege (rather, we’re not willing to accept the risk) and performance tuning can be a bear as a result.
- Pull as little data as possible from the linked server. Ideally, you only want to get filtered records from a where clause (but if you’re doing that, you probably should just do this at the business/data object level in your application). Inter-server joins can be a bit trickier because, in order to do the join, you need to send the entire data set across the wire to one of the servers–by default, the remote server sends data to the local server, though if you use an INNER REMOTE JOIN, you can reverse that scenario and have the local server send data to the remote server for processing. This isn’t bad if you’re pulling 3 records from the remote server, but if you’re joining millions of rows on one server to millions of rows on another server, that’s trouble.
- I’ve never had a situation in which INNER REMOTE JOIN actually makes things better. Maybe it’s our particular queries and loads, but it just hasn’t come up.
- Keep your queries as simple as possible. Try to avoid non-indexed views or anything which brings over more tables than you absolutely need. For example, we had a situation where we had two remote tables, Big and Little (not their real names, obviously). Little only had a couple thousand records, but Big had about 200,000 records. If you join your local table to Little and put a where clause in, it can be okay–even if you need to ship over the couple thousand records, that’s not too bad if the two servers are on a fast enough connection. The problem was that the query we had involved a left outer join to Big, so in order to get the few rows from Big that we actually cared about, we needed to get all 200,000 rows across the wire before the joins and filters could actually work. You could imagine the kind of performance trouble this query had.
- Following up on 4, I’ve also seen cases where the optimizer gets tricked with remote queries. If you join to Big 3 times, instead of using the same local temporary table, I’ve seen it get the table three separate times.
- A simple select * from dbo.Coordinator c INNER JOIN [Remote].[DB].dbo.User u ON c.UserId = u.Id can have much worse results than you would expect. In the case of one plan I ended up with, the remote server creates a temporary procedure (using sp_prepexec) with a parameter. Then, for each UserId the query passes along from the Coordinator table, it runs sp_execute with the temporary query and the current User ID. So with 60 distinct user IDs in the Coordinator table, it runs sp_execute on the server 60 times with 60 different IDs. With 500,000? Built-in coffee break.
With this in mind, here are a few take-aways:
- It is possible to use linked servers in a production situation. I would, however, recommend limiting your dependence upon this bit of functionality.
- Whenever possible, use some form of replication/transference strategy to get the data where you need it to go. Performance for queries within a single instance are significantly better than cross-instance queries.
- Sometimes, it’s actually faster to do a “join” in your application itself, especially if you have relatively narrow filters on two large tables. Say you have two 1-million row tables, one on server A and one on server B. Server A, furthermore, is your local server. Suppose that you want to get 10 specific records from A, and these join to roughly 10-20 rows in B by A’s primary key (which would be B’s foreign key, if you could do foreign keys across servers). Rather than sending 1 million rows across the pipe, it would be a lot faster (even with the overhead of opening and closing a connection) to grab your 10 specific records from A and getting the records from B where the “foreign key” is in the set of A records’ primary keys you currently have in memory. Put good indexes on both sides and I’ll bet that you’ll have your results displayed much faster.
- If you do use linked server connections, I highly recommend regularly reviewing the performance, especially if you don’t use a linked server account which can access the remote server’s statistics.