Warehousing On The Cheap

Decision Support Systems (DSS) are excellent mechanisms for getting important information to business users in a fast and efficient manner.  These warehouses serve three vital purposes.  First, they reduce strain on On-Line Transactional Processing (OLTP) systems.  OLTP systems are designed for efficient insertion, updating, and deletion of data, as well as quick retrieval of a limited number of rows.  When a user wants to build reports off of a large number of rows, however, that can cause blocking on the OLTP system, preventing other users from modifying data quickly.

Secondly, warehouses allow you to collect data from disparate systems.  Often times, users might pull data from one system and mix it with data from other systems.  The finance department may take employee data from an HR system, budgeting data from a finance system, expenses data from the accounting system, not to mention hidden Excel spreadsheets or Access databases which contain vital business information.  A well-designed and maintained warehouse can pull from all of these systems, conform the data to business standards, and present it to end users as a single, unified data stream.  This makes building effective reports much easier.

Finally, warehouses simplify the view of the data for business users.  Well-designed warehouses (following the Kimball model) use schemas which minimize the number of joins necessary for reporting, and those joins make intuitive sense to end users.  An end user doesn’t need to understand bridge tables (which are how we model many-to-many relationships in a transactional system); can ignore business-inessential metadata like created & modified times and users; and can easily understand that the EmployeeID key ties to the Employee dimension, which contains all essential employee information.

At this point, most companies are sold on having warehouses, but not every company has the people, time, and money to do it “right.”  The ideal would be to have a separate warehousing team which has significant resources (including C-level support) available to build and maintain these separate systems.  I’ve never worked at a company of that scale; historically, I’ve worked at companies in which one or two database administrators or database engineers are responsible for “taking care of the warehouse” in addition to normal job duties.  And right now, I’m working at a client with no dedicated database people and little domain knowledge, meaning that even low-intensity warehousing may be out of the question.  In this blog post, I’m going to talk about a few ideas for how to put together a warehousing solution which requires relatively little maintenance.

The first thing to think about is tooling.  I love SQL Server Integration Services (SSIS) and will happily use it for warehousing ETL.  I also acknowledge that SSIS is not trivial to learn and can be difficult to master.  I’m pretty sure I would not want the average web developer to be responsible for maintaining an SSIS project, as there are just too many things that can go wrong, too many hidden options to toggle and properties to set to get it right; and developing an SSIS project from scratch sounds even scarier.  In an enterprise scenario, I’ll happily recommend SSIS, but for a small-scale company, that’s overkill.

The good news about most small-scale companies is that one of the big reasons for warehousing—collecting data from disparate systems—typically is inoperative.  Startups and small businesses tend to have relatively few data systems and almost no need to connect them together.  Yes, you might make the finance guy (who may also be the CEO) slightly more efficient, but they normally aren’t pushing the boundaries of software and Excel or some independent single product might actually be the best solution, even if it means double-entering some data.  They also tend not to need as many types of reports as people in a larger company might require, so you can scale down the solution.

Instead, for a smaller business, the main benefit to having a warehouse tends to come in improving application performance.  In the scenario I’m dealing with now, the company’s flagship application needs to perform an expensive calculation on a significant number of pages.  This calculation comes from aggregating data from several disparate tables, and their data model—although not perfect—is reasonably well-suited for their OLTP system.  To me, this says that a warehousing solution which pre-calculates this expensive calculation would improve application performance significantly.  But without any dedicated database people to support a warehouse, I want to look for things which are easy to implement, easy to debug, and easy to maintain as the application changes.  What I’m willing to trade off is ease of connecting multiple data sources and writing to a separate server—if I can improve query performance, the current production hardware is capable of handling the task without a dedicated warehouse instance.

Given this, I think the easiest solution is to build out a separate reporting schema with warehouse tables.  You don’t need to use a separate schema, but I like to take advantage of SQL Server’s schemas as a way of logically separating database functionality.  As far as design goes, I can see three potentially reasonable solutions:

  1. A miniature Kimball-style data model inside the reporting scheme.  You create facts and dimensions and load them with data.  The upside to this is that it’s the most extensible option, but the downside is that it requires the most maintenance, as you’ll need to create ETL processes for each fact and dimension and keep those processes up to date as the base tables change.
  2. A single table per report.  For people just starting out with reporting, or for people who only need one or two report tables, this could be the best option.  You would get the report results and store that data denormalized in a single table.  The major downside to this is extensibility.  If you start getting a large number of reports, or if several reports use the same base tables, you quickly duplicate the report table loading process and this can be disastrous for performance when modifying data.
  3. A miniature quasi-Inmon-style data model inside the reporting scheme.  You only include information relevant to the reports and your data model might be a bit more denormalized or munged-together than the base tables.  For example, suppose that you have a report which is the union of three separate tables.  Instead of storing those three separate tables, you would store the common outputs of those three tables in a manner which is somewhat-normalized but not perfectly so.  For example, let’s say that we’re dealing with travel information.  You want to report on some core details like departure and arrival times, method of travel, etc.  In the transactional system, we might care about details specific to airline flights (was a drink offered?  Which seat number?) that won’t apply to travel by taxi or train.  To solve this problem in the transactional system, we probably have child tables to store data for air flights, taxi rides, and train rides.  But in our reporting schema, we may only have a single Travel table which includes all of the necessary data.  The advantage to this is that it is extensible like the Kimball-style model and is still a simplification of the OLTP data model.  The downside is that it is more difficult to maintain than a single reporting table, and changes to the OLTP system may necessitate more reporting system change than the single report model.

None of these is the wrong solution; it all depends upon requirements.  In my case, I really only need one single reporting table, so option #2 is probably the best for me.  If it turns out that I need more tables in the future, I can migrate the data model to a proper Kimball-style model, and hopefully by that time, my client will have at least one data professional on staff to support the design.

Now that we’ve selected a model, it’s time to figure out the best way to get data into that reporting table.  I’ve struggled with this and I only see three realistic options:

  1. Have a separate ETL process which runs periodically.  If users are okay with relatively stale data, this solution can work well.  If you only need to update once every 6 hours or once a day, your ETL process could query the transactional tables, build today’s results, and load the reporting table.  This is a simple process to understand, a simple process to maintain, and a simple process to extend.  The big disadvantage is that you won’t get real-time reporting, so end users really need to be okay with that.
  2. Update stored procedures to dual-write.  The advantage to this is that you can easily see what is happening.  The disadvantage is that procedures are now writing out to two separate sources, meaning that procedures will take longer to write data and it is still possible for warehouse readers to block writers.  Also, if somebody creates a procedure (or has ad hoc code) which does not write to the warehouse, we lose those updates and our reporting table is now inconsistent.  Depending upon the size of data we’re dealing with, we might be able to put in compensating controls—like a job which runs regularly and synchronizes reporting data with transactional data—but that’s kind of a hacky solution and end users can see bad data in the meantime, making the application less trustworthy for users.
  3. Use database triggers to write.  The advantage to this is that code which inserts into the relevant OLTP tables does not need to know of the existence of these reporting tables, so there is no chance of accidentally missing an important insertion.  The biggest disadvantage is that database triggers are certainly not easily maintainable—it’s hard to remember that they exist and debugging them can be painful.  Also, with database triggers inserting into reporting tables, it is possible to have report readers block writers.  Allen White has an excellent presentation in which he uses Service Broker in conjunction with database triggers to populate warehouses, and that’s an excellent way to solve this problem.  I fully support using Service Broker, but for a company without data professionals on hand, maintaining Service Broker might be a bit too much for them.

In my case, we need real-time data and so I’m going to use database triggers.  I’m not sure yet if I’ll use Service Broker or not; I really want to, but I don’t want to get late-night troubleshooting calls asking me why the reporting table is out of date and things are failing.  I need a solution that web developers can maintain without significant domain expertise.

Your environment may differ, but a combination of stored procedures (if you can afford reporting latency) or database triggers (if you cannot) and SQL Agent jobs or Service Broker can provide a small IT team with limited SQL Server troubleshooting knowledge the ability to scale reporting calls.  In my client’s case, we are looking at cutting out almost 90% of the server’s resource requirements by going from nested table-valued and scalar functions to a simpler to use, simpler to maintain solution.

Advertisements

One thought on “Warehousing On The Cheap

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s