Companero Conference

SpeakerPlug_Kevin

I am extremely pleased to have the opportunity to speak at Compañero Conference. Compañero Conference (or CompaCon, as I can’t pronounce “Compañero” because I can’t roll that n) is a paid conference in Norfolk, Virginia, starting the evening of October 3rd and going on through the 5th.

The purpose of Compañero Conference is to give lone DBAs, accidental DBAs, and new DBAs a place to meet compatriots and learn some useful skills.  Carlos and Steve have a pattern in mind for an intimate affair:  a single track of sessions, social events to give people a chance to get to know someone in a similar situation, and unfettered access to great speakers (and me).

I’m going to do two sessions at CompaCon:  one on database security, as well as a gentle introduction to the Hadoop ecosystem.  Because it’s a small group, you’ll have plenty of chances to get one-on-one discussion time with any of the speakers, so if you need help convincing your boss to attend, there’s a built-in benefit:  you have some really sharp people (and me—oh, wait, I did that one already) trapped in a room and you’ll get to pick our brains about work topics.  And unlike a SQL Saturday, you’ll still get the whole weekend to yourself.

Tickets are currently available at a discounted price of $400, but that’s only for a limited time.

What Makes A Data Platform Professional?

Eugene Meidinger hates the broadness of the term “data platform professional”:

During PASS Summit, I wrote a post about the broadening data platform. I talked about the term Data Professional, and how I feel how it describes the changes going on the in SQL space. Here’s the problem: It’s a terrible guide. It’s a great description, it’s a wonderful attitude, it’s an ambitious goal; but it’s a terrible guide.

Being a data professional means being a jack of all trades. It means being a renaissance man (or woman). It’s means a career plan that looks like this:

Here’s my summary of Eugene’s argument:

  1. The concept of “data platform” is too broad to be meaningful, because
  2. nobody can gain expertise in the entire data platform.  Therefore,
  3. focus on a particular slice of the platform.

Before I go on to make a counter-argument, let me start by saying that I agree almost whole-heartedly with this summary.  But that never stopped me from arguing before…

So here’s my counter-argument:  the concept of “data platform” is quite broad and nobody will master it all.  Within that group, there are core skills, position-specific core skills, secondary skills, and tertiary skills.  I recommend one of two paths:

  1. The Polyglot:  be great at the core and position-specific core skills, good at secondary skills, and aware of tertiary skills.  Be in the top 25% of your field at one to two skills, ideally one core and one secondary.
  2. The Specialist:  be outstanding (in the top 1-3% of your field) at one to two skills.

With that in mind, let’s flesh this argument out.

Gauss and Mathematics

Carl Friedrich Gauss was the last true polyglot mathematician.  He was able to make significant contributions to pretty much every branch of mathematics at the time, something no mathematician has been able to do since.  The reason for this is not that Gauss was that much smarter than any other person since him, but rather that Gauss himself helped expand the world of mathematics considerably, so a 21st century Gauss would need to know about everything Gauss did plus what his contemporaries did plus what their chronological successors did.  This spider-webbed growth of knowledge makes it impossible for one person to repeat what Gauss did.

Even though nobody can be a “true” polyglot mathematician—in the sense of knowing everything about mathematics at the present time—anymore, it doesn’t mean that “mathematics” is so broad a term as to be meaningless as a career plan.  Instead, it means that we all have to specialize to some increasingly greater extent relative to the entire body of knowledge.

What’s The Right Level Of Specialization?

One of my colleagues, Brian Carrig, was working the SQL Clinic at SQL Saturday Raleigh.  When he lost his original “I can help you with…” badge, he created his own.

This…might not be the right level of specialization.  Brian’s a sharp enough guy that he knows more than the average practitioner on a wide range of topics, but I don’t think the choices are to be Brian or to be replaced by robo-developers; there are few enough people who can reach Brian’s level of skill that if these were the only choices, it’d be a dystopian nightmare for IT practitioners (and I’m not just saying that because I want Brian to provision me some more SQL Server instances).

So there has to be a middle ground between “know everything” and “exit the industry.”  I agree with Eugene that we have to specialize, and here’s what I see, at least based off the current landscape.

Sub-Data Platform Job Categories

To save this from being a 5000-word essay, let’s pick four very broad categories for data platform jobs.  These share some overlap and there are certainly people who don’t fit in any of these roles, so this is not a complete taxonomy.  It should serve as a guide for us, however.

The four broad categories I have in mind are as follows:  database developer, database administrator, Business Intelligence specialist, and data analyst.  Database developers focus on writing and tuning queries and tend to specialize in performance tuning.  Database administrators focus on backup and recovery, dealing with database corruption, and availability; they tend to specialize in process automation.  Business Intelligence specialists build warehouses and migrate data from different systems into warehouses; this is a broad enough term that it’s hard to say what they specialize in, but pick one piece of the puzzle (cubes, warehouse modeling, ETL) and you’ll find people who specialize there.  Finally, data analysts apply concepts of statistical analysis to business problems and come up with explanations or predictions of behavior.

Choosing Your Skills

I see four major categories of skill, but the specific details of what fits into each category will differ based on the role.  Again, this is not intended to be a taxonomy but rather a conceptual description.  We have the following concepts:  core skills, position-specific core skills, secondary skills, and tertiary skills.

Core skills are skills which are common to all data platform professionals.  These are relatively uncommon but tend to be fundamental to all positions.  Think of things such as an understanding of SQL and relatively basic query tuning (which includes figuring out when to tune a query and what information is available on your platform for tuning queries).  But really, when we think of core skills, we’re thinking of position-specific core skills.

As an example of a position-specific core skill, administrators need to know how to back up and restore the databases under their care.  How you do this will differ based on the product, but if you administer a database without knowing how to recover it, you’re running a major risk and have a major skill gap.  So basically, position-specific core skills are the things that you train juniors to do and expect mid-levels to know already.

Secondary and tertiary skills are even more nebulous, but I see them as skills which are ever-more-distant from the position-specific core skills.  For a database administrator, the ability to write .NET code is a secondary skill:  relatively few employers or practitioners think of a database administrator as someone who needs to write C# or F# code, but they can see how it’d apply to the position.  A language like R would be a tertiary skill:  a skill which the average practitioner has trouble tying back to day-to-day life.  Most DBAs never think of using R for anything (although I’m trying to change that in my own small way).

Now, skills move over time.  As Eugene points out in his post, I sincerely believe that administrators who don’t understand Powershell are at a serious disadvantage and that there will come a time that database administrators entirely lacking in Powershell scripts will be superfluous.  We’re probably the better part of a decade away from containerization technologies like Docker having the same impact, but it’s ramping up as well.  On the other side, an example of a technique that differentiated good from great database administrators a long time ago was the ability to lay out files on disk to minimize drive latency.  SANs and later SSDs killed that skill altogether.

I wouldn’t describe these skill shifts as fluid, but rather tectonic; they don’t change overnight but they do significantly alter the landscape when they happen, and you don’t want to be on the wrong side of that tectonic shift.

So What’s The Answer?

In my eyes, the answer is to build your skills along one of two paths:  the polyglot or the specialist.  The polyglot knows a little about a lot but has a few major points of depth.  A polyglot database developer might know a lot about writing PL/SQL and tuning Postgres queries, but also has enough experience to query Lucene, write some basic Informatica scripts, and maintain a Python-based ETL project.  At many companies, a broad slice with good depth in a couple skills and relatively little depth in several skills is enough, and for our polyglot developer, it keeps doors open in case the market for Postgres developers flattens out for a few years or our developer wants to go down a new road.

In contrast to the polyglot, a specialist developer is elite at certain skills and knowingly ignorant of most others.  A specialist SQL Server query tuner is in the top 1-3% of all developers at tuning SQL Server queries and knows all kinds of language and configuration tricks to squeeze percentages off of queries which take milliseconds or even microseconds, but might not know (or care) much about the right way to automate taking backups.  You go to the polyglot to solve general, overarching problems but go to the specialist because you have a particular problem which is beyond the polyglot’s skill level.

In case the parallel isn’t completely clear, this model fits with the model for medical doctors:  you have Primary Care Physicians/General Practitioners (PCPs or GPs) and you have specialists.  The PCP knows how to diagnose issues and sees patients with a wide range of maladies.  Sometimes, the PCP refers a patient to a specialist for further diagnosis or action. As an example, a PCP might stitch up a patient with a nasty gash, but that same PCP won’t rebuild a shattered femur; that PCP will refer the patient to a specialist in that area.

Is This Really The Right Direction?

A couple days before Eugene’s post, I had a discussion with a person about this topic.  She was getting back into development after a couple years doing something a bit different, and one thing she noticed was the expectation of employees being more and more polyglot.  Her argument is that we, as IT professionals, have a lot to do with this, as there’s a bit of a race to the bottom with developers wanting to learn more and willing to spend more and more time learning things.  This makes IT jobs feel like running on a treadmill:  you expend a lot of effort just trying to keep up.  And this shows in how job titles and job role expectations have changed, including the concept of a data scientist (which I’ll touch upon at the end).

I’m not sure I agree with this assessment, but it does seem that more positions require (or at least request) knowledge of a range of skills and technologies, that it’s not enough to be “just” a T-SQL stored procedure developer in most shops.  So to that extent, there seems to be a combination of developers innately moving this direction as well as job roles shifting in this direction.

To the extent that she is correct, there’s a good question as to how sustainable this strategy is, as the platform is expanding ever-further but we don’t have any more hours in the day.  But at the same time, take a step back and this is nothing new:  database developers are already a subset of all developers (much as we are loathe to admit this sometimes), so these categories are themselves cases of specialization.  But let’s shelve that for a moment.

Anecdote:  ElasticSearch And Me

It’s time for an anecdote.  A few months ago, I started running a predictive analytics team.  Our first project was to perform predictions of disk growth based on historical data.  No big deal at all, except that all the data was stored in ElasticSearch and our DBA team wanted the results in ElasticSearch as well.  My experience with ElasticSearch prior to this assignment was practically nil, but I went into it eager…at the beginning…

There were days that I wasted just figuring out how to do things that would take me five or ten minutes in SQL Server (particularly around aggregating data).  In that sense, it was a complete waste of time to use ElasticSearch, and throughout that time period I felt like an idiot for struggling so hard to do things that I intuitively knew were extremely simple.  It took a while, but I did muddle through the project, which means that I picked up some ElasticSearch experience.  I’m definitely not good at writing ElasticSearch queries, but I’m better than I was three months ago, and that experience can help me out elsewhere if I need to use ElasticSearch in other projects or even to give me an idea of other ways to store and query data.

This is one of the most common ways that people learn:  they muddle through things because they need to, either because the work requires it or because it’s a challenge or growth opportunity.  If you’re able to take the results of that muddling through and apply it to other areas, you’ve got a leg up on your competition.  And I think it’s easier to form quality systems when you have a few examples—it’s easier to reason accurately from several scenarios rather than just one scenario.

Conclusion

Summarizing a rather long blog post, I do agree with Eugene that “data platform” is a very broad space, and expecting someone to know everything about it would be folly.  But that’s not unique.  “Programmer” is extremely broad as well, but we don’t expect embedded systems developers to write databases (or even write database queries) or design responsive web applications.  Doctors and lawyers specialize to extreme degrees, as do plenty of other professionals, and I see no reason to expect anything different from data platform professionals.  I do believe that unless you are at the extreme right end of the distribution for certain skills (and can thus be a top-end specialist), you want to err in the direction of being broader than deeper, as it reduces the chances of getting caught in a sudden paradigm shift (remember how cool Web Forms was for about 4 years?) and risking your livelihood as a result.

One other point I want to make is that the broadness of this space shows the power of teamwork and complimentary skills.  There’s an argument that a good scrum team is made up of a bunch of generalists who can all fill in each other’s roles on demand.  I think that concept’s garbage for several reasons, one of which is that you often need specialists because specialists fix problems that generalists can’t.  So instead of having a team of generalists, you have a team of people with different skills, some of which overlap and some of which complement each other:  you have one or two data specialists, one or two UI specialists, one or two “backbone” specialists (usually .NET or Java developers), one or two QA specialists, etc.  This says to me that there’s less a difference in kind than a difference in degree, even between the polyglot type and the specialist type:  you can be a polyglot with respect to other data professionals (because you’re using several data platform technologies and are working across multiple parts of the stack) while being a specialist with respect to your development team (because you’re the database person).

Coda:  Data Science

One bit at the end of Eugene’s post is that he’s interested in digging into data science.  For a post criticizing the impossibility of living up to a definition, this did elicit a knowing chuckle.  The problem is that the term “data scientist” is a microcosm of the issues with “data platform professional.”  To be a data scientist, you should have development skills (preferably in multiple languages, including but not limited to SQL, R, and Python), a strong background in statistics (ideally having worked through PhD level courses), and a deep knowledge of the context of data (as in spending years getting to know the domain).  I saw the perfect t-shirt today to describe these people.

trust-me-i-m-a-unicorn-ladies-shirt-women-s-t-shirt

There are very few people who have all three skill sets and yet that’s what being a data scientist requires.  It’s the same problem as “data platform professional” but at a slightly smaller scale.

The Benefits Of Technological Differentiation

Dave Mason has some early qualms about “big data” technologies like Hadoop and Spark.  Paraphrasing his argument (and splicing it with arguments I’ve heard from other SQL Server specialists I’ve talked to over the past few years), it seems that a good practitioner can replicate the examples and demos that are out there for technologies like Spark and Hadoop, sometimes even better than Spark or Hadoop can in the original demo.  This is especially true whenever someone comes in with a laptop and runs a demo off of a virtual machine (like, uh, me).  But I think that there is an important caveat to this critique.

Trade-Offs

For me, Hadoop & Spark aren’t about “things you simply cannot do otherwise.” There are some companies where the scope and scale work out that there might not be another practical choice, but I’ve never worked at one of those companies. I think that it’s more a question of value versus cost on two levels. First, I can buy ten mid-spec servers a lot cheaper than a single server with 8-10x the power. Second, SQL Server Enterprise Edition is pricey. So if it’s equally easy to build a solution with Spark vs SQL Server, then that—ceteris paribus—is a potential reason to use Spark.

There’s a lot of overlap in data platform technologies and good people can make create workable solutions with reasonable data size and reasonable response time requirements using quite a few platforms. It’s when you start relaxing some of these assumptions that platform technologies really differentiate themselves.  For example, suppose I need sub-millisecond query response times for frequently-updated data while retaining ACID properties; if so, I’d lean toward a powerful relational database like SQL Server.  If I need sub-second response times for large warehouse queries, I’d look toward Teradata or maybe Azure SQL Data Warehouse.  If I need to ingest millions of data points per second in near-real-time, I probably want to combine Kafka with Spark.  If I need to process petabytes of non-relational genomic data, Hadoop is king of that show.  On the other side, if I need to put a small relational database on millions of embedded devices, I’ll use sqlite or maybe SQL Server Compact Edition.  In each of these cases, it’s not so much that it’s literally impossible to envison using Tech Stack A instead of Tech Stack B or that people who start using Tech Stack B will come up with entirely different ways of solving business problems than professionals familiar with Tech Stack A, but rather that the relative trade-offs can make one a more viable alternative than the other.

As a totally different example, I can use Full-Text Indexing and various tricks (like n-grams) to perform quick text search in SQL Server. For some data size, that’ll even work, and if it meets my data and response time requirements, great. But if I’m expected to do full-text search of the Library of Congress in milliseconds, I’m probably at a point where I need something tailored to this specific problem, like Solr.

Nothing New Under The Sun(?)

Aside from restrictive constraints, I want to address in a little more detail the architecture point I made above.  Based on my reading of Dave’s post, it sounds like he’s expecting New Ways Of Doing Things across the board.  That’s because many of the proponents tend to blur the lines between techniques, architectures, and solving business problems at the highest level.

Technique changes over time.  The Google MapReduce paper has spawned a generation of distributed computing techniques and has indirectly led to Resilient Distributed Datasetskappa and lambda architectures, and plenty more.  But if you look at it at a high enough level, the concepts have stayed very similar.  We’re still building text processing engines, aggregation engines, and lots of servers which take data from one place and put it into a different place (sometimes making it look a bit different in the process).  At that level, there’s nothing new under the sun.

But I don’t consider that a shortcoming; it is an acknowledgement that conceptual solutions to business problems are independent of their technological implementations.  Part of our job as data specialists is to translate conceptual solutions to a particular set of technological tools available under given constraints for a particular problem.  The new tools won’t open up a previously-undiscovered world of conceptual solutions; instead, they shift the constraints and potentially open doors when we previously would have said “Sorry, not possible.”

That said, I think that there is a benefit in people knowing multiple tech stacks, because that helps us delay or defer the “Sorry, not possible” mentality.  That’s because “Sorry, not possible” really means “Sorry, not possible given my best expectation of what servers, processes, technological stacks, potential solutions, the budget, and business requirements look like at this moment.”  That’s a lot of hidden assumptions.

Hadoop and Spark Specialties

Wrapping up what was supposed to be a three-line response to Dave on the SQL Server Community Slack, the closest thing I have off-hand to a “thing you simply cannot do otherwise” with Spark is distributed analytics with SparkR/sparklyr or Pandas.  You can use a scale-up approach with an R server or a Microsoft R Server instance, but analyze a large enough data set and you’ll eventually run out of RAM.  With the enterprise-level version of Microsoft R Server, you can page to disk so the query won’t outright fail (like it will when you’re using the open-source R client or Microsoft R Open), but performance will not be great.

But even then, the mindset isn’t so much “How do we come up with a brand new concept to solve the problem?” as much as it is “How can we relax binding constraints on existing problems?”  That, I think, is the critical question to answer, and where you start to see value in these platforms.

Loading Into Columnstore: Avoid Trickle Loads

I’m going to tell this one in story format, so here’s the short of it up front:

tl;dr — Clustered columnstore indexes don’t like the combination of wipe-and-replace with multi-threaded trickle loaders.  Avoid that pattern.

The Setup

In the olden days, we had a large fact table with a standard clustered index and some standard non-clustered indexes.  Because the primary use of this fact table was to aggregate fairly large amounts of data, performance was less than stellar.

Then came SQL Server 2014, with its promises of updatable, clustered columnstore indexes.  I jumped on this immediately and replaced my large fact table with one built off of a clustered columnstore index.  Initial testing looked great, giving me a 2-5X performance gain depending upon the query, and that was good enough for our PM to authorize moving forward with the project.

After that project went live, all was well…for about a week.  Then things started slowing down.  It took a while before we were able to tell that there was a problem and, from there, find the root cause.

The Root Cause

To understand the problem, let’s talk a little about our infrastructure.  We load the warehouse on a per-client, per-day basis and we have a number of processes which load data concurrently.  In other words, one process loads data for Client A on June 1st while another process may load data for Client B on June 1st and a third process loads data for Client C on June 2nd, all at the same time.  Loading data includes two steps:  deleting current data and loading new data.  The first step of deleting current data happens because it turned out to be much more efficient in our scenario to delete old records and then insert new records rather than trying to merge data in (either using the MERGE keyword or combination INSERT/UPDATE/DELETE statements).

There are two major issues that we experienced with this pattern against a clustered columnstore index in SQL Server 2014.  First, there was no way to reorganize or rebuild the index online in SQL Server 2014, meaning that the only way I could clean up deleted records would be to rebuild an entire partition.  Given that our hot load partitions are also hot access (specifically, the current and prior months) and we’re a 24/7 company, rebuilding those partitions is pretty much out of the question.  This means that I wouldn’t be able to clean out partitions which are full of deleted records.  That means that my compressed columnstore rowgroups were woefully under-populated.

At the same time, we experienced large numbers of open rowgroups in the deltastore, many of which contained just a few records.  My best understanding of why this happened is as follows:  when a process goes to delete records for a customer-day combination, that process can lock multiple deltastore rowgroups.  If other processes are trying to insert data into the deltastore while that first process tries to delete records, they’ll open new rowgroups because the current ones are locked.  After a while, we’d end up with hundreds or thousands of open rowgroups in the deltastore, many of which contained well under 10,000 rows apiece but which added up to tens of millions of records in total.  Given the way the deltastore works (it’s a big heap), having to scan a giant heap made our queries slower.  The worst part is that because these rowgroups tended not to grow much in size, the tuple mover wouldn’t do anything with them, so they’d just accumulate as new deltastore rowgroups get created and populated.

SQL Server 2016 gave me the ability to reorganize indexes online, which was a great addition as it allowed us to keep those columnstore tables online while reorganizing the partitions and smashing together all of those open rowgroups and combine together the mostly-empty rowgroups.  But that’s not a real solution to the problem; it just buys a bit of time and masks the symptoms.

Possible Solutions

Now we’re in the architecture portion of today’s post.  There are three potential solutions I want to bring up, two of which I’ve implemented in production at one time or another.

Rowstore Front-End

The first architecture involves putting a rowstore table in front of the columnstore.

rowstorefrontend

In this design, I have data coming from the transactional system, undergoing some ETL processing, and going into a staging table on the warehouse.  From there, I perform the remainder of the ETL work and insert into a rowstore table.  This rowstore table has the same attribute names and data types as the columnstore table, but instead of having a clustered columnstore index, it has a standard B-tree index and can have additional non-clustered indexes.  From there, I expose the combination table using a view which simply unions the two sets of data so the application doesn’t have to see rowstore versus columnstore tables.

To move data from the rowstore to the columnstore, I have an external migration process.  This migration process waits until one of two conditions is met:  either there are at least 250,000 records in a single partition, or there is data from at least 4 partitions ago.  In other words, for last three months (including the current), I’d hold off on migrating data until I hit the magic number of 250K records, so that I could migrate that as a single batch and bulk insert the results, bypassing the deltastore altogether.  For older data, my expectation was that these are typically one-off or smaller data moves, and so waiting for 250K records was folly, as that might never come.  Instead, move those immediately to keep the rowstore table compact.  The migration process I wrote looked at data by partition, so I could pull in data from 6 months ago while still waiting for the current partition to accumulate enough records to make that move worthwhile.

Advantages
  1. I get immediate access to the data once it makes its way into the rowstore.  This gets me access to the data earlier than the other alternatives.
  2. It solves one of the pain points I expressed above.  Hot records don’t go into the deltastore, so we don’t see a proliferation of open rowgroups.  Also, depending upon how quickly we reload data, it might solve the other problem as well:  if data doesn’t get reloaded very frequently, letting it sit in a rowgroup for a day or so means that if we delete and reinsert data, we aren’t deleting from the clustered columnstore index.
Disadvantages
  1. This is a relatively complex design to implement, especially with a zero down-time release and a significant amount of existing code looking for tables.
  2. If a huge number of rows get into the rowstore table, query performance won’t be that great because we’re unioning rowstore values and columnstore values and performing calculations later (which negates some of the effect of having a columnstore index).
  3. If I have to reprocess months worth of data, that columnstore gets hit hard.
  4. It seems harder to follow.  I had other database engineers regularly ask about these tables and some of our software engineers and testers found them vexing.
  5. You have to keep two tables in sync, so whenever I add a column to one side, I have to add the column to the other side and to the migration procedure and to the view.

I used this model for SQL Server 2014, but then removed it after we moved to SQL Server 2016 and went back to direct columnstore insertion.  My hope was that I would not need to re-implement something for 2016, but that ended up not being the case.

Staging Table Direct Insertion

Once I started experiencing the same problems in SQL Server 2016, I had to act.  Instead of once more putting a rowstore in front of my columnstore table, I decided to increase complexity in the ETL process to simplify application querying.  To wit, I created a two-step load process.  The first step of the load process, moving data from our transactional system into the warehouse, remains a trickle load, inserting records for a customer-date combination into a memory-optimized staging table.  Once that load process is complete, the ETL process inserts a row into a warehouse queue table, letting the next step know that this customer-day combo is ready to go.

stagingtabledirectinsertion

From there, I let the staging table grow a bit and run a job periodically to bulk move the staging table rows into the columnstore.  Now I delete and insert larger chunks of rows, usually in the 250-800K range.  This means that I avoid the deltastore completely and get a pretty decent row count in each compressed columnstore rowgroup.

Advantages
  1. Seamless post-ETL process.  No new tables and no views.
  2. Calculations don’t need to hit the rowstore table, so I can take full advantage of columnstore aggregations.
Disadvantages
  1. Data loads are no longer “instantaneous.”  I have to wait a bit longer for the second step of the process to commit before records show up in the warehouse.
  2. Additional ETL complexity means there are more things that could break.
  3. There is a possibility of weird data issues.  For example, if I’m trying to load a customer-day combo while the first phase of the ETL process is trying to re-insert that data, I could get inconsistent results.  I needed to add in checks preventing this from happening.

This is my current architecture.  I’m happier with it than with the original rowstore table architecture.

Bulk Insertion

The third architecture is simple:  don’t do trickle load at all.  For many companies, it’s okay to do a major load of data once a day or once every several hours.  If that’s the case, I’d recommend doing a bulk insertion over trying to implement a trickle load.  Just like the other two methods, bulk insertion bypasses the deltastore when you’re loading enough records.

Ideally, this bulk insertion would be a straight insert, never updating or deleting data.  If you can get away with it, the ideal pattern would be something like this:

bulkswap

If I need to load data for July 5th, I’m going to load all of the data for the partition which contains July 5th into a new table with that partition, and then I’m going to swap out the corresponding partition on the columnstore side.  I would want to partition by the normal load unit—for example, if we load data monthly, I’d partition by month; if we load data daily, I’d partition by day if possible.

Advantages
  1. No deletes from the columnstore means no fragmentation.
  2. Columnstore rowgroups are as packed as possible.  If there’s any doubt, we can run an index rebuild on the new partition before swapping it, as nothing reads that table.
  3. Just like the staging table direct insert, I don’t need to make any application changes or create new supporting objects outside of ETL.
Disadvantages
  1. Data loading must be periodic, and will probably be slower than trickle loads.  You’re probably loading once a day or once a month at this point.
  2. If just one row changes, you have to rebuild the entire partition.  That can be a time sink when partitions get very large.

This is the “ideal” solution, but making it work when customers expect data in less than 20 minutes is tricky.  The staging table direct insert architecture seems to be a reasonable compromise between spray loading data and waiting a long time for data.

Use SQL Server 2016

Regardless of the architecture, SQL Server 2016 is a must-have for clustered columnstore indexes.  The ability to reorganize indexes online is a life-saver.  There is the possibility for these reorganizations to block queries for short amounts of time, but that’s a huge benefit if you do find yourself deleting a lot of data in the columnstore.

Conclusion

In this post, we looked at three architectures for loading data into columnstore indexes, with a special focus on trickle loading data.  The common denominator for all of these is good staging tables to absorb the first wave of changes and move data into the columnstore in bulk.

R For The DBA: Graphing Rowcounts

Something I am trying to harp upon is that R isn’t just a language for data analysts; it makes sense for DBAs to learn the language as well.  Here’s a really simple example.

The Setup

I have a client data warehouse which holds daily rollups of revenue and cost for customers.  We’ve had some issues with the warehouse lately where data was not getting loaded due to system errors and timeouts, and our services team gave me a list of some customers who had gaps in their data due to persistent processing failures.  I figured out the root cause behind this (which will show up as tomorrow’s post), but I wanted to make sure that we filled in all of the gaps.

My obvious solution is to write a T-SQL query, getting some basic information by day for each customer.  I could scan through that result set, but the problem is that people aren’t great at reading tables of numbers; they do much better looking at pictures.  This is where R comes into play.

The Solution

My solution is just a few lines of R code, as well as a few lines of T-SQL.  I’m using SQL Server 2016 but we don’t use SQL Server R Services (yet!), so I’m doing this the “old-fashioned” way by pulling data across the wire.  Here’s the code:

install.packages("ggplot2")
install.packages("RODBC")
library(RODBC)
library(ggplot2)

conn <- odbcDriverConnect("Driver=SQL Server;Server=MYSERVER;Initial Catalog=MYDATABASE;Provider=SQLNCLI11.1;Integrated Security=SSPI")

wh <- sqlQuery(conn, "SELECT fc.CustomerWK, fc.DateWK, COUNT(1) AS NumRecords, SUM(fc.Revenue) AS Revenue, SUM(fc.Cost) AS Cost FROM MYDATABASE.dbo.FactTable fc WHERE fc.CustomerWK IN (78296,80030,104098,104101,104104,108371) AND fc.DateWK > 20160901 GROUP BY fc.CustomerWK, fc.DateWK;")

wh$CustomerWK <- as.factor(wh$CustomerWK)
wh$DateWK <- as.Date(as.character(wh$DateWK), "%Y%m%d")
str(wh)

ggplot(wh, aes(x=DateWK, y=NumRecords, colour=CustomerWK, group=CustomerWK)) +
  geom_line() +
  xlab("Date") +
  ylab("Number Of Records")

Let’s walk through this step by step.  After installing and loading the two relevant packages (RODBC to connect to SQL Server and ggplot2 to help us create a pretty graph), I open a connection to my server (with the name replaced to protect the innocent).

Next, I create a data frame called wh and populate it with the results of a query to the warehouse.  This is a pretty simple SQL query which gets the number of rows by customer by day, and also shows me revenue and cost.  I’m not using revenue and cost in this graph, but did look at them as part of my sanity checks.

Next up, I want to “fix” a couple data types.  CustomerWK is an int and represents the customer’s surrogate key.  Although this is an integer type, it’s really a factor.  I have a small, unique set of categories; there’s no mathematical relationship between CustomerWK 1 and CustomerWK2.  Anyhow, I replace CustomerWK with this new, factorized attribute.

After taking care of the CustomerWK factor, I convert DateWK to a date type.  DateWK is an integer representation of the date in ISO format, so January 15th, 2016 would be represented as 20160115.  I need to convert this from an integer to a character string, and then I can convert it to a date.  I replace the DateWK value with this date type.  I included the str(wh) call to show that my data frame really does have the correct types.

Finally, I call ggplot, passing in my warehouse data frame.  I create an aesthetic, which tells the graphing engine what I want to see on the screen.  I want to see the number of records in the fact table per day for each customer, so my Y coordinate is defined by NumRecords, my X coordinate by DateWK, and my group by CustomerWK.  To make it easier to read, I color-code each customer.

After creating the aesthetic, I plot the results as a line graph using the geom_line() function, and then give meaningful X and Y axis labels.

The Results

What I get in return is a decent-enough looking graph:

numrecordsbyday

I can easily see that customer 108371 experienced a major dropoff sometime in mid-October, and fell off the cliff in early November.  The other customers have been fairly stable, leading me to believe that just one customer (in this cohort) has an issue.  I was able to investigate the issue and determine the root cause of the falloff—that the customer stopped sending us data.

Conclusion

This is another example where knowing a little bit of R can be very helpful.  Even if you aren’t building predictive models or performing advanced regressions, the ability to throw results into a plot and quickly spot outliers makes the job much easier.  If I had to discern results from a Management Studio result set, I could still do the job, but I might have been thrown for a loop with customer 78296, whose counts fluctuated over a fairly wide band.

SSIS And HDInsight

Not too long ago, I decided to try connecting to HDInsight using SSIS.  One of the Integration Services updates for SQL Server 2016 is Hadoop support.  This Hadoop support comes via the Hadoop Connection Manager, which allows for two connection methods:  WebHCat and WebHDFS.  My preference is WebHDFS, but it appears that we cannot use WebHDFS at all with HDInsight.  So that leaves WebHCat.

What Is WebHCat?

WebHCat is a web-based REST API for HCatalog, a management layer for dealing with files in HDFS.  If you’re looking for configuration settings for WebHCat, you’ll want generally to look for “templeton” in config files, as Templeton was the project name before WebHCat.  In Ambari, you can go to the Hive configs and look at webhcat-site.xml for configuration settings.  For WebHCat, the default port in HDInsight is 30111, which you should find in the templeton.port configuration setting.

templetonport

Connect To Port 30111

Now that we know this, it’s as easy as putting your cluster name into the WebHCat host, putting in the Hadoop user account, and hitting the Test Connection button:

port30111

And, if you’re following along at home, prepare to wait about 2 minutes and have it fail.

It turns out that HDInsight clusters are accessible through port 443 via URL rewrites, which is pretty cool:  you always go through SSL connections, and based on the URL you go to, you’ll get forwarded on to the correct service.  In practice, this makes it fairly easy to connect to WebHCat via curl.

Port 443

Given that we need to connect on port 443 via an HTTPS connect (thus SSL), it seems that we should set the WebHCat port to 443 and turn on HTTPS and we’ll be good:

port443

This also failed to connect when I hit Test Connection, giving me a weird System.AggregateException error at that (though I don’t have a screenshot of it).  So what do we do now?

Capture Those Packets!

Wireshark is your friend, assuming that you’re allowed to install it in your environment.  Make sure of it first; people have been fired for less…

So let’s fire up a packet capture and see what exactly is happening.  I’ve removed any IP addresses or other identifiers, as it’s going out to a public IP, but we can easily see what’s going on here.

wiresharkfailure

The red boxes show me trying to connect directly to port 30111 and we can see why we get errors:  we’re sending SYN packets (to try to open a connection) but not receiving an ACK (acknowledgement) packet from the server.  After a while, we have a connection timeout.  Our HDInsight cluster is not listening to public requests on port 30111.

The blue box shows what’s happening when I use port 443 instead of 30111.  First, we have SYN and ACK packets, which establish the connection.  Next, SSIS tries to connect to the server on port 443 and…wait, it sends an HTTP request instead of HTTPS!  Notice that we try to re-transmit a few times, assuming that the server simply lost track of time or got busy or left the phone on vibrate or something, when in reality, port 443 is studiously ignoring our packet because it’s not over SSL.  Eventually our client gets the hint and sends a FIN packet to tell our the server that we’re just going to to shaddup about that whole HTTP request thing.

How Should It Look?

By contrast, I’ve taken a screenshot of what a successful SSL connection looks like when I use my browser to connect to the cluster:

wiresharkhttps

We start with SYN and ACK, followed by an SSL Hello, which the server acknowledges.  Then we have a key exchange and the server is happy, so it responds to our request and sends the data we want.  That’s how it should look.

Fixes And Workarounds

Right now, there is no fix for Test Connection.  At PASS Summit, I had a chance to discuss this issue with members of the SSIS team and it sounds like they are aware of the bug and are working on a fix.

In the meantime, it actually can work; if you hit OK and create the connection manager, the SSIS processes will actually connect via SSL.

PASS Summit Follow-Up

This was quite a busy Summit…  My original plan was to write one PASS Summit-related post per day (hence the Wednesday through Friday gap last week), but too much time and not enough to do.

The good news is that PASS Summit just defined the next three months for me.

Polybase

I already have several posts on Polybase, but this year’s Summit convinced me that I need to dig deeper into it.  Much deeper.  My plan is to have a comprehensive series on Polybase, covering not just on-prem Hadoop (like I’ve done so far), but also other experiences like HDInsight, Azure Blob Storage, and Azure Data Warehouse.

Re-Introducing Hadoop

Barely two years ago, I wrote my Much Ado About Hadoop presentation and ran it through 2015.  Looking back at this talk, I still think it was a good intro for its time, but two years is huge in the Hadoop space.  For 2017, I plan to rebuild that talk and focus on a true overview.  I think it made sense to discuss the concepts behind MapReduce and Hive then, but today, I’m going to start with a higher-level overview of streaming and integration technologies which weren’t generally available at that point.

This series will probably run alongside my Polybase posts.

Driving User Group And SQL Saturday Attendance

Hi, I’m Kevin, and I’m terrible at marketing!  Despite that, I’m going to look for ways to engage people in the Raleigh-Durham community, getting the interested in going to local user groups (like the SQL Server User Group) and attending events (like SQL Saturday).  I picked up some great ideas talking to people at PASS Summit and I want to implement them.  In the course of this, the plan is to figure out good metrics and see what drives these metrics.  I also plan to dig through Andy Warren’s posts, as he is a font of knowledge.

This includes a two-pronged approach:  first, I want to reach out to DBAs and data engineers who either have never heard of the user group or who haven’t really given it much thought.  Second, I want to reach out to people who deal with SQL but don’t think of themselves as DBAs:  software developers, QA engineers, system administrators, etc.

Of course, if you have ideas, I absolutely want to hear them.