Course Reviews: Fundamentals of Bayesian Data Analysis in R

This is a review of Rasmus Baath’s Fundamentals of Bayesian Data Analysis in R DataCamp course.

I really enjoyed this course. Rasmus takes us through an intuitive understanding of Bayesian data analysis without introducing Bayes’s Theorem until the 4th chapter. The best part is, that’s not even a criticism: by the time he introduces the theorem, you already know the parts and the theorem is more a formalization of what you have already done.

If you’re new to Bayesian thought, give this course a try. The examples are clear and interesting, and Rasmus does a good job of mixing tabular results with histograms and other methods for visualizing results.

One of the nicest things I can say about the course is that during the exercise phases, I almost never had to go look things up independently of the course materials. Pretty much every concept was already on the slides or in the instructions and it was a matter of putting the pieces together rather than spending an hour trying to research some function somewhere which might get you through the problem at hand. I did have to read the help files to figure out parameters for a couple of functions, but that’s fine—the problem comes instead when an instructor expects you to know something not mentioned at all anywhere. In setting up these exercises, Rasmus does a great job.

If there’s one thing I would have liked, it was a bit more detail on BEST and other Bayesian estimation tools. Fortunately, there are a couple of courses dedicated to STAN and JAGS, so those should satisfy my curiosity.

SQL Saturday Raleigh Call For Speakers Closes Soon

If you have not already submitted a session for the best SQL Saturday in the Research Triangle in the year 2019, your time and chances are dwindling, so act now! Go to the SQL Saturday Raleigh website and submit a session or four today! Or tomorrow, but no later than Friday because that’s when our call for speakers closes.

The PolyBase Papers

This week’s PolyBase Revealed post is all about a couple of academic papers that Microsoft Research published a few years back.

In their first paper, entitled Split Query Processing in Polybase (note the pre-marketing capitalization), David DeWitt, et al, take us through some of the challenges and competitive landscape around PolyBase during 2013. One of my favorite pieces from this paper was the discussion on where, specifically, there were differences between the C# and C++ heavy SQL Server infrastructure and the Java language used for Hadoop. This includes things such as null checking, implicit conversions, and join mechanics. One of the things they didn’t point out in the paper but which is important is rounding: different languages and products round things differently, such that you might get a result of $499.99 when using SQL Server’s rounding methods but $500 or $500.02 when bouncing out to a MapReduce job. In cases where an estimate is good enough and being off by a hundredth of a percent is no big deal, that’s fine, but this kind of seemingly non-deterministic behavior can lead to erroneous bug reports. Ensuring that these rounding rules remain consistent across platforms is important for that reason, but it is also complex given how many ways different languages can estimate slightly differently.

One other thing of note in the first paper is that they were thinking of this with regard to Parallel Data Warehouse, so you can see some of the differences when they brought PolyBase to Enterprise Edition in 2016 around naming semantics, available data formats (no ORC or Parquet in the 2013 paper), etc. It was also during the Hadoop 1.0 days, so although there are a couple of mentions of YARN at the end, none of those improvements had hit the mainstream yet.

The second paper, from Vinitha Reddy Gankidi, et al, is called Indexing HDFS Data in PDW: Splitting the data from the index and came out in 2014. The impetus for this paper is something we’re well familiar with today: companies have the desire to store and query massive amounts of data, but waiting minutes or hours for batch queries to finish just isn’t in the cards. So what can we do to improve the performance of queries of data stored in HDFS? Their answer in the paper is, create indexes in the relational database for data in HDFS and query the indexes whenever possible. This works great if you need to do point lookups or pull a couple of columns from a giant dataset and the paper explains how they can keep the index in sync with an external Hadoop filesystem.

What’s really interesting to me is that this is a feature which never made it to Enterprise/Standard Edition. We have the ability to create statistics on external tables but no ability to create indexes. If you prefer a good slide deck over reading the paper, they’ve got you covered.

Approaching Zero: Making Life Simpler

This is part three in a series on near-zero downtime deployments.

In the last post, I walked through several assumptions, particularly around source control, deployment processes, and testing. In this post, we will look at a few things you can do to make (near) zero-downtime deployments easier.

Use Enterprise Edition

SQL Server 2016 Service Pack 1 expanded the programming surface area significantly for Standard Edition, but there are some things you can only do with Enterprise Edition. For example, one thing you can do is to change indexes, including clustered indexes, online. Online index building doesn’t mean that end users will see absolutely no effect, but it does reduce significantly the likelihood and amount of blocking users will experience. On extremely busy systems users might experience a slowdown, but for the most part, you can get away with these changes. One note on indexes is that up through SQL Server 2017, if you create a non-clustered index on a table with a clustered columnstore index, that operation will be an offline operation, making it much harder to deploy new indexes on those tables after users have started accessing tables with clustered columnstore indexes.

Another thing you can do in Enterprise Edition that you cannot do with Standard is partition tables. Partitioning large tables can be great for maintenance operations: you have the ability to delete old data by shunting that partition off to another table where you can then drop it. Or, starting in SQL Server 2017, you can simply truncate a partition if you could normally truncate that table.

There are other advantages as well to using Enterprise Edition over Standard Edition, but I think those are the two biggest when it comes to making database deployments easier.

Use Read Committed Snapshot Isolation

By default, SQL Server uses pessimistic locking, meaning that readers can block writers, writers can block readers, and writers can block writers. In most circumstances, you can switch from Read Committed to Read Committed Snapshot Isolation and gain several benefits. RCSI has certainly been in the product long enough to vet the code and Oracle has defaulted to an optimistic concurrency level for as long as I can remember.

The downtime-reducing benefit to using RCSI is that if you have big operations which write to tables, your inserts, updates, and deletes won’t affect end users. End users will see the old data until your transactions commit, so your updates will not block readers. You can still block writers, so you will want to batch your operations—that is, open a transaction, perform a relatively small operation, and commit that transaction. I will go into batching in some detail in a later post in the series, so my intent here is just to prime you for it and emphasize that Read Committed Snapshot Isolation is great.

Use Stored Procedures

There is a longstanding debate around whether to use stored procedures or ad hoc SQL in applications. My general preference is to use stored procedures, but I do understand that there are people who prefer ad hoc SQL queries. In this case, however, stored procedures have a huge benefit that you won’t get from ad hoc SQL: the interface effect.

When done right, stored procedures act as an interface layer between the database and the application. You define the input parameters and build expectations about the output, but your application code doesn’t care how you get from inputs to outputs. Particularly with complicated database operations, having a stored procedure gives you the ability to tune database code without changing application code.

Another advantage to having all of your database code in stored procedures is that you can change the underlying tables without changing application code. This makes table modifications much easier to pull off and helps ensure that you make the change everywhere you need to make it.

If you use weakly typed code to retrieve—for example, reading from a DataSet dictionary where you specify column names—you can add new columns to the stored procedure without breaking existing code. This lets you separate the database work from the application work, so that when the application is ready to use that new column, it will be there. Granted, this is something you can do with ad hoc queries, so it’s not a benefit in and of itself. If you use a strongly-typed process which expects an exact set of columns and only that set of columns, you can paint yourself into a corner by limiting your ability to act. I’d recommend avoiding that strategy as it reduces the number of options you have for managing your code.

One last advantage of using stored procedures is that managing backward compatibility is clearer: you can have different versions of procedures which provide different behavior depending upon which version(s) of the code are still running. For example, V1 of a procedure might insert values for 8 columns into a table. Then we added a nullable 9th column that our new code will be able to use. We can create a new version of the procedure which accepts 9 parameters and writes the outputs to all 9 columns. But during our blue-green window, we will have both sets of code inserting rows, so we need both versions of the procedures at the same time.

Use Database Tests

I pointed out the benefits of database tests in the prior post, so I won’t belabor them here. Tests give you the ability to ensure that your database code works at different steps in the deployment process, giving you a safety net. Writing test code is alternately complicated and tedious, but any bug your tests find before release is a bug your users didn’t find after release.

Use Views (Maybe)

Views are a tricky call for me. On the one hand, views can act as an interface layer and give us the ability to abstract away table changes, similar to stored procedures. You can also tune views without changing application code. That’s the positive side when it comes to reducing deployment downtime. On the other side of the ledger is that nested views can lead to massive performance problems. This makes views a risky proposition for an application. If you never join views to views and never nest views, you get some of the advantages of using stored procedures. But at that point, why not use stored procedures? Then you get all of the advantages of using stored procedures.

Conclusion

In this post, we looked at several ways that you can make your life better when working to minimize downtime. The goal of each of these is to increase the number of developer degrees of freedom you have. The more degrees of freedom you have, the less likely you are to get stuck in a situation where the only solution is to accept an extended amount of downtime.

In the next post, I will cover working with stored procedures and how we can perform common stored procedure actions with near-zero downtime.

TriPASS Incorporation Update: The Board

It’s been a month, so I wanted to provide an update on incorporation. After our articles of incorporation went up in flames (the opposite of going down in flames), I opened up a call for board members among community members. I was able to get enough volunteer (and quasi-volunteer) support to nominate a full slate but no additional volunteers. Supposing I had enough people interested in running for positions, I would have held individual votes but with just 5 people, I decided to have a single up or down vote on the entire slate.

As of Sunday night when I closed the polls, the final vote was 28-0 in favor of the board.

Congratulations to the new Triangle SQL Server Users Group board:

President – Me. Because I’m a control freak.
Vice President of Marketing – Rick Pack ( blog | twitter )
Treasurer – Mala Mahadevan ( blog | twitter )
Vice President of Membership – Tom Norman ( blog | twitter )
Vice President of Programs – Tracy Boggiano ( blog | twitter )

Now that we have our board, we can begin to take the next official steps toward incorporation, and I’ll have an update on that when someone tells me what they are. That’s my executive leadership in action.

Approaching Zero: Assumptions and Flow

This is part two in a series on near-zero downtime deployments.

As we begin to approach zero-downtime database deployments, I’m going to lay out a few assumptions which will make your life much easier. Then, I will cover the idea of the release flow.

Key Assumptions

There are three major things which will greatly increase the likelihood of successful (near) zero-downtime deployments: source control, automated releases, and continuous integration. Let’s look at each in order, starting with source control.

Source Control

You really want to have source control in place for a number of reasons. It’s not absolutely required, but it is quite helpful. Source control is one of your safety nets as you make changes: you have a history of what the code looked like before and have the possibility of reverting code in the event of failure. When it comes to minimizing downtime, a stable source control repository helps us put into place the other two key assumptions.

What should you use for source control? Pick whatever you want. Git is pretty much the standard these days, but if you want to use Mercurial, SVN, Team Foundation Version Control, or pretty much any other source control product, go for it. The advantage to Git is that it’s pretty easy to hook into a Git repo and automate processes, but at the end of the day, a source control system is primarily a place where you put code so that you don’t lose it.

A Release Process, Preferably Automated

The next assumption is that you have a release process in place as opposed to “developers push F5 to run scripts whenever they remember to.”

Like source control, that you have something is more important than what you have. Maybe you use Azure DevOps (nee Team Foundation Server). Maybe it’s Octopus Deploy or Jenkins or Cruise Control or something else. Maybe you wrote your own build process with Ant (or NAnt) or MSBuild or just a set of Powershell scripts. I’m not here to shame you into using a particular product; you do you. But you really need a release process. Ideally, this is an automated release process where it takes zero (or just a few) clicks to deploy code. If you have a person running scripts manually, that introduces a lot of uncertainty into the process. Did that person run all of the scripts? In the right order? At the right time? It also tends to be much slower than an automated process where you have the possibility of concurrency.

You also have choices about how your release scripts end up in the right place. There are a couple of alternatives here: automatically generate deployment scripts or hand-write deployment scripts. SQL Server Data Tools database projects, for example, can generate a diff file for schema changes. One warning with that, however, is that the auto-generated diff file will not necessarily avoid downtime. I find it really useful for relatively simple operations: creating and modifying stored procedures, adding new tables, that kind of thing. But my preference is to hand-write deployment scripts, as I’m less likely to do something like drop and recreate an index offline because the casing changed. That comes with the downside of being quite a bit more effort, though, so maybe a combination of the two is best.

Once you have a release process in place, you can get to the last key assumption.

Continuous Integration

This is the least important of the three but still plenty important in its own right. You should have a QA environment where your continuous integration process can keep deploying scripts. That way, you can make sure that as you develop during a sprint (or project or whatever you use), you can still deploy all of your code successfully. This means your deployment process has to be re-runnable.

Any of the processes above—even rolling your own Powershell scripts—can be used in a Continuous Integration scenario, but some of the tools are better-suited for it.

Now that we have these key assumptions in place, let’s talk about the flow of operations.

Going with the Flow

We have an environment with source control, an automated release process, and continuous integration. With that structure in mind, let’s talk about the timings of release as phases. There are four phases in total: pre-release, database release, code release, database post-release.

Pre-Release

When we start pre-release, we have a stable system. We know what the code looks like and what the database objects look like. The pre-release process starts…well, whenever you want it to start. Unlike the rest of the phases, pre-release has no fixed starting point.

I like to use pre-release for three things:

  • Scheduling things that will take a long time. For example, building a new index (online, to prevent blocking as much as I can) on a big table or migrating data from one table to another.
  • Making changes which need to happen before the rest of the process. This might be a preparatory step like shutting off a back-end service temporarily in order to make some changes to it.
  • Phase 1 of a very complex set of database changes where I need multiple steps. We’re going to see an example of this in my case study as part 8 in this series. Yeah, it’s a long series.

One of the key considerations here is that during the pre-release process, we typically do not want to degrade the customer experience. Keep those changes behind the scenes.

After pre-release is over, we can move to the database release phase.

Database Release

The database release phase is the first “primary” phase. It usually starts on a schedule, maybe 2 PM on a Wednesday or maybe “every day at 9 AM, 1 PM, 6 PM, and 10 PM” for more mature shops. Depending upon how much of an effect our release process normally has on end users, we might alert them that we expect to see a degradation in services starting at this point.

This phase of the release has us push out our database changes. This can involve creating or altering database objects but will not involve dropping existing objects.

Our database changes should support the blue-green deployment model. At this point in the process, all of the application code is “blue”—that is, the current production code. Our procedure changes need to be able to support that code without breaking. If we need to drop a column from a stored procedure, for example, we would not want to do it here. If we need to add a column to a stored procedure, we might do it here as long as it doesn’t break the calling code.

Once we’ve taken care of these non-disruptive database changes, we can move to the next phase of the process: the application code release.

Application Code Release

Here is where we get to the classic stateless blue-green deployment scenario. Ideally, we have multiple application servers running or we can spin up new ones easily (like in AWS or Azure). For the case of oversimplification, let’s say that we have 4 on-prem application servers. When we start out, all of our servers have the blues.


I cover the waterfront / Watchin’ the ship go by

We take two of the four servers out of our load balancer, so they become grey.


I could see everybody’s baby / But I couldn’t see mine

While they’re offline, we load the new code on those servers and bring them back into the load balancer. They’re running new code, which we represent as green.


I could see the ships pullin’ in / To the harbor

At this point, we have both sets of code running concurrently: some users are hitting old code and others are hitting new code. Our database needs to support both versions of the code for long enough for us to know that this release will be successful.

Once we’re confident that the new code is working as expected, we bring the remaining blue servers down.


I could see the people / Meeting their loved one

We then load the new code on our remaining servers and re-enlist them into the load balancer.


Shakin’ hand / I sat there
So all alone / Coverin’ the waterfront

Once this happens, the customer experience degradation should be over: users should now continue on, blissfully unaware that a bunch of stuff changed around them.

We aren’t out of the woods yet, though, because there’s still one more phase to go, and it’s my favorite.

Database Post-Release

The database post-release phase is where we get to do all that destructive stuff: dropping obsolete tables, dropping columns, dropping everything. Burn the house down.

At this point, we don’t have any code relying on old database objects, so they’re fair game. Now, as we go through this series, I’ll add some important nuance to the slightly overboard version above. But at its core, post-release is about cleaning up the old stuff.

Database post-release begins after the code release finishes and it goes on until you are completely done. This is also non-disruptive in the sense that we’re not running anything which should impede end users or block tables for an extended time.

Conclusion

In today’s post, we covered some of the key assumptions and process behind (near) zero-downtime deployments. With these in mind, we’ll spend the rest of the series working through different scenarios and seeing what we can and cannot do. Because that depends upon a few factors, our next post will look at ways we can make our lives simpler.

PolyBase Revealed: CDH Quickstart And PolyBase

All of my experience with Hadoop has been with the Hortonworks Data Platform. But to make sure that the stuff I’m advising in PolyBase Revealed works for Cloudera too, I grabbed the Cloudera QuickStart VM and took it for a spin. The two products are pretty similar overall but there are a few things I ended up figuring out along the way.

YARN Moved Out From Here, Buddy

My experience with Hortonworks has me think of port 8050 whenever someone asks for the YARN management port. But if you go knocking on port 8050, you get nothing back. That’s because YARN is running on port 8032 by default.

If you want to check your installation to see where your YARN resource manager address is, there’s an easy way to do it. Navigate to etc\hadoop\conf and run the following code: grep -n1 yarn.resourcemanager.address yarn-site.xml. That will return back three lines, and on the third line is your port.

Getting the YARN port in nineteen easy steps.

Host Bodies

Cloudera’s Quickstart VM tries to make things easy for you. One area where it does that is to auto-generate /etc/hosts. If you read the host file, you can see a comment at the top:

And that’s when the police said that the ping was coming from inside the house!

This set up is well and good if you’re just playing with Cloudera locally, but if you want to try accessing CDH remotely—especially when working with PolyBase—we need to make a couple of changes. First, we need to set a valid IP address for quickstart.cloudera and quickstart, not just 127.0.0.1. If you need help figuring out what IP address to use, check out the contents of ifconfig, which will tell you what IP addresses you have registered.

But we aren’t done yet. Like the comment states, Cloudera regenerates /etc/hosts each time you reboot the machine or restart the QuickStart services. To avoid this, we will need to open up /etc/init.d/cloudera-quickstart-init and comment out the line which calls cloudera-quickstart-ip. Then, you manage /etc/hosts yourself. You could also modify the quickstart IP script if you’d still like Cloudera to do the work of updating for you, but because I tend to set static IP addresses for VMs, I’m okay managing it myself.

My Data Node Has a First Name, It’s O-S-C-A-R

Here’s something which tripped me up a little bit while connecting to Cloudera using SQL Server. The data node name, instead of being quickstart.cloudera like the host name, is actually localhost. You can change this in /etc/cloudera-scm-agent/config.ini.

Because PolyBase needs to have direct access to the data nodes, having a node called localhost is a bit of a drag.

Conclusion

Today, we looked at a few non-obvious ways in which the Cloudera Distribution of Hadoop QuickStart VM differs from a Hortonworks Data Platform installation. There are other differences as well, but these were a few of the most apparent ones.