TIL: Installing Spark On Windows

I’m going through Justin Pihony’s Pluralsight course on Spark, and he covers installing.  I’m glad that he did because there are a couple nuances involved.

First, you can download Spark without having Hadoop installed, but if you do that, you’ll want to follow Sangeet Chourey’s instructions.  If you’re running 64-bit Windows, you’ll want to get his version of winutils (which is 106 KB); if you’re running 32-bit Windows, you’ll need a different executable which is something like 43 or 46 KB.

If you do not follow Sangeet’s instructions, you’ll end up with an error described in SPARK-10528, but because it’s technically not a Spark error, they can’t do much to help out here.

I also recommend going into your Spark\conf\ directory and open log4j.properties.template.  Change INFO and WARN to ERROR and save the file as log4j.properties.  That will turn down logging to errors only, which gets rid of a lot of the mess of verbiage Spark inundates you with when you hit the console.

TIL: Hive And C#

I work in a .NET shop, meaning that our developers focus primarily on writing C# code.  I’m introducing Hadoop more and more into this environment, but I know it’s an easier road if I can reduce the number of unfamiliar languages.

To that end, I was happy to learn that Randy Gelhausen has a project which integrates C# code (either Microsoft C# or Mono) with Hive.

My function-writing skills in Hive are weak, but the knowledge that this exists is pretty cool.

TIL: Installing Jupyter On Windows

Installing Jupyter on Windows isn’t too difficult.  Here is a product-by-product guide.  Each program is just simple next-next-next clicking so I’m not going to include screenshots or step-by-step installation instructions.

Install R

There are two versions of R available, CRAN R and Microsoft R Open.  You can grab either of them at the provided links.  If you do install Microsoft R Open, you should probably also install the mathematical libraries.

Install R Studio

R Studio isn’t technically required, but it’s extremely useful to have a good IDE installed. Jupyter is fantastic for giving people a repeatable method of running your code, but you want to have something you can do quick tests against.  Grab R Studio.  You can also try R Tools for Visual Studio.

Install Anaconda

The easiest way to install Jupyter is to install Anaconda.  If you don’t have Python installed on your machine already, you can grab and install this full-featured suite.

Install Jupyter

Once you have Anaconda installed, you can install Jupyter by running the following command at your console:

conda install jupyter

Install R Support

The last step is to include R support in Jupyter.  This is pretty simple:

conda install -c r ipython-notebook r-irkernel

Once you’ve done all of these, you have Jupyter installed!  To kick off Jupyter, run the following:

jupyter notebook

If you want to test your Jupyter installation, I have a couple of notebooks available that I’m going to use in my Polyglot .NET Meetup tonight:

  1. Basic R Test
  2. Wake County Restaurant Inspections

TIL: Power BI

Last night, I had my first hands-on experience with Power BI.  I’ve read quite a bit about it (especially given all of the Curated SQL articles on the topic), but tonight was my first hands-on experience thanks to the newly-formed Power BI user group.

Using data from the first lab (an Access database with US sales, a set of CSVs with international sales, and an Excel spreadsheet with population data), I was able to put together a couple of simple dashboard components:

DashboardInAnHour.png

In this simplistic dashboard, I have a stacked bar chart with populations from five countries over time.  Of interest is that the German population has dropped slightly between 1999 and 2014.  I also have a couple of text widgets with total revenue and total number of units sold in the US, followed by a pie chart (shaddup) showing revenue across the five foreign countries in this data set.

I have a long way to go before I can pretend to be proficient in Power BI, but considering that I did this in about 20 minutes while hacking away at a user group, I’m pretty happy with the low introductory learning curve.

TIL: RunningValue In SSRS

I created a new SQL Server Reporting Services report for the first time in a long while, and ran into a problem:  I have a matrix and want to alternate row color.  I’m re-implementing an Excel report using SSRS and want to get the colors right.  This report uses a Matrix, so a “row” is a little more complicated than in a normal table.

To do this, I read Ghanesh Prasad’s article on the topic.  Of particular interest was his use of the RunningValue function:

=RunningValue(Fields!City.Value,countDistinct,Nothing)

RunningValue is a function which I surprisingly never used in my prior SSRS days.  It looks extremely useful for building running totals, sub-group calculations, and the like.

TIL: Kafka Queueing Strategy

I had the privilege to sit down with a couple Hortonworks solution engineers and discuss a potential Hadoop solution in our environment.  During that time, I learned an interesting strategy for handling data in Kafka.

Our environment uses MSMQ for queueing.  What we do is add items to a queue, and then consumers pop items off of the queue and consume them.  The advantage to this is that you can easily see how many items are currently on the queue and multiple consumer threads can interact, playing nice with the queue.  Queue items last for a certain amount of time—in our case, we typically expire them after one hour.

With Kafka, however, queues are handled a bit differently.  The queue manager does not know or care about which consumers read what data when (making it impossible for a consumer to tell how many items are left on the queue at a certain point in time), and the consumers have no ability to pop items off of the queue.  Instead, queue items fall off after they expire.

Our particular scenario has some web servers which need to handle incoming clicks.  Ideally, we want to handle that click and dump it immediately onto a queue, freeing up the web server thread to handle the next request.  Once data gets into the queue, we want it to live until our internal systems have a chance to process that data—if we have a catastrophe and items fall off of this queue, we lose revenue.

The strategy in this case is to take advantage of multiple queues and multiple stages.  I had thought of “a” queue, into which the web server puts click data and out of which the next step pulls clicks for processing.  Instead of that, a better strategy (given what we do and our requirements) is to immediately put the data into a queue and then have consumers pull from the queue, perform some internal “enrichment” processes, and finally put the enriched data back onto a new queue.  That new queue will collect data and an occasional batch job pulls it off to write to HDFS.  This way, you don’t take the hit of streaming rows into HDFS.  As far as maintaining data goes, we’d need to set our TTL to last long enough that we can deal with an internal processing engine catastrophe but not so long that we run out of disk space holding messages.  That’s a fine line trade-off we’ll need to figure out as we go along.

Summing things up, Kafka is quite a different product than MSMQ, and a workable architecture is going to look different depending upon which queue product you use.

Reorganize Columnstore Indexes

I have created a script to help figure out when to reorganize columnstore indexes in SQL Server 2016.

Note that this script requires SQL Server 2016 (or later) because the database engine team made some great changes to columnstore indexes, allowing us to use REORGANIZE to clear out deleted rows and compact row groups together, as well as its previous job of marking open delta stores as available for compression.

The code is available as a Gist for now, at least until I decide what to do with it.  Comments are welcome, especially if I’m missing a major reorganize condition.

Incidentally, with CTP 3.3, I’ve noticed that there’s no reason to run index rebuilds over index reorganizations, at least in my environment.  It’s possible that there are plenty of environments in which it makes sense to do regular rebuilds, but I’ve noticed reorganization to be more efficient in terms of compacting row groups together, and it’s an online operation to boot.