GitPitch: Revamping My Slide Decks

Over the past few months, I’ve started transitioning my slide decks to GitPitch.  I’ve used reveal.js for years and enjoyed being able to put together presentations in HTML format.  I can run these slides in any browser (or even phones) and don’t have to worry about Power Point messing up when I go to presentation mode.  I also get the ability to transition up and down as well as left and right, which lets me “drill into” topics.  For example, from my Genetic Algorithms talk:

If I want to go into a topic, I can slide down; if I need to skip it for time, I can bypass the section.

GitPitch also uses Reveal, so many of the skills that I’ve built up still work for me.  But there are a few compelling reasons for me to migrate.

Reasons For Moving To GitPitch

So why did I move to GitPitch?  It started with an offer to Microsoft MVPs where they offer a free year of GitPitch Pro.  For transparency purposes, I am taking advantage of that free year.  After the year is up, I’m going to start paying for the service, but I did get it for free.

So why did I make the jump?  There are a few reasons.

Background Images

One of the best features of GitPitch (and a reason I went with Pro over the Free edition) is the ability to use a background image and set the transparency level.  This lets me add flavor to my slide decks without overwhelming the audience:

There are a number of good sources for getting free images you can use for these purposes, including UnsplashCreative Commons Search, and SkyPixel for drone photos.  I’m also working to introduce my own photography when it makes sense.

Markdown

Reveal.js uses HTML to build slides, but GitPitch uses Markdown.  Markdown is a fairly easy syntax to pick up and is a critical part of Jupyter Notebooks.

To build the slide above, the markdown looks like this:

---?image=presentation/assets/background/Background_13.jpg&size=cover&opacity=15

### Outlook

|Outlook|Yes|No|P(Yes)|P(No)|
|-------|---|--|------------|
|Sunny|2|3|2/9|3/5|
|Overcast|4|0|4/9|0/5|
|Rainy|3|2|3/9|2/5|
|Total|9|5|100%|100%|

No HTML and it’s pretty easy to make changes.  Because it’s a text-based format, major changes are also pretty easy to find-and-replace, something hard to do with Power Point.

Math

Recently, I did a presentation on Naive Bayes classifiers.  To display math, we can use MathJax.  Here’s an example slide:

---

### Applying Bayes' Theorem

Supposing multiple inputs, we can combine them together like so:

`$P(B|A) = \dfrac{P(x_1|B) * P(x_2|B) * ... * P(x_n|B)}{P(A)}$`

This is because we assume that the inputs are **independent** from one another.

Given `$B_1, B_2, ..., B_N$` as possible classes, we want to find the `$B_i$` with the highest probability.

And here’s how it looks:

I’m definitely not very familiar with MathJax, but there are online editors to help you out.

Code Samples

Another cool feature is the ability to include code samples:

What’s really cool is the ability to walk through step by step:

---

```r
if(!require(naivebayes)) {
install.packages("naivebayes")
library(naivebayes)
}
if(!require(caret)) {
install.packages("caret")
library(caret)
}

data(iris)

set.seed(1773)
irisr <- iris[sample(nrow(iris)),]
irisr <- irisr[sample(nrow(irisr)),]

iris.train <- irisr[1:120,]
iris.test <- irisr[121:150,]

nb <- naivebayes::naive_bayes(Species ~ ., data = iris.train)
#plot(nb, ask=TRUE)

iris.output <- cbind(iris.test, 
prediction = predict(nb, iris.test))
table(iris.output$Species, iris.output$prediction)
confusionMatrix(iris.output$prediction, iris.output$Species)
```

@[1-4](Install the naivebayes package to generate a Naive Bayes model.)
@[5-8](Install the caret package to generate a confusion matrix.)
@[12-14](Pseudo-randomize the data set. This is small so we can do it by hand.)
@[16-17](Generate training and test data sets.)
@[19](Building a Naive Bayes model is as simple as a single function call.)
@[22-25](Generate predictions and analyze the resulting predictions for accuracy.)

I haven’t used this as much as I want to, as my talks historically have not included much code—I save that for the demos.  But with the ability to walk through code a section at a time, it makes it easier to present code directly.

Up On GitHub

Something I wasn’t a huge fan of but grew to be okay with was that the markdown files and images are stored in your GitHub repo.  For example, my talk on the data science process is just integrated into my GitHub repo for the talk.  This has upsides and downsides.  The biggest upside is that I don’t have to store the slides work anywhere else, but I’ll describe the downside in the next section.

Tricky Portions

There are a few things that I’ve had to work out in the process.

Handling Updates

Testing things out can be a bit annoying because you have to push changes to GitHub first.  Granted, this isn’t a big deal—commit and push, test, commit and push, test, rinse and repeat.  But previously I could save the HTML file, open it locally, and test the results.  That was a smoother process, though as I’ve built up a couple of decks, I have patterns I can follow so that reduces the pain quite a bit.

Online Only (Mostly)

There is an offline slideshow mode in GitPitch Pro and a desktop version, but I haven’t really gotten into those.  It’s easier for me just to work online and push to my repos.

When presenting, I do need to be online to grab the presentation, but that’s something I can queue up in my hotel room the night before if I want—just keep the browser window open and I can move back and forth through the deck.

Down Transitions And Background Images

One thing I’ve had to deal with when migrating slide decks is that although I can still use the same up-down-left-right mechanics in Reveal.js, when using transparency on images, I’ve noticed that the images tend to bleed together when moving downward.  I’ve dealt with the problem by going back to a “classic” left-right only slide transition scheme.

Wrapping Up

I’ve become enough of a fan of GitPitch to decide that I’m going to use that as my default.  I think the decks are more visually compelling:  for example, there’s my original data science process slide deck and my GitPitch version.  As far as content goes, both decks tell the same story, but I think the GitPitch version retains interest a bit better.  As I give these talks in the new year, I’ll continue transitioning to GitPitch, and new talks will go there by default.

Advertisements

TIL: Docker

Last night, I went to a local .NET User Group meetup and got my first taste of Docker.

In my case, I ended up running on Elementary OS rather than Windows, but the experience was a good one, going through the tutorials. In the end, I installed Solr and was able to load a document for indexing.  Installing nginx was also easy.

If you are interested in installing Docker in a Windows environment, start with Mano Marks’s article on the topic. You can also look at the Docker Tools for Visual Studio.

TIL: Kafka Queueing Strategy

I had the privilege to sit down with a couple Hortonworks solution engineers and discuss a potential Hadoop solution in our environment.  During that time, I learned an interesting strategy for handling data in Kafka.

Our environment uses MSMQ for queueing.  What we do is add items to a queue, and then consumers pop items off of the queue and consume them.  The advantage to this is that you can easily see how many items are currently on the queue and multiple consumer threads can interact, playing nice with the queue.  Queue items last for a certain amount of time—in our case, we typically expire them after one hour.

With Kafka, however, queues are handled a bit differently.  The queue manager does not know or care about which consumers read what data when (making it impossible for a consumer to tell how many items are left on the queue at a certain point in time), and the consumers have no ability to pop items off of the queue.  Instead, queue items fall off after they expire.

Our particular scenario has some web servers which need to handle incoming clicks.  Ideally, we want to handle that click and dump it immediately onto a queue, freeing up the web server thread to handle the next request.  Once data gets into the queue, we want it to live until our internal systems have a chance to process that data—if we have a catastrophe and items fall off of this queue, we lose revenue.

The strategy in this case is to take advantage of multiple queues and multiple stages.  I had thought of “a” queue, into which the web server puts click data and out of which the next step pulls clicks for processing.  Instead of that, a better strategy (given what we do and our requirements) is to immediately put the data into a queue and then have consumers pull from the queue, perform some internal “enrichment” processes, and finally put the enriched data back onto a new queue.  That new queue will collect data and an occasional batch job pulls it off to write to HDFS.  This way, you don’t take the hit of streaming rows into HDFS.  As far as maintaining data goes, we’d need to set our TTL to last long enough that we can deal with an internal processing engine catastrophe but not so long that we run out of disk space holding messages.  That’s a fine line trade-off we’ll need to figure out as we go along.

Summing things up, Kafka is quite a different product than MSMQ, and a workable architecture is going to look different depending upon which queue product you use.

TIL: Bots

Last night, Jamie Dixon (whose book you should buy) talked about his experience at Build this year.  His main takeaway is that Microsoft is pushing their Bot Framework pretty hard.  Jamie showed how to create a stock lookup bot and deploy it to Slack using F# code and an Azure account.

About a month ago, I went to F8 and my main takeaway from that conference is that Facebook is pushing bots on their Messenger platform pretty hard.

Based on my (extremely) limited knowledge of both, it seems that the Facebook bot platform is a bit easier to deal with, as there’s a user interface for writing messages and tokenizing, but right now, Microsoft’s platform is a bit more customizable and developer-friendly.  What’s particularly interesting about both of these is that the analytics and intelligence engines are both closed-source—neither company will let you see the wizard behind the curtain.

TIL: Auditing, Monitoring, Alerting

I’m giving a presentation on monitoring this Monday.  As part of that, I want to firm up some thoughts on the differences between auditing, monitoring, and alerting.  All three of these are vital for an organization, but they serve entirely different functions and have different requirements.  I’ll hit a bunch of bullet points for each.

Auditing

Auditing is all about understanding a process and what went on.  Ideally, you would audit every business-relevant action, in order, and be able to “replay” that business action.  Let’s say we have a process which grabs a flat file from an FTP server somewhere, dumps data into a staging table, and then performs ETL and puts rows into transactional tables.  Our auditing process should be able to show what happened, when.  We want to log each activity (grab flat file, insert rows into staging table, process ETL) down to its most granular level.  If we make an external API call for each row as part of the ETL process, we should log the exact call.  If we throw away a row, we should note that.  If we modify attributes, we should note that.

Of course, this is a huge amount of data and depending upon processing requirements and available storage space, you probably have to live with something much less thorough.  So here are some thoughts:

  • Keep as much information around errors as you can, including stack traces, full parameter listings, and calling processes.
  • Build in (whenever possible) the full logging mentioned, but leave it as a debug/trace flag in your app.  You could get creative and have custom tracing—maybe turn on debugging just for one customer.  You might also think about automatically switching that debug mode back off after a certain amount of time.
  • Add logical “process run” keys.  If there are three or four systems which process the same data in a pipeline, it makes sense to track those chunks of data separate from the individual pipeline steps.  At an extreme case, you might want to see how an individual row in a table somewhere got there, with a lineage ID that traces back to specific flat files or specific API calls or specific processes and tells you everything that happened to get to that point.  Again, this is probably more of an ideal than a practical scenario, but dream big…
  • Build an app to read your audit data.  Reading text files is okay, but once you get processes interacting with one another, audit files can get really confusing.

Monitoring

Monitoring is all about seeing what’s going on in your system “right now.  You want nice visualizations which give you relevant information about currently-running processes, and I put “right now” in quotation marks because you can be monitoring a process which only updates once every X minutes.

There are a couple of important things to consider with monitoring:

  • Track what’s important.  Don’t track everything “just in case,” but focus on metrics you know are important.  As you investigate problems and find new metrics which can help, add them in, but don’t be afraid to start small.
  • Monitoring should focus on aggregations, streams, and trends.  It’s your 50,000-foot view of your world.  Ideally, your monitoring system will let you drill down to more detail, but at the very least, it should let you see if there’s a problem.
  • Monitors are not directly actionable.  In other words, the purpose of a monitor is to display information so a human can observe, orient, decide, and act.  If you have an automated solution to a problem, you don’t need a monitor; you need an automated process to fix the issue!  You can monitor the automated solution to make sure it’s still running and track how frequently it’s fixing things, of course, but the end consumer of a monitor is a human.
  • Ideally, a monitor will display enough information to weed out cyclical noise.  If you have a process which runs every 60 minutes and which always slams your SAN the top 5 minutes of each hour, maybe graph the last 2 or 3 hours so you can see the cycles.  If you have enough data, you can also build baselines of “normal” behavior and plot those against current behavior to make it easier for people to see if there is a potential issue.
  • Monitors are a “pull” technology.  You, as a consumer, go to the monitor application and look at what’s going on.  The monitor does not jump out and send you messages and popups and try to get your attention.

Alerting

Alerting is all about sending messages and getting your attention.  This is because an alert is telling you something that you (as a trained operator) need to act upon.  I think alerting is the hardest thing on the list to get right because there are several important considerations here:

  • Alerts need to be actionable.  If I page the guy on call at 3 AM, it’d better be because I need the guy on call to do something.
  • Alerts need to be “complete.”  The alert should provide enough information that a sleep-deprived technician can know exactly what to do.  The alert can provide links to additional documentation, how-to guides, etc.  It can also show the complete error message and even some secondary diagnostic stuff which is (potentially) related.  In other words, the alert definitely needs to be more than an e-mail alert which reads “Error:  object reference not set to an instance of an object.”
  • Alerts need to be accurate.  If you start throwing false positive alerts—alerting when there is no actual underlying problem—people will turn off the alert.  If you have false negatives—not alerting when there is an underlying problem—your technicians are living under a false sense of security.  In the worst case scenario, technicians will turn off (or ignore) the alerts and occasionally remember to check a monitor which lets the know that there was a problem two hours ago.
  • Alerts need human intervention.  If I get an alert saying that something failed and an automated process has kicked in to fix the problem, I don’t need that alert!  If the automated process fails and I need to perform some action, then I should get an alert.  Otherwise, just log the failure, have the automated process run to fix the problem, and let the technicians sleep.  If management needs figures or wants to know what things looked like overnight, create reports and digests of this information and pass it along to them, but don’t bother your technicians.
  • On a related note, alerts need to be for non-automatable issues.  If you can automate a problem away, do so.  Even if it takes a fair amount of time, there’s a lot less risk in a documented, tested, automated process than in waking up some groggy technician.  People at 3 AM make mistakes, even when they have how-to documents and clear processes.  People at all hours of the day make mistakes; we get distracted and miss steps, mis-type something, click the wrong button, follow the wrong process, think we have everything memorized but (whoopsie) forgot a piece.  Computers are less likely to have these problems.

Wrapping Up

Auditing, monitoring, and alerting solve three different sets of problems.  They also have three different sets of requirements for what kind of data to use, how frequently to refresh this data, and how people interact with them.  It’s important to keep these clearly delineated for that reason.

During this, I’m also working on some toy monitoring stuff, so I hope that’ll be tomorrow’s TIL.

TIL: Installing Jupyter And R Support

I recently got through some difficulties installing Jupyter and incorporating R support, so I wanted to write up a quick installation post for a Linux installation.

First, install Jupyter through anaconda.  Notes:

  1. I grabbed the Python 3.5 version.  I don’t intend to write too much Python code here, so it shouldn’t make a huge difference to me.
  2. When you install, do not run sudo.  Just run the bash script.  It will install in your home directory by default, and for a one-off installation on a VM (like my scenario), this is fine.  It also makes future steps easier.

When running Jupyter, I started by following this guide.  Notes:

  1. Starting Jupyter is as easy as running “jupyter notebook” and navigating to http://localhost:8888 in a browser.
  2. Midori does not appear to be a good browser for Jupyter; when I tried to open a new notebook, I got an error message in the console and the browser seemed not to want to open up the notebook.  Firefox worked just fine.  Maybe I’m doing this wrong, though.

As for installing R support, Andrie de Vries has a nice post on the topic.  Notes:

  1. Here’s where not running sudo above pays off.  If you did run sudo, you’ll get an error saying that you can’t install in the home directory and that you should run a command to make a copy of the files…in your home directory.  If you accidentally ran sudo, you can chmod all of the files in your anaconda3/ directory using “chmod -R [user]:[user] anaconda3/” and correct the issue.
  2. Installation is as simple as running “conda install -c r ipython-notebook r-irkernel”  Again, note that I’m not running sudo here.