I’m Writing A Book

Due to having far too much free time (I mean, there are hours a day where I’m doing nothing but sleeping! More than one!), I’ve taken to the idea of writing a book on PolyBase. The working title is PolyBase Revealed and has a scheduled release date of “That’s way in the future so I don’t have to worry…wait, what month is it now? Oh crap!”

Long-time readers may recall my series on PolyBase (I still prefer “Polybase” and so does about a quarter of the Microsoft documentation, but I’ll capitalize the middle B for marketing purposes) focusing on SQL Server 2016. Well, SQL Server 2019 is coming and with it, we have a lot of new PolyBase-related functionality, including external connections to a slew of new resources. The book will focus on SQL Server 2019 with a chapter on Azure SQL Data Warehouse. My goal is to make it easy for someone getting started with data virtualization and with a T-SQL background to integrate with the different supported systems, pointing out bugaboos along the way. Because there are plenty of bugaboos.

As I write the book, I intend on adding to this blog the odds and ends that I find during research but which don’t make their way into the book. I’ll probably spam my blog with way too much info about the book because I expect to build a house made out of gold using my royalties. A house made out of gold is utterly impractical, but how else could I spend that much in royalty checks?

Advertisements

What’s On My Work PC?

I just had to rebuild my work machine, so I figured I’d get a cheap blog post out of it discuss the tools that I use regularly. I’ve broken these down by category. As a quick note, this is on my main work machine, so there are some things which I use on other PCs but don’t here. Part of that is the nature of the job. This also isn’t everything on my machine, but does cover most of the day-to-day tools I use.

Got something you really like but I don’t have? Let me know in the comments.

Connecting To Databases

  • SQL Server Management Studio 17. I don’t have any SQL Server 2019 instances yet, so no need to move to SSMS 18 just yet. Plugins I rely on include:
    • devart SQL Complete. Their snippet management, intellisense improvements, and (most importantly) document formatting work great for me. This plugin also works with Visual Studio.
    • Tabs Studio. I tend to have dozens of SSMS tabs open at once. Tabs Studio lets me use the vertical space on my monitor instead of having 4 visible tabs and a drop-down with a couple dozen others. This plugin also works with Visual Studio.
    • SentryOne Plan Explorer. The best way to view execution plans.
  • Azure Data Studio. I don’t use this quite as much as SSMS, but I’m moving in that direction. A couple of plugins and I’d be there. Key plugins:
    • SQL Server 2019 support. This adds some nice functionality including notebooks in Azure Data Studio. That’s a feature differentiator between ADS and SSMS.
    • SSMS keymap. I have too much muscle memory tied up in Ctrl-R and the like to give it up.
    • High Color Queries. I like the color sets. Because I am a monster, I prefer the light theme.
  • Aginity Redshift Workbench. I don’t use Redshift often, but when I do, this is my go-to app.
  • pgAdmin 4. Same as Aginity: I don’t get into Postgres very often, but when I need to go there, this is how I do it.
  • Power BI Desktop and Power BI Report Server. We use Power BI Report Server internally rather than deploying to Azure. I also use Power BI Desktop for my own personal dashboards that don’t make it to the outside world.

Writing Code

  • Visual Studio 2017. Pretty much all of my .NET development happens in Visual Studio. Visual Studio 2017 also installs Anaconda and Jupyter if you install the Data Science tools, so I include them here rather than as separate line items. Key plugin:
    • BimlExpress. If you’re doing SQL Server Integration Services development, you really need to know Biml. There’s a lot more you can do with Biml as well.
  • Visual Studio Code. I haven’t gotten quite into VS Code but I’m starting to use it for Python code dev and other non-.NET languages.
  • R Studio. It’s still the standard for R development.

Checking In Code

  • TortoiseGit. This is a tool that I have mostly because a lot of other developers at the company have it, so sometimes it’s easier just to have it installed when working through issues.
  • SourceTree. This is my primary Git client.

Wrangling Text

  • OneNote. I really don’t like the new version and highly prefer the Office 2016 version. The only thing I like about the Windows app version is its superior touchscreen support, but on my non-touchscreen devices it’s a pain. It’s also not close to feature-complete. So it’s the Office 2016 version for me.
  • Notepad++. I’m using it a bit less and VS Code a bit more, but it’s the first thing I look for when I right-click a text file.
  • Liquid Studio. This is a tool that I used to have. I used it specifically for its large file support, being able to read multi-gigabyte text files without choking. Honestly, I just want good versions of head, tail, and the like on Windows. And I’m still not really sold on Cygwin though I might install it yet again.
  • KDiff3. Yeah, the last update was 4 1/2 years ago, but it’s still my go-to diff and merge tool.

Dealing With Files

  • MultiCommander. The dual-pane layout is great for my purposes, as it’s sort of like having two Windows Explorer windows open, except a lot more functional.
  • S3 Browser. This is one of the better Amazon S3 file browsers that I’ve seen, particularly given the price.
  • Azure Storage Explorer. Free and deals with Azure storage. That works for me.
  • WinDirStat. Ever wonder where all of your disk space went? WinDirStat will tell you exactly where it is and give you a treemap to visualize whether that’s one giant file or a bunch of smaller ones.
  • 7-Zip. The 7z format is rather efficient, though I tend not to share those files. I do use it for compressing files local to my machine, and I like its interface for extracting files. I’d like the .tar.gz process to be a one-step process instead of two, though.
  • Sysinternals Utilities. I include this here because I didn’t want to create a “miscellany” section just for it. I use ZoomIt frequently, especially during code reviews. Presentations aren’t always in front of large groups.

TriPASS Incorporation, Part One

As the Triangle Area SQL Server Users Group (TriPASS) grows ever-larger, we have decided to incorporate as a not-for-profit corporation. This is something which several of the members, including Tom Norman (blog | twitter) and Tracy Boggiano (blog | twitter) have been pushing for a while, and they sicced Jamie Wick (blog | twitter)—who wrote an excellent blog post on the process—on me at a SQL Saturday. Well, my slow-rolling is over and we’re moving ahead.

As we move along in the process of incorporation, I’m going to post updates along the way. Hopefully these will help other organizations looking to incorporate.

Step One: Write Up Bylaws

Tom Norman put together our bylaws (PDF format), taking them from his time in incorporating the Colorado Springs PASS chapter. We leaned on those bylaws quite heavily, making some minor changes to fit our group but otherwise leaving it open.

From there, we had a one-week period for review so members of the community could review the bylaws and suggest improvements before the vote.

Step Two: Vote On Bylaws

After putting together the bylaws, we submitted them to a vote. The vote passed handily, though I’m glad we did not have a minimum number of votes needed to pass in our bylaws. That is one of the going concerns I have: most of our attendees will be pleasantly neutral on most things and few will reach the point where they invest additional time in the group. Any endeavor which relies on a majority or even a large minority stepping up will necessarily fail because the benefit structure is not there. Perhaps as we go forward, we can figure out ways to align benefits to participation and increase the number of avid participants. Perhaps.

Step Two Point Five: Continue the Process

At this point, we are continuing to move forward with incorporation and obtaining not-for-profit status. Our next step is to elect a board, which we will do during the February main meeting and online (for people who can’t make the meeting). Our board will have five members: President, Vice President of Membership, Vice President of Marketing, Vice President of Programs, and Treasurer. My hope is that we’ll have enough interest in people becoming board members that we’ll have more than five total nominees, or at the very least that I won’t need to take hostages to get “volunteers.”

Solving Naive Bayes With R

This is part four in a series on classification with Naive Bayes.

Classification Of Features With R

So far, we’ve walked through the Naive Bayes class of algorithms by hand, learning the details of how it works. Now that we have a good understanding of the mechanics at that level, let’s let the computer do what it does better than us: mindless calculations.

Of Course It’s The Iris Data Set

Our first example is a classic: the iris data set in R. We are going to use the naivebayes R package to implement Naive Bayes for us and classify this iris data set. To give us an idea of how good the classifier is, I’m going to use 80% of the data for training and reserve 20% for testing. I’ll set a particular seed so that if you want to try it at home, you can end up with the same results as me.

Here’s the code, which I’ll follow up with some discussion.

if(!require(naivebayes)) {
  install.packages("naivebayes")
  library(naivebayes)
}
if(!require(caret)) {
  install.packages("caret")
  library(caret)
}

data(iris)

set.seed(1773)
irisr <- iris[sample(nrow(iris)),]
irisr <- irisr[sample(nrow(irisr)),]

iris.train <- irisr[1:120,]
iris.test <- irisr[121:150,]

nb <- naivebayes::naive_bayes(Species ~ ., data = iris.train)

plot(nb)

iris.output <- cbind(iris.test, prediction = predict(nb, iris.test))

caret::confusionMatrix(iris.output$prediction, iris.output$Species)

All in all, a couple dozen lines of code to do the job. The first two if statements load our packages: naivebayes and caret. I could use caret to split my training and test data, but because it’s such a small data set, I figured I’d shuffle it in place and assign the first 80% to iris.train and leave the remaining 20% for iris.test.

The key function is naive_bayes in the naivebayes package. In this case, we are predicting Species given all of the other inputs on iris.train.

If you do use the default seed that I’ve set, you’ll end up with four plots, one for each feature. Here they are:

undefined

Looking at these images, sepal length and sepal width aren’t very helpful for us: what we want is a great separating equilibrium—that is, where most of the distributions are independent. Petal length ad petal width are better—setosa is clearly different from the others, though there is some overlap between versicolor and virginica, which will lead to some risk of ambiguity.

Maximizing Confusion

Once we have our output, we can quickly generate a confusion matrix using caret. I like using this a lot more than building my own with e.g. table(iris.output$Species, iris.output$prediction). The reason I prefer what caret has to offer is that it also includes statistics like positive predictive value and negative predictive value. These tend to be at least as important as accuracy when performing classification, especially for scenarios where one class is extremely likely and the other extremely unlikely.

Here is the confusion matrix output from caret. After that, I’ll explain positive and negative predictive values.

Confusion Matrix and Statistics

            Reference
Prediction   setosa versicolor virginica
  setosa          6          0         0
  versicolor      0         11         1
  virginica       0          1        11

Overall Statistics
                                          
               Accuracy : 0.9333          
                 95% CI : (0.7793, 0.9918)
    No Information Rate : 0.4             
    P-Value [Acc > NIR] : 1.181e-09       
                                          
                  Kappa : 0.8958          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: setosa Class: versicolor Class: virginica
Sensitivity                    1.0            0.9167           0.9167
Specificity                    1.0            0.9444           0.9444
Pos Pred Value                 1.0            0.9167           0.9167
Neg Pred Value                 1.0            0.9444           0.9444
Prevalence                     0.2            0.4000           0.4000
Detection Rate                 0.2            0.3667           0.3667
Detection Prevalence           0.2            0.4000           0.4000
Balanced Accuracy              1.0            0.9306           0.9306

Positive predictive value for a category is: if my model predicts that a particular set of inputs matches a particular class, what is the probability that this judgement is correct? For example, we have 12 versicolor entries (read the “versicolor” Prediction row across and sum up values). 11 of the 12 were predicted as versicolor, so our positive predictive value is 11/12 = 0.9167.

Negative predictive value for a category is: if my model predicts that a particular set of inputs does not match a particular class, what is the probability that this judgement is correct? For example, we have 18 predictions which were not versicolor (sum up all of the values across the rows except for the versicolor row). Of those 18, 1 was actually versicolor (read the versicolor column and ignore the point where the prediction was versicolor). Therefore, 17 of our 18 negative predictions for versicolor were correct, so our negative predictive value is 17/18 = 0.9444.

This is a small data set with relatively little variety and only one real place for ambiguity, so it’s a little boring. So let’s do something a bit more interesting: sentiment analysis.

Yes, Mr. Sherman, Everything Stinks

Now we’re going to look at movie reviews and predict whether a movie review is a positive or a negative review based on its words. If you want to play along at home, grab the data set, which is under 3MB zipped in 2000 reviews in total.

Unike last time, I’m going to break this out into sections with commentary in between. If you want the full script with notebook, check out the GitHub repo I put together for this talk.

First up, we load some packages. I’ll use naivebayes to perform classification and tm for text mining. If you’re a tidytext fan, you can certainly use that for this work too.

if(!require(naivebayes)) {
  install.packages("naivebayes")
  library(naivebayes)
}
if(!require(tidyverse)) {
  install.packages("tidyverse")
  library(tidyverse)
}
if(!require(tm)) {
  install.packages("tm")
  library(tm)
}
if(!require(caret)) {
  install.packages("caret")
  library(caret)
}

We’ll next load the data and split it into training and test data sets.

df <- readr::read_csv("../data/movie-pang02.csv")

set.seed(1)
df <- df[sample(nrow(df)),]
df <- df[sample(nrow(df)),]
df$class <- as.factor(df$class)

corpus <- tm::Corpus(tm::VectorSource(df$text))
corpus

I’m going to stop here and lay out a warning: this will leak information: if your test data set includes words your training data set does not, the trained model will gain knowledge of those additional words and that they don’t appear in the training set. In a real project, I’d build a corpus off of the training data and then apply those rules to the test set, using Laplace Smoothing or a similar technique to deal with any test words not in the training set.

With that warning said, I’m now going to clean up the data by converting everything to lower-case, removing punctuation and numbers, removing stopwords, and stripping out any extraneous whitespace. This reduces the total document space and gives us a more consistent set of words.

corpus.clean <- corpus %>%
  tm::tm_map(tm::content_transformer(tolower)) %>% 
  tm::tm_map(tm::removePunctuation) %>%
  tm::tm_map(tm::removeNumbers) %>%
  tm::tm_map(tm::removeWords, tm::stopwords(kind="en")) %>%
  tm::tm_map(tm::stripWhitespace)

dtm <- tm::DocumentTermMatrix(corpus.clean)

Then we turn our words into features using the bag of words technique. It’s not the fanciest or best, but it’s quick-and-easy—sort of like Naive Bayes.

Once we have the document term matrix, we can build out our training and test data. I already shuffled at the beginning, so we split out our elements into training and test, reserving 25% for test.

df.train <- df[1:1500,]
df.test <- df[1501:2000,]

dtm.train <- dtm[1:1500,]
dtm.test <- dtm[1501:2000,]

corpus.clean.train <- corpus.clean[1:1500]
corpus.clean.test <- corpus.clean[1501:2000]

After doing this, our training data set includes 38,957 unique terms, but many of these only appear in one review. That’s great for pinpointing a specific document (a particular review), but not as great for classification: they won’t help me pick a good class and take up memory, so let’s get rid of them. I’ll throw away any term which appears in fewer than 5 documents. This will get me down to 12,144 terms, or just under a third of the original total.

After that, I will rebuild the document term matrices for training and testing, as we want to take advantage of that smaller domain.

fiveFreq <- tm::findFreqTerms(dtm.train, 5)

dtm.train.nb <- tm::DocumentTermMatrix(corpus.clean.train, control=list(dictionary = fiveFreq))
dtm.test.nb <- tm::DocumentTermMatrix(corpus.clean.test, control=list(dictionary = fiveFreq))

From here, I am going to create a function which helps me determine whether a term has appeared in a document, which is more important than how many times a term has appeared in the document. This prevents one document making heavy use of a term from biasing us too much toward that document’s class. Going back to our baseball versus business example, it’d be like a single business article writing about “going to the bullpen” over and over, using that as a metaphor for something business-related. Most business documents will not use the term bullpen (whereas plenty of baseball documents will), so a single business document applying a baseball metaphor shouldn’t ruin our model.

After doing that, I’ll run the naive_bayes function with Laplace Smoothing turned on (laplace = 1) and predict what our test values will look like.

convert_count <- function(x) {
  y <- ifelse(x > 0, 1,0)
  y <- factor(y, levels=c(0,1), labels=c("No", "Yes"))
  y
}

trainNB <- apply(dtm.train.nb, 2, convert_count)
testNB <- apply(dtm.test.nb, 2, convert_count)

classifier <- naivebayes::naive_bayes(trainNB, df.train$class, laplace = 1)
pred <- predict(classifier, newdata=testNB)

conf.mat <- caret::confusionMatrix(pred, df.test$class, positive="Pos")
conf.mat

Then we can look at the confusion matrix. Here’s how it looks:

Confusion Matrix and Statistics

          Reference
Prediction Neg Pos
       Neg 224  54
       Pos  41 181
                                          
               Accuracy : 0.81            
                 95% CI : (0.7728, 0.8435)
    No Information Rate : 0.53            
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.6174          
 Mcnemar's Test P-Value : 0.2183          
                                          
            Sensitivity : 0.7702          
            Specificity : 0.8453          
         Pos Pred Value : 0.8153          
         Neg Pred Value : 0.8058          
             Prevalence : 0.4700          
         Detection Rate : 0.3620          
   Detection Prevalence : 0.4440          
      Balanced Accuracy : 0.8077          
                                          
       'Positive' Class : Pos     

Overall, our classifier has an accuracy of 81%. But this example lets us look at four important features: sensitivity, specificity, positive predictive value, and negative predictive value. This is a simple two-class, binary classifier so these definitions are pretty simple. The tricky part is that the confusion matrix in caret orders alphabetically, whereas ideally you want the “positive” result first and the “negative” result last.

Sensitivity is where we capture when an event is positive, whether our predictor considers it positive, and is defined as (Rpos|Ppos) / (Rpos). That is, 181/(181+54) or 0.7702.

Specificity is where we capture when an event is negative, whether our predictor considers it negative, and is defined as (Rneg|Pneg) / (Rneg). That is, 224/(224+41) or 0.8453.

Positive predictive value looks at all cases where the Prediction was positive (read the “Pos” row), and is defined as (Ppos|Rpos) / (Ppos). That is, 181/(181+41) or 0.8153.

Negative predictive value looks at cases where the Prediction was negative (read the “Neg” row), and is defined as (Pneg|Rneg) / (Pneg). That is, 224/(224+54) or 0.8058.

Overall, our Naive Bayes classifier was in the 75-85% range for all five of our major measures. If we need to get to 85-90%, this is a good sign: Naive Bayes is getting us most of the way there, so better classifier algorithms should get us over the top.

Conclusion

In today’s post, we dug into the naivebayes R package and showed how we could solve for Naive Bayes with and without Laplace Smoothing in just a few lines of code.

If you want to learn more, check out Classification with Naive Bayes, a talk I’ve put together on the topic.

Southern Fried F# Conference In April

I’m part of the committee putting together Southern Fried F#, a conference which will take place in Raleigh, North Carolina on Saturday, April 13th.

No conference is complete without a great logo.

The Call for Papers is currently open and will remain open through February 15th, so if you are interested in presenting, get a move on—that way people won’t need to see me give a talk…

The idea of Southern Fried F# is to provide a full-day community conference dedicated to the F# programming language. We are aiming at introductory-level talks to build up interest in the language and show areas in which F# can thrive, whether that be using type providers to integrate external sources quickly, building business rules which are impossible to break and easy to understand, or building production-quality machine learning models.

We will host the 2019 edition of Southern Fried F# at Red Hat’s office in downtown Raleigh. More details will be available on the main website as we get closer to the event. You can also sign up for the event now—tickets are free.

R Training In Redmond

I had to wait until this was official, but the long wait is over: I am giving my full-day R training in Redmond, Washington on Friday, February 8th. Tickets are priced at $125 for a jam-packed day full of learning and excitement and a little bit of snark hosted on Microsoft’s Redmond campus.

Course Description

In this day-long training, you will learn about R, the premier language for data analysis. We will approach the language from the standpoint of data professionals: database developers, database administrators, and data scientists. We will see how data professionals can translate existing skills with SQL to get started with R. We will also dive into the tidyverse, an opinionated set of libraries which has modernized R development. We will see how to use libraries such as dplyr, tidyr, and purrr to write powerful, set-based code. In addition, we will use ggplot2 to create production-quality data visualizations.

Over the course of the day, we will look at several problem domains. For database administrators, areas of note will include visualizing SQL Server data, predicting error occurrences, and estimating backup times for new databases. We will also look at areas of general interest, including analysis of open source data sets.

No experience with R is necessary. The only requirements are a laptop and an interest in leveling up your data professional skillset.

Quick Notes

We will use Azure Notebooks for the entire training, so there is no need to install anything on personal laptops. I’ve found that this saves about 45 minutes at the beginning of the pre-con. I also have secret bonus material (read: stuff I realized I couldn’t fit into an eight-hour session). So it’s like getting 9.5 hours of me droning on for the price of eight! If that doesn’t make you want to sign yourself and your extended family up right now, I don’t know what will.

Text Classification By Hand With Naive Bayes

This is part three in a series on classification with Naive Bayes.

Last Time On…The People’s Court

In our last post, we learned the math behind Naive Bayes and felt sorry for Nate Barkerson, the quarterback with Derek Anderson’s acccuracy, Matt Barkley’s scrambling abilities, and Nate Peterman’s innate ability to throw the ball to the wrong team. But let’s leave all of that behind and go to a completely different sport: baseball. And business, because that’s how we roll.

The Business of Baseball

I now want to classify texts as belonging to one of two categories: baseball texts or business texts. Here is my input data set:

TextTag
Stock prices fellBusiness
Shares were up thirty percentBusiness
Pitched out of a tough situationBaseball
Bullish investors seized on the opportunityBusiness
Threw a no hitterBaseball
Runners on second and third with nobody outBaseball

With these categorized snippets, I want to build a model which can tell me whether the phrase “Threw out the runner” is a baseball text or a business text. As textual experts, we already know that this is a baseball phrase, so we can test our model versus that expected result.

Rounding the Bases in Order

As a reminder, there are three steps to solving the problem:

  1. Find the prior probability: the overall likelihood of a text being a business or a baseball phrase.
  2. Find the probability that a set of words is a business phrase or a baseball phrase.
  3. Plug values from our new test data into our formula to obtain the posterior probability for each test phrase.

Setting Our Priors

Step one is simple. In our sample set, we have 3 business terms and 3 baseball terms out of six terms in total. Therefore, the prior probability of a phrase being a business phrase is 50% and the probability of it being a baseball phrase is 50%. Note that this is a two-class problem, so the only texts we care about are baseball texts and business texts; if you have some other kind of text, get out of here with your crazy talk words.

Determine Probabilities

Step two is, on the surface, pretty tough: how do we figure out if a set of words is a business phrase or a baseball phrase? We could try to think up a set of features. For example, how long is the phrase? How many unique words does it have? Is there a pile of sunflower seeds near the phrase? But there’s an easier way.

Remember the “naive” part of Naive Bayes: all features are independent. And in this case, we can use as features the individual words. Therefore, the probability of a word being a baseball-related word or a business-related word is what matters, and we cross-multiply those probabilities to determine if the overall phrase is a baseball phrase or a business phrase.

For example, let’s calculate the probability of the word “threw” being a baseball phrase or a business phrase. First, we count the number of times “threw” appears in our baseball texts and divide by the total number of baseball words. There are 18 total words belonging to baseball texts and the word “threw” appears once, so our probability is 1/18.

Then, we do the same for the business side: threw appears as 0 of the 14 total words in the sample, so its probability is 0/14.

Instead of doing this for every possible word, I’m going to look only at the four words in our test data phrase: “threw out the runner.” “Threw” appears once, “out” appears twice, but “the” and “runner” don’t appear at all in our baseball corpus. “Runners” does but that’s not the same word.

Therefore, our probability looks like this:

P(BB|x) = \dfrac{1}{18} \cdot \dfrac{2}{18} \cdot \dfrac{0}{18} \cdot \dfrac{0}{18} \cdot \dfrac{3}{6} = 0

That gives us a probability of 0. How about on the business side? Well, the only word which appears here is the word “the” so our probabilities look like:

P(BUS|x) = \dfrac{0}{14} \cdot \dfrac{0}{14} \cdot \dfrac{1}{14} \cdot \dfrac{0}{14} \cdot \dfrac{3}{6} = 0

Well, that’s not very helpful: our model gives us zero percent probability that this is either a baseball text or a business text. Before throwing up your hands in disgust and returning to your life as a goat dairy farmer, however, let’s try doing something else.

Laplace? He Played For the A’s, Right?

It turns out that there’s a way to fix this zero probability problem, and it’s called Laplace Smoothing. The idea is to add 1 to each word’s numerator so that we never multiply by zero. But to even things out, we need to add N (the count of unique words) to each denominator. There are 29 unique words in the data set above—you’re welcome to count them if you’d like. I’ll still be here.

Now that you’ve counted (if there’s one thing I can trust, it’s that somebody on the Internet will be pedantic enough to count), let’s build a quick table of probabilities for each word. I won’t do this in LaTeX so it’ll look a bit uglier, but we’ll get beyond this, you and me.

WordP(Baseball)P(Business)
threw(1+1) / (18+29)(0+1) / (14+29)
out(2+1) / (18+29)(0+1) / (14+29)
the(0+1) / (18+29)(1+1) / (14+29)
runner(0+1) / (18+29)(0+1) / (14+29)

As a quick reminder, “runner” and “runners” are still distinct words.

Now that we have our probabilities by word, let’s plug them back into the formulas, cross-multiplying the word probabilities and then multiplying by our prior probability. First for baseball:

P(BB|x) = \dfrac{2}{47} \cdot \dfrac{3}{47} \cdot \dfrac{1}{47} \cdot \dfrac{1}{47} \cdot \dfrac{3}{6} = 6.15 \times 10^-7

Then for business:

P(BUS|x) = \dfrac{1}{43} \cdot \dfrac{1}{43} \cdot \dfrac{2}{43} \cdot \dfrac{1}{43} \cdot \dfrac{3}{6} = 2.93 \times 10^-7

According to this, the phrase is more than twice as likely to be a baseball term than a business term.

Increasing Your Model’s Launch Angle

There are a few things we can do to improve our prediction quality:

  • Remove stopwords. These are extremely frequent words with little predictive meaning. In most English texts, they would include words like { a, on, the, of, and, but }. That is, prepositions, definite and indefinite articles, conjunctions, and the like. You may have custom stopwords as well which appear in all texts and have a very low predictive value.
  • Lemmatize words, grouping together inflections of the same word. For example, I pointed out twice that “runner” is not “runners.” But they both have the same stem, so if we focus on stems, we’ll have more hits.
  • Use n-grams as features instead of individual words. N-grams are combinations of words in order. For example, “threw out the” and “out the runner” are the two 3-grams we can make from our test input. This works best with long works, like if you’re classifying novels or pamphlets or other multi-page documents.
  • Use Term Frequency – Inverse Document Frequency (TF-IDF). This is a process which penalizes words which appear in larger numbers of texts. The idea is akin to stopwords, where words which appear in a broad number of texts are less likely to identify a specific text accurately, but without you needing to pre-specify the terms.

These techniques are not specific to Naive Bayes classifiers and get more into natural language processing as a whole. Using different combinations of these techniques can help you boost classification quality, especially as you begin to introduce more classes.

Conclusion

In today’s post, we looked at using Naive Bayes for natural language processing (NLP), classifying phrases into being baseball-related or business-related. We also introduced the concept of Laplace Smoothing, which helps us deal with new words or relatively small dictionaries by ensuring that we do not multiply by 0. Finally, we looked at a few techniques for improving Naive Bayes and other NLP algorithms.

In the next post, we’re going to offload some of this math onto computers and solve some problems in R.

Five Minutes of Silence, Then a Bonus Track

Now that you know about Laplace Smoothing, you might want to go back and determine just how much of a drag having Benjamin/Clay lead the team in receiving yardage was. If we apply Laplace smoothing only to the last feature (top receiver) and use the set of inputs { QB = Allen, Home Game, 14+ Points, Top Receiver = Benjamin/Clay }, we end up with. First, the partial probability of a win:

P(W|x'_3) = \dfrac{5}{6} \cdot \dfrac{4}{6} \cdot \dfrac{5}{6} \cdot \dfrac{1}{10} \cdot \dfrac{6}{16} = 0.0174

And then the partial probability for a loss:

P(L|x_3) = \dfrac{4}{10} \cdot \dfrac{4}{10} \cdot \dfrac{3}{10} \cdot \dfrac{5}{14} \cdot \dfrac{10}{16} = 0.0107

It turns out that if everything else went right (like Josh Allen rushing for 200 yards and scoring 3 touchdowns on his own), there’d be about a 62% chance of the Bills pulling off a victory according to this model. That’s a marginal drop of about 35 percentage points versus Robert Foster. It’s not entirely the faults of Benjamin and Clay, but this isn’t exactly making me miss them.