Text Classification By Hand With Naive Bayes

This is part three in a series on classification with Naive Bayes.

Last Time On…The People’s Court

In our last post, we learned the math behind Naive Bayes and felt sorry for Nate Barkerson, the quarterback with Derek Anderson’s acccuracy, Matt Barkley’s scrambling abilities, and Nate Peterman’s innate ability to throw the ball to the wrong team. But let’s leave all of that behind and go to a completely different sport: baseball. And business, because that’s how we roll.

The Business of Baseball

I now want to classify texts as belonging to one of two categories: baseball texts or business texts. Here is my input data set:

TextTag
Stock prices fellBusiness
Shares were up thirty percentBusiness
Pitched out of a tough situationBaseball
Bullish investors seized on the opportunityBusiness
Threw a no hitterBaseball
Runners on second and third with nobody outBaseball

With these categorized snippets, I want to build a model which can tell me whether the phrase “Threw out the runner” is a baseball text or a business text. As textual experts, we already know that this is a baseball phrase, so we can test our model versus that expected result.

Rounding the Bases in Order

As a reminder, there are three steps to solving the problem:

  1. Find the prior probability: the overall likelihood of a text being a business or a baseball phrase.
  2. Find the probability that a set of words is a business phrase or a baseball phrase.
  3. Plug values from our new test data into our formula to obtain the posterior probability for each test phrase.

Setting Our Priors

Step one is simple. In our sample set, we have 3 business terms and 3 baseball terms out of six terms in total. Therefore, the prior probability of a phrase being a business phrase is 50% and the probability of it being a baseball phrase is 50%. Note that this is a two-class problem, so the only texts we care about are baseball texts and business texts; if you have some other kind of text, get out of here with your crazy talk words.

Determine Probabilities

Step two is, on the surface, pretty tough: how do we figure out if a set of words is a business phrase or a baseball phrase? We could try to think up a set of features. For example, how long is the phrase? How many unique words does it have? Is there a pile of sunflower seeds near the phrase? But there’s an easier way.

Remember the “naive” part of Naive Bayes: all features are independent. And in this case, we can use as features the individual words. Therefore, the probability of a word being a baseball-related word or a business-related word is what matters, and we cross-multiply those probabilities to determine if the overall phrase is a baseball phrase or a business phrase.

For example, let’s calculate the probability of the word “threw” being a baseball phrase or a business phrase. First, we count the number of times “threw” appears in our baseball texts and divide by the total number of baseball words. There are 18 total words belonging to baseball texts and the word “threw” appears once, so our probability is 1/18.

Then, we do the same for the business side: threw appears as 0 of the 14 total words in the sample, so its probability is 0/14.

Instead of doing this for every possible word, I’m going to look only at the four words in our test data phrase: “threw out the runner.” “Threw” appears once, “out” appears twice, but “the” and “runner” don’t appear at all in our baseball corpus. “Runners” does but that’s not the same word.

Therefore, our probability looks like this:

P(BB|x) = \dfrac{1}{18} \cdot \dfrac{2}{18} \cdot \dfrac{0}{18} \cdot \dfrac{0}{18} \cdot \dfrac{3}{6} = 0

That gives us a probability of 0. How about on the business side? Well, the only word which appears here is the word “the” so our probabilities look like:

P(BUS|x) = \dfrac{0}{14} \cdot \dfrac{0}{14} \cdot \dfrac{1}{14} \cdot \dfrac{0}{14} \cdot \dfrac{3}{6} = 0

Well, that’s not very helpful: our model gives us zero percent probability that this is either a baseball text or a business text. Before throwing up your hands in disgust and returning to your life as a goat dairy farmer, however, let’s try doing something else.

Laplace? He Played For the A’s, Right?

It turns out that there’s a way to fix this zero probability problem, and it’s called Laplace Smoothing. The idea is to add 1 to each word’s numerator so that we never multiply by zero. But to even things out, we need to add N (the count of unique words) to each denominator. There are 29 unique words in the data set above—you’re welcome to count them if you’d like. I’ll still be here.

Now that you’ve counted (if there’s one thing I can trust, it’s that somebody on the Internet will be pedantic enough to count), let’s build a quick table of probabilities for each word. I won’t do this in LaTeX so it’ll look a bit uglier, but we’ll get beyond this, you and me.

WordP(Baseball)P(Business)
threw(1+1) / (18+29)(0+1) / (14+29)
out(2+1) / (18+29)(0+1) / (14+29)
the(0+1) / (18+29)(1+1) / (14+29)
runner(0+1) / (18+29)(0+1) / (14+29)

As a quick reminder, “runner” and “runners” are still distinct words.

Now that we have our probabilities by word, let’s plug them back into the formulas, cross-multiplying the word probabilities and then multiplying by our prior probability. First for baseball:

P(BB|x) = \dfrac{2}{47} \cdot \dfrac{3}{47} \cdot \dfrac{1}{47} \cdot \dfrac{1}{47} \cdot \dfrac{3}{6} = 6.15 \times 10^-7

Then for business:

P(BUS|x) = \dfrac{1}{43} \cdot \dfrac{1}{43} \cdot \dfrac{2}{43} \cdot \dfrac{1}{43} \cdot \dfrac{3}{6} = 2.93 \times 10^-7

According to this, the phrase is more than twice as likely to be a baseball term than a business term.

Increasing Your Model’s Launch Angle

There are a few things we can do to improve our prediction quality:

  • Remove stopwords. These are extremely frequent words with little predictive meaning. In most English texts, they would include words like { a, on, the, of, and, but }. That is, prepositions, definite and indefinite articles, conjunctions, and the like. You may have custom stopwords as well which appear in all texts and have a very low predictive value.
  • Lemmatize words, grouping together inflections of the same word. For example, I pointed out twice that “runner” is not “runners.” But they both have the same stem, so if we focus on stems, we’ll have more hits.
  • Use n-grams as features instead of individual words. N-grams are combinations of words in order. For example, “threw out the” and “out the runner” are the two 3-grams we can make from our test input. This works best with long works, like if you’re classifying novels or pamphlets or other multi-page documents.
  • Use Term Frequency – Inverse Document Frequency (TF-IDF). This is a process which penalizes words which appear in larger numbers of texts. The idea is akin to stopwords, where words which appear in a broad number of texts are less likely to identify a specific text accurately, but without you needing to pre-specify the terms.

These techniques are not specific to Naive Bayes classifiers and get more into natural language processing as a whole. Using different combinations of these techniques can help you boost classification quality, especially as you begin to introduce more classes.

Conclusion

In today’s post, we looked at using Naive Bayes for natural language processing (NLP), classifying phrases into being baseball-related or business-related. We also introduced the concept of Laplace Smoothing, which helps us deal with new words or relatively small dictionaries by ensuring that we do not multiply by 0. Finally, we looked at a few techniques for improving Naive Bayes and other NLP algorithms.

In the next post, we’re going to offload some of this math onto computers and solve some problems in R.

Five Minutes of Silence, Then a Bonus Track

Now that you know about Laplace Smoothing, you might want to go back and determine just how much of a drag having Benjamin/Clay lead the team in receiving yardage was. If we apply Laplace smoothing only to the last feature (top receiver) and use the set of inputs { QB = Allen, Home Game, 14+ Points, Top Receiver = Benjamin/Clay }, we end up with. First, the partial probability of a win:

P(W|x'_3) = \dfrac{5}{6} \cdot \dfrac{4}{6} \cdot \dfrac{5}{6} \cdot \dfrac{1}{10} \cdot \dfrac{6}{16} = 0.0174

And then the partial probability for a loss:

P(L|x_3) = \dfrac{4}{10} \cdot \dfrac{4}{10} \cdot \dfrac{3}{10} \cdot \dfrac{5}{14} \cdot \dfrac{10}{16} = 0.0107

It turns out that if everything else went right (like Josh Allen rushing for 200 yards and scoring 3 touchdowns on his own), there’d be about a 62% chance of the Bills pulling off a victory according to this model. That’s a marginal drop of about 35 percentage points versus Robert Foster. It’s not entirely the faults of Benjamin and Clay, but this isn’t exactly making me miss them.

Advertisements

One thought on “Text Classification By Hand With Naive Bayes

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s