This is part one in a series on classification with Naive Bayes.

What Is Naive Bayes?

Let me fix that right away: Naive Bayes isn’t a thing; it’s a class of things.

You do realize that collective nouns are typically referred to in the singular, right?

You probably shouldn’t do editorializing in the headings, me. Nonetheless, let’s talk about the Naive Bayes class of algorithms and whistle past the pluralization of collective nouns. If you want, every time I write “Naive Bayes,” assume I wrote “the class of algorithms which we have defined as Naive Bayes.”

So what is Naive Bayes? [Isn’t that what I asked? – ed]

Naive Bayes is a class of algorithms designed to solve classification problems. Naive Bayes algorithms share three primary characteristics:

  1. They are probabilistic. This means that we calculate the probability of each output category, and to determine the best, we choose the one with the highest likelihood.
  2. Probabilities are derived from Bayes’ Theorem. That’s the “Bayes” in Naive Bayes, and we’ll get into it in a bit.
  3. Features are independent. That’s the “naive” part of “Naive Bayes.”

The plan in this post is to cover the basics of this class of algorithms, and in follow-up posts, we will look at implementations by hand and in code. But first, the $64,000 question:

Why Should We Use Naive Bayes? Is It the Best Classifier Out There?

Probably not, no. In fact, it’s typically a mediocre classifier—it’s the one you strive to beat with your fancy algorithm. So why even care about this one?

Because it’s fast, easy to understand, and it works reasonably well. In other words, this is the classifier you start with to figure out if it’s worth investing your time on a problem. If you need to hit 90% category accuracy and Naive Bayes is giving you 70%, you’re probably in good shape; if it’s giving you 20% accuracy, you might need to take another look at whether you have a viable solution given your data.

The Foundation of Naive Bayes: Bayes’ Theorem

Bayes’ theorem is pretty simple:

P(B|A) = \dfrac{P(A|B) * (B)}{P(A)}

In case you’re not familiar with the notation, here’s a quick breakdown:

  • P(B|A) is called the posterior probability. It represents the probability of hypothesis B being correct given input data A. Let’s cover the other elements and loop back to this at the end.
  • P(A|B) is the probability of us seeing input data A if we know that hypothesis B is true. In other words, if Bob always wears red hats on Thursdays, P(Bob is wearing a red hat | Today is Thursday) is 1. If Bob wears red hats on alternating Thursdays, P(Bob is wearing a red hat | Today is Thursday) is 0.5. And so on.
  • P(B) is called the prior probability. It represents the likelihood that hypothesis B is true regardless of any specific input data. Continuing with the example above, it is the probability that today is Thursday, which is 1/7 unless you are re-living Groundhogs Day.
  • P(A) is the probability of us seeing data A in our sample. In our simple example, it is the likelihood that Bob will wear a red hat.

Now that I’ve gone into the elements, let me wrap back around to the term posterior probability. Continuing with the example, our goal is to figure out what day it is based solely on Bob’s headgear. In other words, finding P(Today is Thursday | Bob is wearing a red hat), read as “the probability that today is Thursday given that Bob is wearing a red hat.”

We started out with a prior: the information we expect (or know) beforehand. Our prior is that, knowing nothing else, is it equally likely to be any day of the week, so P(Today is Thursday) is 1/7.

From there, we update our prior by adding new information into the mix. We know that the probability of Bob wearing a red hat is 50% when it is Thursday—for lore-building, let’s say that Bob is a high school football nut and always wears his kids’ high school cap for some Thursday night action. Oh, and that there are 26 games a year because shaddup. So P(Bob is wearing a red hat | Today is Thursday) is 0.5, or 50%.

But knowing that Bob wears red hats on half of all Thursdays isn’t enough for us—we need a bit more. We need to know how often Bob wears his red hat in general. So let’s say that we have 52*7=364 data points and Bob wore his red hat 36 times. That means P(Bob is wearing a red hat) is 0.099 in our data set.

Armed with this information, we can calculate the posterior probability:

\dfrac{(0.5)(0.14)}{(0.099)} = .722

In other words, given what we know, there is a 72% chance that, if we see Bob wearing a red hat, it is Thursday.

If you decided to ignore all of the rest of the details and just focus on the fact that Bob wore his hat on 26 Thursdays and 36 times in total, you could also take 26/36 and get to .722 as well. So this seems like a lot of calculation for nothing, at least until we introduce the real kicker.

One Naive Assumption

What makes Naive Bayes so useful is something which seems absurd at first blush: we assume that all inputs are independent. This sounds like a bad idea: there are complex interactions between different inputs, so simply ignoring them seems like we’re throwing away valuable information. But it turns out that this simplification mechanism still retains most of our valuable information while making it easier for us to calculate.

Here’s the new formula:

P(B|A) = \dfrac{P(x_1|B) * P(x_2|B) * \ldots * P(x_n|B) * P(B)}{P(A)}

Now A is a specific collection of {x1, x2, …, xn} inputs. Let’s say this could be things like Bob’s head covering, how surly your co-workers are today (on a scale from 1 to 5 so now it’s totally scientific), and the amount of junk mail you received the day before. Some of these things may have some overlap, but for the purposes of Naive Bayes, we assume that all away and calculate these probabilities (red hat, blue hat, no hat for Bob; 1-5 on the Surl Scale; pounds of junk mail) in our data set separately.

We’ve also changed the nature of B a little bit too. Now B is a collection of {B1, B2, … BN} different choices. So now, instead of asking if today is Thursday, we try to figure out which day of the week it is based on the specific values of our inputs. We might end up with a 45% chance of it being Thursday, 29% chance of Tuesday, 17% chance of Wednesday, 6% chance of it being Friday, and 3% chance of it being Monday. In that case, our model predicts that a day with those particular inputs is Thursday. That’s what we mean by the model being probabilistic: even though we end up with a result like “The model predicts that the day is Thursday given those inputs,” we really end up with a series of likelihoods and it’s up to us to figure out any cutoffs between “This day is Thursday” versus “This day could be Tuesday or Thursday” to “I have no idea what day it is and basing that off of some random garbage you pulled together is absurd.” Naturally, these algorithms tend to be a bit too polite to give you the third answer directly, but you can infer it if you look hard enough.

Conclusion

In today’s post, we looked at the basics of a Naive Bayes model. We walked through Bayes’ Theorem and solved a simple probability question. Finally, we introduced the concept of independence. In the next post (which actually will drop on a Thursday), we will solve a simple but realistic two-class classification problem by hand with Naive Bayes.

Advertisement
Posted in R

One thought on “Classification With Naive Bayes: An Introduction

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s