*This is part four in a series on **classification with Naive Bayes**.*

### Classification Of Features With R

So far, we’ve walked through the Naive Bayes class of algorithms by hand, learning the details of how it works. Now that we have a good understanding of the mechanics at that level, let’s let the computer do what it does better than us: mindless calculations.

#### Of Course It’s The Iris Data Set

Our first example is a classic: the iris data set in R. We are going to use the `naivebayes`

R package to implement Naive Bayes for us and classify this iris data set. To give us an idea of how good the classifier is, I’m going to use 80% of the data for training and reserve 20% for testing. I’ll set a particular seed so that if you want to try it at home, you can end up with the same results as me.

Here’s the code, which I’ll follow up with some discussion.

if(!require(naivebayes)) {
install.packages("naivebayes")
library(naivebayes)
}
if(!require(caret)) {
install.packages("caret")
library(caret)
}
data(iris)
set.seed(1773)
irisr <- iris[sample(nrow(iris)),]
irisr <- irisr[sample(nrow(irisr)),]
iris.train <- irisr[1:120,]
iris.test <- irisr[121:150,]
nb <- naivebayes::naive_bayes(Species ~ ., data = iris.train)
plot(nb)
iris.output <- cbind(iris.test, prediction = predict(nb, iris.test))
caret::confusionMatrix(iris.output$prediction, iris.output$Species)

All in all, a couple dozen lines of code to do the job. The first two if statements load our packages: `naivebayes`

and `caret`

. I could use `caret`

to split my training and test data, but because it’s such a small data set, I figured I’d shuffle it in place and assign the first 80% to `iris.train`

and leave the remaining 20% for `iris.test`

.

The key function is `naive_bayes`

in the `naivebayes`

package. In this case, we are predicting `Species`

given all of the other inputs on `iris.train`

.

If you do use the default seed that I’ve set, you’ll end up with four plots, one for each feature. Here they are:

Looking at these images, sepal length and sepal width aren’t very helpful for us: what we want is a great separating equilibrium—that is, where most of the distributions are independent. Petal length ad petal width are better—setosa is clearly different from the others, though there is some overlap between versicolor and virginica, which will lead to some risk of ambiguity.

#### Maximizing Confusion

Once we have our output, we can quickly generate a confusion matrix using `caret`

. I like using this a lot more than building my own with e.g. `table(iris.output$Species, iris.output$prediction)`

. The reason I prefer what `caret`

has to offer is that it also includes statistics like positive predictive value and negative predictive value. These tend to be at least as important as accuracy when performing classification, especially for scenarios where one class is extremely likely and the other extremely unlikely.

Here is the confusion matrix output from `caret`

. After that, I’ll explain positive and negative predictive values.

Confusion Matrix and Statistics
Reference
Prediction setosa versicolor virginica
setosa 6 0 0
versicolor 0 11 1
virginica 0 1 11
Overall Statistics
Accuracy : 0.9333
95% CI : (0.7793, 0.9918)
No Information Rate : 0.4
P-Value [Acc > NIR] : 1.181e-09
Kappa : 0.8958
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: setosa Class: versicolor Class: virginica
Sensitivity 1.0 0.9167 0.9167
Specificity 1.0 0.9444 0.9444
Pos Pred Value 1.0 0.9167 0.9167
Neg Pred Value 1.0 0.9444 0.9444
Prevalence 0.2 0.4000 0.4000
Detection Rate 0.2 0.3667 0.3667
Detection Prevalence 0.2 0.4000 0.4000
Balanced Accuracy 1.0 0.9306 0.9306

**Positive predictive value** for a category is: if my model predicts that a particular set of inputs matches a particular class, what is the probability that this judgement is correct? For example, we have 12 versicolor entries (read the “versicolor” Prediction row across and sum up values). 11 of the 12 were predicted as versicolor, so our positive predictive value is 11/12 = 0.9167.

**Negative predictive value** for a category is: if my model predicts that a particular set of inputs does *not* match a particular class, what is the probability that this judgement is correct? For example, we have 18 predictions which were *not* versicolor (sum up all of the values across the rows *except for the versicolor row*). Of those 18, 1 was actually versicolor (read the *versicolor* column and ignore the point where the prediction was versicolor). Therefore, 17 of our 18 negative predictions for versicolor were correct, so our negative predictive value is 17/18 = 0.9444.

This is a small data set with relatively little variety and only one real place for ambiguity, so it’s a little boring. So let’s do something a bit more interesting: sentiment analysis.

### Yes, Mr. Sherman, Everything Stinks

Now we’re going to look at movie reviews and predict whether a movie review is a positive or a negative review based on its words. If you want to play along at home, grab the data set, which is under 3MB zipped in 2000 reviews in total.

Unike last time, I’m going to break this out into sections with commentary in between. If you want the full script with notebook, check out the GitHub repo I put together for this talk.

First up, we load some packages. I’ll use `naivebayes`

to perform classification and `tm`

for text mining. If you’re a `tidytext`

fan, you can certainly use that for this work too.

if(!require(naivebayes)) {
install.packages("naivebayes")
library(naivebayes)
}
if(!require(tidyverse)) {
install.packages("tidyverse")
library(tidyverse)
}
if(!require(tm)) {
install.packages("tm")
library(tm)
}
if(!require(caret)) {
install.packages("caret")
library(caret)
}

We’ll next load the data and split it into training and test data sets.

df <- readr::read_csv("../data/movie-pang02.csv")
set.seed(1)
df <- df[sample(nrow(df)),]
df <- df[sample(nrow(df)),]
df$class <- as.factor(df$class)
corpus <- tm::Corpus(tm::VectorSource(df$text))
corpus

I’m going to stop here and lay out a warning: this **will** leak information: if your test data set includes words your training data set does not, the trained model will gain knowledge of those additional words and that they don’t appear in the training set. In a real project, I’d build a corpus off of the training data and then apply those rules to the test set, using Laplace Smoothing or a similar technique to deal with any test words not in the training set.

With that warning said, I’m now going to clean up the data by converting everything to lower-case, removing punctuation and numbers, removing stopwords, and stripping out any extraneous whitespace. This reduces the total document space and gives us a more consistent set of words.

corpus.clean <- corpus %>%
tm::tm_map(tm::content_transformer(tolower)) %>%
tm::tm_map(tm::removePunctuation) %>%
tm::tm_map(tm::removeNumbers) %>%
tm::tm_map(tm::removeWords, tm::stopwords(kind="en")) %>%
tm::tm_map(tm::stripWhitespace)
dtm <- tm::DocumentTermMatrix(corpus.clean)

Then we turn our words into features using the bag of words technique. It’s not the fanciest or best, but it’s quick-and-easy—sort of like Naive Bayes.

Once we have the document term matrix, we can build out our training and test data. I already shuffled at the beginning, so we split out our elements into training and test, reserving 25% for test.

df.train <- df[1:1500,]
df.test <- df[1501:2000,]
dtm.train <- dtm[1:1500,]
dtm.test <- dtm[1501:2000,]
corpus.clean.train <- corpus.clean[1:1500]
corpus.clean.test <- corpus.clean[1501:2000]

After doing this, our training data set includes 38,957 unique terms, but many of these only appear in one review. That’s great for pinpointing a specific document (a particular review), but not as great for classification: they won’t help me pick a good class and take up memory, so let’s get rid of them. I’ll throw away any term which appears in fewer than 5 documents. This will get me down to 12,144 terms, or just under a third of the original total.

After that, I will rebuild the document term matrices for training and testing, as we want to take advantage of that smaller domain.

fiveFreq <- tm::findFreqTerms(dtm.train, 5)
dtm.train.nb <- tm::DocumentTermMatrix(corpus.clean.train, control=list(dictionary = fiveFreq))
dtm.test.nb <- tm::DocumentTermMatrix(corpus.clean.test, control=list(dictionary = fiveFreq))

From here, I am going to create a function which helps me determine **whether** a term has appeared in a document, which is more important than **how many times** a term has appeared in the document. This prevents one document making heavy use of a term from biasing us too much toward that document’s class. Going back to our baseball versus business example, it’d be like a single business article writing about “going to the bullpen” over and over, using that as a metaphor for something business-related. Most business documents will not use the term bullpen (whereas plenty of baseball documents will), so a single business document applying a baseball metaphor shouldn’t ruin our model.

After doing that, I’ll run the `naive_bayes`

function with Laplace Smoothing turned on (`laplace = 1`

) and predict what our test values will look like.

convert_count <- function(x) {
y <- ifelse(x > 0, 1,0)
y <- factor(y, levels=c(0,1), labels=c("No", "Yes"))
y
}
trainNB <- apply(dtm.train.nb, 2, convert_count)
testNB <- apply(dtm.test.nb, 2, convert_count)
classifier <- naivebayes::naive_bayes(trainNB, df.train$class, laplace = 1)
pred <- predict(classifier, newdata=testNB)
conf.mat <- caret::confusionMatrix(pred, df.test$class, positive="Pos")
conf.mat

Then we can look at the confusion matrix. Here’s how it looks:

Confusion Matrix and Statistics
Reference
Prediction Neg Pos
Neg 224 54
Pos 41 181
Accuracy : 0.81
95% CI : (0.7728, 0.8435)
No Information Rate : 0.53
P-Value [Acc > NIR] : <2e-16
Kappa : 0.6174
Mcnemar's Test P-Value : 0.2183
Sensitivity : 0.7702
Specificity : 0.8453
Pos Pred Value : 0.8153
Neg Pred Value : 0.8058
Prevalence : 0.4700
Detection Rate : 0.3620
Detection Prevalence : 0.4440
Balanced Accuracy : 0.8077
'Positive' Class : Pos

Overall, our classifier has an accuracy of 81%. But this example lets us look at four important features: sensitivity, specificity, positive predictive value, and negative predictive value. This is a simple two-class, binary classifier so these definitions are pretty simple. The tricky part is that the confusion matrix in `caret`

orders alphabetically, whereas ideally you want the “positive” result first and the “negative” result last.

**Sensitivity** is where we capture when an event is positive, whether our predictor considers it positive, and is defined as (Rpos|Ppos) / (Rpos). That is, 181/(181+54) or 0.7702.

**Specificity** is where we capture when an event is negative, whether our predictor considers it negative, and is defined as (Rneg|Pneg) / (Rneg). That is, 224/(224+41) or 0.8453.

**Positive predictive value** looks at all cases where the Prediction was positive (read the “Pos” row), and is defined as (Ppos|Rpos) / (Ppos). That is, 181/(181+41) or 0.8153.

**Negative predictive value** looks at cases where the Prediction was negative (read the “Neg” row), and is defined as (Pneg|Rneg) / (Pneg). That is, 224/(224+54) or 0.8058.

Overall, our Naive Bayes classifier was in the 75-85% range for all five of our major measures. If we need to get to 85-90%, this is a good sign: Naive Bayes is getting us most of the way there, so better classifier algorithms should get us over the top.

### Conclusion

In today’s post, we dug into the `naivebayes`

R package and showed how we could solve for Naive Bayes with and without Laplace Smoothing in just a few lines of code.

If you want to learn more, check out Classification with Naive Bayes, a talk I’ve put together on the topic.