This is part four in a series on classification with Naive Bayes.
Classification Of Features With R
So far, we’ve walked through the Naive Bayes class of algorithms by hand, learning the details of how it works. Now that we have a good understanding of the mechanics at that level, let’s let the computer do what it does better than us: mindless calculations.
Of Course It’s The Iris Data Set
Our first example is a classic: the iris data set in R. We are going to use the naivebayes
R package to implement Naive Bayes for us and classify this iris data set. To give us an idea of how good the classifier is, I’m going to use 80% of the data for training and reserve 20% for testing. I’ll set a particular seed so that if you want to try it at home, you can end up with the same results as me.
Here’s the code, which I’ll follow up with some discussion.
if(!require(naivebayes)) {
install.packages("naivebayes")
library(naivebayes)
}
if(!require(caret)) {
install.packages("caret")
library(caret)
}
data(iris)
set.seed(1773)
irisr <- iris[sample(nrow(iris)),]
irisr <- irisr[sample(nrow(irisr)),]
iris.train <- irisr[1:120,]
iris.test <- irisr[121:150,]
nb <- naivebayes::naive_bayes(Species ~ ., data = iris.train)
plot(nb)
iris.output <- cbind(iris.test, prediction = predict(nb, iris.test))
caret::confusionMatrix(iris.output$prediction, iris.output$Species)
All in all, a couple dozen lines of code to do the job. The first two if statements load our packages: naivebayes
and caret
. I could use caret
to split my training and test data, but because it’s such a small data set, I figured I’d shuffle it in place and assign the first 80% to iris.train
and leave the remaining 20% for iris.test
.
The key function is naive_bayes
in the naivebayes
package. In this case, we are predicting Species
given all of the other inputs on iris.train
.
If you do use the default seed that I’ve set, you’ll end up with four plots, one for each feature. Here they are:
![]() | ![]() |
![]() | ![]() |
Looking at these images, sepal length and sepal width aren’t very helpful for us: what we want is a great separating equilibrium—that is, where most of the distributions are independent. Petal length ad petal width are better—setosa is clearly different from the others, though there is some overlap between versicolor and virginica, which will lead to some risk of ambiguity.
Maximizing Confusion
Once we have our output, we can quickly generate a confusion matrix using caret
. I like using this a lot more than building my own with e.g. table(iris.output$Species, iris.output$prediction)
. The reason I prefer what caret
has to offer is that it also includes statistics like positive predictive value and negative predictive value. These tend to be at least as important as accuracy when performing classification, especially for scenarios where one class is extremely likely and the other extremely unlikely.
Here is the confusion matrix output from caret
. After that, I’ll explain positive and negative predictive values.
Confusion Matrix and Statistics Reference Prediction setosa versicolor virginica setosa 6 0 0 versicolor 0 11 1 virginica 0 1 11 Overall Statistics Accuracy : 0.9333 95% CI : (0.7793, 0.9918) No Information Rate : 0.4 P-Value [Acc > NIR] : 1.181e-09 Kappa : 0.8958 Mcnemar's Test P-Value : NA Statistics by Class: Class: setosa Class: versicolor Class: virginica Sensitivity 1.0 0.9167 0.9167 Specificity 1.0 0.9444 0.9444 Pos Pred Value 1.0 0.9167 0.9167 Neg Pred Value 1.0 0.9444 0.9444 Prevalence 0.2 0.4000 0.4000 Detection Rate 0.2 0.3667 0.3667 Detection Prevalence 0.2 0.4000 0.4000 Balanced Accuracy 1.0 0.9306 0.9306
Positive predictive value for a category is: if my model predicts that a particular set of inputs matches a particular class, what is the probability that this judgement is correct? For example, we have 12 versicolor entries (read the “versicolor” Prediction row across and sum up values). 11 of the 12 were predicted as versicolor, so our positive predictive value is 11/12 = 0.9167.
Negative predictive value for a category is: if my model predicts that a particular set of inputs does not match a particular class, what is the probability that this judgement is correct? For example, we have 18 predictions which were not versicolor (sum up all of the values across the rows except for the versicolor row). Of those 18, 1 was actually versicolor (read the versicolor column and ignore the point where the prediction was versicolor). Therefore, 17 of our 18 negative predictions for versicolor were correct, so our negative predictive value is 17/18 = 0.9444.
This is a small data set with relatively little variety and only one real place for ambiguity, so it’s a little boring. So let’s do something a bit more interesting: sentiment analysis.
Yes, Mr. Sherman, Everything Stinks
Now we’re going to look at movie reviews and predict whether a movie review is a positive or a negative review based on its words. If you want to play along at home, grab the data set, which is under 3MB zipped in 2000 reviews in total.
Unike last time, I’m going to break this out into sections with commentary in between. If you want the full script with notebook, check out the GitHub repo I put together for this talk.
First up, we load some packages. I’ll use naivebayes
to perform classification and tm
for text mining. If you’re a tidytext
fan, you can certainly use that for this work too.
if(!require(naivebayes)) {
install.packages("naivebayes")
library(naivebayes)
}
if(!require(tidyverse)) {
install.packages("tidyverse")
library(tidyverse)
}
if(!require(tm)) {
install.packages("tm")
library(tm)
}
if(!require(caret)) {
install.packages("caret")
library(caret)
}
We’ll next load the data and split it into training and test data sets.
df <- readr::read_csv("../data/movie-pang02.csv")
set.seed(1)
df <- df[sample(nrow(df)),]
df <- df[sample(nrow(df)),]
df$class <- as.factor(df$class)
corpus <- tm::Corpus(tm::VectorSource(df$text))
corpus
I’m going to stop here and lay out a warning: this will leak information: if your test data set includes words your training data set does not, the trained model will gain knowledge of those additional words and that they don’t appear in the training set. In a real project, I’d build a corpus off of the training data and then apply those rules to the test set, using Laplace Smoothing or a similar technique to deal with any test words not in the training set.
With that warning said, I’m now going to clean up the data by converting everything to lower-case, removing punctuation and numbers, removing stopwords, and stripping out any extraneous whitespace. This reduces the total document space and gives us a more consistent set of words.
corpus.clean <- corpus %>%
tm::tm_map(tm::content_transformer(tolower)) %>%
tm::tm_map(tm::removePunctuation) %>%
tm::tm_map(tm::removeNumbers) %>%
tm::tm_map(tm::removeWords, tm::stopwords(kind="en")) %>%
tm::tm_map(tm::stripWhitespace)
dtm <- tm::DocumentTermMatrix(corpus.clean)
Then we turn our words into features using the bag of words technique. It’s not the fanciest or best, but it’s quick-and-easy—sort of like Naive Bayes.
Once we have the document term matrix, we can build out our training and test data. I already shuffled at the beginning, so we split out our elements into training and test, reserving 25% for test.
df.train <- df[1:1500,]
df.test <- df[1501:2000,]
dtm.train <- dtm[1:1500,]
dtm.test <- dtm[1501:2000,]
corpus.clean.train <- corpus.clean[1:1500]
corpus.clean.test <- corpus.clean[1501:2000]
After doing this, our training data set includes 38,957 unique terms, but many of these only appear in one review. That’s great for pinpointing a specific document (a particular review), but not as great for classification: they won’t help me pick a good class and take up memory, so let’s get rid of them. I’ll throw away any term which appears in fewer than 5 documents. This will get me down to 12,144 terms, or just under a third of the original total.
After that, I will rebuild the document term matrices for training and testing, as we want to take advantage of that smaller domain.
fiveFreq <- tm::findFreqTerms(dtm.train, 5)
dtm.train.nb <- tm::DocumentTermMatrix(corpus.clean.train, control=list(dictionary = fiveFreq))
dtm.test.nb <- tm::DocumentTermMatrix(corpus.clean.test, control=list(dictionary = fiveFreq))
From here, I am going to create a function which helps me determine whether a term has appeared in a document, which is more important than how many times a term has appeared in the document. This prevents one document making heavy use of a term from biasing us too much toward that document’s class. Going back to our baseball versus business example, it’d be like a single business article writing about “going to the bullpen” over and over, using that as a metaphor for something business-related. Most business documents will not use the term bullpen (whereas plenty of baseball documents will), so a single business document applying a baseball metaphor shouldn’t ruin our model.
After doing that, I’ll run the naive_bayes
function with Laplace Smoothing turned on (laplace = 1
) and predict what our test values will look like.
convert_count <- function(x) {
y <- ifelse(x > 0, 1,0)
y <- factor(y, levels=c(0,1), labels=c("No", "Yes"))
y
}
trainNB <- apply(dtm.train.nb, 2, convert_count)
testNB <- apply(dtm.test.nb, 2, convert_count)
classifier <- naivebayes::naive_bayes(trainNB, df.train$class, laplace = 1)
pred <- predict(classifier, newdata=testNB)
conf.mat <- caret::confusionMatrix(pred, df.test$class, positive="Pos")
conf.mat
Then we can look at the confusion matrix. Here’s how it looks:
Confusion Matrix and Statistics Reference Prediction Neg Pos Neg 224 54 Pos 41 181 Accuracy : 0.81 95% CI : (0.7728, 0.8435) No Information Rate : 0.53 P-Value [Acc > NIR] : <2e-16 Kappa : 0.6174 Mcnemar's Test P-Value : 0.2183 Sensitivity : 0.7702 Specificity : 0.8453 Pos Pred Value : 0.8153 Neg Pred Value : 0.8058 Prevalence : 0.4700 Detection Rate : 0.3620 Detection Prevalence : 0.4440 Balanced Accuracy : 0.8077 'Positive' Class : Pos
Overall, our classifier has an accuracy of 81%. But this example lets us look at four important features: sensitivity, specificity, positive predictive value, and negative predictive value. This is a simple two-class, binary classifier so these definitions are pretty simple. The tricky part is that the confusion matrix in caret
orders alphabetically, whereas ideally you want the “positive” result first and the “negative” result last.
Sensitivity is where we capture when an event is positive, whether our predictor considers it positive, and is defined as (Rpos|Ppos) / (Rpos). That is, 181/(181+54) or 0.7702.
Specificity is where we capture when an event is negative, whether our predictor considers it negative, and is defined as (Rneg|Pneg) / (Rneg). That is, 224/(224+41) or 0.8453.
Positive predictive value looks at all cases where the Prediction was positive (read the “Pos” row), and is defined as (Ppos|Rpos) / (Ppos). That is, 181/(181+41) or 0.8153.
Negative predictive value looks at cases where the Prediction was negative (read the “Neg” row), and is defined as (Pneg|Rneg) / (Pneg). That is, 224/(224+54) or 0.8058.
Overall, our Naive Bayes classifier was in the 75-85% range for all five of our major measures. If we need to get to 85-90%, this is a good sign: Naive Bayes is getting us most of the way there, so better classifier algorithms should get us over the top.
Conclusion
In today’s post, we dug into the naivebayes
R package and showed how we could solve for Naive Bayes with and without Laplace Smoothing in just a few lines of code.
If you want to learn more, check out Classification with Naive Bayes, a talk I’ve put together on the topic.
One thought on “Solving Naive Bayes With R”