Solving Naive Bayes With R

This is part four in a series on classification with Naive Bayes.

Classification Of Features With R

So far, we’ve walked through the Naive Bayes class of algorithms by hand, learning the details of how it works. Now that we have a good understanding of the mechanics at that level, let’s let the computer do what it does better than us: mindless calculations.

Of Course It’s The Iris Data Set

Our first example is a classic: the iris data set in R. We are going to use the naivebayes R package to implement Naive Bayes for us and classify this iris data set. To give us an idea of how good the classifier is, I’m going to use 80% of the data for training and reserve 20% for testing. I’ll set a particular seed so that if you want to try it at home, you can end up with the same results as me.

Here’s the code, which I’ll follow up with some discussion.

if(!require(naivebayes)) {
  install.packages("naivebayes")
  library(naivebayes)
}
if(!require(caret)) {
  install.packages("caret")
  library(caret)
}

data(iris)

set.seed(1773)
irisr <- iris[sample(nrow(iris)),]
irisr <- irisr[sample(nrow(irisr)),]

iris.train <- irisr[1:120,]
iris.test <- irisr[121:150,]

nb <- naivebayes::naive_bayes(Species ~ ., data = iris.train)

plot(nb)

iris.output <- cbind(iris.test, prediction = predict(nb, iris.test))

caret::confusionMatrix(iris.output$prediction, iris.output$Species)

All in all, a couple dozen lines of code to do the job. The first two if statements load our packages: naivebayes and caret. I could use caret to split my training and test data, but because it’s such a small data set, I figured I’d shuffle it in place and assign the first 80% to iris.train and leave the remaining 20% for iris.test.

The key function is naive_bayes in the naivebayes package. In this case, we are predicting Species given all of the other inputs on iris.train.

If you do use the default seed that I’ve set, you’ll end up with four plots, one for each feature. Here they are:

undefined

Looking at these images, sepal length and sepal width aren’t very helpful for us: what we want is a great separating equilibrium—that is, where most of the distributions are independent. Petal length ad petal width are better—setosa is clearly different from the others, though there is some overlap between versicolor and virginica, which will lead to some risk of ambiguity.

Maximizing Confusion

Once we have our output, we can quickly generate a confusion matrix using caret. I like using this a lot more than building my own with e.g. table(iris.output$Species, iris.output$prediction). The reason I prefer what caret has to offer is that it also includes statistics like positive predictive value and negative predictive value. These tend to be at least as important as accuracy when performing classification, especially for scenarios where one class is extremely likely and the other extremely unlikely.

Here is the confusion matrix output from caret. After that, I’ll explain positive and negative predictive values.

Confusion Matrix and Statistics

            Reference
Prediction   setosa versicolor virginica
  setosa          6          0         0
  versicolor      0         11         1
  virginica       0          1        11

Overall Statistics
                                          
               Accuracy : 0.9333          
                 95% CI : (0.7793, 0.9918)
    No Information Rate : 0.4             
    P-Value [Acc > NIR] : 1.181e-09       
                                          
                  Kappa : 0.8958          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: setosa Class: versicolor Class: virginica
Sensitivity                    1.0            0.9167           0.9167
Specificity                    1.0            0.9444           0.9444
Pos Pred Value                 1.0            0.9167           0.9167
Neg Pred Value                 1.0            0.9444           0.9444
Prevalence                     0.2            0.4000           0.4000
Detection Rate                 0.2            0.3667           0.3667
Detection Prevalence           0.2            0.4000           0.4000
Balanced Accuracy              1.0            0.9306           0.9306

Positive predictive value for a category is: if my model predicts that a particular set of inputs matches a particular class, what is the probability that this judgement is correct? For example, we have 12 versicolor entries (read the “versicolor” Prediction row across and sum up values). 11 of the 12 were predicted as versicolor, so our positive predictive value is 11/12 = 0.9167.

Negative predictive value for a category is: if my model predicts that a particular set of inputs does not match a particular class, what is the probability that this judgement is correct? For example, we have 18 predictions which were not versicolor (sum up all of the values across the rows except for the versicolor row). Of those 18, 1 was actually versicolor (read the versicolor column and ignore the point where the prediction was versicolor). Therefore, 17 of our 18 negative predictions for versicolor were correct, so our negative predictive value is 17/18 = 0.9444.

This is a small data set with relatively little variety and only one real place for ambiguity, so it’s a little boring. So let’s do something a bit more interesting: sentiment analysis.

Yes, Mr. Sherman, Everything Stinks

Now we’re going to look at movie reviews and predict whether a movie review is a positive or a negative review based on its words. If you want to play along at home, grab the data set, which is under 3MB zipped in 2000 reviews in total.

Unike last time, I’m going to break this out into sections with commentary in between. If you want the full script with notebook, check out the GitHub repo I put together for this talk.

First up, we load some packages. I’ll use naivebayes to perform classification and tm for text mining. If you’re a tidytext fan, you can certainly use that for this work too.

if(!require(naivebayes)) {
  install.packages("naivebayes")
  library(naivebayes)
}
if(!require(tidyverse)) {
  install.packages("tidyverse")
  library(tidyverse)
}
if(!require(tm)) {
  install.packages("tm")
  library(tm)
}
if(!require(caret)) {
  install.packages("caret")
  library(caret)
}

We’ll next load the data and split it into training and test data sets.

df <- readr::read_csv("../data/movie-pang02.csv")

set.seed(1)
df <- df[sample(nrow(df)),]
df <- df[sample(nrow(df)),]
df$class <- as.factor(df$class)

corpus <- tm::Corpus(tm::VectorSource(df$text))
corpus

I’m going to stop here and lay out a warning: this will leak information: if your test data set includes words your training data set does not, the trained model will gain knowledge of those additional words and that they don’t appear in the training set. In a real project, I’d build a corpus off of the training data and then apply those rules to the test set, using Laplace Smoothing or a similar technique to deal with any test words not in the training set.

With that warning said, I’m now going to clean up the data by converting everything to lower-case, removing punctuation and numbers, removing stopwords, and stripping out any extraneous whitespace. This reduces the total document space and gives us a more consistent set of words.

corpus.clean <- corpus %>%
  tm::tm_map(tm::content_transformer(tolower)) %>% 
  tm::tm_map(tm::removePunctuation) %>%
  tm::tm_map(tm::removeNumbers) %>%
  tm::tm_map(tm::removeWords, tm::stopwords(kind="en")) %>%
  tm::tm_map(tm::stripWhitespace)

dtm <- tm::DocumentTermMatrix(corpus.clean)

Then we turn our words into features using the bag of words technique. It’s not the fanciest or best, but it’s quick-and-easy—sort of like Naive Bayes.

Once we have the document term matrix, we can build out our training and test data. I already shuffled at the beginning, so we split out our elements into training and test, reserving 25% for test.

df.train <- df[1:1500,]
df.test <- df[1501:2000,]

dtm.train <- dtm[1:1500,]
dtm.test <- dtm[1501:2000,]

corpus.clean.train <- corpus.clean[1:1500]
corpus.clean.test <- corpus.clean[1501:2000]

After doing this, our training data set includes 38,957 unique terms, but many of these only appear in one review. That’s great for pinpointing a specific document (a particular review), but not as great for classification: they won’t help me pick a good class and take up memory, so let’s get rid of them. I’ll throw away any term which appears in fewer than 5 documents. This will get me down to 12,144 terms, or just under a third of the original total.

After that, I will rebuild the document term matrices for training and testing, as we want to take advantage of that smaller domain.

fiveFreq <- tm::findFreqTerms(dtm.train, 5)

dtm.train.nb <- tm::DocumentTermMatrix(corpus.clean.train, control=list(dictionary = fiveFreq))
dtm.test.nb <- tm::DocumentTermMatrix(corpus.clean.test, control=list(dictionary = fiveFreq))

From here, I am going to create a function which helps me determine whether a term has appeared in a document, which is more important than how many times a term has appeared in the document. This prevents one document making heavy use of a term from biasing us too much toward that document’s class. Going back to our baseball versus business example, it’d be like a single business article writing about “going to the bullpen” over and over, using that as a metaphor for something business-related. Most business documents will not use the term bullpen (whereas plenty of baseball documents will), so a single business document applying a baseball metaphor shouldn’t ruin our model.

After doing that, I’ll run the naive_bayes function with Laplace Smoothing turned on (laplace = 1) and predict what our test values will look like.

convert_count <- function(x) {
  y <- ifelse(x > 0, 1,0)
  y <- factor(y, levels=c(0,1), labels=c("No", "Yes"))
  y
}

trainNB <- apply(dtm.train.nb, 2, convert_count)
testNB <- apply(dtm.test.nb, 2, convert_count)

classifier <- naivebayes::naive_bayes(trainNB, df.train$class, laplace = 1)
pred <- predict(classifier, newdata=testNB)

conf.mat <- caret::confusionMatrix(pred, df.test$class, positive="Pos")
conf.mat

Then we can look at the confusion matrix. Here’s how it looks:

Confusion Matrix and Statistics

          Reference
Prediction Neg Pos
       Neg 224  54
       Pos  41 181
                                          
               Accuracy : 0.81            
                 95% CI : (0.7728, 0.8435)
    No Information Rate : 0.53            
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.6174          
 Mcnemar's Test P-Value : 0.2183          
                                          
            Sensitivity : 0.7702          
            Specificity : 0.8453          
         Pos Pred Value : 0.8153          
         Neg Pred Value : 0.8058          
             Prevalence : 0.4700          
         Detection Rate : 0.3620          
   Detection Prevalence : 0.4440          
      Balanced Accuracy : 0.8077          
                                          
       'Positive' Class : Pos     

Overall, our classifier has an accuracy of 81%. But this example lets us look at four important features: sensitivity, specificity, positive predictive value, and negative predictive value. This is a simple two-class, binary classifier so these definitions are pretty simple. The tricky part is that the confusion matrix in caret orders alphabetically, whereas ideally you want the “positive” result first and the “negative” result last.

Sensitivity is where we capture when an event is positive, whether our predictor considers it positive, and is defined as (Rpos|Ppos) / (Rpos). That is, 181/(181+54) or 0.7702.

Specificity is where we capture when an event is negative, whether our predictor considers it negative, and is defined as (Rneg|Pneg) / (Rneg). That is, 224/(224+41) or 0.8453.

Positive predictive value looks at all cases where the Prediction was positive (read the “Pos” row), and is defined as (Ppos|Rpos) / (Ppos). That is, 181/(181+41) or 0.8153.

Negative predictive value looks at cases where the Prediction was negative (read the “Neg” row), and is defined as (Pneg|Rneg) / (Pneg). That is, 224/(224+54) or 0.8058.

Overall, our Naive Bayes classifier was in the 75-85% range for all five of our major measures. If we need to get to 85-90%, this is a good sign: Naive Bayes is getting us most of the way there, so better classifier algorithms should get us over the top.

Conclusion

In today’s post, we dug into the naivebayes R package and showed how we could solve for Naive Bayes with and without Laplace Smoothing in just a few lines of code.

If you want to learn more, check out Classification with Naive Bayes, a talk I’ve put together on the topic.

Advertisements

One thought on “Solving Naive Bayes With R

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s