Classification with Naive Bayes

Kevin Feasel (@feaselkl)
http://CSmore.info/on/naivebayes

Who Am I? What Am I Doing Here?

What is Naive Bayes?

Naive Bayes is not an algorithm; it is a class of algorithms. Naive Bayes is very easy to understand and reasonably accurate, making it a great class of algorithms to use when starting a classification project.

What is Classification?

Classification is a machine learning technique in which we wish to predict what class (or category) an item belongs to. Examples of classification include:

  • Will it blend?
  • Do I have this illness?
  • Will this new product be profitable?
  • To which product category does this item belong?
  • Which breed of dog is this?

Labels and Features

Two important definitions:

  • Label: the thing we want to predict. We need to have known good label data to solve a classification problem.
  • Feature: the thing(s) we can look at to determine which class an item belongs to.

Naive Bayes Characteristics

Naive Bayes algorithms follow the general form:

  • Probabilistic -- calculate probabilities of each output category and choose the best one
  • Probabilities derived from Bayes' Theorem
  • Features are independent (hence the "naive" approach)

Motivation

Today's talk will show how the Naive Bayes class of algorithms works, solve for a simplied form, and then use R packages to solve larger-scale problems.

There are several forms of Naive Bayes algorithms that we will not discuss, but they can be quite useful under certain circumstances.

Agenda

  1. Solving Naive Bayes
  2. Solving by Hand -- Features
  3. Solving by Hand -- Natural Language
  4. R -- Features
  5. R -- Natural Language

Bayes' Theorem

$P(B|A) = \dfrac{P(A|B) * P(B)}{P(A)}$
  • P(B|A) [aka, the posterior probability] is the probability of hypothesis B given data A.
  • P(A|B) is the probability of data A assuming that hypothesis B is true.
  • P(B) [aka, the prior probability] is the likelihood of hypothesis B being true regardless of data.
  • P(A) is the probability of seeing data A.

Applying Bayes' Theorem

Supposing multiple inputs, we can combine them together like so:

$P(B|A) = \dfrac{P(x_1|B) * P(x_2|B) * ... * P(x_n|B) * P(B)}{P(A)}$

This is because we assume that the inputs are independent from one another.

Given $B_1, B_2, ..., B_N$ as possible classes, we want to find the $B_i$ with the highest probability.

Agenda

  1. Solving Naive Bayes
  2. Solving by Hand -- Features
  3. Solving by Hand -- Natural Language
  4. R -- Features
  5. R -- Natural Language

Solving the Problem

Goal: determine, based on input conditions, whether we should go play golf.

Steps:

  1. Find the probability of playing golf (prior probability).
  2. Find the probability of golfing given each variable: Outlook, Temp, Humidity, Wind.
  3. Plug values from new data into our formula.

Testing a Day

Suppose today = {Sunny, Hot, Normal, False}. Let's compare the P(golf) versus P(no golf): $P(Y|t) = \dfrac{P(O_s|Y) \cdot P(T_h|Y) \cdot P(H_n|Y) \cdot P(W_f|Y) \cdot P(Y)}{P(t)}$ $P(N|t) = \dfrac{P(O_s|N) \cdot P(T_h|N) \cdot P(H_n|N) \cdot P(W_f|N) \cdot P(N)}{P(t)}$

Note the common denominator: because we're comparing P(Yes|today) versus P(No|today), the common denominator cancels out.

Testing a Day

Putting this in numbers:

The probability of playing golf:

$P(Yes|today) = \dfrac{2}{9} \cdot \dfrac{2}{9} \cdot \dfrac{6}{9} \cdot \dfrac{6}{9} \cdot \dfrac{9}{14} = 0.0141$

The probability of not playing golf:

$P(No|today) = \dfrac{3}{5} \cdot \dfrac{2}{5} \cdot \dfrac{1}{5} \cdot \dfrac{2}{5} \cdot \dfrac{5}{14} = 0.0068$

Time to golf!

Agenda

  1. Solving Naive Bayes
  2. Solving by Hand -- Features
  3. Solving by Hand -- Natural Language
  4. R -- Features
  5. R -- Natural Language

Test Text And Process

Our test text: Threw out the runner

Goal: determine, based on input conditions, whether we should categorize this as a baseball phrase or a business phrase.

  1. Find the probability of a phrase being business or baseball (prior probability).
  2. Find the probability of the words in the phrase being business or baseball.
  3. Plug values from new data into our formula.

Calculating Values

Calculating the prior probability is easy: the count of "Baseball" categories versus the total number of phrases is the prior probability of selecting the Baseball category: $\dfrac{3}{6}$, or 50%. The same goes for Business.

So what are our features? The answer is, individual words!

Words

Calculate $P(threw|Baseball)$ => count how many times "threw" appears in Baseball texts, divided by the number of words in Baseball texts.

The answer here is $\dfrac{1}{18}$.

Missing Words

What about the word "the"? It doesn't appear in any of the baseball texts, so it would have a result of $\dfrac{0}{18}$.

Because we multiply all of the word probabilities together, a single 0 leads us to a total probability of 0%.

But you're liable to see new words, so this isn't a good solution.

Laplace Smoothing

To fix the zero probability problem, we can apply Laplace smoothing: add 1 to each count so it is never zero. Then, add N (the number of unique words) to the denominator.

There are 29 unique words in the entire data set:

a and bullish fell hitter investors no nobody of on opportunity out percent pitched prices runners second seized shares situation stock the third thirty threw tough up were with

Calculating Values -- Final Calculation

$P(Baseball|totr) = \dfrac{2}{47} \cdot \dfrac{3}{47} \cdot \dfrac{1}{47} \cdot \dfrac{1}{47} \cdot \dfrac{3}{6} = 6.15 \times 10^-7$ $P(Business|totr) = \dfrac{1}{43} \cdot \dfrac{1}{43} \cdot \dfrac{2}{43} \cdot \dfrac{1}{43} \cdot \dfrac{3}{6} = 2.93 \times 10^-7$

Baseball is therefore the best category for our phrase.

Making Results Better

Ways that we can improve prediction quality:

  1. Remove stopwords (e.g., a, the, on, of, etc.)
  2. Lemmatize words: group together inflections of the same word (runner, runners)
  3. Use n-grams (sequences of multiple words; "threw out the" and "out the runner" are 3-grams)
  4. Use TF-IDF (term frequency - inverse document frequency): penalize words which appear frequently in most of the texts.

Agenda

  1. Solving Naive Bayes
  2. Solving by Hand -- Features
  3. Solving by Hand -- Natural Language
  4. R -- Features
  5. R -- Natural Language

Using the naivebayes Package

The naivebayes package is a fast Naive Bayes solver, with built-in versions of the plot and predict functions.

First, we will use the famous iris data set.

Demo Time

Agenda

  1. Solving Naive Bayes
  2. Solving by Hand -- Features
  3. Solving by Hand -- Natural Language
  4. R -- Features
  5. R -- Natural Language

R -- Natural Language

We can use the naivebayes package in R to solve Natural Language Processing problems as well. Just like before, we need to featurize our language samples. Also like before, we will use the bag of words technique to build our corpus.

Unlike before, we will perform several data cleanup operations beforehand to normalize our input data set.

Demo Time

Where Next?

After looking at Naive Bayes, you might be interested in a few other algorithms:

  1. Logistic regression (for two-class problems).
  2. Decision trees, random forests, and gradient boosted trees.
  3. Online Passive-Aggressive Algorithms.
  4. Neural networks (multilayer perceptron, convolutional, or recurrent depending on case).
  5. Support Vector Machines.

Wrapping Up

The Naive Bayes class of algorithms are simple to understand and are reasonably accurate, making them a good starting point for data analysis. There are a number of superior algorithms for specific problems, but starting with Naive Bayes will give you an idea of whether the problem is solvable and what your expected baseline of success should be.

Wrapping Up

To learn more, go here:
https://CSmore.info/on/naivebayes


And for help, contact me:
feasel@catallaxyservices.com | @feaselkl


Catallaxy Services consulting:
https://CSmore.info/on/contact