Classification with Naive Bayes

Kevin Feasel (@feaselkl)
http://CSmore.info/on/naivebayes

Who Am I? What Am I Doing Here?

What is Naive Bayes?

Naive Bayes is not an algorithm; it is a class of algorithms. Naive Bayes is straightforward to understand and reasonably accurate, making it a great starting point for classification projects.

What is Classification?

Classification is a machine learning technique that predicts what class (or category) an item belongs to. Examples include:

  • Will it blend?
  • Do I have this illness?
  • Will this new product be profitable?
  • To which product category does this item belong?
  • Which breed of dog is this?

Labels and Features

Two important definitions:

  • Label: the thing we want to predict. We need to have known good label data to solve a classification problem.
  • Feature: the thing(s) we can look at to determine which class an item belongs to.

Naive Bayes Characteristics

Naive Bayes algorithms follow the general form:

  • Probabilistic -- calculate probabilities of each output category and choose the best one
  • Probabilities derived from Bayes' Theorem
  • Features are independent (hence the "naive" approach)

Motivation

Today's talk covers how Naive Bayes works, solves a simplified form by hand, and then uses Python and scikit-learn to tackle larger-scale problems.

There are several forms of Naive Bayes algorithms that we will not discuss, but they can be quite useful under certain circumstances.

Agenda

  1. Solving Naive Bayes
  2. Solving by Hand -- Features
  3. Solving by Hand -- Natural Language
  4. Python -- Features
  5. Python -- Natural Language

Bayes' Theorem

$P(B|A) = \dfrac{P(A|B) * P(B)}{P(A)}$
  • P(B|A) [aka, the posterior probability] is the probability of hypothesis B given data A.
  • P(A|B) is the probability of data A assuming that hypothesis B is true.
  • P(B) [aka, the prior probability] is the likelihood of hypothesis B being true regardless of data.
  • P(A) is the probability of seeing data A.

Applying Bayes' Theorem

Supposing multiple inputs, we can combine them together like so:

$P(B|A) = \dfrac{P(x_1|B) * P(x_2|B) * ... * P(x_n|B) * P(B)}{P(A)}$

This is because we assume that the inputs are independent from one another.

Given $B_1, B_2, ..., B_N$ as possible classes, we want to find the $B_i$ with the highest probability.

Agenda

  1. Solving Naive Bayes
  2. Solving by Hand -- Features
  3. Solving by Hand -- Natural Language
  4. Python -- Features
  5. Python -- Natural Language

Solving the Problem

Goal: determine, based on input conditions, whether we should go play golf.

Steps:

  1. Find the probability of playing golf (prior probability).
  2. Find the probability of golfing given each variable: Outlook, Temp, Humidity, Wind.
  3. Plug values from new data into our formula.

Testing a Day

Suppose today = {Sunny, Hot, Normal, False}. Let's compare the P(golf) versus P(no golf): $P(Y|t) = \dfrac{P(O_s|Y) \cdot P(T_h|Y) \cdot P(H_n|Y) \cdot P(W_f|Y) \cdot P(Y)}{P(t)}$ $P(N|t) = \dfrac{P(O_s|N) \cdot P(T_h|N) \cdot P(H_n|N) \cdot P(W_f|N) \cdot P(N)}{P(t)}$

Note the common denominator: because we're comparing P(Yes|today) versus P(No|today), the common denominator cancels out.

Testing a Day

Putting this in numbers:

The probability of playing golf:

$P(Yes|today) = \dfrac{2}{9} \cdot \dfrac{2}{9} \cdot \dfrac{6}{9} \cdot \dfrac{6}{9} \cdot \dfrac{9}{14} = 0.0141$

The probability of not playing golf:

$P(No|today) = \dfrac{3}{5} \cdot \dfrac{2}{5} \cdot \dfrac{1}{5} \cdot \dfrac{2}{5} \cdot \dfrac{5}{14} = 0.0068$

Time to golf!

Agenda

  1. Solving Naive Bayes
  2. Solving by Hand -- Features
  3. Solving by Hand -- Natural Language
  4. Python -- Features
  5. Python -- Natural Language

Test Text And Process

Our test text: Threw out the runner

Goal: determine, based on input conditions, whether we should categorize this as a baseball phrase or a business phrase.

  1. Find the probability of a phrase being business or baseball (prior probability).
  2. Find the probability of the words in the phrase being business or baseball.
  3. Plug values from new data into our formula.

Calculating Values

Calculating the prior probability is easy: the count of "Baseball" categories versus the total number of phrases is the prior probability of selecting the Baseball category: $\dfrac{3}{6}$, or 50%. The same goes for Business.

So what are our features? The answer is, individual words!

Words

Calculate $P(threw|Baseball)$ => count how many times "threw" appears in Baseball texts, divided by the number of words in Baseball texts.

The answer here is $\dfrac{1}{18}$.

Missing Words

What about the word "the"? It doesn't appear in any of the baseball texts, so it would have a result of $\dfrac{0}{18}$.

Because we multiply all of the word probabilities together, a single 0 leads us to a total probability of 0%.

But you're liable to see new words, so this isn't a good solution.

Laplace Smoothing

To fix the zero probability problem, we can apply Laplace smoothing: add 1 to each count so it is never zero. Then, add N (the number of unique words) to the denominator.

There are 29 unique words in the entire data set:

a and bullish fell hitter investors no nobody of on opportunity out percent pitched prices runners second seized shares situation stock the third thirty threw tough up were with

Calculating Values -- Final Calculation

$P(Baseball|totr) = \dfrac{2}{47} \cdot \dfrac{3}{47} \cdot \dfrac{1}{47} \cdot \dfrac{1}{47} \cdot \dfrac{3}{6} = 6.15 \times 10^-7$ $P(Business|totr) = \dfrac{1}{43} \cdot \dfrac{1}{43} \cdot \dfrac{2}{43} \cdot \dfrac{1}{43} \cdot \dfrac{3}{6} = 2.93 \times 10^-7$

Baseball is therefore the best category for our phrase.

Making Results Better

Ways that we can improve prediction quality:

  1. Remove stopwords (e.g., a, the, on, of, etc.)
  2. Lemmatize words: group together inflections of the same word (runner, runners)
  3. Use n-grams (sequences of multiple words; "threw out the" and "out the runner" are 3-grams)
  4. Use TF-IDF (term frequency - inverse document frequency): penalize words which appear frequently in most of the texts.

Agenda

  1. Solving Naive Bayes
  2. Solving by Hand -- Features
  3. Solving by Hand -- Natural Language
  4. Python -- Features
  5. Python -- Natural Language

Naive Bayes with Features

scikit-learn offers three Naive Bayes classifiers:

  • GaussianNB — continuous features (e.g., measurements)
  • MultinomialNB — discrete counts (e.g., word frequencies)
  • BernoulliNB — binary features (e.g., word presence/absence)

We'll start with GaussianNB on the classic iris data set.

Demo Time

Agenda

  1. Solving Naive Bayes
  2. Solving by Hand -- Features
  3. Solving by Hand -- Natural Language
  4. Python -- Features
  5. Python -- Natural Language

Naive Bayes with NLP

To classify text, we need to convert words into features. scikit-learn gives us two approaches:

  • CountVectorizer — bag of words (word presence/absence)
  • TfidfVectorizer — TF-IDF (penalizes common words, rewards distinctive ones)

Naive Bayes with NLP

scikit-learn's Pipeline chains the vectorizer and classifier together, reducing the entire workflow to a few lines of code.

Demo Time

Where Next?

After looking at Naive Bayes, you might be interested in a few other algorithms:

  1. Logistic regression (for two-class problems).
  2. Decision trees, random forests, and gradient boosted trees.
  3. Online Passive-Aggressive Algorithms.
  4. Neural networks (multilayer perceptron, convolutional, or recurrent depending on case).
  5. Support Vector Machines.

Wrapping Up

The Naive Bayes class of algorithms are straightforward to understand and reasonably accurate, making them a good starting point for data analysis.

More specialized algorithms may outperform Naive Bayes for specific problems. But starting with Naive Bayes tells you whether the problem is solvable and establishes your expected baseline of success.

Wrapping Up

To learn more, go here:
https://CSmore.info/on/naivebayes


And for help, contact me:
feasel@catallaxyservices.com | @feaselkl


Catallaxy Services consulting:
https://CSmore.info/on/contact