Naive Bayes is not an algorithm; it is a class of algorithms. Naive Bayes is straightforward to understand and reasonably accurate, making it a great starting point for classification projects.
Classification is a machine learning technique that predicts what class (or category) an item belongs to. Examples include:
Two important definitions:
Naive Bayes algorithms follow the general form:
Today's talk covers how Naive Bayes works, solves a simplified form by hand, and then uses Python and scikit-learn to tackle larger-scale problems.
There are several forms of Naive Bayes algorithms that we will not discuss, but they can be quite useful under certain circumstances.
Supposing multiple inputs, we can combine them together like so:
$P(B|A) = \dfrac{P(x_1|B) * P(x_2|B) * ... * P(x_n|B) * P(B)}{P(A)}$This is because we assume that the inputs are independent from one another.
Given $B_1, B_2, ..., B_N$ as possible classes, we want to find the $B_i$ with the highest probability.
Goal: determine, based on input conditions, whether we should go play golf.
Steps:
Suppose today = {Sunny, Hot, Normal, False}. Let's compare the P(golf) versus P(no golf):
$P(Y|t) = \dfrac{P(O_s|Y) \cdot P(T_h|Y) \cdot P(H_n|Y) \cdot P(W_f|Y) \cdot P(Y)}{P(t)}$
$P(N|t) = \dfrac{P(O_s|N) \cdot P(T_h|N) \cdot P(H_n|N) \cdot P(W_f|N) \cdot P(N)}{P(t)}$
Note the common denominator: because we're comparing P(Yes|today) versus P(No|today), the common denominator cancels out.
Putting this in numbers:
The probability of playing golf:
$P(Yes|today) = \dfrac{2}{9} \cdot \dfrac{2}{9} \cdot \dfrac{6}{9} \cdot \dfrac{6}{9} \cdot \dfrac{9}{14} = 0.0141$The probability of not playing golf:
$P(No|today) = \dfrac{3}{5} \cdot \dfrac{2}{5} \cdot \dfrac{1}{5} \cdot \dfrac{2}{5} \cdot \dfrac{5}{14} = 0.0068$Time to golf!
Our test text: Threw out the runner
Goal: determine, based on input conditions, whether we should categorize this as a baseball phrase or a business phrase.
Calculating the prior probability is easy: the count of "Baseball" categories versus the total number of phrases is the prior probability of selecting the Baseball category: $\dfrac{3}{6}$, or 50%. The same goes for Business.
So what are our features? The answer is, individual words!
Calculate $P(threw|Baseball)$ => count how many times "threw" appears in Baseball texts, divided by the number of words in Baseball texts.
The answer here is $\dfrac{1}{18}$.
What about the word "the"? It doesn't appear in any of the baseball texts, so it would have a result of $\dfrac{0}{18}$.
Because we multiply all of the word probabilities together, a single 0 leads us to a total probability of 0%.
But you're liable to see new words, so this isn't a good solution.
To fix the zero probability problem, we can apply Laplace smoothing: add 1 to each count so it is never zero. Then, add N (the number of unique words) to the denominator.
There are 29 unique words in the entire data set:
a and bullish fell hitter investors no nobody of on opportunity out percent pitched prices runners second seized shares situation stock the third thirty threw tough up were with
Baseball is therefore the best category for our phrase.
Ways that we can improve prediction quality:
scikit-learn offers three Naive Bayes classifiers:
We'll start with GaussianNB on the classic iris data set.
To classify text, we need to convert words into features. scikit-learn gives us two approaches:
scikit-learn's Pipeline chains the vectorizer and classifier together, reducing the entire workflow to a few lines of code.
After looking at Naive Bayes, you might be interested in a few other algorithms:
The Naive Bayes class of algorithms are straightforward to understand and reasonably accurate, making them a good starting point for data analysis.
More specialized algorithms may outperform Naive Bayes for specific problems. But starting with Naive Bayes tells you whether the problem is solvable and establishes your expected baseline of success.
To learn more, go here:
https://CSmore.info/on/naivebayes
And for help, contact me:
feasel@catallaxyservices.com | @feaselkl
Catallaxy Services consulting:
https://CSmore.info/on/contact