My goals in this talk:
Classification is a flavor of supervised machine learning: when we train a model, we have known good answers for our data (a label) as well as our features.
 
				The most common type of classification is two-class classification, where our label takes on one of two values.
This can be Yes/No, True/False, In/Out, Win/Loss, 0/1, or whatever.
 
						Multi-class classification means our label may take on one of multiple values. Common examples of multi-class classification include:
The first measure of model quality we will use today is accuracy. We can define accuracy as:
$\dfrac{\sum{(Answers|Correct)}}{\sum{Answers}}$In other words, the sum of correct answers divided by the sum of answers. This ranges from 0-1, where 1 is perfection.
Deep learning models have become popular in the machine learning landscape, and there are great things that we can build using neural networks. Because of their level of complexity, however, we will not go into them today.
A decision tree is one of the simplest algorithms available to us. Think of it as a series of if-else statements leading to a final decision:
 
				Classification and Regression Trees (CART) is a common implementation for decision trees.
The key questions that CART (or any other decision tree implementation) needs to answer are:
Stopping rules tell CART when to stop driving down a particular branch. Common stopping rules include:
The good news is that implementations, such as in the scikit-learn library, automatically have stopping rules built in, so you don't necessarily need to tune them.
How do we choose which feature to split on in a node? We look at feature importance for the remaining data in the tree:
 
					
					In this case, our next split would be on age.
CART has some additional niceties:
The random forest algorithm is an ensemble algorithm, combining together decision trees using a technique called bagging.
 
				Bootstrap aggregation (aka bagging) is a technique used to reduce the variance of a statistical learning method. In this case, bagging takes a variety of training sets from the population, builds separate prediction models for each training set, and averages the resulting predictions.
 
				 
				 
				Another important element of the random forest algorithm is the term "random." Whereas CART looks for the optimal feature based on remaining nodes, random forest chooses the optimal model from a randomly-selected subset of features.
 
						The random forest algorithm depends on bagging for its variety and generates independent trees in parallel.
By contrast, Extreme Gradient Boosting (XGBoost) relies on boosting as a technique.
Boosting is another ensembling technique like bagging. Unlike bagging, boosting is a sequential technique, meaning that it takes information learned from the prior model and uses it to get better.
Adaboost is a classic technique for ensemble learning. In each stage of model training, we create a new weak learner to compensate for the shortcomings of existing weak learners.
Specifically, AdaBoost looks at what data the prior model got wrong, weights the incorrect values more highly than correct values, and trains a new model, focusing on getting high-weight, incorrect values right.
Gradient boosting, meanwhile, identifies missed values based on gradients.
Specifically, gradient boosting looks at how far off the prior model was and adds a modification factor intended to make more answers correct than before.
 
				 
				Suppose we want to detect a rare disease which affects 1 in 10,000 people. Without a treatment of orange juice and toothpaste, these people will disappear in exactly three weeks.
We train a model which is 99.99% accurate at discerning if a person has this disease. Great, right?
Return:  Congratulations!  You don't have the disease!
The model does get 9,999 out of 10,000 predictions correct, but it does a terrible job because it misses every single person with the disease!
The confusion matrix compares reference (i.e., actual) results versus predictions for classification problems.
It gives us information to solve the class imbalance problem.
 
						In normal circumstances, accuracy is a really good baseline indicator for model quality.
 
						If my model predicts a particular class, what is the probability that this judgment is correct?
 
						If my model predicts not a particular class, what is the probability that this judgment is correct?
 
						If an event is positive, how often do we predict positively?
 
						If an event is negative, how often do we predict negatively?
 
						When dealing with class imbalance, we look in particular at sensitivity and specificity.
In our health scenario, a specific test would do a great job of rejecting people who do not have the health condition.
A sensitive test would do a great job of detecting people who have the health condition.
K-Nearest Neighbors is a distance-based algorithm for classifying input data.
The short version is, when we get a new data point, we find the k nearest data points (using some distance measure) and choose the most common fit between these.
 
					
					Here, the left-most new dot will be green (3G) and the right-most new dot will be orange (2O, 1B).
Distance-based classification algorithms are susceptible to bad results brought about by irrelevant attributes. This is unlike regression, where irrelevant (but independent) features cannot harm the model.
If you have multiple attributes which define a data point, but only a couple of them are relevant, two data points with similar important features might be spatially disparate because they differ so much in irrelevant features.
Examples of techniques:
Three phases of understanding logistic regression:
Logistic regression is built around the Sigmoid function. This has the property of separating fairly cleanly into two outcomes: 0 and 1.
 
				Naive Bayes is not an algorithm; it is a class of algorithms. Naive Bayes is very easy to understand and reasonably accurate, making it a great class of algorithms to use when starting a classification project.
Naive Bayes algorithms follow the general form:
Goal: determine, based on input conditions, whether we should golf.
Steps:
Suppose today = {Sunny, Hot, Normal, False}.  Let's compare the P(golf) versus P(no golf):
					$P(Y|t) = \dfrac{P(O_s|Y) \cdot P(T_h|Y) \cdot P(H_n|Y) \cdot P(W_f|Y) \cdot P(Y)}{P(t)}$
					$P(N|t) = \dfrac{P(O_s|N) \cdot P(T_h|N) \cdot P(H_n|N) \cdot P(W_f|N) \cdot P(N)}{P(t)}$
					
Note the common denominator:  because we're comparing P(Yes|today) versus P(No|today), the common denominator cancels out.
Putting this in numbers:
The relative likelihood of playing golf:
$P(Yes|today) = \dfrac{2}{9} \cdot \dfrac{2}{9} \cdot \dfrac{6}{9} \cdot \dfrac{6}{9} \cdot \dfrac{9}{14} = 0.0141$The relative likelihood of not playing golf:
$P(No|today) = \dfrac{3}{5} \cdot \dfrac{2}{5} \cdot \dfrac{1}{5} \cdot \dfrac{2}{5} \cdot \dfrac{5}{14} = 0.0068$Time to golf!
Online Passive-Aggressive Algorithms are a margin-based set of online trainers.
They follow three critical properties.
Training happens one record at a time.
 
				If we predict correctly, don't change the model--just move on to the next prediction.
 
				If our prediction was incorrect, shift the curve until our latest prediction becomes correct.
 
				The curve can shift along any relevant axis.
 
				We only care about two things: the current record and the curve. Prior data points do not matter for the base algorithm, though there are variants which do include a few prior records.
 
				So far, we've focused on two-class classification. This is the most common scenario, and several algorithms--such as Logistic Regression and Passive-Aggressive--depend on this.
Other algorithms, like tree-based algorithms, kNN, and Naive Bayes, can naturally support more than two classes.
To get two-class algorithms to work in a multi-class scenario, we can perform "one-versus-all" comparisons.
Instead of calculating P(A), P(B), and P(C) separately, calculate P(A) versus P(Not A). Then, calculate P(B) versus P(Not B) and then P(C) versus P(Not C). Find which of the three has the highest probability and that's your winner.
Over the course of this talk, we have looked at several classification algorithms and have seen how to use them in the scikit-learn Python package. We've covered two-class and multi-class scenarios and have come up with reasonable guidelines around algorithm choice.
						To learn more, go here:
						
						https://csmore.info/on/classy
					
						And for help, contact me:
						
						feasel@catallaxyservices.com | @feaselkl
					
						Catallaxy Services consulting:
						
						https://CSmore.info/on/contact