Last Updated on November 8, 2019
Probability is a field of mathematics that quantifies uncertainty.
It is undeniably a pillar of the field of machine learning, and many recommend it as a prerequisite subject to study prior to getting started. This is misleading advice, as probability makes more sense to a practitioner once they have the context of the applied machine learning process in which to interpret it.
In this post, you will discover why machine learning practitioners should study probabilities to improve their skills and capabilities.
After reading this post, you will know:
- Not everyone should learn probability; it depends where you are in your journey of learning machine learning.
- Many algorithms are designed using the tools and techniques from probability, such as Naive Bayes and Probabilistic Graphical Models.
- The maximum likelihood framework that underlies the training of many machine learning algorithms comes from the field of probability.
Discover bayes opimization, naive bayes, maximum likelihood, distributions, cross entropy, and much more in my new book, with 28 step-by-step tutorials and full Python source code.
Let’s get started.
What You Will Learn
- 1 Overview
- 2 Reasons to NOT Learn Probability
- 3 1. Class Membership Requires Predicting a Probability
- 4 2. Models Are Designed Using Probability
- 5 3. Models Are Trained With Probabilistic Frameworks
- 6 4. Models Are Tuned With a Probabilistic Framework
- 7 5. Models Are Evaluated With Probabilistic Measures
- 8 One More Reason
- 9 Further Reading
- 10 Summary
- 11 Get a Handle on Probability for Machine Learning!
Overview
This tutorial is divided into seven parts; they are:
- Reasons to NOT Learn Probability
- Class Membership Requires Predicting a Probability
- Some Algorithms Are Designed Using Probability
- Models Are Trained Using a Probabilistic Framework
- Models Can Be Tuned With a Probabilistic Framework
- Probabilistic Measures Are Used to Evaluate Model Skill
- One More Reason
Reasons to NOT Learn Probability
Before we go through the reasons that you should learn probability, let’s start off by taking a small look at the reason why you should not.
I think you should not study probability if you are just getting started with applied machine learning.
- It’s not required. Having an appreciation for the abstract theory that underlies some machine learning algorithms is not required in order to use machine learning as a tool to solve problems.
- It’s slow. Taking months to years to study an entire related field before starting machine learning will delay you achieving your goals of being able to work through predictive modeling problems.
- It’s a huge field. Not all of probability is relevant to theoretical machine learning, let alone applied machine learning.
I recommend a breadth-first approach to getting started in applied machine learning.
I call this the results-first approach. It is where you start by learning and practicing the steps for working through a predictive modeling problem end-to-end (e.g. how to get results) with a tool (such as scikit-learn and Pandas in Python).
This process then provides the skeleton and context for progressively deepening your knowledge, such as how algorithms work and, eventually, the math that underlies them.
After you know how to work through a predictive modeling problem, let’s look at why you should deepen your understanding of probability.
Want to Learn Probability for Machine Learning
Take my free 7-day email crash course now (with sample code).
Click to sign-up and also get a free PDF Ebook version of the course.
Download Your FREE Mini-Course
1. Class Membership Requires Predicting a Probability
Classification predictive modeling problems are those where an example is assigned a given label.
An example that you may be familiar with is the iris flowers dataset where we have four measurements of a flower and the goal is to assign one of three different known species of iris flower to the observation.
We can model the problem as directly assigning a class label to each observation.
- Input: Measurements of a flower.
- Output: One iris species.
A more common approach is to frame the problem as a probabilistic class membership, where the probability of an observation belonging to each known class is predicted.
- Input: Measurements of a flower.
- Output: Probability of membership to each iris species.
Framing the problem as a prediction of class membership simplifies the modeling problem and makes it easier for a model to learn. It allows the model to capture ambiguity in the data, which allows a process downstream, such as the user to interpret the probabilities in the context of the domain.
The probabilities can be transformed into a crisp class label by choosing the class with the largest probability. The probabilities can also be scaled or transformed using a probability calibration process.
This choice of a class membership framing of the problem interpretation of the predictions made by the model requires a basic understanding of probability.
2. Models Are Designed Using Probability
There are algorithms that are specifically designed to harness the tools and methods from probability.
These range from individual algorithms, like Naive Bayes algorithm, which is constructed using Bayes Theorem with some simplifying assumptions.
The linear regression algorithm can be seen as a probabilistic model that minimizes the mean squared error of predictions, and the logistic regression algorithm can be seen as a probabilistic model that minimizes the negative log likelihood of predicting the positive class label.
- Linear Regression
- Logistic Regression
It also extends to whole fields of study, such as probabilistic graphical models, often called graphical models or PGM for short, and designed around Bayes Theorem.
A notable graphical model is Bayesian Belief Networks or Bayes Nets, which are capable of capturing the conditional dependencies between variables.
3. Models Are Trained With Probabilistic Frameworks
Many machine learning models are trained using an iterative algorithm designed under a probabilistic framework.
Some examples of general probabilsitic modeling frameworks are:
Perhaps the most common is the framework of maximum likelihood estimation, sometimes shorted as MLE. This is a framework for estimating model parameters (e.g. weights) given observed data.
This is the framework that underlies the ordinary least squares estimate of a linear regression model and the log loss estimate for logistic regression.
The expectation-maximization algorithm, or EM for short, is an approach for maximum likelihood estimation often used for unsupervised data clustering, e.g. estimating k means for k clusters, also known as the k-Means clustering algorithm.
For models that predict class membership, maximum likelihood estimation provides the framework for minimizing the difference or divergence between an observed and predicted probability distribution. This is used in classification algorithms like logistic regression as well as deep learning neural networks.
It is common to measure this difference in probability distribution during training using entropy, e.g. via cross-entropy. Entropy, and differences between distributions measured via KL divergence, and cross-entropy are from the field of information theory that directly build upon probability theory. For example, entropy is calculated directly as the negative log of the probability.
As such, these tools from information theory such as minimising cross-entropy loss can be seen as another probabilistic framework for model estimation.
- Minimum Cross-Entropy Loss Estimation
4. Models Are Tuned With a Probabilistic Framework
It is common to tune the hyperparameters of a machine learning model, such as k for kNN or the learning rate in a neural network.
Typical approaches include grid searching ranges of hyperparameters or randomly sampling hyperparameter combinations.
Bayesian optimization is a more efficient to hyperparameter optimization that involves a directed search of the space of possible configurations based on those configurations that are most likely to result in better performance. As its name suggests, the approach was devised from and harnesses Bayes Theorem when sampling the space of possible configurations.
For more on Bayesian optimization, see the tutorial:
5. Models Are Evaluated With Probabilistic Measures
For those algorithms where a prediction of probabilities is made, evaluation measures are required to summarize the performance of the model.
There are many measures used to summarize the performance of a model based on predicted probabilities. Common examples include:
- Log Loss (also called cross-entropy).
- Brier Score, and the Brier Skill Score
For more on metrics for evaluating predicted probabilities, see the tutorial:
For binary classification tasks where a single probability score is predicted, Receiver Operating Characteristic, or ROC, curves can be constructed to explore different cut-offs that can be used when interpreting the prediction that, in turn, result in different trade-offs. The area under the ROC curve, or ROC AUC, can also be calculated as an aggregate measure. A related method that couses on the positive class is the Precision-Recall Curve and area under curve.
- ROC Curve and ROC AUC
- Precision-Recall Curve and AUC
For more on these curves and when to use them see the tutorial:
Choice and interpretation of these scoring methods require a foundational understanding of probability theory.
One More Reason
If I could give one more reason, it would be: Because it is fun.
Seriously.
Learning probability, at least the way I teach it with practical examples and executable code, is a lot of fun. Once you can see how the operations work on real data, it is hard to avoid developing a strong intuition for a subject that is often quite unintuitive.
Do you have more reasons why it is critical for an intermediate machine learning practitioner to learn probability?
Let me know in the comments below.
Further Reading
This section provides more resources on the topic if you are looking to go deeper.
Books
Posts
Articles
Summary
In this post, you discovered why, as a machine learning practitioner, you should deepen your understanding of probability.
Specifically, you learned:
- Not everyone should learn probability; it depends where you are in your journey of learning machine learning.
- Many algorithms are designed using the tools and techniques from probability, such as Naive Bayes and Probabilistic Graphical Models.
- The maximum likelihood framework that underlies the training of many machine learning algorithms comes from the field of probability.
Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.
Get a Handle on Probability for Machine Learning!
Develop Your Understanding of Probability
…with just a few lines of python code
Discover how in my new Ebook:
Probability for Machine Learning
It provides self-study tutorials and end-to-end projects on:
Bayes Theorem, Bayesian Optimization, Distributions, Maximum Likelihood, Cross-Entropy, Calibrating Models
and much more…
Finally Harness Uncertainty in Your Projects
Skip the Academics. Just Results.
See What’s Inside