Skip to content
Generic filters
Exact matches only

5 Reasons to Learn Probability for Machine Learning

Last Updated on November 8, 2019

Probability is a field of mathematics that quantifies uncertainty.

It is undeniably a pillar of the field of machine learning, and many recommend it as a prerequisite subject to study prior to getting started. This is misleading advice, as probability makes more sense to a practitioner once they have the context of the applied machine learning process in which to interpret it.

In this post, you will discover why machine learning practitioners should study probabilities to improve their skills and capabilities.

After reading this post, you will know:

  • Not everyone should learn probability; it depends where you are in your journey of learning machine learning.
  • Many algorithms are designed using the tools and techniques from probability, such as Naive Bayes and Probabilistic Graphical Models.
  • The maximum likelihood framework that underlies the training of many machine learning algorithms comes from the field of probability.

Discover bayes opimization, naive bayes, maximum likelihood, distributions, cross entropy, and much more in my new book, with 28 step-by-step tutorials and full Python source code.

Let’s get started.

5 Reasons to Learn Probability for Machine Learning

5 Reasons to Learn Probability for Machine Learning
Photo by Marco Verch, some rights reserved.


This tutorial is divided into seven parts; they are:

  1. Reasons to NOT Learn Probability
  2. Class Membership Requires Predicting a Probability
  3. Some Algorithms Are Designed Using Probability
  4. Models Are Trained Using a Probabilistic Framework
  5. Models Can Be Tuned With a Probabilistic Framework
  6. Probabilistic Measures Are Used to Evaluate Model Skill
  7. One More Reason

Reasons to NOT Learn Probability

Before we go through the reasons that you should learn probability, let’s start off by taking a small look at the reason why you should not.

I think you should not study probability if you are just getting started with applied machine learning.

  • It’s not required. Having an appreciation for the abstract theory that underlies some machine learning algorithms is not required in order to use machine learning as a tool to solve problems.
  • It’s slow. Taking months to years to study an entire related field before starting machine learning will delay you achieving your goals of being able to work through predictive modeling problems.
  • It’s a huge field. Not all of probability is relevant to theoretical machine learning, let alone applied machine learning.

I recommend a breadth-first approach to getting started in applied machine learning.

I call this the results-first approach. It is where you start by learning and practicing the steps for working through a predictive modeling problem end-to-end (e.g. how to get results) with a tool (such as scikit-learn and Pandas in Python).

This process then provides the skeleton and context for progressively deepening your knowledge, such as how algorithms work and, eventually, the math that underlies them.

After you know how to work through a predictive modeling problem, let’s look at why you should deepen your understanding of probability.

Want to Learn Probability for Machine Learning

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Download Your FREE Mini-Course

1. Class Membership Requires Predicting a Probability

Classification predictive modeling problems are those where an example is assigned a given label.

An example that you may be familiar with is the iris flowers dataset where we have four measurements of a flower and the goal is to assign one of three different known species of iris flower to the observation.

We can model the problem as directly assigning a class label to each observation.

  • Input: Measurements of a flower.
  • Output: One iris species.

A more common approach is to frame the problem as a probabilistic class membership, where the probability of an observation belonging to each known class is predicted.

  • Input: Measurements of a flower.
  • Output: Probability of membership to each iris species.

Framing the problem as a prediction of class membership simplifies the modeling problem and makes it easier for a model to learn. It allows the model to capture ambiguity in the data, which allows a process downstream, such as the user to interpret the probabilities in the context of the domain.

The probabilities can be transformed into a crisp class label by choosing the class with the largest probability. The probabilities can also be scaled or transformed using a probability calibration process.

This choice of a class membership framing of the problem interpretation of the predictions made by the model requires a basic understanding of probability.

2. Models Are Designed Using Probability

There are algorithms that are specifically designed to harness the tools and methods from probability.

These range from individual algorithms, like Naive Bayes algorithm, which is constructed using Bayes Theorem with some simplifying assumptions.

The linear regression algorithm can be seen as a probabilistic model that minimizes the mean squared error of predictions, and the logistic regression algorithm can be seen as a probabilistic model that minimizes the negative log likelihood of predicting the positive class label.

  • Linear Regression
  • Logistic Regression

It also extends to whole fields of study, such as probabilistic graphical models, often called graphical models or PGM for short, and designed around Bayes Theorem.

A notable graphical model is Bayesian Belief Networks or Bayes Nets, which are capable of capturing the conditional dependencies between variables.

3. Models Are Trained With Probabilistic Frameworks

Many machine learning models are trained using an iterative algorithm designed under a probabilistic framework.

Some examples of general probabilsitic modeling frameworks are:

Perhaps the most common is the framework of maximum likelihood estimation, sometimes shorted as MLE. This is a framework for estimating model parameters (e.g. weights) given observed data.

This is the framework that underlies the ordinary least squares estimate of a linear regression model and the log loss estimate for logistic regression.

The expectation-maximization algorithm, or EM for short, is an approach for maximum likelihood estimation often used for unsupervised data clustering, e.g. estimating k means for k clusters, also known as the k-Means clustering algorithm.

For models that predict class membership, maximum likelihood estimation provides the framework for minimizing the difference or divergence between an observed and predicted probability distribution. This is used in classification algorithms like logistic regression as well as deep learning neural networks.

It is common to measure this difference in probability distribution during training using entropy, e.g. via cross-entropy. Entropy, and differences between distributions measured via KL divergence, and cross-entropy are from the field of information theory that directly build upon probability theory. For example, entropy is calculated directly as the negative log of the probability.

As such, these tools from information theory such as minimising cross-entropy loss can be seen as another probabilistic framework for model estimation.

  • Minimum Cross-Entropy Loss Estimation

4. Models Are Tuned With a Probabilistic Framework

It is common to tune the hyperparameters of a machine learning model, such as k for kNN or the learning rate in a neural network.

Typical approaches include grid searching ranges of hyperparameters or randomly sampling hyperparameter combinations.

Bayesian optimization is a more efficient to hyperparameter optimization that involves a directed search of the space of possible configurations based on those configurations that are most likely to result in better performance. As its name suggests, the approach was devised from and harnesses Bayes Theorem when sampling the space of possible configurations.

For more on Bayesian optimization, see the tutorial:

5. Models Are Evaluated With Probabilistic Measures

For those algorithms where a prediction of probabilities is made, evaluation measures are required to summarize the performance of the model.

There are many measures used to summarize the performance of a model based on predicted probabilities. Common examples include:

  • Log Loss (also called cross-entropy).
  • Brier Score, and the Brier Skill Score

For more on metrics for evaluating predicted probabilities, see the tutorial:

For binary classification tasks where a single probability score is predicted, Receiver Operating Characteristic, or ROC, curves can be constructed to explore different cut-offs that can be used when interpreting the prediction that, in turn, result in different trade-offs. The area under the ROC curve, or ROC AUC, can also be calculated as an aggregate measure. A related method that couses on the positive class is the Precision-Recall Curve and area under curve.

  • ROC Curve and ROC AUC
  • Precision-Recall Curve and AUC

For more on these curves and when to use them see the tutorial:

Choice and interpretation of these scoring methods require a foundational understanding of probability theory.

One More Reason

If I could give one more reason, it would be: Because it is fun.


Learning probability, at least the way I teach it with practical examples and executable code, is a lot of fun. Once you can see how the operations work on real data, it is hard to avoid developing a strong intuition for a subject that is often quite unintuitive.

Do you have more reasons why it is critical for an intermediate machine learning practitioner to learn probability?

Let me know in the comments below.

Further Reading

This section provides more resources on the topic if you are looking to go deeper.





In this post, you discovered why, as a machine learning practitioner, you should deepen your understanding of probability.

Specifically, you learned:

  • Not everyone should learn probability; it depends where you are in your journey of learning machine learning.
  • Many algorithms are designed using the tools and techniques from probability, such as Naive Bayes and Probabilistic Graphical Models.
  • The maximum likelihood framework that underlies the training of many machine learning algorithms comes from the field of probability.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

Get a Handle on Probability for Machine Learning!

Probability for Machine Learning

Develop Your Understanding of Probability

…with just a few lines of python code

Discover how in my new Ebook:
Probability for Machine Learning

It provides self-study tutorials and end-to-end projects on:
Bayes Theorem, Bayesian Optimization, Distributions, Maximum Likelihood, Cross-Entropy, Calibrating Models
and much more…

Finally Harness Uncertainty in Your Projects

Skip the Academics. Just Results.

See What’s Inside

error: Content is protected !!