Last Updated on August 8, 2019
It is important to both present the expected skill of a machine learning model a well as confidence intervals for that model skill.
Confidence intervals provide a range of model skills and a likelihood that the model skill will fall between the ranges when making predictions on new data. For example, a 95% likelihood of classification accuracy between 70% and 75%.
A robust way to calculate confidence intervals for machine learning algorithms is to use the bootstrap. This is a general technique for estimating statistics that can be used to calculate empirical confidence intervals, regardless of the distribution of skill scores (e.g. non-Gaussian)
In this post, you will discover how to use the bootstrap to calculate confidence intervals for the performance of your machine learning algorithms.
After reading this post, you will know:
- How to estimate confidence intervals of a statistic using the bootstrap.
- How to apply this method to evaluate machine learning algorithms.
- How to implement the bootstrap method for estimating confidence intervals in Python.
Discover statistical hypothesis testing, resampling methods, estimation statistics and nonparametric methods in my new book, with 29 step-by-step tutorials and full source code.
Let’s get started.
- Update June/2017: Fixed a bug where the wrong values were provided to numpy.percentile(). Thanks Elie Kawerk.
- Update March/2018: Updated link to dataset file.
What You Will Learn
Bootstrap Confidence Intervals
Calculating confidence intervals with the bootstrap involves two steps:
- Calculate a Population of Statistics
- Calculate Confidence Intervals
Need help with Statistics for Machine Learning?
Take my free 7-day email crash course now (with sample code).
Click to sign-up and also get a free PDF Ebook version of the course.
Download Your FREE Mini-Course
1. Calculate a Population of Statistics
The first step is to use the bootstrap procedure to resample the original data a number of times and calculate the statistic of interest.
The dataset is sampled with replacement. This means that each time an item is selected from the original dataset, it is not removed, allowing that item to possibly be selected again for the sample.
The statistic is calculated on the sample and is stored so that we build up a population of the statistic of interest.
The number of bootstrap repeats defines the variance of the estimate, and more is better, often hundreds or thousands.
We can demonstrate this step with the following pseudocode.
statistics = []
for i in bootstraps:
sample = select_sample_with_replacement(data)
stat = calculate_statistic(sample)
statistics.append(stat)
statistics = []
for i in bootstraps:
sample = select_sample_with_replacement(data)
stat = calculate_statistic(sample)
statistics.append(stat)
2. Calculate Confidence Interval
Now that we have a population of the statistics of interest, we can calculate the confidence intervals.
This is done by first ordering the statistics, then selecting values at the chosen percentile for the confidence interval. The chosen percentile in this case is called alpha.
For example, if we were interested in a confidence interval of 95%, then alpha would be 0.95 and we would select the value at the 2.5% percentile as the lower bound and the 97.5% percentile as the upper bound on the statistic of interest.
For example, if we calculated 1,000 statistics from 1,000 bootstrap samples, then the lower bound would be the 25th value and the upper bound would be the 975th value, assuming the list of statistics was ordered.
In this, we are calculating a non-parametric confidence interval that does not make any assumption about the functional form of the distribution of the statistic. This confidence interval is often called the empirical confidence interval.
We can demonstrate this with pseudocode below.
ordered = sort(statistics)
lower = percentile(ordered, (1-alpha)/2)
upper = percentile(ordered, alpha+((1-alpha)/2))
ordered = sort(statistics)
lower = percentile(ordered, (1-alpha)/2)
upper = percentile(ordered, alpha+((1-alpha)/2))
Bootstrap Model Performance
The bootstrap can be used to evaluate the performance of machine learning algorithms.
The size of the sample taken each iteration may be limited to 60% or 80% of the available data. This will mean that there will be some samples that are not included in the sample. These are called out of bag (OOB) samples.
A model can then be trained on the data sample each bootstrap iteration and evaluated on the out of bag samples to give a performance statistic, which can be collected and from which confidence intervals may be calculated.
We can demonstrate this process with the following pseudocode.
statistics = []
for i in bootstraps:
train, test = select_sample_with_replacement(data, size)
model = train_model(train)
stat = evaluate_model(test)
statistics.append(stat)
statistics = []
for i in bootstraps:
train, test = select_sample_with_replacement(data, size)
model = train_model(train)
stat = evaluate_model(test)
statistics.append(stat)
Calculate Classification Accuracy Confidence Interval
This section demonstrates how to use the bootstrap to calculate an empirical confidence interval for a machine learning algorithm on a real-world dataset using the Python machine learning library scikit-learn.
This section assumes you have Pandas, NumPy, and Matplotlib installed. If you need help setting up your environment, see the tutorial:
First, download the Pima Indians dataset and place it in your current working directory with the filename “pima–indians-diabetes.data.csv” (update: download here).
We will load the dataset using Pandas.
# load dataset
data = read_csv(‘pima-indians-diabetes.data.csv’, header=None)
values = data.values
# load dataset
data = read_csv(‘pima-indians-diabetes.data.csv’, header=None)
values = data.values
Next, we will configure the bootstrap. We will use 1,000 bootstrap iterations and select a sample that is 50% the size of the dataset.
# configure bootstrap
n_iterations = 1000
n_size = int(len(data) * 0.50)
# configure bootstrap
n_iterations = 1000
n_size = int(len(data) * 0.50)
Next, we will iterate over the bootstrap.
The sample will be selected with replacement using the resample() function from sklearn. Any rows that were not included in the sample are retrieved and used as the test dataset. Next, a decision tree classifier is fit on the sample and evaluated on the test set, a classification score calculated, and added to a list of scores collected across all the bootstraps.
# run bootstrap
stats = list()
for i in range(n_iterations):
# prepare train and test sets
train = resample(values, n_samples=n_size)
test = numpy.array([x for x in values if x.tolist() not in train.tolist()])
# fit model
model = DecisionTreeClassifier()
model.fit(train[:,:-1], train[:,-1])
# evaluate model
predictions = model.predict(test[:,:-1])
score = accuracy_score(test[:,-1], predictions)
# run bootstrap
stats = list()
for i in range(n_iterations):
# prepare train and test sets
train = resample(values, n_samples=n_size)
test = numpy.array([x for x in values if x.tolist() not in train.tolist()])
# fit model
model = DecisionTreeClassifier()
model.fit(train[:,:-1], train[:,-1])
# evaluate model
predictions = model.predict(test[:,:-1])
score = accuracy_score(test[:,-1], predictions)
Once the scores are collected, a histogram is created to give an idea of the distribution of scores. We would generally expect this distribution to be Gaussian, perhaps with a skew with a symmetrical variance around the mean.
Finally, we can calculate the empirical confidence intervals using the percentile() NumPy function. A 95% confidence interval is used, so the values at the 2.5 and 97.5 percentiles are selected.
Putting this all together, the complete example is listed below.
import numpy
from pandas import read_csv
from sklearn.utils import resample
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from matplotlib import pyplot
# load dataset
data = read_csv(‘pima-indians-diabetes.data.csv’, header=None)
values = data.values
# configure bootstrap
n_iterations = 1000
n_size = int(len(data) * 0.50)
# run bootstrap
stats = list()
for i in range(n_iterations):
# prepare train and test sets
train = resample(values, n_samples=n_size)
test = numpy.array([x for x in values if x.tolist() not in train.tolist()])
# fit model
model = DecisionTreeClassifier()
model.fit(train[:,:-1], train[:,-1])
# evaluate model
predictions = model.predict(test[:,:-1])
score = accuracy_score(test[:,-1], predictions)
print(score)
stats.append(score)
# plot scores
pyplot.hist(stats)
pyplot.show()
# confidence intervals
alpha = 0.95
p = ((1.0-alpha)/2.0) * 100
lower = max(0.0, numpy.percentile(stats, p))
p = (alpha+((1.0-alpha)/2.0)) * 100
upper = min(1.0, numpy.percentile(stats, p))
print(‘%.1f confidence interval %.1f%% and %.1f%%’ % (alpha*100, lower*100, upper*100))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
import numpy
from pandas import read_csv
from sklearn.utils import resample
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from matplotlib import pyplot
# load dataset
data = read_csv(‘pima-indians-diabetes.data.csv’, header=None)
values = data.values
# configure bootstrap
n_iterations = 1000
n_size = int(len(data) * 0.50)
# run bootstrap
stats = list()
for i in range(n_iterations):
# prepare train and test sets
train = resample(values, n_samples=n_size)
test = numpy.array([x for x in values if x.tolist() not in train.tolist()])
# fit model
model = DecisionTreeClassifier()
model.fit(train[:,:-1], train[:,-1])
# evaluate model
predictions = model.predict(test[:,:-1])
score = accuracy_score(test[:,-1], predictions)
print(score)
stats.append(score)
# plot scores
pyplot.hist(stats)
pyplot.show()
# confidence intervals
alpha = 0.95
p = ((1.0-alpha)/2.0) * 100
lower = max(0.0, numpy.percentile(stats, p))
p = (alpha+((1.0-alpha)/2.0)) * 100
upper = min(1.0, numpy.percentile(stats, p))
print(‘%.1f confidence interval %.1f%% and %.1f%%’ % (alpha*100, lower*100, upper*100))
Running the example prints the classification accuracy each bootstrap iteration.
A histogram of the 1,000 accuracy scores is created showing a Gaussian-like distribution.
Finally, the confidence intervals are reported, showing that there is a 95% likelihood that the confidence interval 64.4% and 73.0% covers the true skill of the model.
…
0.646288209607
0.682203389831
0.668085106383
0.673728813559
0.686021505376
95.0 confidence interval 64.4% and 73.0%
…
0.646288209607
0.682203389831
0.668085106383
0.673728813559
0.686021505376
95.0 confidence interval 64.4% and 73.0%
This same method can be used to calculate confidence intervals of any other errors scores, such as root mean squared error for regression algorithms.
Further Reading
This section provides additional resources on the bootstrap and bootstrap confidence intervals.
Summary
In this post, you discovered how to use the bootstrap to calculate confidence intervals for machine learning algorithms.
Specifically, you learned:
- How to calculate the bootstrap estimate of confidence intervals of a statistic from a dataset.
- How to apply the bootstrap to evaluate machine learning algorithms.
- How to calculate bootstrap confidence intervals for machine learning algorithms in Python.
Do you have any questions about confidence intervals?
Ask your questions in the comments below.
Get a Handle on Statistics for Machine Learning!
Develop a working understanding of statistics
…by writing lines of code in python
Discover how in my new Ebook:
Statistical Methods for Machine Learning
It provides self-study tutorials on topics like:
Hypothesis Tests, Correlation, Nonparametric Stats, Resampling, and much more…
Discover how to Transform Data into Knowledge
Skip the Academics. Just Results.
See What’s Inside