Skip to content
Search
Generic filters
Exact matches only

Compare Models And Select The Best Using The Caret R Package

Last Updated on December 13, 2019

The Caret R package allows you to easily construct many different model types and tune their parameters.

After creating and tuning many model types, you may want know and select the best model so that you can use it to make predictions, perhaps in an operational environment.

In this post you discover how to compare the results of multiple models using the caret R package.

Discover how to prepare data, fit machine learning models and evaluate their predictions in R with my new book, including 14 step-by-step tutorials, 3 projects, and full source code.

Let’s get started.

Compare Machine Learning Models

While working on a problem, you will settle on one or a handful of well-performing models. After tuning the parameters of each, you will want to compare the models and discover which are the best and worst performing.

It is useful to get an idea of the spread of the models, perhaps one can be improved, or you can stop working on one that is clearly performing worse than the others.

In the example below we compare three sophisticated machine learning models in the Pima Indians diabetes dataset. This dataset is a summary from a collection of medical reports and indicate the onset of diabetes in the patient within five years.

You can learn more about the dataset here:

The three models constructed and tuned are Learning Vector Quantization (LVQ), Stochastic Gradient Boosting (also known as Gradient Boosted Machine or GBM), and Support Vector Machine (SVM). Each model is automatically tuned and is evaluated using 3 repeats of 10-fold cross validation.

The random number seed is set before each algorithm is trained to ensure that each algorithm gets the same data partitions and repeats. This allows us to compare apples to apples in the final results. Alternatively, we could ignore this concern and increase the number of repeats to 30 or 100, using randomness to control for variation in the data partitioning.

Need more Help with R for Machine Learning?

Take my free 14-day email course and discover how to use R on your project (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Start Your FREE Mini-Course Now!

Once the models are trained and an optimal parameter configuration found for each, the accuracy results from each of the best models are collected. Each “winning” model has 30 results (3 repeats of 10-fold cross validation). The objective of comparing results is to compare the accuracy distributions (30 values) between the models.

This is done in three ways. The distributions are summarized in terms of the percentiles. The distributions are summarized as box plots and finally the distributions are summarized as dot plots.

# load the library
library(mlbench)
library(caret)
# load the dataset
data(PimaIndiansDiabetes)
# prepare training scheme
control <- trainControl(method=”repeatedcv”, number=10, repeats=3)
# train the LVQ model
set.seed(7)
modelLvq <- train(diabetes~., data=PimaIndiansDiabetes, method=”lvq”, trControl=control)
# train the GBM model
set.seed(7)
modelGbm <- train(diabetes~., data=PimaIndiansDiabetes, method=”gbm”, trControl=control, verbose=FALSE)
# train the SVM model
set.seed(7)
modelSvm <- train(diabetes~., data=PimaIndiansDiabetes, method=”svmRadial”, trControl=control)
# collect resamples
results <- resamples(list(LVQ=modelLvq, GBM=modelGbm, SVM=modelSvm))
# summarize the distributions
summary(results)
# boxplots of results
bwplot(results)
# dot plots of results
dotplot(results)

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

# load the library

library(mlbench)

library(caret)

# load the dataset

data(PimaIndiansDiabetes)

# prepare training scheme

control <- trainControl(method=”repeatedcv”, number=10, repeats=3)

# train the LVQ model

set.seed(7)

modelLvq <- train(diabetes~., data=PimaIndiansDiabetes, method=”lvq”, trControl=control)

# train the GBM model

set.seed(7)

modelGbm <- train(diabetes~., data=PimaIndiansDiabetes, method=”gbm”, trControl=control, verbose=FALSE)

# train the SVM model

set.seed(7)

modelSvm <- train(diabetes~., data=PimaIndiansDiabetes, method=”svmRadial”, trControl=control)

# collect resamples

results <- resamples(list(LVQ=modelLvq, GBM=modelGbm, SVM=modelSvm))

# summarize the distributions

summary(results)

# boxplots of results

bwplot(results)

# dot plots of results

dotplot(results)

Below is the table of results from summarizing the distributions for each model.

Models: LVQ, GBM, SVM
Number of resamples: 30

Accuracy
Min. 1st Qu. Median Mean 3rd Qu. Max. NA’s
LVQ 0.5921 0.6623 0.6928 0.6935 0.7273 0.7922 0
GBM 0.7013 0.7403 0.7662 0.7665 0.7890 0.8442 0
SVM 0.6711 0.7403 0.7582 0.7651 0.7890 0.8961 0

Kappa
Min. 1st Qu. Median Mean 3rd Qu. Max. NA’s
LVQ 0.03125 0.1607 0.2819 0.2650 0.3845 0.5103 0
GBM 0.32690 0.3981 0.4638 0.4663 0.5213 0.6426 0
SVM 0.21870 0.3889 0.4167 0.4520 0.5003 0.7638 0

Models: LVQ, GBM, SVM

Number of resamples: 30

 

Accuracy

      Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA’s

LVQ 0.5921  0.6623 0.6928 0.6935  0.7273 0.7922    0

GBM 0.7013  0.7403 0.7662 0.7665  0.7890 0.8442    0

SVM 0.6711  0.7403 0.7582 0.7651  0.7890 0.8961    0

 

Kappa

       Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA’s

LVQ 0.03125  0.1607 0.2819 0.2650  0.3845 0.5103    0

GBM 0.32690  0.3981 0.4638 0.4663  0.5213 0.6426    0

SVM 0.21870  0.3889 0.4167 0.4520  0.5003 0.7638    0

 

Box Plot Comparing Model Results

Box Plot Comparing Model Results using the Caret R Package

Dotplot Comparing Model Results using the Caret R Package

Dotplot Comparing Model Results using the Caret R Package

If you needed to make strong claims about which algorithm was better, you could also use statistical hypothesis tests to statistically show that the differences in the results were significant.

Something like a Student t-test if the results are normally distributed or a rank sum test if the distribution is unknown.

Summary

In this post you discovered how you can use the caret R package to compare the results from multiple different models, even after their parameters have been optimized. You saw three ways the results can be compared, in table, box plot and a dot plot.

The examples in this post are standalone and you can easily copy-and-paste them into your own project and adapt them for your problem.

Discover Faster Machine Learning in R!

Master Machine Learning With R

Develop Your Own Models in Minutes

…with just a few lines of R code

Discover how in my new Ebook:
Machine Learning Mastery With R

Covers self-study tutorials and end-to-end projects like:
Loading data, visualization, build models, tuning, and much more…

Finally Bring Machine Learning To Your Own Projects

Skip the Academics. Just Results.

See What’s Inside

error: Content is protected !!