Skip to content
Search
Generic filters
Exact matches only

Data Preparation for Gradient Boosting with XGBoost in Python

Last Updated on May 1, 2020

XGBoost is a popular implementation of Gradient Boosting because of its speed and performance.

Internally, XGBoost models represent all problems as a regression predictive modeling problem that only takes numerical values as input. If your data is in a different form, it must be prepared into the expected format.

In this post, you will discover how to prepare your data for using with gradient boosting with the XGBoost library in Python.

After reading this post you will know:

  • How to encode string output variables for classification.
  • How to prepare categorical input variables using one hot encoding.
  • How to automatically handle missing data with XGBoost.

Discover how to configure, fit, tune and evaluation gradient boosting models with XGBoost in my new book, with 15 step-by-step tutorial lessons, and full python code.

Let’s get started.

  • Update Sept/2016: I updated a few small typos in the impute example.
  • Update Jan/2017: Updated to reflect changes in scikit-learn API version 0.18.1.
  • Update Jan/2017: Updated breast cancer example to converted input data to strings.
  • Update Oct/2019: Updated usage of OneHotEncoder to suppress warnings.
  • Update Dec/2019: Updated example to fix bug in API usage in the multi-class example.
  • Update May/2020: Updated to use SimpleImputer in new v0.22 API.

Data Preparation for Gradient Boosting with XGBoost in Python

Data Preparation for Gradient Boosting with XGBoost in Python
Photo by Ed Dunens, some rights reserved.

Need help with XGBoost in Python?

Take my free 7-day email course and discover xgboost (with sample code).

Click to sign-up now and also get a free PDF Ebook version of the course.

Start Your FREE Mini-Course Now!

Label Encode String Class Values

The iris flowers classification problem is an example of a problem that has a string class value.

This is a prediction problem where given measurements of iris flowers in centimeters, the task is to predict to which species a given flower belongs.

Download the dataset and place it in your current working directory with the filename “iris.csv“.

Below is a sample of the raw dataset.

5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
5.0,3.6,1.4,0.2,Iris-setosa

5.1,3.5,1.4,0.2,Iris-setosa

4.9,3.0,1.4,0.2,Iris-setosa

4.7,3.2,1.3,0.2,Iris-setosa

4.6,3.1,1.5,0.2,Iris-setosa

5.0,3.6,1.4,0.2,Iris-setosa

XGBoost cannot model this problem as-is as it requires that the output variables be numeric.

We can easily convert the string values to integer values using the LabelEncoder. The three class values (Iris-setosa, Iris-versicolor, Iris-virginica) are mapped to the integer values (0, 1, 2).

# encode string class values as integers
label_encoder = LabelEncoder()
label_encoder = label_encoder.fit(Y)
label_encoded_y = label_encoder.transform(Y)

# encode string class values as integers

label_encoder = LabelEncoder()

label_encoder = label_encoder.fit(Y)

label_encoded_y = label_encoder.transform(Y)

We save the label encoder as a separate object so that we can transform both the training and later the test and validation datasets using the same encoding scheme.

Below is a complete example demonstrating how to load the iris dataset. Notice that Pandas is used to load the data in order to handle the string class values.

# multiclass classification
import pandas
import xgboost
from sklearn import model_selection
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
# load data
data = pandas.read_csv(‘iris.csv’, header=None)
dataset = data.values
# split data into X and y
X = dataset[:,0:4]
Y = dataset[:,4]
# encode string class values as integers
label_encoder = LabelEncoder()
label_encoder = label_encoder.fit(Y)
label_encoded_y = label_encoder.transform(Y)
seed = 7
test_size = 0.33
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, label_encoded_y, test_size=test_size, random_state=seed)
# fit model no training data
model = xgboost.XGBClassifier()
model.fit(X_train, y_train)
print(model)
# make predictions for test data
y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]
# evaluate predictions
accuracy = accuracy_score(y_test, predictions)
print(“Accuracy: %.2f%%” % (accuracy * 100.0))

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

# multiclass classification

import pandas

import xgboost

from sklearn import model_selection

from sklearn.metrics import accuracy_score

from sklearn.preprocessing import LabelEncoder

# load data

data = pandas.read_csv(‘iris.csv’, header=None)

dataset = data.values

# split data into X and y

X = dataset[:,0:4]

Y = dataset[:,4]

# encode string class values as integers

label_encoder = LabelEncoder()

label_encoder = label_encoder.fit(Y)

label_encoded_y = label_encoder.transform(Y)

seed = 7

test_size = 0.33

X_train, X_test, y_train, y_test = model_selection.train_test_split(X, label_encoded_y, test_size=test_size, random_state=seed)

# fit model no training data

model = xgboost.XGBClassifier()

model.fit(X_train, y_train)

print(model)

# make predictions for test data

y_pred = model.predict(X_test)

predictions = [round(value) for value in y_pred]

# evaluate predictions

accuracy = accuracy_score(y_test, predictions)

print(“Accuracy: %.2f%%” % (accuracy * 100.0))

Running the example produces the following output:

XGBClassifier(base_score=0.5, colsample_bylevel=1, colsample_bytree=1,
gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=3,
min_child_weight=1, missing=None, n_estimators=100, nthread=-1,
objective=’multi:softprob’, reg_alpha=0, reg_lambda=1,
scale_pos_weight=1, seed=0, silent=True, subsample=1)
Accuracy: 92.00%

XGBClassifier(base_score=0.5, colsample_bylevel=1, colsample_bytree=1,

       gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=3,

       min_child_weight=1, missing=None, n_estimators=100, nthread=-1,

       objective=’multi:softprob’, reg_alpha=0, reg_lambda=1,

       scale_pos_weight=1, seed=0, silent=True, subsample=1)

Accuracy: 92.00%

Notice how the XGBoost model is configured to automatically model the multiclass classification problem using the multi:softprob objective, a variation on the softmax loss function to model class probabilities. This suggests that internally, that the output class is converted into a one hot type encoding automatically.

One Hot Encode Categorical Data

Some datasets only contain categorical data, for example the breast cancer dataset.

This dataset describes the technical details of breast cancer biopsies and the prediction task is to predict whether or not the patient has a recurrence of cancer, or not.

Download the dataset and place it in your current working directory with the filename “breast-cancer.csv“.

Below is a sample of the raw dataset.

’40-49′,’premeno’,’15-19′,’0-2′,’yes’,’3′,’right’,’left_up’,’no’,’recurrence-events’
’50-59′,’ge40′,’15-19′,’0-2′,’no’,’1′,’right’,’central’,’no’,’no-recurrence-events’
’50-59′,’ge40′,’35-39′,’0-2′,’no’,’2′,’left’,’left_low’,’no’,’recurrence-events’
’40-49′,’premeno’,’35-39′,’0-2′,’yes’,’3′,’right’,’left_low’,’yes’,’no-recurrence-events’
’40-49′,’premeno’,’30-34′,’3-5′,’yes’,’2′,’left’,’right_up’,’no’,’recurrence-events’

’40-49′,’premeno’,’15-19′,’0-2′,’yes’,’3′,’right’,’left_up’,’no’,’recurrence-events’

’50-59′,’ge40′,’15-19′,’0-2′,’no’,’1′,’right’,’central’,’no’,’no-recurrence-events’

’50-59′,’ge40′,’35-39′,’0-2′,’no’,’2′,’left’,’left_low’,’no’,’recurrence-events’

’40-49′,’premeno’,’35-39′,’0-2′,’yes’,’3′,’right’,’left_low’,’yes’,’no-recurrence-events’

’40-49′,’premeno’,’30-34′,’3-5′,’yes’,’2′,’left’,’right_up’,’no’,’recurrence-events’

We can see that all 9 input variables are categorical and described in string format. The problem is a binary classification prediction problem and the output class values are also described in string format.

We can reuse the same approach from the previous section and convert the string class values to integer values to model the prediction using the LabelEncoder. For example:

# encode string class values as integers
label_encoder = LabelEncoder()
label_encoder = label_encoder.fit(Y)
label_encoded_y = label_encoder.transform(Y)

# encode string class values as integers

label_encoder = LabelEncoder()

label_encoder = label_encoder.fit(Y)

label_encoded_y = label_encoder.transform(Y)

We can use this same approach on each input feature in X, but this is only a starting point.

# encode string input values as integers
features = []
for i in range(0, X.shape[1]):
label_encoder = LabelEncoder()
feature = label_encoder.fit_transform(X[:,i])
features.append(feature)
encoded_x = numpy.array(features)
encoded_x = encoded_x.reshape(X.shape[0], X.shape[1])

# encode string input values as integers

features = []

for i in range(0, X.shape[1]):

label_encoder = LabelEncoder()

feature = label_encoder.fit_transform(X[:,i])

features.append(feature)

encoded_x = numpy.array(features)

encoded_x = encoded_x.reshape(X.shape[0], X.shape[1])

XGBoost may assume that encoded integer values for each input variable have an ordinal relationship. For example that ‘left-up’ encoded as 0 and ‘left-low’ encoded as 1 for the breast-quad variable have a meaningful relationship as integers. In this case, this assumption is untrue.

Instead, we must map these integer values onto new binary variables, one new variable for each categorical value.

For example, the breast-quad variable has the values:

left-up
left-low
right-up
right-low
central

left-up

left-low

right-up

right-low

central

We can model this as 5 binary variables as follows:

left-up, left-low, right-up, right-low, central
1,0,0,0,0
0,1,0,0,0
0,0,1,0,0
0,0,0,1,0
0,0,0,0,1

left-up, left-low, right-up, right-low, central

1,0,0,0,0

0,1,0,0,0

0,0,1,0,0

0,0,0,1,0

0,0,0,0,1

This is called one hot encoding. We can one hot encode all of the categorical input variables using the OneHotEncoder class in scikit-learn.

We can one hot encode each feature after we have label encoded it. First we must transform the feature array into a 2-dimensional NumPy array where each integer value is a feature vector with a length 1.

feature = feature.reshape(X.shape[0], 1)

feature = feature.reshape(X.shape[0], 1)

We can then create the OneHotEncoder and encode the feature array.

onehot_encoder = OneHotEncoder(sparse=False, categories=’auto’)
feature = onehot_encoder.fit_transform(feature)

onehot_encoder = OneHotEncoder(sparse=False, categories=’auto’)

feature = onehot_encoder.fit_transform(feature)

Finally, we can build up the input dataset by concatenating the one hot encoded features, one by one, adding them on as new columns (axis=2). We end up with an input vector comprised of 43 binary input variables.

# encode string input values as integers
encoded_x = None
for i in range(0, X.shape[1]):
label_encoder = LabelEncoder()
feature = label_encoder.fit_transform(X[:,i])
feature = feature.reshape(X.shape[0], 1)
onehot_encoder = OneHotEncoder(sparse=False, categories=’auto’)
feature = onehot_encoder.fit_transform(feature)
if encoded_x is None:
encoded_x = feature
else:
encoded_x = numpy.concatenate((encoded_x, feature), axis=1)
print(“X shape: : “, encoded_x.shape)

# encode string input values as integers

encoded_x = None

for i in range(0, X.shape[1]):

label_encoder = LabelEncoder()

feature = label_encoder.fit_transform(X[:,i])

feature = feature.reshape(X.shape[0], 1)

onehot_encoder = OneHotEncoder(sparse=False, categories=’auto’)

feature = onehot_encoder.fit_transform(feature)

if encoded_x is None:

encoded_x = feature

else:

encoded_x = numpy.concatenate((encoded_x, feature), axis=1)

print(“X shape: : “, encoded_x.shape)

Ideally, we may experiment with not one hot encode some of input attributes as we could encode them with an explicit ordinal relationship, for example the first column age with values like ’40-49′ and ’50-59′. This is left as an exercise, if you are interested in extending this example.

Below is the complete example with label and one hot encoded input variables and label encoded output variable.

# binary classification, breast cancer dataset, label and one hot encoded
import numpy
from pandas import read_csv
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
# load data
data = read_csv(‘breast-cancer.csv’, header=None)
dataset = data.values
# split data into X and y
X = dataset[:,0:9]
X = X.astype(str)
Y = dataset[:,9]
# encode string input values as integers
encoded_x = None
for i in range(0, X.shape[1]):
label_encoder = LabelEncoder()
feature = label_encoder.fit_transform(X[:,i])
feature = feature.reshape(X.shape[0], 1)
onehot_encoder = OneHotEncoder(sparse=False, categories=’auto’)
feature = onehot_encoder.fit_transform(feature)
if encoded_x is None:
encoded_x = feature
else:
encoded_x = numpy.concatenate((encoded_x, feature), axis=1)
print(“X shape: : “, encoded_x.shape)
# encode string class values as integers
label_encoder = LabelEncoder()
label_encoder = label_encoder.fit(Y)
label_encoded_y = label_encoder.transform(Y)
# split data into train and test sets
seed = 7
test_size = 0.33
X_train, X_test, y_train, y_test = train_test_split(encoded_x, label_encoded_y, test_size=test_size, random_state=seed)
# fit model no training data
model = XGBClassifier()
model.fit(X_train, y_train)
print(model)
# make predictions for test data
y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]
# evaluate predictions
accuracy = accuracy_score(y_test, predictions)
print(“Accuracy: %.2f%%” % (accuracy * 100.0))

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

# binary classification, breast cancer dataset, label and one hot encoded

import numpy

from pandas import read_csv

from xgboost import XGBClassifier

from sklearn.model_selection import train_test_split

from sklearn.metrics import accuracy_score

from sklearn.preprocessing import LabelEncoder

from sklearn.preprocessing import OneHotEncoder

# load data

data = read_csv(‘breast-cancer.csv’, header=None)

dataset = data.values

# split data into X and y

X = dataset[:,0:9]

X = X.astype(str)

Y = dataset[:,9]

# encode string input values as integers

encoded_x = None

for i in range(0, X.shape[1]):

label_encoder = LabelEncoder()

feature = label_encoder.fit_transform(X[:,i])

feature = feature.reshape(X.shape[0], 1)

onehot_encoder = OneHotEncoder(sparse=False, categories=’auto’)

feature = onehot_encoder.fit_transform(feature)

if encoded_x is None:

encoded_x = feature

else:

encoded_x = numpy.concatenate((encoded_x, feature), axis=1)

print(“X shape: : “, encoded_x.shape)

# encode string class values as integers

label_encoder = LabelEncoder()

label_encoder = label_encoder.fit(Y)

label_encoded_y = label_encoder.transform(Y)

# split data into train and test sets

seed = 7

test_size = 0.33

X_train, X_test, y_train, y_test = train_test_split(encoded_x, label_encoded_y, test_size=test_size, random_state=seed)

# fit model no training data

model = XGBClassifier()

model.fit(X_train, y_train)

print(model)

# make predictions for test data

y_pred = model.predict(X_test)

predictions = [round(value) for value in y_pred]

# evaluate predictions

accuracy = accuracy_score(y_test, predictions)

print(“Accuracy: %.2f%%” % (accuracy * 100.0))

Running this example we get the following output:

(‘X shape: : ‘, (285, 43))
XGBClassifier(base_score=0.5, colsample_bylevel=1, colsample_bytree=1,
gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=3,
min_child_weight=1, missing=None, n_estimators=100, nthread=-1,
objective=’binary:logistic’, reg_alpha=0, reg_lambda=1,
scale_pos_weight=1, seed=0, silent=True, subsample=1)
Accuracy: 71.58%

(‘X shape: : ‘, (285, 43))

XGBClassifier(base_score=0.5, colsample_bylevel=1, colsample_bytree=1,

       gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=3,

       min_child_weight=1, missing=None, n_estimators=100, nthread=-1,

       objective=’binary:logistic’, reg_alpha=0, reg_lambda=1,

       scale_pos_weight=1, seed=0, silent=True, subsample=1)

Accuracy: 71.58%

Again we can see that the XGBoost framework chose the ‘binary:logistic‘ objective automatically, the right objective for this binary classification problem.

Support for Missing Data

XGBoost can automatically learn how to best handle missing data.

In fact, XGBoost was designed to work with sparse data, like the one hot encoded data from the previous section, and missing data is handled the same way that sparse or zero values are handled, by minimizing the loss function.

For more information on the technical details for how missing values are handled in XGBoost, see Section 3.4 “Sparsity-aware Split Finding” in the paper XGBoost: A Scalable Tree Boosting System.

The Horse Colic dataset is a good example to demonstrate this capability as it contains a large percentage of missing data, approximately 30%.

Download the dataset and place it in your current working directory with the filename “horse-colic.csv“.

Below is a sample of the raw dataset.

2 1 530101 38.50 66 28 3 3 ? 2 5 4 4 ? ? ? 3 5 45.00 8.40 ? ? 2 2 11300 00000 00000 2
1 1 534817 39.2 88 20 ? ? 4 1 3 4 2 ? ? ? 4 2 50 85 2 2 3 2 02208 00000 00000 2
2 1 530334 38.30 40 24 1 1 3 1 3 3 1 ? ? ? 1 1 33.00 6.70 ? ? 1 2 00000 00000 00000 1
1 9 5290409 39.10 164 84 4 1 6 2 2 4 4 1 2 5.00 3 ? 48.00 7.20 3 5.30 2 1 02208 00000 00000 1
2 1 530255 37.30 104 35 ? ? 6 2 ? ? ? ? ? ? ? ? 74.00 7.40 ? ? 2 2 04300 00000 00000 2

2 1 530101 38.50 66 28 3 3 ? 2 5 4 4 ? ? ? 3 5 45.00 8.40 ? ? 2 2 11300 00000 00000 2

1 1 534817 39.2 88 20 ? ? 4 1 3 4 2 ? ? ? 4 2 50 85 2 2 3 2 02208 00000 00000 2

2 1 530334 38.30 40 24 1 1 3 1 3 3 1 ? ? ? 1 1 33.00 6.70 ? ? 1 2 00000 00000 00000 1

1 9 5290409 39.10 164 84 4 1 6 2 2 4 4 1 2 5.00 3 ? 48.00 7.20 3 5.30 2 1 02208 00000 00000 1

2 1 530255 37.30 104 35 ? ? 6 2 ? ? ? ? ? ? ? ? 74.00 7.40 ? ? 2 2 04300 00000 00000 2

The values are separated by whitespace and we can easily load it using the Pandas function read_csv.

dataframe = read_csv(“horse-colic.csv”, delim_whitespace=True, header=None)

dataframe = read_csv(“horse-colic.csv”, delim_whitespace=True, header=None)

Once loaded, we can see that the missing data is marked with a question mark character (‘?’). We can change these missing values to the sparse value expected by XGBoost which is the value zero (0).

# set missing values to 0
X[X == ‘?’] = 0

# set missing values to 0

X[X == ‘?’] = 0

Because the missing data was marked as strings, those columns with missing data were all loaded as string data types. We can now convert the entire set of input data to numerical values.

# convert to numeric
X = X.astype(‘float32’)

# convert to numeric

X = X.astype(‘float32’)

Finally, this is a binary classification problem although the class values are marked with the integers 1 and 2. We model binary classification problems in XGBoost as logistic 0 and 1 values. We can easily convert the Y dataset to 0 and 1 integers using the LabelEncoder, as we did in the iris flowers example.

# encode Y class values as integers
label_encoder = LabelEncoder()
label_encoder = label_encoder.fit(Y)
label_encoded_y = label_encoder.transform(Y)

# encode Y class values as integers

label_encoder = LabelEncoder()

label_encoder = label_encoder.fit(Y)

label_encoded_y = label_encoder.transform(Y)

The full code listing is provided below for completeness.

# binary classification, missing data
from pandas import read_csv
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
# load data
dataframe = read_csv(“horse-colic.csv”, delim_whitespace=True, header=None)
dataset = dataframe.values
# split data into X and y
X = dataset[:,0:27]
Y = dataset[:,27]
# set missing values to 0
X[X == ‘?’] = 0
# convert to numeric
X = X.astype(‘float32’)
# encode Y class values as integers
label_encoder = LabelEncoder()
label_encoder = label_encoder.fit(Y)
label_encoded_y = label_encoder.transform(Y)
# split data into train and test sets
seed = 7
test_size = 0.33
X_train, X_test, y_train, y_test = train_test_split(X, label_encoded_y, test_size=test_size, random_state=seed)
# fit model no training data
model = XGBClassifier()
model.fit(X_train, y_train)
print(model)
# make predictions for test data
y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]
# evaluate predictions
accuracy = accuracy_score(y_test, predictions)
print(“Accuracy: %.2f%%” % (accuracy * 100.0))

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

# binary classification, missing data

from pandas import read_csv

from xgboost import XGBClassifier

from sklearn.model_selection import train_test_split

from sklearn.metrics import accuracy_score

from sklearn.preprocessing import LabelEncoder

# load data

dataframe = read_csv(“horse-colic.csv”, delim_whitespace=True, header=None)

dataset = dataframe.values

# split data into X and y

X = dataset[:,0:27]

Y = dataset[:,27]

# set missing values to 0

X[X == ‘?’] = 0

# convert to numeric

X = X.astype(‘float32’)

# encode Y class values as integers

label_encoder = LabelEncoder()

label_encoder = label_encoder.fit(Y)

label_encoded_y = label_encoder.transform(Y)

# split data into train and test sets

seed = 7

test_size = 0.33

X_train, X_test, y_train, y_test = train_test_split(X, label_encoded_y, test_size=test_size, random_state=seed)

# fit model no training data

model = XGBClassifier()

model.fit(X_train, y_train)

print(model)

# make predictions for test data

y_pred = model.predict(X_test)

predictions = [round(value) for value in y_pred]

# evaluate predictions

accuracy = accuracy_score(y_test, predictions)

print(“Accuracy: %.2f%%” % (accuracy * 100.0))

Running this example produces the following output.

XGBClassifier(base_score=0.5, colsample_bylevel=1, colsample_bytree=1,
gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=3,
min_child_weight=1, missing=None, n_estimators=100, nthread=-1,
objective=’binary:logistic’, reg_alpha=0, reg_lambda=1,
scale_pos_weight=1, seed=0, silent=True, subsample=1)
Accuracy: 83.84%

XGBClassifier(base_score=0.5, colsample_bylevel=1, colsample_bytree=1,

       gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=3,

       min_child_weight=1, missing=None, n_estimators=100, nthread=-1,

       objective=’binary:logistic’, reg_alpha=0, reg_lambda=1,

       scale_pos_weight=1, seed=0, silent=True, subsample=1)

Accuracy: 83.84%

We can tease out the effect of XGBoost’s automatic handling of missing values, by marking the missing values with a non-zero value, such as 1.

Re-running the example demonstrates a drop in accuracy for the model.

We can also impute the missing data with a specific value.

It is common to use a mean or a median for the column. We can easily impute the missing data using the scikit-learn SimpleImputer class.

# impute missing values as the mean
imputer = SimpleImputer()
imputed_x = imputer.fit_transform(X)

# impute missing values as the mean

imputer = SimpleImputer()

imputed_x = imputer.fit_transform(X)

Below is the full example with missing data imputed with the mean value from each column.

# binary classification, missing data, impute with mean
import numpy
from pandas import read_csv
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import SimpleImputer
# load data
dataframe = read_csv(“horse-colic.csv”, delim_whitespace=True, header=None)
dataset = dataframe.values
# split data into X and y
X = dataset[:,0:27]
Y = dataset[:,27]
# set missing values to 0
X[X == ‘?’] = numpy.nan
# convert to numeric
X = X.astype(‘float32’)
# impute missing values as the mean
imputer = SimpleImputer()
imputed_x = imputer.fit_transform(X)
# encode Y class values as integers
label_encoder = LabelEncoder()
label_encoder = label_encoder.fit(Y)
label_encoded_y = label_encoder.transform(Y)
# split data into train and test sets
seed = 7
test_size = 0.33
X_train, X_test, y_train, y_test = train_test_split(imputed_x, label_encoded_y, test_size=test_size, random_state=seed)
# fit model no training data
model = XGBClassifier()
model.fit(X_train, y_train)
print(model)
# make predictions for test data
y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]
# evaluate predictions
accuracy = accuracy_score(y_test, predictions)
print(“Accuracy: %.2f%%” % (accuracy * 100.0))

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

# binary classification, missing data, impute with mean

import numpy

from pandas import read_csv

from xgboost import XGBClassifier

from sklearn.model_selection import train_test_split

from sklearn.metrics import accuracy_score

from sklearn.preprocessing import LabelEncoder

from sklearn.impute import SimpleImputer

# load data

dataframe = read_csv(“horse-colic.csv”, delim_whitespace=True, header=None)

dataset = dataframe.values

# split data into X and y

X = dataset[:,0:27]

Y = dataset[:,27]

# set missing values to 0

X[X == ‘?’] = numpy.nan

# convert to numeric

X = X.astype(‘float32’)

# impute missing values as the mean

imputer = SimpleImputer()

imputed_x = imputer.fit_transform(X)

# encode Y class values as integers

label_encoder = LabelEncoder()

label_encoder = label_encoder.fit(Y)

label_encoded_y = label_encoder.transform(Y)

# split data into train and test sets

seed = 7

test_size = 0.33

X_train, X_test, y_train, y_test = train_test_split(imputed_x, label_encoded_y, test_size=test_size, random_state=seed)

# fit model no training data

model = XGBClassifier()

model.fit(X_train, y_train)

print(model)

# make predictions for test data

y_pred = model.predict(X_test)

predictions = [round(value) for value in y_pred]

# evaluate predictions

accuracy = accuracy_score(y_test, predictions)

print(“Accuracy: %.2f%%” % (accuracy * 100.0))

Running this example we see results equivalent to the fixing the value to one (1). This suggests that at least in this case we are better off marking the missing values with a distinct value of zero (0) rather than a valid value (1) or an imputed value.

It is a good lesson to try both approaches (automatic handling and imputing) on your data when you have missing values.

Summary

In this post you discovered how you can prepare your machine learning data for gradient boosting with XGBoost in Python.

Specifically, you learned:

  • How to prepare string class values for binary classification using label encoding.
  • How to prepare categorical input variables using a one hot encoding to model them as binary variables.
  • How XGBoost automatically handles missing data and how you can mark and impute missing values.

Do you have any questions?
Ask your questions in the comments and I will do my best to answer.

Discover The Algorithm Winning Competitions!

XGBoost With Python

Develop Your Own XGBoost Models in Minutes

…with just a few lines of Python

Discover how in my new Ebook:
XGBoost With Python

It covers self-study tutorials like:
Algorithm Fundamentals, Scaling, Hyperparameters, and much more…

Bring The Power of XGBoost To Your Own Projects

Skip the Academics. Just Results.

See What’s Inside

error: Content is protected !!