Learnings from Distributed XGBoost on Amazon SageMaker
What I learned from distributed training with XGBoost on Amazon SageMaker.
Overview
XGBoost is a popular Python library for gradient boosted decision trees. The implementation allows practitioners to distribute training across multiple compute instances (or workers), which is especially useful for large training sets.
One tool used at Zalando for deploying production machine learning models is the managed service from Amazon called SageMaker. XGBoost is already included in SageMaker as a built-in algorithm, meaning that a prebuilt docker container is available. This container also supports distributed training, making it easy to scale training jobs across many instances.
Despite SageMaker handling the infrastructure side of things, I found that distributed training with XGBoost and SageMaker is not as easy as simply increasing the number of instances. I discovered a few small "gotchas!" when attempting a few simple trainings. This post will step through my failed attempts, and end with a genuine distributed training with XGBoost in Amazon SageMaker.
Experiment Setup
I wanted to get an intuitive idea of how well the training time with XGBoost scaled as the number of instances scaled, as well as the training time when the data size increases. I am not especially interested in producing the "best" model to solve a problem, per-say, but there is a natural trade-off between training time and model accuracy that should considered.
For a data set, I used the Fashion MNIST by Zalando Research. The problem itself is to classify small images (28x28 pixels) of clothing as being from 1 of 10 different classes (t-shirts, trousers, pullovers, etc). The data set has 60,000 images for a training set and 10,000 images for a validation set.
To increase the training size, I duplicate the training data to measure the scaling of model training time as the computational resources change. The number of times the training data is duplicated is referred to as the "replication factor". For a typical ML project, you probably don't want to duplicate the training set outright. Although doing so improves our training and validation accuracies here, this method is likely not as efficient as changing hyperparameters (however, you might create new images with noise to improve regularization). For reference, the size on disk of the training data for different replication factors is provided below.
- Replication factor 1: 0.63 GB, 60,000 images
- Replication factor 2: 1.24 GB, 120,000 images
- Replication factor 4: 2.48 GB, 240,000 images
- Replication factor 8: 4.95 GB, 480,000 images
I wanted to use hyperparameters that would give a somewhat reasonable performance for accuracy, so I used a hyperparameter tuning job in SageMaker, with one instance per training. I tuned all of the tunable hyperparameters, except "num_round"
, which was fixed to 100. This hyperparameter increases the number of decision trees used, and increases training time and accuracy as its value increases. My hyperparameters were as follows:
hps = {'alpha': 0.0,
'colsample_bylevel': 0.4083530569296091,
'colsample_bytree': 0.8040025839325579,
'eta': 0.11764087266272522,
'gamma': 0.43319156621549954,
'lambda': 37.547406128070286,
'max_delta_step': 10,
'max_depth': 6,
'min_child_weight': 5.076838893848415,
'num_round': 100, # Not tuned: kept fixed
'subsample': 0.8915771964367318,
'num_class': 10, # Not tuned: defined by Fashion MNIST
'objective': 'multi:softmax' # Not tuned: defined by Fashion MNIST
}
There are additional hyperparameters than those listed above which are not tunable. I took those as their default value (which, as you will see, can cause some unexpected results). The full list of hyperparameters offered by XGBoost is different from the those offered by the SageMaker container as SageMaker adds a few additional hyperparameters which do not control model performance. The objective "multi:softmax"
produces a metric called merror
, which is defined as #(wrong cases)/#(all cases)
.
Lastly, the tools. I wrote all of the code for my experiments in Python 3.7 using the Amazon SageMaker Python SDK. I used the SageMaker docker container version 0.90-1 for XGBoost, the URI of which can be found by using the SageMaker Python SDK:
from sagemaker.amazon.amazon_estimator import get_image_uri
container = get_image_uri(region, 'xgboost', repo_version='0.90-1')
For each of the SageMaker training jobs, I used the ml.m5.xlarge
instance.
Failed Attempt: Naive Distributed Computing
My first attempt was to check how the training time scaled as the number of instances increases. I expected to see a roughly linear improvement: if the number of instances doubles, then the training time should be cut in half.
I used a naive approach: other than the settings mentioned in the "Experiment Setup" section, I used default values, including a replication factor of 1. What I found was very different from my expectations:
There are two things to note here. Going from 1 to 2 instances increases the training time, though I expected to see the training time cut in half. Going beyond 2 instances, the training time is relatively flat.
Going from 1 to 2 instances demonstrates an internal switch of non-distributed to distributed training with XGBoost. There is a hyperparameter called tree_method
which sets the algorithm used for computing splits at a node in a decision tree of XGBoost. The default for tree_method
is "auto"
. For one instance, their greedy algorithm called "exact"
is used. For more than 1 instance, an algorithm called "approx" is used, which approximates the greedy algorithm. The logic behind "auto" is explained in the XGBoost documentation (as well as other algorithm choices) and the implementation of "exact" and "approx" are described in the XGBoost paper from Chen and Guestrin.
The second thing to note is that after two instances, the training time remains flat as more instances are added. This is because each instance is training using the same data. In SageMaker, unless otherwise specified, the entire training data set is distributed to each instance. This setting is called "FullyReplicated"
. However, XGBoost expects that each instance receives a subset of the full data set. Another way to think of this is that the training data is completely replicated a number of times equal to the number of instances, and then each copy sent to each instance.
The data distribution can be corrected by sharding the training data (dividing it into different files, one for each instance), and defining an s3_input
object such as
import sagemaker
s3_input_train = sagemaker.s3_input(s3_data=s3_location,
content_type='csv',
distribution='ShardedByS3Key')
and then starting a training job by passing the s3_input
objects for training (and a similar one for validation):
xgb.fit(inputs={'train': s3_input_train, 'validation': s3_input_validation})
Take-Aways
- XGBoost takes different default actions for the hyperparameter
tree_method
when moving to distributed training from non-distributed training. We should be mindful of this when estimating training times and when tuning hyperparameters. - XGBoost expects data to be split for each of the instances, but SageMaker by default sends the entire data set to each instance. We need to set the data distribution to
"ShardedByS3Key"
in SageMaker to match the expectations of XGBoost.
Failed Attempt: Using the Greedy Algorithm
To correct my previous failed attempt at distributed training, I made two changes to my experiment:
- I set the value of the hyperparameter
tree_method
to"exact"
, so that each training job uses the same value oftree_method
. - I set the data distribution for SageMaker to
"ShardedByS3Key"
, and divided my training set randomly so that each instance gets a different piece of the training set.
In addition to the expected training times (i.e. doubling the number of instances cuts the training time in half), I also tried increasing the replication factor to get a sense of the scaling of training time compared to the size of the training set. I expect something similar: if the size of the training data doubles, then the training time should double.
The first plot shows the training times for each of the 4 replication factor. The second plot is the same as the first, but in log scale. The dotted lines indicate my expected training time (i.e. doubling the number of instances should halve the training time).
This actually looks pretty good! The training times match well with what one might expect. The trainings with higher replication factors require more time to run computations as there is more data to process. In fact, it's about a factor 2 increase in the training time when the training data size is doubled. It's also worth pointing out that more training data results in better scalability. In fact, with lower replication factors, the training time plateaus (actually, it even increases a little) with a larger number of instances. This would suggest that the overhead costs are eating the benefits of distributing the workload.
At first glance, everything seems to be ok: more training data implies longer training times, more compute resources implies shorter run times. But a check of the training and validation errors shows that something is not right:
As the number of instances increases, the error for training and validation increases. This is an artifact of the hyperparameter tree_method
. For distributed training, XGBoost does NOT implement the "exact"
algorithm. However, SageMaker has no problem letting us select this value in the distributed trainings. In this situation, the training data is divided among the instances, and then each instance calculates its own XGBoost model, ignoring all other instances. Once each instance is finished, the model from the first instance is saved, and the others are discarded.
The timing and error graphs reflect this behavior: as the number of instances increases, the training data on any given instance is smaller, resulting in faster trainings but worse error. A cheaper way to replicate this experiment is to throw away a percentage of the training data and then train with only one instance.
Take-Aways
- Don't use
"exact"
for the value oftree_method
with distributed XGBoost, because it's not actually implemented on the XGBoost side. Use instead"approx"
or"hist"
.
Successful Attempt: Distributed XGBoost with SageMaker
After the learning from the previous attempt, I repeated the experiment, but this time using "approx"
for tree_method
. This does introduce a new hyperparameter, called sketch_eps
, for which I use the default value.
The scaling looks good here and similar to those from experiment 2, albeit with longer training times. A check of the training and validation errors is more satisfying:
From the training and validation errors, we do see noise appearing. Note that there is randomness to using XGBoost: the piece of the training set given to each instance was selected randomly for each training, and node splitting in a decision tree has randomness (see hyperparameters like subsample
or colsample_bylevel
).
Take-Aways
- Using many instances with a "low" amount of training data is a waste of computational resources. For example, using a replication factor of 1, the training time of using 10 instances is not much better than using 3 or 4.
- When the training data is sufficiently large, doubling the number of instances approximately halves the training time.
- The scaling in training data size is about what we expect: doubling the training data approximately doubles the training time.
Conclusion
Amazon SageMaker makes it easy to scale XGBoost algorithms across many instances. But with so many "knobs" to play with, it's easy to create an inefficient machine learning project.
We're hiring! Do you like working in an ever evolving organization such as Zalando? Consider joining our teams as a Machine Learning Engineer!