This page contains some tips and tricks for getting the best results out of Optimal Trees.
OptimalTrees.jl is set up to easily train trees in parallel across multiple processes or machines. For details see the IAIBase documentation on parallelization.
Whenever OptimalTrees is training trees it will automatically parallelize the training across all worker processes in the Julia session. Increasing the number of workers leads to a roughly linear speedup in training, so training with three workers (four processes) will give a roughly 4x speedup.
As mentioned in the parameter tuning guide, it is often important to select a value for the
criterion parameter. Optimal Classification Trees use
:misclassification as the default training criterion, which works well in most cases where the goal is to predict the correct class. However, this criterion may not give the best solution if goal of the model is to predict probabilities as accurately as possible.
To illustrate this, consider an example where the label probability distribution is proportional to the feature
using StableRNGs # for consistent RNG output across all Julia versions rng = StableRNG(1) X = rand(rng, 1000, 1) y = [rand(rng) < X[i, 1] for i in 1:size(X, 1)]
Now, we train with the default
grid = IAI.GridSearch( IAI.OptimalTreeClassifier(random_seed=1), max_depth=1:5, ) IAI.fit!(grid, X, y)
We observe that the tree only has one split at
x1 < 0.5121.
For comparison, we will train again with
:entropy would also work):
grid2 = IAI.GridSearch( IAI.OptimalTreeClassifier( random_seed=1, criterion=:gini, ), max_depth=1:5, ) IAI.fit!(grid2, X, y)
Fitted OptimalTreeClassifier: 1) Split: x1 < 0.5121 2) Split: x1 < 0.2778 3) Predict: false (91.44%), [235,22], 257 points, error 0.1566 4) Predict: false (65.27%), [156,83], 239 points, error 0.4534 5) Split: x1 < 0.8217 6) Predict: true (71.92%), [89,228], 317 points, error 0.4039 7) Predict: true (95.19%), [9,178], 187 points, error 0.09162
We see that with
:gini as the training criterion we find a tree with more splits. Note that the first split is the same, and that both leaves on the lower side of this first split predict
true while those on the upper side predict
false. The new splits further refine the predicted probability. This is consistent with how the data was generated.
Comparing the trees, we can understand how the different values of
criterion affect the output. After the first split, the tree trained with
:misclassification does not split any further, as these splits would not change the predicted label for any point, and thus make no difference to the overall misclassification. The tree chooses not to include these splits as they increase the complexity for no improvement in training score. On the other hand, the tree trained with
:gini does improve its training score by splitting further, as the score is calculated using the probabilities rather than the predicted label.
We can compare the AUC of each method:
IAI.score(grid, X, y, criterion=:auc), IAI.score(grid2, X, y, criterion=:auc)
As we would expect, the tree trained with
:gini has significantly higher AUC, as a result of having more refined probability estimates. This demonstrates the importance of choosing a value for
criterion that is aligned with how you intend to evaluate and use the model.
Imbalances in class labels can cause difficulties during model fitting. We will use the Climate Model Simulation Crashes dataset as an example:
using CSV, DataFrames df = DataFrame(CSV.File("pop_failures.dat", delim=" ", ignorerepeated=true)) X = df[:, 3:20] y = df[:, 21]
Taking a look at the target variable, we see the data is very unbalanced (91% of values are 1):
using Statistics mean(y)
Let's see what happens if we try to fit to this data
(train_X, train_y), (test_X, test_y) = IAI.split_data(:classification, X, y, seed=123) grid = IAI.GridSearch( IAI.OptimalTreeClassifier( random_seed=123, ), max_depth=1:5, ) IAI.fit!(grid, train_X, train_y) IAI.score(grid, test_X, test_y, criterion=:auc)
We see that the model could not find any model more predictive than simply guessing randomly, due to the class imbalance.
The IAIBase documentation outlines multiple strategies that we can use to try to improve performance on unbalanced data. First, we can try using the
:autobalance option for
sample_weight to automatically adjust and balance the label distribution:
IAI.fit!(grid, train_X, train_y, sample_weight=:autobalance) IAI.score(grid, test_X, test_y, criterion=:auc)
We can see this has improved the out-of-sample performance significantly.
Another approach we can use to improve the performance in an unbalanced scenario is to use an alternative scoring criterion when training the model. Typically we see better performance on unbalanced data when using either gini impurity or entropy as the scoring criterion:
grid = IAI.GridSearch( IAI.OptimalTreeClassifier( random_seed=123, criterion=:gini, ), max_depth=1:5, ) IAI.fit!(grid, train_X, train_y) IAI.score(grid, test_X, test_y, criterion=:auc)
This approach has also increased the out-of-sample performance significantly.
When running Optimal Regression Trees with linear regression predictions in the leaves (
:regression_sparsity set to
:all), the linear regression models are fit using regularization to limit the degree of overfitting. By default, the function that is minimized during training is
where $T$ is the tree, $X$ and $y$ are the training features and labels, respectively, $t$ are the leaves in the tree, and $\beta_t$ is the vector of regression coefficients in leaf $t$. In this way, the regularization applied in each leaf is a lasso penalty, and these are summed over the leaves to get the overall penalty. We are therefore penalizing the total complexity of the regression equations in the tree.
This regularization scheme is generally sufficient for fitting the regression equations in each leaf, as it only adds those regression coefficients that significantly improve the training error. However, there are classes of problems where this regularization limits the quality of the trees that can be found.
To illustrate this, consider the following univariate piecewise linear function:
Note that this is exactly a regression tree with a single split and univariate regression predictions in each leaf.
We can generate data according to this function:
using DataFrames x = -2:0.025:2 X = DataFrame(x=x) y = map(v -> v > 0 ? 11v : 10v, x)
We will apply Optimal Regression Trees to learn this function, with the hope that the splits in the tree will allow us to model the breakpoints.
grid = IAI.GridSearch( IAI.OptimalTreeRegressor( random_seed=1, max_depth=1, minbucket=10, regression_sparsity=:all, regression_lambda=0.001, ), ) IAI.fit!(grid, X, y)