Tips and Tricks

This page contains some tips and tricks for getting the best results out of Optimal Trees.

Parallelization

OptimalTrees.jl is set up to easily train trees in parallel across multiple processes or machines. For details see the IAIBase documentation on parallelization.

Whenever OptimalTrees is training trees it will automatically parallelize the training across all worker processes in the Julia session. Increasing the number of workers leads to a roughly linear speedup in training, so training with three workers (four processes) will give a roughly 4x speedup.

Choosing `criterion` for Classification Trees

As mentioned in the parameter tuning guide, it is often important to select a value for the criterion parameter. Optimal Classification Trees use :misclassification as the default training criterion, which works well in most cases where the goal is to predict the correct class. However, this criterion may not give the best solution if goal of the model is to predict probabilities as accurately as possible.

To illustrate this, consider an example where the label probability distribution is proportional to the feature x1:

using StableRNGs  # for consistent RNG output across all Julia versions
rng = StableRNG(1)
X = rand(rng, 1000, 1)
y = [rand(rng) < X[i, 1] for i in 1:size(X, 1)]

Now, we train with the default :misclassification criterion:

grid = IAI.GridSearch(
    IAI.OptimalTreeClassifier(random_seed=1),
    max_depth=1:5,
)
IAI.fit!(grid, X, y)

Optimal Trees Visualization

We observe that the tree only has one split at x1 < 0.5121.

For comparison, we will train again with :gini (:entropy would also work):

grid2 = IAI.GridSearch(
    IAI.OptimalTreeClassifier(
        random_seed=1,
        criterion=:gini,
    ),
    max_depth=1:5,
)
IAI.fit!(grid2, X, y)

Optimal Trees Visualization

We see that with :gini as the training criterion we find a tree with more splits. Note that the first split is the same, and that both leaves on the lower side of this first split predict true while those on the upper side predict false. The new splits further refine the predicted probability. This is consistent with how the data was generated.

Comparing the trees, we can understand how the different values of criterion affect the output. After the first split, the tree trained with :misclassification does not split any further, as these splits would not change the predicted label for any point, and thus make no difference to the overall misclassification. The tree chooses not to include these splits as they increase the complexity for no improvement in training score. On the other hand, the tree trained with :gini does improve its training score by splitting further, as the score is calculated using the probabilities rather than the predicted label.

We can compare the AUC of each method:

IAI.score(grid, X, y, criterion=:auc), IAI.score(grid2, X, y, criterion=:auc)

(0.7970557749950976, 0.8568106963770465)

As we would expect, the tree trained with :gini has significantly higher AUC, as a result of having more refined probability estimates. This demonstrates the importance of choosing a value for criterion that is aligned with how you intend to evaluate and use the model.

Unbalanced Data

Imbalances in class labels can cause difficulties during model fitting. We will use the Climate Model Simulation Crashes dataset as an example:

using CSV, DataFrames
df = CSV.read("pop_failures.dat", DataFrame, delim=" ", ignorerepeated=true)
X = df[:, 3:20]
y = df[:, 21]

Taking a look at the target variable, we see the data is very unbalanced (91% of values are 1):

using Statistics
mean(y)

0.9148148148148149

Let's see what happens if we try to fit to this data

(train_X, train_y), (test_X, test_y) = IAI.split_data(:classification, X, y,
                                                      seed=1)
grid = IAI.GridSearch(
    IAI.OptimalTreeClassifier(
        random_seed=1,
    ),
    max_depth=1:5,
)
IAI.fit!(grid, train_X, train_y)
IAI.score(grid, test_X, test_y, criterion=:auc)

0.6894305019305019

We see that the performance of the model is not particularly strong, and it is possible that this is due to the class imbalance.

The IAIBase documentation outlines multiple strategies that we can use to try to improve performance on unbalanced data. First, we can try using an alternative scoring criterion when training the model. Typically we see better performance on unbalanced data when using either gini impurity or entropy as the scoring criterion:

grid = IAI.GridSearch(
    IAI.OptimalTreeClassifier(
        random_seed=1,
        criterion=:gini,
    ),
    max_depth=1:5,
)
IAI.fit!(grid, train_X, train_y)
IAI.score(grid, test_X, test_y, criterion=:auc)

0.8436293436293436

We can see this has improved the out-of-sample performance significantly.

Another approach we can use to improve the performance in an unbalanced scenario is to use the :autobalance option for sample_weight to automatically adjust and balance the label distribution:

grid = IAI.GridSearch(
    IAI.OptimalTreeClassifier(
        random_seed=1,
    ),
    max_depth=1:5,
)
IAI.fit!(grid, train_X, train_y, sample_weight=:autobalance)
IAI.score(grid, test_X, test_y, criterion=:auc)

0.8315637065637066

This approach has also increased the out-of-sample performance significantly.

One thing to note when using :autobalance (or any sample weight adjustment) is that it can affect interpretation of the predicted probabilities returned by predict_proba or get_classification_proba. As a concrete example, let us look at the tree that resulted from training with autobalanced sample weights:

lnr = IAI.get_learner(grid)

Optimal Trees Visualization

We see that predicted probabilities of having label 1 in the leaves of the tree are 99.08%, 98.82% and 61.33%. However, when using predict_proba or get_classification_proba, we see different numbers:

IAI.get_classification_proba(lnr, 2)

Dict{Int64, Float64} with 2 entries:
  0 => 0.0910047
  1 => 0.908995

IAI.get_classification_proba(lnr, 4)

Dict{Int64, Float64} with 2 entries:
  0 => 0.114041
  1 => 0.885959

IAI.get_classification_proba(lnr, 5)

Dict{Int64, Float64} with 2 entries:
  0 => 0.872067
  1 => 0.127933

The reason for this difference is that the probabilities shown in the tree visualization are calculated based solely on the number of points of each label in the leaves, whereas the probabilities used for predictions are calculated in the weighted space after the labels have been rebalanced to have roughly equal importance. This discrepancy will be present any time that sample weights are used during the tree training process.

If desired, refit_leaves! can be used to replace the weighted probability predictions inside the tree with the unweighted probabilities seen in the visualization:

IAI.refit_leaves!(lnr, X, y)

Optimal Trees Visualization

Now, the predicted probabilities are the same as those shown in the tree:

IAI.get_classification_proba(lnr, 2)

Dict{Int64, Float64} with 2 entries:
  0 => 0.0130719
  1 => 0.986928

Different Regularization Schemes for Regression Trees

When running Optimal Regression Trees with linear regression predictions in the leaves (via regression_features), the linear regression models are fit using regularization to limit the degree of overfitting. By default, the function that is minimized during training is

\[\min \left\{ \text{error}(T, X, y) + \texttt{cp} * \text{complexity}(T) + \sum_{t} \| \boldsymbol\beta_t \|_1 \right\}\]

where $T$ is the tree, $X$ and $y$ are the training features and labels, respectively, $t$ are the leaves in the tree, and $\beta_t$ is the vector of regression coefficients in leaf $t$. In this way, the regularization applied in each leaf is a lasso penalty, and these are summed over the leaves to get the overall penalty. We are therefore penalizing the total complexity of the regression equations in the tree.

This regularization scheme is generally sufficient for fitting the regression equations in each leaf, as it only adds those regression coefficients that significantly improve the training error. However, there are classes of problems where this regularization limits the quality of the trees that can be found.

To illustrate this, consider the following univariate piecewise linear function:

\[y = \begin{cases} 10x & x < 0 \\ 11x & x \geq 0 \end{cases}\]

Note that this is exactly a regression tree with a single split and univariate regression predictions in each leaf.

We can generate data according to this function:

using DataFrames
x = -2:0.025:2
X = DataFrame(x=x)
y = map(v -> v > 0 ? 11v : 10v, x)

We will apply Optimal Regression Trees to learn this function, with the hope that the splits in the tree will allow us to model the breakpoints.

grid = IAI.GridSearch(
    IAI.OptimalTreeRegressor(
        random_seed=1,
        max_depth=1,
        minbucket=10,
        regression_features=All(),
        regression_lambda=0.01,
    ),
)
IAI.fit!(grid, X, y)

Optimal Trees Visualization

We see that the trained tree has no splits, preferring to just fit a single linear regression model across the entire domain, with the coefficient being roughly the average of 10 and 11. However, we know that the ideal model is really a tree with a single split at $x = 0$, with each leaf containing the appropriate coefficient (10 and 11, respectively)

The root cause of the tree having no splits is the regularization scheme we have applied: the regularization penalty applied to the tree with no splits is roughly 10.5, whereas the penalty applied to the ideal tree would be 21. This means that if all else is equal, the tree with no splits would be preferred by the training process and selected before the ideal tree. We can therefore see that splitting in the tree to refine the estimates of coefficients (e.g. refining 10.5 to 10 and 11) is actually penalized under the regularization scheme used by default.

We can resolve this by using an alternative regularization scheme that penalizes the average complexity of the regression equations in the tree instead of the total complexity:

\[\min \left\{ \text{error}(T, X, y) + \texttt{cp} * \text{complexity}(T) + \sum_{t} \frac{n_t}{n} \| \boldsymbol\beta_t \|_1 \right\}\]

where $n$ is the number of training points, and $n_t$ is the number of training points in leaf $t$. This new regularization scheme penalizes the objective by the average lasso penalty in each leaf, weighted by the number of points contained in each leaf to give more weight to those leaves with more points. We can see that under this alternative regularization scheme, both the tree with no splits and the ideal tree would have very similar regression penalties, and so the ideal tree could now be selected as it has a better training error.

To see this in action, we set :regression_weighted_betas to true to enable this alternative regularization scheme:

grid = IAI.GridSearch(
    IAI.OptimalTreeRegressor(
        random_seed=1,
        max_depth=1,
        minbucket=10,
        regression_features=All(),
        regression_lambda=0.001,
        regression_weighted_betas=true,
    ),
)
IAI.fit!(grid, X, y)

Optimal Trees Visualization

We can see that indeed this regularization scheme enabled us to find the ideal tree for this problem.

In general, each of the regularization schemes can be better suited to different problems, so it can be valuable to try out both approaches to see which works best for a given problem.

Categorical Features with Many Levels

Sometimes the input data has categorical features with many levels (10+). We illustrate with an example where using the feature directly may be harmful, as it may result in overfitting and reduced interpretability.

In this example, we generate the predictor X which is a single categoric feature with 40 different levels, and the outcome y is a function of whether X is in some of the levels plus some noise.

using CategoricalArrays
using DataFrames
using StableRNGs

function make_data(n)
  rng = StableRNG(1)
  X = DataFrame(x1=rand(rng, 1:40, n))
  y = (X.x1 .>= 20) + 0.5 * randn(rng, n)
  X.x1 = categorical(X.x1)
  X, y
end
X, y = make_data(200)

If we train an Optimal Regression Tree directly with this categoric feature:

grid = IAI.GridSearch(
    IAI.OptimalTreeRegressor(random_seed=1),
    max_depth=1,
)
IAI.fit!(grid, X, y)

Optimal Trees Visualization

The tree we get does not recover the splits exactly. The split includes 40 and excludes 9 incorrectly. This is likely a result of the many options for splitting, where the model has too much freedom to overfit to the noisy data.

We evaluate the tree both in-sample and out-of-sample (using a newly-generated and larger dataset):

ins = IAI.score(grid, X, y)
test_X, test_y = make_data(10000)
oos = IAI.score(grid, test_X, test_y)
ins, oos

(0.485867893882, 0.401890859498)

We see that the out-of-sample performance is 0.08 lower than in-sample, a strong indication of overfitting.

One remedy to this problem is reducing the number of levels in the categoric feature. For instance, it may be possible to combine the levels into similar groups. In this case, we consider grouping the levels based on the first digit in the level, before refitting the tree:

X.x1 = CategoricalVector(floor.(unwrap.(X.x1) ./ 10))
test_X.x1 = CategoricalVector(floor.(unwrap.(test_X.x1) ./ 10))

IAI.fit!(grid, X, y)

Fitted OptimalTreeRegressor:
  1) Split: x1 in [0.0,1.0,4.0] or is missing
    2) Predict: 0.02152, 89 points, error 0.2498
    3) Predict: 1.029, 111 points, error 0.2807

Optimal Trees Visualization

We recovered the correct split, as the tree is no longer allowed to mix-and-match levels but rather has to send all levels in a group on the same side of the split. This means it cannot overfit and choose to send 9 in a different direction to levels 1–8.

Again, we can inspect the in-sample and out-of-sample performances:

ins = IAI.score(grid, X, y)
oos = IAI.score(grid, test_X, test_y)
ins, oos

(0.484297709017, 0.455619234199)

We see that the out-of-sample performance is a lot higher than using all the levels directly, bridging the overfitting gap.

Suggestions for handling categorical features with many levels

It is important to inspect the data and check if there are categorical features that have many levels, as this is often problematic. It could be an ID feature, which should be removed. There could be many levels due to typos or variations of the same levels, in which case these should be corrected.
If there are categoric features with many levels, they are often too granular for the model to learn anything. A model trained with these features can overfit, and may not be very useful for prediction. Additionally, we need to think about the potential for unseen levels when making predictions.
As a remedy, we can combine some levels into higher-level groups. For example, we can map ZIP codes to Counties or States, or map States to broader regions. Another simple method where there is not an obvious grouping is that we can group the rare levels together as "Other".
We can also try running a model without this feature. Sometimes categoric features with many levels can dominate the tree-fitting process and drive other features out, leading to overfitting and less meaningful trees. It is often the case that the performance is the same or higher with these features removed.

Hierarchical Features

One special type of feature that can be encountered is a hierachical feature. These features are like categoric features, but there is some hierarchical ordering embedded into the categories. For instance, medical diagnoses are often captured as ICD-10 codes, which have a hierarchical structure whereby the code conveys more and more specific information as we read it left-to-right. As an example, consider the code S86.011, which has five levels of embedded information:

S indicates "Injuries, poisoning and certain other consequences of external causes related to single body regions."
S86 indicates "Injury of muscle, fascia and tendon at lower leg."
S86.0 indicates "Injury of Achilles tendon."
S86.01 indicates "Strain of Achilles tendon."
S86.011 indicates "Strain of the right Achilles tendon."

If we were to simply use these codes as a normal categorical feature, we might run into a couple of problems. First of all, there are approximate 70,000 possible codes, which as described above may lead to overfitting and poor generalization. Additionally, the model would not have any sense of the proximity of different codes, and rather would be able to mix-and-match levels to optimize in-sample performance. For instance, W34.00 (accidental discharge from a firearm) should likely be seen as close to W32.0 (accidental discharge from a handgun), whereas S86.011 should not be close to either of these.

To resolve this, we might treat each level of the hierarchy as a separate feature, so for S86.011 above, we might create five features: S, 86, 0, 1, 1. This means that the model is now able to split on each level of injury separately. For instance, it might choose to split on the broad type of injury at the first level, followed by refining splits on the second and subsequent levels.

However, there are a number of flaws with this plan. Firstly, consider that for the same top-level value, the second-level values do not necessarily have the same meaning (e.g. the 86 of S86 shares no meaning with the 86 of W86), and for this reason it makes no sense to split on them at the same time. Secondly, there is no reason that the model would be required to split on the top-level value before the second-level value, and so on. This means that it may use the splits in unintuitive ways that do not reflect the hierarchical nature.

To remedy this, we have found the following transformation of the features useful:

create a feature for the top-level value
for each value of the top-level value, create a feature for the second-level value that has the value of the second-level value if the top-level value matches, and NA otherwise
for each combination of the top- and second-level values, create a feature for the third level value that has the value of the third-level value if the first two levels match, and NA otherwise.
etc.

For instance, the code S86.011 would have the following values:

S for the top-level feature (level1)
86 for the second-level feature corresponding to S (S_level2)
NA for the second-level features corresponding to all other top-level values (e.g. W_level2 etc.)
0 for the third-level feature corresponding to S86 (S_86_level3)
NA for the third-level features correspoinding to all other pairs of the top two levels
etc.

This focuses the information in the higher-level values, and incentivizes the model to use these values rather than the very specific information far down the hierarchy. If there is indeed value in splitting further, the model is free to use the engineered features to refine the predictions according to the hierarchy in subsequent splits.

Note that this does not force the model to split in the hierarchical ordering, it is merely a way of recoding the features in a way that incentivizes the model to do so.

Demonstration of Hierchical Splitting

To illustrate this point, we will generate data with a categorical code with three levels of underlying hierarchy. We also generate a target outcome that depends on the first two levels (specificially, it depends on whether the first level is 1 or 2, and if so, whether the first two levels are equal):

function make_data(n; seed)
  rng = StableRNG(seed)
  X = DataFrame(level1=rand(rng, 1:8, n),
                level2=rand(rng, 1:6, n),
                level3=rand(rng, 1:4, n))
  X.code = string.(X.level1, X.level2, X.level3)
  y = (X.level1 .<= 2) .* (1 .+ (X.level1 .== X.level2)) .+ randn(rng, n)
  transform!(X, names(X) .=> categorical, renamecols=false)
  X, y
end
train_X, train_y = make_data(2000, seed=1)
test_X, test_y = make_data(2000, seed=2)

First, we try running a model with just the code as a categorical feature:

grid = IAI.GridSearch(
    IAI.OptimalTreeRegressor(random_seed=1234, missingdatamode=:separate_class),
    max_depth=3,
)
IAI.fit!(grid, select(train_X, :code), train_y)

Optimal Trees Visualization

We see that the model uses this feature and makes some non-intuitive groupings. It correctly identified that the first level being 1 or 2 is important, but also includes in this group a few codes that have values of 3 or more in the first level. We also see that it only has a single split, and has failed to capture the further level of interaction between the first two levels when the first level is 1 or 2.

We can also measure the performance both in- and out-of-sample:

IAI.score(grid, train_X, train_y), IAI.score(grid, test_X, test_y)

(0.205216486581, 0.131049835581)

Unsurprisingly, we see that the model overfits to the training set, and does not generalize well to new data.

Next, we will try training the model is each of the three levels as a separate feature:

IAI.fit!(grid, select(train_X, Not(:code)), train_y)

Optimal Trees Visualization

We can see that the model is now picking up on the interaction between the first two levels. However, it does not manage to identify the true interaction of these levels, as it simply predicts a higher outcome when both levels are 1 or 2, whereas the true relationship also requires that they are equal. This demonstrates the limitation mentioned above, that the model treats all values at the second level as having the same meaning, when in reality their meaning is conditional on the value of the first level.

We notice that the splits happened to show up in hierarchical order in this case. It is important to note that there is no reason to expect this will be the case.

IAI.score(grid, train_X, train_y), IAI.score(grid, test_X, test_y)

(0.203495241727, 0.197044577253)

The out-of-sample performance is much better than before.

Finally, we will try with the hierarchical features engineering we presented above. For simplicity, in this demonstration we will only do this for the second-level feature:

function engineer_hierarchy!(X)
  for k in levels(X.level1)
    new_feature = ifelse.(X.level1 .== k, string.(X.level2), "NA")
    X[!, "$(k)_level2"] = categorical(new_feature)
  end
  X
end

engineer_hierarchy!(train_X)
engineer_hierarchy!(test_X)

2000×12 DataFrame
  Row │ level1  level2  level3  code  1_level2  2_level2  3_level2  4_level2   ⋯
      │ Cat…    Cat…    Cat…    Cat…  Cat…      Cat…      Cat…      Cat…       ⋯
──────┼─────────────────────────────────────────────────────────────────────────
    1 │ 6       6       3       663   NA        NA        NA        NA         ⋯
    2 │ 1       5       4       154   5         NA        NA        NA
    3 │ 2       5       4       254   NA        5         NA        NA
    4 │ 6       1       3       613   NA        NA        NA        NA
    5 │ 2       3       2       232   NA        3         NA        NA         ⋯
    6 │ 2       4       4       244   NA        4         NA        NA
    7 │ 4       3       2       432   NA        NA        NA        3
    8 │ 1       5       4       154   5         NA        NA        NA
  ⋮   │   ⋮       ⋮       ⋮      ⋮       ⋮         ⋮         ⋮         ⋮       ⋱
 1994 │ 1       2       4       124   2         NA        NA        NA         ⋯
 1995 │ 8       5       3       853   NA        NA        NA        NA
 1996 │ 4       1       1       411   NA        NA        NA        1
 1997 │ 3       6       1       361   NA        NA        6         NA
 1998 │ 4       6       2       462   NA        NA        NA        6          ⋯
 1999 │ 5       5       1       551   NA        NA        NA        NA
 2000 │ 8       4       3       843   NA        NA        NA        NA
                                                 4 columns and 1985 rows omitted

We have created 8 additional features corresponding to the second level, one for each value of the first level. We can see that indeed the new feature takes the value of the second level when the first level matches, otherwise we insert the value NA.

We can now fit a model with these new features in place of the previous level2 feature:

IAI.fit!(grid, select(train_X, Not([:code, :level2])), train_y)

Optimal Trees Visualization

We can see that the tree now identifies the correct interactions between the features, and we also capture the hierarchical nature of the relationship.

IAI.score(grid, train_X, train_y), IAI.score(grid, test_X, test_y)

(0.216841288078, 0.21140573409)

We also see improvements in the performance. Due to the simplicity of this setup, the improvement over the previous approach was not very large, but in real world applications with complicated hierarchical relationships, the difference can be substantial.