Tips and Tricks

Tips and Tricks

This page contains some tips and tricks for getting the best results out of Optimal Trees.

Parallelization

OptimalTrees.jl is set up to easily train trees in parallel across multiple processes or machines. For details see the IAIBase documentation on parallelization.

Whenever OptimalTrees is training trees it will automatically parallelize the training across all worker processes in the Julia session. Increasing the number of workers leads to a roughly linear speedup in training, so training with three workers (four processes) will give a roughly 4x speedup.

Unbalanced Data

Imbalances in class labels can cause difficulties during model fitting. We will use the Climate Model Simulation Crashes dataset as an example:

using CSV
df = CSV.read("pop_failures.dat", delim=" ", ignorerepeated=true)
X = df[:, 3:20]
y = df[:, 21]

Taking a look at the target variable, we see the data is very unbalanced (91% of values are 1):

using Statistics
mean(y)
0.9148148148148149

Let's see what happens if we try to fit to this data

(train_X, train_y), (test_X, test_y) = IAI.split_data(:classification, X, y,
                                                      seed=123)
grid = IAI.GridSearch(
    IAI.OptimalTreeClassifier(
        random_seed=123,
    ),
    max_depth=1:5,
)
IAI.fit!(grid, train_X, train_y)
IAI.score(grid, test_X, test_y, criterion=:auc)
0.841457528957529

The IAIBase documentation outlines multiple strategies that we can use to try to improve performance on unbalanced data. First, we can try using the :autobalance option for sample_weight to automatically adjust and balance the label distribution:

IAI.fit!(grid, train_X, train_y, sample_weight=:autobalance)
IAI.score(grid, test_X, test_y, criterion=:auc)
0.875

We can see this has improved the out-of-sample performance significantly.

Another approach we can use to improve the performance in an unbalanced scenario is to use an alternative scoring criterion when training the model. Typically we see better performance on unbalanced data when using either gini impurity or entropy as the scoring criterion:

grid = IAI.GridSearch(
    IAI.OptimalTreeClassifier(
        random_seed=123,
        criterion=:gini,
    ),
    max_depth=1:5,
)
IAI.fit!(grid, train_X, train_y)
IAI.score(grid, test_X, test_y, criterion=:auc)
0.8810328185328186

This approach has also increased the out-of-sample performance significantly.