Tips and Tricks
This page contains some tips and tricks for getting the best results out of Optimal Trees.
OptimalTrees.jl is set up to easily train trees in parallel across multiple processes or machines. For details see the IAIBase documentation on parallelization.
Whenever OptimalTrees is training trees it will automatically parallelize the training across all worker processes in the Julia session. Increasing the number of workers leads to a roughly linear speedup in training, so training with three workers (four processes) will give a roughly 4x speedup.
Imbalances in class labels can cause difficulties during model fitting. We will use the Climate Model Simulation Crashes dataset as an example:
using CSV df = CSV.read("pop_failures.dat", delim=" ", ignorerepeated=true) X = df[:, 3:20] y = df[:, 21]
Taking a look at the target variable, we see the data is very unbalanced (91% of values are 1):
using Statistics mean(y)
Let's see what happens if we try to fit to this data
(train_X, train_y), (test_X, test_y) = IAI.split_data(:classification, X, y, seed=123) grid = IAI.GridSearch( IAI.OptimalTreeClassifier( random_seed=123, ), max_depth=1:5, ) IAI.fit!(grid, train_X, train_y) IAI.score(grid, test_X, test_y, criterion=:auc)
The IAIBase documentation outlines multiple strategies that we can use to try to improve performance on unbalanced data. First, we can try using the
:autobalance option for
sample_weight to automatically adjust and balance the label distribution:
IAI.fit!(grid, train_X, train_y, sample_weight=:autobalance) IAI.score(grid, test_X, test_y, criterion=:auc)
We can see this has improved the out-of-sample performance significantly.
Another approach we can use to improve the performance in an unbalanced scenario is to use an alternative scoring criterion when training the model. Typically we see better performance on unbalanced data when using either gini impurity or entropy as the scoring criterion:
grid = IAI.GridSearch( IAI.OptimalTreeClassifier( random_seed=123, criterion=:gini, ), max_depth=1:5, ) IAI.fit!(grid, train_X, train_y) IAI.score(grid, test_X, test_y, criterion=:auc)
This approach has also increased the out-of-sample performance significantly.