# Tips and Tricks

This page contains some tips and tricks for getting the best results out of Optimal Trees.

## Parallelization

OptimalTrees.jl is set up to easily train trees in parallel across multiple processes or machines. For details see the IAIBase documentation on parallelization.

Whenever OptimalTrees is training trees it will automatically parallelize the training across all worker processes in the Julia session. Increasing the number of workers leads to a roughly linear speedup in training, so training with three workers (four processes) will give a roughly 4x speedup.

## Unbalanced Data

Imbalances in class labels can cause difficulties during model fitting. We will use the Climate Model Simulation Crashes dataset as an example:

```
using CSV
df = CSV.read("pop_failures.dat", delim=" ", ignorerepeated=true)
X = df[:, 3:20]
y = df[:, 21]
```

Taking a look at the target variable, we see the data is very unbalanced (91% of values are 1):

```
using Statistics
mean(y)
```

`0.9148148148148149`

Let's see what happens if we try to fit to this data

```
(train_X, train_y), (test_X, test_y) = IAI.split_data(:classification, X, y,
seed=123)
grid = IAI.GridSearch(
IAI.OptimalTreeClassifier(
random_seed=123,
),
max_depth=1:5,
)
IAI.fit!(grid, train_X, train_y)
IAI.score(grid, test_X, test_y, criterion=:auc)
```

`0.7654440154440154`

The IAIBase documentation outlines multiple strategies that we can use to try to improve performance on unbalanced data. First, we can try using the `:autobalance`

option for `sample_weight`

to automatically adjust and balance the label distribution:

```
IAI.fit!(grid, train_X, train_y, sample_weight=:autobalance)
IAI.score(grid, test_X, test_y, criterion=:auc)
```

`0.875`

We can see this has improved the out-of-sample performance significantly.

Another approach we can use to improve the performance in an unbalanced scenario is to use an alternative scoring criterion when training the model. Typically we see better performance on unbalanced data when using either gini impurity or entropy as the scoring criterion:

```
grid = IAI.GridSearch(
IAI.OptimalTreeClassifier(
random_seed=123,
criterion=:gini,
),
max_depth=1:5,
)
IAI.fit!(grid, train_X, train_y)
IAI.score(grid, test_X, test_y, criterion=:auc)
```

`0.9085424710424711`

This approach has also increased the out-of-sample performance significantly.

## Different Regularization Schemes for Regression Trees

When running Optimal Regression Trees with linear regression predictions in the leaves (`:regression_sparsity`

set to `:all`

), the linear regression models are fit using regularization to limit the degree of overfitting. By default, the function that is minimized during training is

where $T$ is the tree, $X$ and $y$ are the training features and labels, respectively, $t$ are the leaves in the tree, and $\beta_t$ is the vector of regression coefficients in leaf $t$. In this way, the regularization applied in each leaf is a lasso penalty, and these are summed over the leaves to get the overall penalty. We are therefore penalizing the *total* complexity of the regression equations in the tree.

This regularization scheme is generally sufficient for fitting the regression equations in each leaf, as it only adds those regression coefficients that significantly improve the training error. However, there are classes of problems where this regularization limits the quality of the trees that can be found.

To illustrate this, consider the following univariate piecewise linear function:

Note that this is exactly a regression tree with a single split and univariate regression predictions in each leaf.

We can generate data according to this function:

```
using DataFrames
x = -2:0.025:2
X = DataFrame(x=x)
y = map(v -> v > 0 ? 11v : 10v, x)
```

We will apply Optimal Regression Trees to learn this function, with the hope that the splits in the tree will allow us to model the breakpoints.

```
grid = IAI.GridSearch(
IAI.OptimalTreeRegressor(
random_seed=1,
max_depth=1,
minbucket=10,
regression_sparsity=:all,
regression_lambda=0.001,
),
)
IAI.fit!(grid, X, y)
```

We see that the trained tree has no splits, preferring to just fit a single linear regression model across the entire domain, with the coefficient being roughly the average of 10 and 11. However, we know that the ideal model is really a tree with a single split at $x = 0$, with each leaf containing the appropriate coefficient (10 and 11, respectively)

The root cause of the tree having no splits is the regularization scheme we have applied: the regularization penalty applied to the tree with no splits is roughly 10.5, whereas the penalty applied to the ideal tree would be 21. This means that if all else is equal, the tree with no splits would be preferred by the training process and selected before the ideal tree. We can therefore see that splitting in the tree to refine the estimates of coefficients (e.g. refining 10.5 to 10 and 11) is actually penalized under the regularization scheme used by default.

We can resolve this by using an alternative regularization scheme that penalizes the *average* complexity of the regression equations in the tree instead of the total complexity:

where $n$ is the number of training points, and $n_t$ is the number of training points in leaf $t$. This new regularization scheme penalizes the objective by the average lasso penalty in each leaf, weighted by the number of points contained in each leaf to give more weight to those leaves with more points. We can see that under this alternative regularization scheme, both the tree with no splits and the ideal tree would have very similar regression penalties, and so the ideal tree could now be selected as it has a better training error.

To see this in action, we set `:regression_weighted_betas`

to `true`

to enable this alternative regularization scheme:

```
grid = IAI.GridSearch(
IAI.OptimalTreeRegressor(
random_seed=1,
max_depth=1,
minbucket=10,
regression_sparsity=:all,
regression_lambda=0.001,
regression_weighted_betas=true,
),
)
IAI.fit!(grid, X, y)
```

We can see that indeed this regularization scheme enabled us to find the ideal tree for this problem.

In general, each of the regularization schemes can be better suited to different problems, so it can be valuable to try out both approaches to see which works best for a given problem.