Optimal Regression Trees

# Quick Start Guide: Optimal Regression Trees

In this example we will use Optimal Regression Trees (ORT) on the yacht hydrodynamics dataset. First we load in the data and split into training and test datasets:

``````using CSV
"yacht_hydrodynamics.data",
delim=' ',            # file uses ' ' as separators rather than ','
ignorerepeated=true,  # sometimes columns are separated by more than one ' '
header=[:position, :prismatic, :length_displacement, :beam_draught,
:length_beam, :froude, :resistance],
)``````
``````308×7 DataFrames.DataFrame. Omitted printing of 3 columns
│ Row │ position │ prismatic │ length_displacement │ beam_draught │
│     │ Float64  │ Float64   │ Float64             │ Float64      │
├─────┼──────────┼───────────┼─────────────────────┼──────────────┤
│ 1   │ -2.3     │ 0.568     │ 4.78                │ 3.99         │
│ 2   │ -2.3     │ 0.568     │ 4.78                │ 3.99         │
│ 3   │ -2.3     │ 0.568     │ 4.78                │ 3.99         │
│ 4   │ -2.3     │ 0.568     │ 4.78                │ 3.99         │
│ 5   │ -2.3     │ 0.568     │ 4.78                │ 3.99         │
│ 6   │ -2.3     │ 0.568     │ 4.78                │ 3.99         │
│ 7   │ -2.3     │ 0.568     │ 4.78                │ 3.99         │
⋮
│ 301 │ -2.3     │ 0.6       │ 4.34                │ 4.23         │
│ 302 │ -2.3     │ 0.6       │ 4.34                │ 4.23         │
│ 303 │ -2.3     │ 0.6       │ 4.34                │ 4.23         │
│ 304 │ -2.3     │ 0.6       │ 4.34                │ 4.23         │
│ 305 │ -2.3     │ 0.6       │ 4.34                │ 4.23         │
│ 306 │ -2.3     │ 0.6       │ 4.34                │ 4.23         │
│ 307 │ -2.3     │ 0.6       │ 4.34                │ 4.23         │
│ 308 │ -2.3     │ 0.6       │ 4.34                │ 4.23         │``````
``````X = df[:, 1:(end - 1)]
y = df[:, end]
(train_X, train_y), (test_X, test_y) = IAI.split_data(:regression, X, y, seed=1)``````

### Optimal Regression Trees

We will use a `GridSearch` to fit an `OptimalTreeRegressor`:

``````grid = IAI.GridSearch(
IAI.OptimalTreeRegressor(
random_seed=1,
),
max_depth=1:5,
)
IAI.fit!(grid, train_X, train_y)
IAI.get_learner(grid)``````
Optimal Trees Visualization

We can make predictions on new data using `predict`:

``IAI.predict(grid, test_X)``
``````92-element Array{Float64,1}:
0.5922368421052632
2.2875
22.067500000000003
0.5922368421052632
0.5922368421052632
4.673928571428571
57.50000000000001
0.5922368421052632
0.5922368421052632
12.98
⋮
8.080666666666668
0.5922368421052632
0.5922368421052632
0.5922368421052632
8.080666666666668
33.5576923076923
0.5922368421052632
2.2875
4.673928571428571``````

We can evaluate the quality of the tree using `score` with any of the supported loss functions. For example, the \$R^2\$ on the training set:

``IAI.score(grid, train_X, train_y, criterion=:mse)``
``0.9941445278755715``

Or on the test set:

``IAI.score(grid, test_X, test_y, criterion=:mse)``
``0.9917006908856699``

### Optimal Regression Trees with Hyperplanes

To use Optimal Regression Trees with hyperplane splits (ORT-H), you should set the `hyperplane_config` parameter:

``````grid = IAI.GridSearch(
IAI.OptimalTreeRegressor(
random_seed=1,
hyperplane_config=(sparsity=:all,)
),
max_depth=1:4,
)
IAI.fit!(grid, train_X, train_y)
IAI.get_learner(grid)``````
Optimal Trees Visualization

Now we can find the performance on the test set with hyperplanes:

``IAI.score(grid, test_X, test_y, criterion=:mse)``
``0.9877064480736487``

It looks like the addition of hyperplane splits did not help too much here. It seems that the main variable affecting the target is `froude`, and so perhaps allowing multiple variables per split in the tree is not that useful for this dataset.

### Optimal Regression Trees with Linear Predictions

To use Optimal Regression Trees with linear regression in the leaves (ORT-L), you should set the `regression_sparsity` parameter to `:all` and use the `regression_lambda` parameter to control the degree of regularization.

``````grid = IAI.GridSearch(
IAI.OptimalTreeRegressor(
random_seed=1,
max_depth=2,
regression_sparsity=:all,
),
regression_lambda=[0.0005, 0.001, 0.005],
)
IAI.fit!(grid, train_X, train_y)
IAI.get_learner(grid)``````
Optimal Trees Visualization